3. GPGPU 101 – GPU memory model

This is the next part in the series of GPU 101 posts I started some time ago. If you haven’t checked part 0part1 or part2, please do so. In the last part we discussed the modern GPUs thread model. In this one we will take a look at the memory model of the GPUs.

So, as we’ve seen so far the GPUs basically is a huge SIMD machine. Because we had to have locality for efficiency, the SIMD units were grouped. We called those groups SIMD groups. AMD calls them computation units (CU), Intel execution units (EU) and nVidia streaming multi-processors (SMX). They do have differences (thats why the vendors gave them different names), but their idea is the same – put memory and ALUs near each other to spare the power needed for moving data.

C++, Java, Python and all of the popular modern languages have no notion of different memory space. For a C++ developer a pointer is basically a pointer and it does not matter when or how that pointer is being accessed – you can always deference it, write or read from it and you don’t know explicitly what the price of that will be.

Regarding memory types, there are  different buzz words in CUDA and OpenCL, and the syntax differences between those too. We will stick to the OpenCL notation for the purposes of this part (but you can think of the source shown here as if it is written in GPU pseudo code). The table below is showing those differences.

Keywords regarding memory in OpenCL and CUDA

The types of memory that we care about on the GPU are registers and local memory, shared memory, global memory and constant memory. You might think that having so much types of memory would make the programming complicated. This is somewhat true, but in the end programming having that much memory types is not much more difficult.

A simplified view of SIMD Group / SMX / EU / CU.

The global memory is the slowest and largest memory (reaching tens of gigabytes) in the GPU. All threads (SIMD lanes) have the same view of it. Unfortunately, it can’t be used to implement synchronization, since there is no way to prevent race conditions among SIMD lanes (threads). This is one of the fundamental design decisions in the GPUs, so don’t expect that to change anytime soon. You can think of it more or less just as a regular memory in the CPU. We should note that the global memory can’t be allocated on the device itself**. A pointer to such memory has to be passed to the kernel as an argument. In OpenCL such pointer has to be marked with the __global prefix. Just like in x86, all reads from the global memory are cached with L1 and L2 caches, wheres the exact configuration of the caches is vendor specific. You can’t count on cache coherency.

In addition to the global memory, each and every GPU thread has access to a memory type called “registers“. This memory is private, so each thread has its own view of it. It is physically very close to the ALUs and it is the fastest memory in the GPU. Generally, the register memory a GPU has is huge compared to the CPU one. But we should consider the fact, that we are running thousands of threads (SIMD lanes) at the same time. And those registers should be divided along all SIMD lanes, so at the end each SIMD lane (thread) has few of them. So, what happens when we have a big and complicated GPU program (kernel) that needs more registers per thread ?

Well, the first that a GPU stack might do is to prepare fewer tasks for the chip, so those registers would be split among fewer threads. This is so-called “lowering the threads in flight“. But as we discussed in the previous chapter, the primary GPU way of hiding latency is through multithreading, and when we reduce the number of threads in flight, the performance might suffer.
Another approach would be to use one of the other slower memory types as registers memory. And this is done too and it is called “register spill“. Register are being spilled (stored) in the global memory. So, each thread can have its own part of the global memory. In CUDA, this part of the global memory is called “local*” memory (there is no special name for that in OpenCL). You do not have to manage those spills manually. So you can just write code as if there is infinite amount of registers and the compiler would figure out how many tasks to send to the GPU – if any, (or if it should reduce the number of threads in flight) and how many registers to spill. Some compilers are offering command line options to specify that manually.

Every stack variable a kernel uses is stored in the registers, if possible. On some platforms this does not apply for stack arrays, so we would consider them stored in the slow “local” memory (meaning that the stack arrays are just like spilled-into-local-memory regular stack variables). On other platforms, stack vars not fitting the registers are actually allocated in true local memory, chunked off from the SIMD shared mem pool. Here is a sample kernel using only registers:

It is hard to calculate the exact register usage by looking at the code, since the register are being recycled and reused (this reuse might be possible if there are no data dependencies). Some compilers have options to print the register usage of every function, though.

GPU memory types (keywords in CUDA)

The register usage is crucial for the performance of the GPU. Hiding latency with multithreading implies that a lot of threads should be ready for execution at any given time and if the data they need is far away, this might hurt. This also introduces another phenomena – you might end up in situation in which you are adding source code that is never being executed, but it still slows down the program execution a lot! The reason for this is, that the new source code might take up the registers you’ll need and make them spill into the local memory (or just reduce the threads in flight). There are several approaches to that and we will discuss them in the next chapters, but it is clear that having a GPU kernel that runs fast is one thing, and having a huge GPU kernel that runs fast is another.

Now, on to shared memory. It has several purposes. All SIMD lanes (threads) inside SIMD unit have the same view of it – meaning that all threads that we have grouped share that memory and they can communicate through it. Being able to do so, they can cooperate and help each other. The shared memory is almost as fast as a register. Because of that it is often used as manually managed cache – in most of the GPGPU apps the primary efforts are finding what kind of memory your thread might share and use that fact to pre-cache it. Lets do a simple kernel that finds the difference between every two elements of an array. The straightforward solution would be:

Every thread does one subtraction. If you do that on an average GPU with input data large enough, it would outperform the CPU version of this code by a lot. But we can do better. If we think about it, from the view of the SIMD group same memory will be read multiple times from the slow global memory. We can reduce those reads using shared memory as follows.

The second code is a bit more complex, since we do manually pre-load. But it easily can make the execution tens of times faster, since the reads from the global memory are reduced. The amount of shared memory a kernel might needs has to be known prior the kernel launch, so it either has to be declared with a constant amount inside the kernel, or to be explicitly declared with an API function when the kernel is launched from the host. This is needed, because just like registers, shared memory is a finite resource and can reduce the number of threads in flights – if a lot of it is needed by every thread block, you will have to run fewer of them.

Fun fact is that if you calculate FLOP/s per line of code, the example with shared memory is still times faster.

According to the OpenCL standard, the shared memory should be at least 16KB. It is interesting, that according to the same standard, the minimum size a thread block / SIMD unit might have is 1 (one). If you have such OpenCL implementation (with block size = 1), the shared memory will match the private(register, local) memory, since it will be shared along 1 thread. Doing shared memory optimizations on such systems may turn out to slow down the execution, because of the extra work being done (in the same manner, if you run block size that is not big enough for your hardware, the shared memory implementation of the task above might take longer to execute).

Memory types size in nVidia GTX 980 (log scale)

Constant memory is special read-only type of global memory, which on some devices can be cached better. Some vendor implement clever mechanisms with it too – for example when a SIMD lane reads constant var it might broadcast it to the other SIMD lane in the SIMD unit (some vendors like nVidia implement that broadcast on half a SIMD unit) and if any of those need it, it will get it for free. As you can see, if none of the other SIMD lanes needs that particular memory, constant memory can be in fact slower than global one (because of the extra work done for the broadcasting and caching). The amount of constant memory is limited (it is usually in terms of tens of kilobytes and is vendor specific). Also, it should be noted that in contrast to the common notion of constant memory, the constant memory on the GPU is constant during the kernel execution, meaning that you can change it before you launch your kernel (with API calls).

There is another type of memory in some GPUs named “texture” memory. If you happen to see such, you can just think of it as a read-only global memory with fancy addressing capabilities (and in some architectures, with better cache system). Texture memory is usually used to store 2D (or 3D) textures. When you do texture lookups, you often need to read not only the data stored at [x][y] position, but also the ones stored near that, like the ones at [x+1][y], [x][y+1], [x-1][y], etc. Because of that, the texture memory might be internally optimized for such read. For example, instead of storing linearly in the memory every row of the texture one after another, the data for the texture can be stored using an space filling curve pattern like Hilbert curve. Using such curve assures that the texture lookups will read data fields that are close to each other. The good part is that you don’t have to manage that data ordering manually. This is handled by the driver and the GPU.

There is no way to to do synchronization along all threads in the GPU, but most of the hardware today is capable of doing atomic operations in the global memory. So source like this is perfectly valid:

If there are a lot of threads trying to do atomic operation on the same memory, they will be serialized and the performance will suffer (a lot). Here is an improved version of the source above – instead of doing atomic_add of global memory, we will do it in a shared variable, and after that we will just update the global variable with the accumulated result. Thus we will reduce the number of threads trying to modify the same variable at the same time.

And lastly – as a rule of thumb – mixing pointers from different memory spaces is bad idea, so don’t do that. And by design, if you put memory_barrier() call in a place from where not all threads will go (for example a branch) your system will most likely hang (since it will wait forever for all threads to get on that barrier, and they will not since some went trough the true branch of the if, and some trough the false, so the GPU could not continue to execute all threads).

In the next part, we will continue with more real code.

* Again, please not that there is significant difference between the local memory in CUDA and OpenCL. We are using the term ‘local’ here as in CUDA (meaning, registers spilled to the global memory). OpenCL has no word for that (calls them spilled registers), so it uses the word ‘local’ as ‘shared’. Meaning, that in OpenCL the memory types are named private(or registers), local (or shared) and global (this one is the same). It is confusing. You could think that it is nVidia fault for not using the same keywords as the OpenCL standard, but than again, CUDA was born before OpenCL.

**Actually, you can preallocate global memory before kernel launch, and use memory manager on the device itself, thus having malloc/delete in the device code, but this is not a good idea since the GPUs are not designed to be good memory managers.

2. GPGPU 101 – GPU thread model

This is the next part in the series of GPGPU 101 posts I started some time ago. If
you haven’t checked part 0 or part1, please do so. In part 1 we’ve talked how the x86 processor approaches the problems of efficiency (locality) and performance (parallelism). Now, we will discuss how the modern GPUs are approaching those. In this part we will talk more about parallelism and in the next we will focus on locality.

As the name of the GPU (Graphics Process Units) suggests it was designed for processing graphics. And graphics tends to be an embarrassingly parallel problem – you have to calculate the color for tens of millions of pixels, and (most of the time) the color of every pixel is independent from the other pixels. But graphics are not the sole embarrassingly parallel problem and at one point people started using GPUs for stuff different from graphics. First it was used in a hacky fashion, since there was no API for general purpose programming and after that CUDA and OpenCL were born. As we mentioned, there were multiple GPUs designs and approaches, but today more or less all desktop GPU vendors (AMD, Intel and nVidia) have converged to similar architectures. They do have differences which should be taken into consideration when optimizing a specific program, but the general approach to writing programs for those GPUs is the same. Here are the overviews of the GPU approaches from Intel, nVidia and AMD.

CPU architecture – optimized for single thread performance

The CPU is designed to perform a single task as fast as possible (or “optimized for low latency“). In order to do so, it uses complex control logic (superscalar pipeline, branch predictor, register renaming) and complex (and big) data and instruction caches. On the other hand the GPUs have many simple cores, that are lacking all of the complex stuff in the CPUs (or “optimized for high throughput“). So, what does this means for us? …

In a nutshell, the GPU is a huge SIMD machine, so they are well suited for data parallelism tasks. The width of the SIMD unit depends on the vendor (4 in Intel GPUs, 32 in nVidia GPUs, 64 in AMD ones – wider lanes do not mean better performance of course, so keep that in mind). Some of the vendors are calling each lane of the SIMD unit a ‘core’. Usually, the vendors are combining some SIMD units into groups (4 SIMD units in nVidia are forming a “Streaming Multiprocessor”, 2 SIMD units in Intel are forming are “Execution Unit”, and in AMD 4 SIMD units are forming a “GCN Compute Unite (CU)”). We will call that “SIMD Group“. The idea of those groups is that the SIMD units inside them are sharing some kind of hardware (most common a cache unit, but also instruction fetch/decode system and others). This also provides some locality, which is good.

GPU architecture – optimized for throughput

In the basic diagram above, every little green square is a lane from a SIMD unit, which can perform a floating point (or other) operation over some data. All green boxes (or SIMD lanes, or “cores”) in one SIMD unit (surrounded by the whitest box) are executing the same instruction. We’ve grouped 4 SIMD units in a SIMD group and we have 2 such groups here.  So instead of multiple complex hardware, we have many simple cores. By the way, if you ever wondered how the vendors are calculating the FLOP/s on their processors – take the number of the SIMD lanes (cores), multiply that by 2 (since every core can executed up to 2 floating point operations per cycle with something called multiply-add or ‘mad‘) and multiply that by the frequency of the core. It is not hard to see that this theoretical FLOP/s peak is really, really very theoretical – there is no way to keep all units busy all the time and make them do mads all the time in any meaningful app.

But, as we discussed in the previous chapter, this complex hardware (branch predictor, register renaming etc.) is very important for the performance, since it keeps the pipelines busy. So if we are lacking that in the GPUs, isn’t the performance going to suffer a lot?

Well, obviously not. The trick that the GPU does is, that instead of trying to minimize the latency, it is trying to hide it with parallelism. It works as follows – every SIMD unit has a number of registers that it can use. The tasks we give to the GPU are in amount of hundreds of thousands. The GPU prepares some of those tasks in the hardware, so at any given moment there are multiple tasks ready to be executed (they share those registers). So, let say that we execute an instruction on a SIMD unit, and this instruction needs some memory from the DRAM also called global memory. In fact it is a bit different from the CPU DRAM and is named GDDRAM, but for our needs it is DRAM enough. On the CPU, if this instruction is not in the cache, we will have to wait and CPU will stall. If you have other thread ready to run, possibly the OS will switch to it (but this switch is expensive, since all the resources that were used by the stalled thread have to be swapped with the new one). The GPU on the other hand can simply choose another task to execute, since the data needed for it is already in the registers and waiting. So in some sense, we can say that the thread switch on the GPU is hardware implemented and basically free. The key here is that you have to give the GPU enough tasks in order to achieve high level of parallelism, that will be used to hide the latency.

But everything comes at a price, right. The price here are the registers. If your threads need a lot of data, at one point the registers would not be able to fit it all. At this point we have a so-called “register spill” to the global memory. Compared to the register, reads and writes to the global memory are ridiculously slow, so it is best to avoid that situation and keep the memory that every thread needs low.

Now, what about synchronization. If you have hundreds of thousands of threads, probably synchronizing them will be hell (after all synchronizing even 2 threads on the CPU is pretty hard). The answer is that the GPU is suited for tasks that don’t need synchronization at all – like graphics (real time or photorealistic), deep learning, simulations, etc. If you need synchronization on a global level – it is impossible. The only possible synchronization can be done is on a SIMD group level. But there are atomics on global level available. Tasks that require global synchronization usually are split into multiple GPU programs, thus offloading the synchronization job to the CPU.

And lastly, some programming. The thread model of the general purpose programming languages for the GPUs follows more or less the thread model of the hardware. For example in CUDA, threads are grouped in “thread blocks”. Threads in a block can talk to each other, even more they run on the same SIMD unit. How many threads are in a block is programmable, so if you have more than 32 (the size of the SIMD unit in nVidia), the thread block will be executed on multiple runs. And you can run multiple of those thread blocks (grouped in so-called “grid”). Both CUDA and OpenCL are offering multi-dimensional entities like groups of thread blocks and grids for convenience, but I’ve always preferred to use 1D spaces. Enough talking, lets do some coding, since it will make it much easier to understand.

Here is a program that sums up two arrays in C++.

And here it is in CUDA. Instead of having a for loop, every thread (or core) will sum one of the array fields.

Finally, some common keywords with the GPGPU apps. The CPU is usually called the “host“. The GPU is called the “device“. The host and the device are having physically different memories, and we have to do explicit copies from one to the other if we want to move stuff (more on that in the next part). The functions written for the GPU, that can be called from the host are called “kernels” (I’ve no idea why). Kernels can be only void functions (this makes sense, since they are executed from many threads, and if every one of them returns a result it could get messy). Functions, that can only be called from the GPU are called “device functions“.

Both OpenCL and CUDA are C based languages, and both of them offer extensions to get a unique per-thread id. CUDA has a C++ support too (though you don’t want to use virtual functions and other advanced stuff, but more on that in the next parts). Malloc and new can’t be used on the device. You have to allocate memory on the device before you start the kernel. This happens with a API call from the host (in fact, there is a way to use them with CUDA, but it is something that we will avoid and will talk about in the next parts).

How to synchronize the work between the host and the device ? What about having multiple devices ? How does a SIMD unit works, when there is a branch in the source and some threads go into the ‘true’ path, while others into the ‘false’ path? We look at all that (and more), again, in the next part.

CPU thermal profile during runtime

P.S. As you may have noticed the CPU has these very big cores located on one part of the chip, caches and memory on the other. In the GPUs on the other hand, you have kind of a better mixture of cores and memory (and the core are using much less power, since they don’t have any advanced features and a running at around 1GHZ). And we’ve talked how cooling/power is a major problem with current (and next) processors. But it is not only ‘heat’, it is heat per square mm. So in the CPUs, you have parts of the chip that are getting way hotter than the CPU vendors would like (caches require less power compared to ALU units). This makes the life (and the future) for the GPUs even brighter.

1. GPGPU 101 – x86

This is the next part of the GPGPU 101 posts I started some time ago. If you haven’t checked part 0, please do so. In this part we are going to make a brief overview of x86 (mainly based on this article).

So, in part 0 we arrived at the idea that in order to have performance and efficiency from now on, we need to have parallelism and locality. Now we need to introduce some context for our notions of parallelism and locality, so we can quantify those in any meaningful manner. First and foremost, we need basic familiarity with the environment. Which is often underestimated – the environment from the point-of-view a developer is immensely sophisticated. For a C++ developer for example, it starts with the compiler, goes all the way to the operating system and finishes in the hardware. So you have to know how the compiler works (and why it does what it does), what the operating system does (and why we need that) and have at least a basic understanding of the architecture for which you are writing (usually x86-64). If you happen to do scripting, instead of the compiler you have to know the interpreter (normally a Virtual Machine) so you basically have the same issues (along with probably some extra issues).

Software does not run in a magic fairy ether powered by the fevered dreams of CS PhDs” Branimir Karadzic

We will start with the processors. As I already mentioned this is the best article I’ve ever read about how a processor works – really short and hits most of the important stuff. I would say that it is critical for most of the programming needs a developer might have.

In a summary, there are a few ways a processor can be made. What the processor has to do however is fetch an instruction, decode it to understand what exactly has to be done, execute it and store the result somewhere. This process is organized in a pipeline manner, and so it is called pipelining. You can think of the pipeline described above as a 4-step serial operation. Often in real life, processor pipelines are much more complicated, but this explanation is good enough for our needs.

Processor Pipeline (image courtesy of lighterra)

If the stages in the pipeline are independent of each other, you can possibly execute them in parallel relatively easy (you can fetch one instruction, while decoding another, while executing a third one). Also you can possibly multiply that pipeline (create more parallel pipelines to the one you have), so you could process multiple instructions at once. This is called super-scalarity and it is implemented in every modern x86 processor. So here is a buzz-word: this super-scalarity is a way to do Instruction Level Parallelism (ILP) – meaning, you can execute multiple instructions at the same time with the mechanism, we just described. As we talked in the previous chapter, parallelism equals performance, so this is good – ILP gives us some performance here.

ILP works great if there are no data dependencies between the instructions. But what if there are? For example, if in the first pipeline we have an instruction to compute r=(A+B), but “A” is somewhere away so we have to wait for it, whereas in the second pipeline we have an instruction that needs that result, for example, to compute r/2 (the result of (A+B) divided by two). They will both have to wait. The same can happen if both pipelines have to execute an instruction, for which there is a specially dedicated hardware (for example, if you have some circuit that does hardware-implemented exp and both of the pipelines want to do exp(something)). Modern processors usually have a number of such calculation units because of that (so there are multiple floating-point-arithmetic computation units in your processor). We are going to talk more about those units later.

Basically, the text above gives a bird’s eye view of a x86 processor. We can think of the other stuff from the architecture as tools and mechanisms to prevent the processor from stalling and make sure it is as utilized as much as possible.

First, since the processor knows in advance what instructions it will have to execute, it can ask for the data they will need before they are actually in the execution part of the pipeline (and this is really clever). But what if there is a branch in the code and there is no way to know which path the code will take. There is a solution that addresses this issue – it is called speculative execution, and it employs branch predictors – every modern x86 has (at least) one. The branch predictor monitors which branches your control path comes across, and how those branches behave (statistically), which allows the predictor to speculate which direction a pending branch will take, so other units can prefetch the data this path needs in advance (preparing the data from both paths is not an option, since it will flood the data channels). So, basically if you have if-then-else in your code that is being executed a lot, and you are always going into the “if-true” clause (for example), this is practically free. But if your branch path is not that easily predictable, i.e. if it exhibits high entropy, the speculation may go wrong and this is going to cost you. Note that the bulk of the cost may not necessarily come from the instruction decoding and execution, but from the waiting for the operands data (we talked about data moving in part 0).

Cache hierarchy

There is a unit to take care of data moving being too slow. x86-64 processors have multi-level cache hierarchies (L1, L2 and L3, check image below). They are basically super-fast (and super-expensive) memory pools, that are close to the processor (actually, on the chip itself) and they store the data that you need the most (based on the frequency of usage, adjacency, etc). L1 is smallest, but fastest – L3 is largest but slowest. Those caches normally work on cache-line granularity – 64B in modern 64-bit CPUs. What that means is that when you read 1B from memory, the cache will load some adjacent 63B as well, starting from the greatest 64-byte aligned address, lesser-or-equal to the specified address. So, if you read up to 64 characters that are stored in memory arbitrarily apart from each other, you will likely cause multiple memory-to-cache transactions, compared to just one or two, if those characters were adjacent (more on the memory in the computer here). Apart from data, decoded instructions are also cached (on the figure below, L1d is a data L1 cache, L1i is instructions L1 cache). So, again, moving stuff around is expensive, so we need good data locality.

Modern x86 has a lot more inside, but there is at least one more thing that is worth mentioning, and this is so-called “out of order” (OoO) execution. Basically if there is an instruction that has to be executed from a pipeline, but it has to wait for data, this instruction can be put aside for a moment and another one can be executed. So the order of execution of the instruction in a pipeline is not guaranteed. But as you know when you write your programs, the order of execution matters (a lot!), so if you scramble some of the calculations you do, you will most certainly not have the result you are looking for. So the processor is not randomly doing the next instructions in the pipeline – it carefully examines what can be done and what not. Further more, it “renames” the data fields, which helps a lot to resolve the data-dependencies in the code. This introduces another buzz word, called “register renaming“.

Okay, so x86 has ILP, caches, branch predictors, OoO, Register Renaming. Putting all these efforts into keeping the pipelines busy. Actually, there is one more thing that is worth mentioning. It is the so-called (buzz word!) SIMD – Single Instruction Multiple Data. The idea is simple – if you are going to do all that instruction fetching, decoding and so on (and they can be really expensive because of x86’s archaic ISA origins), wouldn’t it be great if you can execute them on multiple data at the same time? For example, if you want to calculate the sum of two vectors (each having float x, y, z; as data members), if you can execute the same instruction (sum) on all of the elements those vector have (at once), that’d be great! And actually every modern processor has special hardware that can do exactly that. Modern compilers are trying to identify parts of your code that are suitable for such SIMD instructions (this is called “auto-vectorization“). Doing SIMD code is important – for example, libraries like Embree are heavily based on it. Pretty much every performance hungry application today has to use them. If you have ever wondered how V-Ray 3.1 got so much faster than 2.4, this is one of the reasons (of course, there are many more). SIMD has multiple versions and it can be basically found in every x86-64 system. So, SIMD is just like ILP, but instead on doing multiple instructions at a time, it does the same instruction on multiple data fields. As with the ILP, it is a way to do parallelism, and since parallelism equals performance, it is a way to boost the performance of your x86-64 code.

SIMD flow

If you want to get the most of SIMD, there is a way to write them yourself. The same goes for the memory access patterns – you can make sure you read/write as linearly as possible, have less data dependencies and so on. The same goes for branch organization and so on and so forth. And, as we talked in part 0, this is going to get more and more important (not less). Fortunately, you can do this fairly easily – at least in C/C++. I am not so sure about the modern languages like Java, C#, Javascript, Haskell and the others …

Remember that there are multiple computation units in the processor, because of the instruction level parallelism? What if there is not enough ILP and not all of them have something to do in the current super-pipelined task? This is where hyper-threading (known outside of x86 as Simultaneous Multi-threading, or SMT) comes. The idea is that you split every physical core into two (or more) logical cores. So you run multiple threads, say, two on the same core. And if one of them can’t occupy all of the hardware you have, the other one can use (some of) it. And no, hyper-threading does not make the processor linearly faster with the number of logical cores, it just helps better utilize its resources (and in some cases, if they are utilized enough or because of other reasons, it gives no benefits at all). Of course, modern x86-64 processors have multiple hardware cores on the same processor too – basically just multiplying the hardware of one core. Hyper-threading, multi-core and multi-processor systems are giving the last type of parallelism, which is on thread level. So we have ILP, SIMD and threads.

There we have it – the complicated x86 architecture at a glance, with its methods for parallelism (ILP, hyper-threading, SIMD, multi-cores) and locality (branch-predictor, caches).

However, there are two more things worth mentioning. First of all, since the compiler can see your code, it can identify the data dependencies during the compilation and spread the instructions for those further apart, thus generating longer sequences of data-independent instructions (and thus help the processor with the OoO task). If the compiler does that really well, the processor could do ILP without the hassle of register renaming, dependency monitoring and so on. There is actually a whole computer architecture approach based on this and it is called VLIW (Very Long Instruction Word). It is called like that because the data-dependency-free instructions that the compiler picks are packed in a single, very long ‘macro’ instruction – sometimes more than a hundred bytes (compared to up to 1-16 in not VLIW architecture, like x86). There were few attempts for such processors – Itanium from Intel, old AMD GPUs and nVidia Denver (which gets ARM instructions runtime (not compile time!), reorders them into VLIW ones and executes them in such a manner. Why nVidia decided to do that – I have no clue). The VLIW sounds great, since when we offload the processor of those tasks, it could do something else (at least, we are sparing transistors on the chip and this is important too). In real life many VLIW attempts have failed – compilers are getting way too complicated to produce, they start wasting way too much time (and memory) to compile and they often fail to generate good code (and during the compile time you know less about the program than at runtime) – meaning that often at the end they generate instructions that are executed in a serial manner. Still, VLIW has its applications and compilers in the wild. For example, everybody can take the LLVM compiler now and use it to generate code for AMD R600 – a VLIW architecture, relatively easy. More on all that in the next part of the GPGPU series.

And of course, there is an option not do all of that OoO with its complicated apparatus, and just use the transistors that would be needed for those to put more units for actual calculations. Those types of processors are called speed-demons (in contrast, complicated ones like x86 are called brainiacs). Speed demons have the potential to be more efficient (but only if there are good programmers and compilers, since they require more careful programming) – those smart features of the brainiacs come at a cost – some of them are power hungry or/and can’t be turned off when they are not needed. And these days a lot people had started talking not about performance, but performance per watt, so the power consumption is getting more and more important.

In the next part we will see how the GPUs are handling the instruction, data and thread level parallelism.

0. GPGPU 101 – Intro

So, if you want to get familiar with why we have to use CUDA and OpenCL and stuff like that, in the text below there are a few links to articles, talks and papers that explain that in length – you can read those in their order of appearance. I strongly advise everybody to at least check those. However, if you want to read just a summary of all that in one place, here you have it …

We have to start with a bit of history, since this is how every good story begins …

At one point early on during the computer era, people decided that they could show the result of the computer calculations on screen in other-than-alpha-numeric form, so their job could be more interactive (and less boring, I guess). Special hardware was needed for that – something that could convert large amounts (for that time) of data from the computer to something appropriate that the screen could understand, and thus the framebuffer was born. In some sense, probably this is the birth of the GPU and it happened long, long time ago (1970-something).

The GPUs have had a really crazy history since that time – till the Voodoo showed up in mid 90s (and OpenGL just before that), when everybody wanted (and did make) GPUs. There were all kinds of designs, approaches and implementations and it was time of really dynamic changes. Many of the companies did not manage to keep up with that, so many of them quit the business. These days we basically have nVidia, AMD and Intel, and in the mobile sector there are ARM (Mali), Imagination Technologies (PowerVR), Qualcomm (Adreno) and a few even smaller IP vendors. Here is pretty much the full story of it.

But the history that brings us to today is actually pretty cool. As it seems from the beginning of the computing era, a lot of chips designs were targeted at the CPU, but at some point everybody just started using x86. Yes, we had MIPS, Power/PowerPC, Sparc, etc, but they were nothing as x86 in terms of market share. More or less, we have been stuck with x86 for a long time. We have ARM now, which is quite different, though in some way it is just a cleaned-up, RISC variant of the same.

In comparison, GPUs have always had the freedom to completely change the architecture or perpetually try new stuff since they don’t have to support tons of legacy software. So there one can find all kind of crazy ideas, some working great, some not.

I would like to split the GPU history into three parts – before Voodoo, after Voodoo and after the GPGPU advent.

The Voodoo GPUs were so great back then (circa 1996), that they marked the whole period. They took the GPU ‘to the masses’ and everybody knew about Voodoo; I was 14-year-old and living in post-socialist country, but even I knew of that Voodoo stuff.

They were good, but not good enough to save their maker, 3dfx, from failing – in the dynamic situation back then it did not take many wrong steps to fail. 3dfx were bought by nVidia (circa 2000), which at that time had some great GPUs too. After a lot of market acquisitions, some time in the beginning of the 21st century there were only two major discrete desktop GPU makers left – nVidia and AMD, and two dominant APIs – DirectX and OpenGL. And actually I find that not so bad – the times when everybody had different approaches on the GPUs were the times when you have to write your code to fit tons of hardware (and this is not fun). This third period continues up to today, but something happened around ~2005, which marked the start of a new era – the GPGPU one, which is what we will focus on for the rest of this text.

Herb Sutter has great article about why that third period is so exciting and important. In my eyes, this article may have turned out to be as foreseeing and important for the computer industry as Dijkstra’s “Go To Statement Considered Harmful”.

Now, lets get away from the history for a moment and look closer at why this GPGPU stuff matters (a lot) …

In a nutshell, because of the way processors are made, we cannot make them a lot faster anymore (as we could every year, till 2005). So our programs can’t get faster and faster as the time goes by (or at least, they can’t get faster without some developer’s efforts).

There is Moore’s Law that everybody talks about – basically it is a rule-of-thumb that says that every 18-24 months or so, the number of transistors in a processor roughly doubles. The way this happens usually is that we start making half as big transistors, so we can put twice as many on the die. And on top of that those transistors can run at two times the frequency they had and because they are smaller, we can feed them considerably less voltage, and this resulted in the roughly same power consumption (but running faster). However, because of a physics phenomenon called “leakage” (if you want to know the details and much more on the topic, this talk is great) we can’t keep lowering the voltage and if we don’t lower the voltage we get four times the power consumption we had before. If you don’t believe that, ask Intel – they predicted 10ghz (and people were thinking a lot of crazy stuff, too) processors for 2011. By the way, because of the limitation of the speed of light, if we want to have a 100GHz single-clock-domain circuit, say, some CPU, the furthest distance an electron can travel during a clock is 2mm. Good luck cooling a chip that is 2mm big!

So, the processors today are power limited – we can’t cool them infinitely, we can’t make them use less power for higher frequencies. So people started saying that Moore’s Law was dead. Here is an image, that represents the number of transistor on Y axis, and the year on X.

As we can see, Moore’s Law was not dead by 2010, but at some point it will be – nothing exponential in nature lasts forever. But this day is not today – and perhaps won’t be in the next 10 years. We can still put more and more transistors, but we can’t make them run at higher frequencies. So the processors these days are getting more cores. Today’s expectations are that single core performance will improve somewhere between 2% and 15% per year (because of better architectures, optimizations in the process of their production and so on).

This makes x30 CPU performance for the next 50 years – for a comparison, for the last 30 years this number is x2500. At the same time, we expect the multi-core speed-up to be around 75% per year. And this happens today. The image below shows the theoretical GFLOPS of modern CPU (which has few very smart cores) vs GPU (which has thousands not-so-smart ones).

I think that most developers underestimate the above, and really bad so.

Let me give you an example – I worked in a company that made MMORPGs. Our code was single-threaded and slow, the game was slow, but we knew that at the moment we start selling that, computers would be fast enough and the game would run fine – MMORPGs usually take 3 to 5 years to make if you do your own engine (the game failed in the end, but that’s another story), so there was plenty of time. These days there are millions of people out there that still write code like that (embarrassingly single-threaded code, using languages and tools that are designed to be such), who still believe the same – that computers are fast enough or that they will get faster (Swift, Javascript – I am looking at you).

Well, they will not, and that’s why from now on the efficiency and performance of the programs will get more important than ever before (this is quite the opposite of what the majority of developers think). In the past, you could have a slow but sophisticated single-core and feature-complete program that runs sluggish on most machines. But you knew that after some time your program would run fine. This is the past – now everybody can design sophisticated programs (since we all have zillions of tools), but not everybody knows how to make them run faster, not to mention that the need for more and more computations has been outgrowing the speed at which processors are getting faster since the beginning of the computing.

“Concurrency is the next revolution in how we write software. The vast majority of programmers today dont grok concurrency, just as the vast majority of programmers 15 years ago didnt yet grok objects” Herb Sutter.*

So, lets get to the power (see the image below). As we noted above, processors are power limited. If we take one 20 mm^2 processor chip and fill it with ALUs only (for instance, FMAD – fused multipy-add, basically simple units that can do some floating point operations), this would give us 12 TFLOPS (using 28nm production process) and would result in 300 Watts of power (since every double FMAD uses 50pj). For a comparison today’s modern GPUs (which are considered really power efficient) are using around 200Watts for ~5 TFLOPS. That is because these units can’t do anything without data – they have to get input data, over which to do their computations, and store the output results somewhere. And moving data comes at its own costs – moving 64 bits at 1mm distance costs 25pj. If you have to move 2×64 bits, you are already wasting more (or the same) power for the data transfer (2×25), than for the calculations (50). These costs are growing linearly – if you have to move the data along 10mm, this would cost 250pj, and so on and so forth. For some cases (as in the DRAM of the GPU) it goes up to 10000pj. These joules can be converted to latency too, since when the data has to travel longer distance, it arrives later. It is clear that you want to have your data as close to the computation units as possible.

And when we look at the “modern” computer languages, we can see that they clearly are “above” all that – C++, Java, Ruby, Haskell, Go – all of them present the world as “flat” (read: fully-coherent) – from their perspective all memory is the same and accessing it costs the same, too. And I think they seem to be proud of that, in some manner. The same goes for x86 architecture itself, by the way – it has some instructions for pre-fetch and cache management, but they are just little patches over a large problem. It is good that we have all those caches in the x86, trying so badly to hide the latencies, but it is just that the developers sometimes (by nature) know more than any heuristics does. There are a lot (really, a lot) problems of such nature. So if there is a way to manually “manage the cache”, a lot of developers/programs could benefit.

“Performance = Parallelism”, “Efficiency = Locality”William Dally

Who targets what and how – in the next part.

* Parallelism is to run stuff in parallel – like having multiple cars on the highway, all going parallel on the road.
Concurrency is to be able to advance multiple tasks in a system. For example, if you have two cars that have to be moved form A to B, and one driver to do the job. The driver might drive the first car for some walkable distance, than go to the second car and drive it to go in front of the first one, than go back to the first car, etc. After a lot of time, both of the cars will be moved from A to B – this is concurrency. If you have two drivers for the job, this is parallelism.
The situation, at which some resource might be needed from different subjects at the same time – for example if the cars have to cross ajunction, they would have to wait for each other before they can cross it (or otherwise a crash will occur) is concurrency problem.
If you have parallelism, you have a concurrency too. Having heavy parallelism presupposes having to deal with concurrency problems.

I admit I was wrong

Okay, I hate that I have to do this, but it seems that if I made a blog about C++, OpenCL and CUDA in a language in which I personally know all the native speakers that do care about those technologies (GPGPU ones the most), I could not have really big audience (they are literally 5 people and Boris. Hey Boris!). So I will start blogging in english. I have bad english, but I will try to do my best.

Before a year or so I decided to write down a paper in which I wished to tell people more about how do I have spend a couple of months in Chaos. Basically it was a story of how we solved a problem, that we had (it is not a general solution that works everywhere and for everybody, but it did worked for us back then). And I learned a lesson – writing papers is boring and frustrating. You have to write a certain number of pages, you have to have references, abstract and images all around the place. Than you have to convince somebody that what you have written is worth it (“it made a popular raytracer faster is not enough of a proof, apparently). And dealing with the academic community was not hours of fun.

So, I am starting to blog about CUDA (and some C++ and OpenCL ofc, but we will start with more CUDA, I think). In english.
I will start with the boring stuff (what is CUDA and why I don’t write more about OpenCL) and I will move to the more interesting one (fast). I promise that even in the boring ones, I will do my best to tell spicy important details that I had hard time finding.

Stay tuned.

no except

Ето една интересна лекция на Meyers, в която той обяснява много добре какво е std::move, std::forward и noexcept :)

Към средата има една част, в която е показано колко по-бърз може да е кода, когато е с noexcept там, където е възможно. Освен това имаше показани примери, как чрез проста смяна на компилатора със C++11/14 enabled такъв, кода става по-бърз. Във връзка с noexcept се появи в един момент и тази графика.

Колко хубаво, помислих си .. Ще направим сега едно NOEXCEPT макро (което ще е #define NOEXCEPT noexcept когато компилираме със С++11/14 компилатор), ще прекомпилираме всичко което имаме (и вкъщи и в офиса) с него и ще получим ей така едни проценти забързване от нищото ..

Та изрових една моя стара (С++98) микро библиотека за математика (вектор, точка, матрица, quaternion, в общи линии това) и реших да завъртя един цикъл с тях (тя всъщност е толкова проста, че не очаквах каквито и да е разлики в резултатите).
Ползвам clang-600.0.51(llvm 3.5svn), компилирам с -Os (което е по дефаулт) (с -O3 резултатите са почти същите) и C++14 enabled.

Кода от теста се намира тук.

Интересно, че противно на приказките на Meyers, Chandler (from LLVM) и не знам си кой друг, на моята машина (I7-3615QM) кода върви по-бавно с между 2 и 5 процента, когато функциите имат noexcept.
Или аз го ползвам много не както трябва, или някой някъде малко послъгва …

No country for old men

Още един “benchmark” между ARM и Intel CPUs (идеята за него дойде след като разни JS benchmark-ове на iPad Mini Retina минаваха само 2-3-4 пъти по-бавно от колкото на i7 2600 … Логичният въпрос, който някой (Тео) зададе беше дали наистина е толкова по-бавен, на приложение като raytracer).

Тестваме smallpt (не че това е представителен raytracer, но е raytracer), naive quick sort, random mem access & floating point operations. Кода на тестовете е тук.

Тестваме I7-3720QM (2.6GHZ) (Macbook Pro Retina 2012) vs 64-bit A7 (1.3GHZ) (iPad Mini Retina).

Тестваме еднонишково.

Тестваме с clang-503.0.40 (based on LLVM 3.4svn) (с опции -O3, relaxed IEEE & strict aliasing).

Основният тест е smallpt, покрай него има и няколко проби с по-прости сметки.
Тестовете са правени по 5 пъти (да, само 5), максималната разликата в резултатите на Intel-a е в рамките на 8% (което е бая), а на iPad-a e в рамките на 2%.

По време на тестовете не е имало други пуснати приложения (а лаптопа не е бил в power-save mode, etc). Сиреч, почти може и да си помисли човек да им вярва.

Показани са усреднените стойности на времето, което е отнел всеки тест в секунди (тоест, по-ниските правоъгълничета означават по-добри резултати).

Ако някой има С/С++/Objective-C/Swift код който иска да бъде тестван, PM me :).

Като цяло, резултатите са такива, каквито се предполагаше че ще бъдат.

There is no country for old men.

p.s. през *почти* същите тестове мина и един iPhone 4, с ~ резултати Raytrace:240s, Sort:72s, Mem:56s, FPU 21s & iPhone 6 – Raytrace 13s, Sort 11s, Rand mem access 9.5s, FPU 1.92s.

VTable по поръчка

Ето в това видео от Going Native 2013 A. Alexandrescu разказва за някои от проблемите, които виртуалните функции в С++ създават. А именно : всеки обект държи по един (или повече) 64 битов (най-често) указател към виртуалната таблица, това измества разни неща от cache-а; винаги този указател седи в началото на обекта (а в началото е добре, да има неща които се ползват по-често. Това може да е указателят към виртуалната таблица, но може да е друго :)). Рядко в живота може човек да се докара до там, че да иска и това да забързв. Но се случва.

Идеята е да си напишем наша имплементация на виртуални функции, която да ползва не 64 битово число, ами 8 (така ще се ограничим по максимален брой виртуални функции в интерфейс или по максимален брой класове в йерархия, но 256 (2^8) често е повече от достатъчно).

Ето и примерна имплементация (с малко macro-та и нагласявки :)). На места е малко redundant; мисля че може да се постигне и по-приятен синтаксис.

Не успях да събера мотивация да направя тестове за да видя има ли забързване, в кои случаи има и колко е то. Според Alexanderscu Facebook работи 4% по-бързо заради тази магия :).
Предимство е, че така можем и да изберем къде да поместим този “char”, който ползваме като указател към виртуална таблица -> ако не се очаква виртуални методи да се викат често, навярно е по-добре да не е в началото на обекта.

Raytrace Interop

Oпитвам се напоследък да правя разни бързи неща с CUDA.
Често има проблем с вземането на резултата от GPUs app (то не става особено бързо, да започне да се случва асинхронно е голяма болка, а пък понякога трябва да става особено често), та един начин това да се случва е с CUDA/OpenGL interop.

Идеята (поне моята) е с някой CUDA kernel да рисувам нещо в някоя текстура, а с OpenGL само да си я рисувам на екрана. И така, понеже всичко си е на едно място (ze gpu), да спестя всякакво четене/писане от видео паметта.

В резултат на ровене из форуми, stackoverflow, tutorials && references, ето какво се получи :

1. YouTube (beware of the video compression)(out-of-core version).
2. GitHub.

Имам около 40fps, интериорна сцена с голяма лампа (cornell box), Macbook Pro, i7 2.6GHZ, nVidia 650M, 512×512 resolution, ray depth 8, rays per pixel 1.

Това, което се случва е : CPU-то инициализира CUDA & OpenGL, стартира CUDA kernel които трасира няколко лъча, резултата се записва в текстура, и накрая се рисува с OpenGL.

Целта на кода по-горе е да се ползва като бърз startup за някои CUDA driven apps (като raytracers :)).

Pros на моят вариант : почти всичко се смята на GPU-то (генериране на семпли, random числа, трасиране, събиране на резултата), ползва Halton редици (beware, nVidia държат патент над тях), ползва CUDA/OpenGL interop (сиреч, ъпдейта на картинката се случва максимално бързо).

Cons : това не е production, aми по-скоро naive имплементация на ray trace, може да се стане няколко пъти по-бързо и noise-free (няма shadow rays, complex sampling), GPU-тата се feed-ват от 1 CPU thread, etc.