This is the next part in the series of GPU 101 posts I started some time ago. If you haven’t checked part 0, part1 or part2, please do so. In the last part we discussed the modern GPUs thread model. In this one we will take a look at the memory model of the GPUs.
So, as we’ve seen so far the GPUs basically is a huge SIMD machine. Because we had to have locality for efficiency, the SIMD units were grouped. We called those groups SIMD groups. AMD calls them computation units (CU), Intel execution units (EU) and nVidia streaming multi-processors (SMX). They do have differences (thats why the vendors gave them different names), but their idea is the same – put memory and ALUs near each other to spare the power needed for moving data.
C++, Java, Python and all of the popular modern languages have no notion of different memory space. For a C++ developer a pointer is basically a pointer and it does not matter when or how that pointer is being accessed – you can always deference it, write or read from it and you don’t know explicitly what the price of that will be.
Regarding memory types, there are different buzz words in CUDA and OpenCL, and the syntax differences between those too. We will stick to the OpenCL notation for the purposes of this part (but you can think of the source shown here as if it is written in GPU pseudo code). The table below is showing those differences.
The types of memory that we care about on the GPU are registers and local memory, shared memory, global memory and constant memory. You might think that having so much types of memory would make the programming complicated. This is somewhat true, but in the end programming having that much memory types is not much more difficult.
The global memory is the slowest and largest memory (reaching tens of gigabytes) in the GPU. All threads (SIMD lanes) have the same view of it. Unfortunately, it can’t be used to implement synchronization, since there is no way to prevent race conditions among SIMD lanes (threads). This is one of the fundamental design decisions in the GPUs, so don’t expect that to change anytime soon. You can think of it more or less just as a regular memory in the CPU. We should note that the global memory can’t be allocated on the device itself**. A pointer to such memory has to be passed to the kernel as an argument. In OpenCL such pointer has to be marked with the __global prefix. Just like in x86, all reads from the global memory are cached with L1 and L2 caches, wheres the exact configuration of the caches is vendor specific. You can’t count on cache coherency.
In addition to the global memory, each and every GPU thread has access to a memory type called “registers“. This memory is private, so each thread has its own view of it. It is physically very close to the ALUs and it is the fastest memory in the GPU. Generally, the register memory a GPU has is huge compared to the CPU one. But we should consider the fact, that we are running thousands of threads (SIMD lanes) at the same time. And those registers should be divided along all SIMD lanes, so at the end each SIMD lane (thread) has few of them. So, what happens when we have a big and complicated GPU program (kernel) that needs more registers per thread ?
Well, the first that a GPU stack might do is to prepare fewer tasks for the chip, so those registers would be split among fewer threads. This is so-called “lowering the threads in flight“. But as we discussed in the previous chapter, the primary GPU way of hiding latency is through multithreading, and when we reduce the number of threads in flight, the performance might suffer.
Another approach would be to use one of the other slower memory types as registers memory. And this is done too and it is called “register spill“. Register are being spilled (stored) in the global memory. So, each thread can have its own part of the global memory. In CUDA, this part of the global memory is called “local*” memory (there is no special name for that in OpenCL). You do not have to manage those spills manually. So you can just write code as if there is infinite amount of registers and the compiler would figure out how many tasks to send to the GPU – if any, (or if it should reduce the number of threads in flight) and how many registers to spill. Some compilers are offering command line options to specify that manually.
Every stack variable a kernel uses is stored in the registers, if possible. On some platforms this does not apply for stack arrays, so we would consider them stored in the slow “local” memory (meaning that the stack arrays are just like spilled-into-local-memory regular stack variables). On other platforms, stack vars not fitting the registers are actually allocated in true local memory, chunked off from the SIMD shared mem pool. Here is a sample kernel using only registers:
1 2 3 4 5 6 7 8 |
__kernel void registerUsingKernel() { int firstRegister = 42; int secondRegister = 196; int uniqueThreadId = get_global_id(); // int result = firstRegister + secondRegister + uniqueThreadId; // int registerStoredInTheLocalMemory[4]; //local memory } |
It is hard to calculate the exact register usage by looking at the code, since the register are being recycled and reused (this reuse might be possible if there are no data dependencies). Some compilers have options to print the register usage of every function, though.
The register usage is crucial for the performance of the GPU. Hiding latency with multithreading implies that a lot of threads should be ready for execution at any given time and if the data they need is far away, this might hurt. This also introduces another phenomena – you might end up in situation in which you are adding source code that is never being executed, but it still slows down the program execution a lot! The reason for this is, that the new source code might take up the registers you’ll need and make them spill into the local memory (or just reduce the threads in flight). There are several approaches to that and we will discuss them in the next chapters, but it is clear that having a GPU kernel that runs fast is one thing, and having a huge GPU kernel that runs fast is another.
Now, on to shared memory. It has several purposes. All SIMD lanes (threads) inside SIMD unit have the same view of it – meaning that all threads that we have grouped share that memory and they can communicate through it. Being able to do so, they can cooperate and help each other. The shared memory is almost as fast as a register. Because of that it is often used as manually managed cache – in most of the GPGPU apps the primary efforts are finding what kind of memory your thread might share and use that fact to pre-cache it. Lets do a simple kernel that finds the difference between every two elements of an array. The straightforward solution would be:
1 2 3 4 5 6 7 8 9 |
__kernel void difference(__global int* result, __global int* input) { int id = get_global_id(); //store the unique thread id in register if (id > 0) { int a = input[id]; //load from global memory to register int b = input[id - 1]; //load from global memory to register result[id] = a - b; //store to global memory } } |
Every thread does one subtraction. If you do that on an average GPU with input data large enough, it would outperform the CPU version of this code by a lot. But we can do better. If we think about it, from the view of the SIMD group same memory will be read multiple times from the slow global memory. We can reduce those reads using shared memory as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
__kernel void difference(__global int* result, __global int* input) { int tx = get_local_id(); //gives the thread idx in the thread block/simd unit int id = get_global_id(); // __local int sharedData[BLOCK_SIZE]; //__local marks shared memory in OpenCL, NOT local memory in terms of CUDA. All threads in the thread block / simd unit are having the same view of this memory. sharedData[tx] = input[id]; //each thread per-loads one of the input values into the shared memory. This is the-so-called manual caching to shared memory. barrier(CLK_LOCAL_MEM_FENCE); //the memory barrier works inside the thread block / simd group and asserts that all threads from the block/group have reached this point before we move forward (thus we are sure all preloads from input in sharedData are done). The barrier is hardware implemented and fast. if (tx > 0) { int a = sharedData[tx] int b = sharedData[tx - 1]; result[id] = a - b; } else if (id > 0) { //handle the corner case of block boundary result[id] = sharedData[tx] - input[id - 1]; } } |
The second code is a bit more complex, since we do manually pre-load. But it easily can make the execution tens of times faster, since the reads from the global memory are reduced. The amount of shared memory a kernel might needs has to be known prior the kernel launch, so it either has to be declared with a constant amount inside the kernel, or to be explicitly declared with an API function when the kernel is launched from the host. This is needed, because just like registers, shared memory is a finite resource and can reduce the number of threads in flights – if a lot of it is needed by every thread block, you will have to run fewer of them.
Fun fact is that if you calculate FLOP/s per line of code, the example with shared memory is still times faster.
According to the OpenCL standard, the shared memory should be at least 16KB. It is interesting, that according to the same standard, the minimum size a thread block / SIMD unit might have is 1 (one). If you have such OpenCL implementation (with block size = 1), the shared memory will match the private(register, local) memory, since it will be shared along 1 thread. Doing shared memory optimizations on such systems may turn out to slow down the execution, because of the extra work being done (in the same manner, if you run block size that is not big enough for your hardware, the shared memory implementation of the task above might take longer to execute).
Constant memory is special read-only type of global memory, which on some devices can be cached better. Some vendor implement clever mechanisms with it too – for example when a SIMD lane reads constant var it might broadcast it to the other SIMD lane in the SIMD unit (some vendors like nVidia implement that broadcast on half a SIMD unit) and if any of those need it, it will get it for free. As you can see, if none of the other SIMD lanes needs that particular memory, constant memory can be in fact slower than global one (because of the extra work done for the broadcasting and caching). The amount of constant memory is limited (it is usually in terms of tens of kilobytes and is vendor specific). Also, it should be noted that in contrast to the common notion of constant memory, the constant memory on the GPU is constant during the kernel execution, meaning that you can change it before you launch your kernel (with API calls).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
//this example contains pseudo-OpenCL code (simplified versions of the API calls) for the purposes of our examples //cpp (host) : allocateDeviceBuffer(d_buf, 1024 * sizeof(int)); setConstant("offset", 42); launchKernel("test", d_buf); //kernel (device) : __constant int offset; __kernel void test(__global int* dBuffer) { int id = get_global_id(); if (id > 1024) return; dBuffer[id] = id + offset; } |
There is another type of memory in some GPUs named “texture” memory. If you happen to see such, you can just think of it as a read-only global memory with fancy addressing capabilities (and in some architectures, with better cache system). Texture memory is usually used to store 2D (or 3D) textures. When you do texture lookups, you often need to read not only the data stored at [x][y] position, but also the ones stored near that, like the ones at [x+1][y], [x][y+1], [x-1][y], etc. Because of that, the texture memory might be internally optimized for such read. For example, instead of storing linearly in the memory every row of the texture one after another, the data for the texture can be stored using an space filling curve pattern like Hilbert curve. Using such curve assures that the texture lookups will read data fields that are close to each other. The good part is that you don’t have to manage that data ordering manually. This is handled by the driver and the GPU.
There is no way to to do synchronization along all threads in the GPU, but most of the hardware today is capable of doing atomic operations in the global memory. So source like this is perfectly valid:
1 2 3 4 5 6 7 8 |
__kernel void count196(__global int* ptr, int size, __global int* result) { int id = get_global_id(); if (id < size) { if (ptr[id] == 196) atomic_add(result, 1); } } |
If there are a lot of threads trying to do atomic operation on the same memory, they will be serialized and the performance will suffer (a lot). Here is an improved version of the source above – instead of doing atomic_add of global memory, we will do it in a shared variable, and after that we will just update the global variable with the accumulated result. Thus we will reduce the number of threads trying to modify the same variable at the same time.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
__kernel void count196(__global int* ptr, __global int size, int* result) { __local int partialResult = 0; int id = get_global_id(); if (id < size) { if (ptr[id] == 196) atomic_add(&partialResult, 1); } barrier(CLK_LOCAL_MEM_FENCE); int tx = get_local_id(); if (tx == 0 && partialResult) //only the first thread in the thread block updates the global result, thus we have NUM_THREADS/BLOCK_SIZE atomic updates (compared to NUM_THREADS) of the global variable atomic_add(result, partialResult); } |
And lastly – as a rule of thumb – mixing pointers from different memory spaces is bad idea, so don’t do that. And by design, if you put memory_barrier() call in a place from where not all threads will go (for example a branch) your system will most likely hang (since it will wait forever for all threads to get on that barrier, and they will not since some went trough the true branch of the if, and some trough the false, so the GPU could not continue to execute all threads).
In the next part, we will continue with more real code.
* Again, please not that there is significant difference between the local memory in CUDA and OpenCL. We are using the term ‘local’ here as in CUDA (meaning, registers spilled to the global memory). OpenCL has no word for that (calls them spilled registers), so it uses the word ‘local’ as ‘shared’. Meaning, that in OpenCL the memory types are named private(or registers), local (or shared) and global (this one is the same). It is confusing. You could think that it is nVidia fault for not using the same keywords as the OpenCL standard, but than again, CUDA was born before OpenCL.
**Actually, you can preallocate global memory before kernel launch, and use memory manager on the device itself, thus having malloc/delete in the device code, but this is not a good idea since the GPUs are not designed to be good memory managers.
p.s. and million thanks to Marto (blu) for the edit!