4. GPGPU 101 – APIs

This is the next part in the series of GPU 101 posts I started some time ago. If you haven’t checked part 0part1part2 or part3, please do so.

Let’s look a bit deeper into the APIs used for GPGPU programming. Basically, every GPGPU API we will discuss is C extension language, in which you write the kernels that run on the GPU itself, plus an API to call those kernels, send them arguments and receive results. Some APIs (like OpenCL) have API bindings to script languages like PyOpenCL (Python), jock (Java) and even for the browser.

Historically, the first popular API was from nVidia and it is called CUDA. CUDA is proprietary and runs on nVidia hardware only. Being such allowed it to evolve faster than other committee-driven APIs, since decisions for changes in it are much easier to take when there is no committee and you have all the control.

OpenCL showed few years after CUDA in attempt to prevent the chaos that would happen if everybody (read Intel, AMD, Imagination, ARM) did there own API (which in the end is what happened) and to provide a standard, which everybody (making CPU, GPU or whatever) could implement for their hardware. nVidia were not really happy about OpenCL, since programs written in it could possibly run on Intel and AMD GPUs or CPUs, but they supported it nevertheless.

The history of OpenCL so far is dark and full of terror – back then AMD were using VLIW architecture, which was everything but ready for GPGPU. VLIW can’t handle data dependencies well. In a game shader, you may have only few such, but in a general purpose app their number easily goes up, since the general purpose programs usually contain more complicated logic. GPUs from Intel on the other hand were small and not ready for that, too. nVidia had Tesla architecture, but they lack the will to make an OpenCL implementation that runs better than CUDA. There were lack of tools, too – both for debugging and profiling. Compilers were buggy, took a lot of time to compile and often crashed. Most of the runtime bugs in the code ended up in BSOD. Those compilers were often done in a quick ‘n’ dirty manner by modifying some C frontend and appending some shader-like backend. As you would tell, this didn’t end well. There were OpenCL implementation for CPUs as well, both from AMD and Intel, and both of them were really good (and run on all kind of x86 processors, of course). These OpenCL implementations for CPU allowed faster compiling, easier debugging and testing for quite some time – they were the glass of ice water for the GPU developer, who was in the hell named OpenCL.

But it evolved. AMD moved from VLIW to SIMD with the GCN architecture, which allowed them to do GPGPU very well, since SIMD is better suited for data dependencies and since there are already a lot of general purpose compilers that can be used for such approach (like llvm). Intel started making bigger/more powerful GPUs, too. nVidia improved their OpenCL implementation over time (as of 2015 it still runs slower compared to CUDA, but in spite of that, they have one of the best OpenCL implementations as of today – faster than most and quite stable). There are tools both for profiling and debugging, not as mature as C/C++, but they are here. BSOD is something that almost does not happen when there is bug in the GPGPU app.
CUDA evolved with that. It added support for recursion and virtual functions, C++ and even C++11. It has the best tools for debugging and profiling these days and gives deeper control. In the same time, using CUDA is much easier, since the API is simpler and does a lot of stuff for you. Also, it runs really fast. nVidia is making its compiler and drivers for SIMD-like architecture for a much longer time compared to its competitors, so it should be not much of a surprise that they have one of the best toolchains in the industry.
So, we have OpenCL which is a standard and runs everywhere, and CUDA, which was the first GPGPU API implementation. Two is not so bad after all, right ? Wrong.
Firstly, Google declined the support for OpenCL in Android. I don’t know if that has something with the fact that OpenCL was initially pushed by Apple, but here is a statement they made.

Android has no plans to support OpenCL. There is really nothing more to discuss.

They actually had support for OpenCL in Android 4.2, but decided to permanently drop it without any further explanation. They made their own GPGPU language named RenderScript, which as OpenCL can run both on CPU and GPU and automatically decides that. So far, it always chooses to run on the CPU, so I have no idea if that RenderScript is any worthy at the moment (or will it ever be).

Microsoft extended their DirectX API with support for GPGPU as well, named DirectCompute. It runs on Windows systems only, and just like RenderScript is quite similar to OpenCL and CUDA.

After starting the initiative for OpenCL, Apple also created another (!) API with syntax similar to the OpenCL, called Metal. Metal can be used as a low-level replacement both for OpenGL and OpenCL (meaning it has two parts – graphics and compute, just like DirectX) and can run both on iOS and OS X – OpenCL can run only on OS X (no iOS support). And as we speak for OpenCL and Apple – Apple is well known for one of the worst OpenCL  implementations, ever.

It is worth mentioning that some of these APIs require the kernel to be precompiled (like CUDA before version 7) – which is called offline compiling, when others require the kernel to be shipped as source with the app and to be build on-the-fly (like OpenCL) – which is called, of course, online compiling.

So, CUDA gives you speed, OpenCL better portability, RenderScript gives you Android and Metal gives you iOS. This sucks and I thought I can make it even worse, but more on that later.

We will use OpenCL for our first in-depth look of hello-gpgpu app, since it can run on most of the systems without need of special OS or hardware.
OpenCL works as follows – from the host (read CPU) you query for platforms (read vendors, like nVidia, Intel, AMD), then you query each vendor for device (you can have multiple OpenCL implementations, let say an Intel CPU one, multiple nVidia and AMD GPUs). After that you create something like context object . Each context is bound to exactly one device. Lastly, you have queues in which you add commands to be executed on the device (this would get clearer in the code). The source for doing that is ugly and old-school, so most people prefer to hide it in prettier and friendlier wrappers.
Also, from the host, you have to alloc, copy, dealloc device memory, call device kernels and wait for their results.

Don’t get scared with that messy looking code, just go through it. You will see why later. The source is modified version of the one from here.

We have the source of the kernel in (1). In (2) we query for devices and we setup context and queue for the first device from the first platform, after that in (3) we compile the source of the program using the OpenCL API call. In (4) we allocate memory on the device and we transfer the input host array in it. In (5) we call the kernel itself and in (6)  we wait for its completion and we transfer the result from the device to the host.

Some time ago I decided to wrap all that in a pretty-looking abstract layer over that OpenCL. It became easily clear, that that layer can be over OpenCL and CUDA at the same time – I could just wrap the API calls from both of those APIs and change which one to use only with compiler define. The language differences between the OpenCL kernels and CUDA were easily hidden with some macro-magic. It is true however, that only the common part of each API could be used. But using that wrapper, I could have the portability of OpenCL with the speed of CUDA.

However, it turned out to be even better (or worse, still not sure). It seems that having such a wrapper allows you to run the kernels natively – you have just to compile them as regular C++ code and implement all the methods of the abstraction layer. All that needs to be done is to implement the GPGPU extensions to C, which are not that many.

Unpretentiously, I call that GPAPI (General Purpose API) – you can write your source once and run it in OpenCL (so you get portability), CUDA (so you get speed) and Native(C++) – so you have debugging. Here is how the example from above looks like in GPAPI.


As a bonus, the sample code with the GPAPI will run the program on all the devices, that are on the system (in my Macbook Pro, there are 3 – Intel CPU, Intel GPU and nVidia GPU). If we want to make the OpenCL code run on all those devices, it would take at least hundred more lines.

The point here is, that the APIs and languages for GPGPU are practically the same. GPAPI could be easily ported to support RenderScript, Metal or DirectCompute. But this should not be any of a surprise – after all the hardware behind that is the same – SIMD machines controlled by a host CPU. In the next parts and code samples, I will use the macro magic words from GPAPI – they make sense (KERNEL, GLOBAL, SHARED, etc), and they can easily translated to whatever API you like using that reference file.

The native version of GPAPI is designed for debug purpose (it could be used to run CUDA on the CPU too, but I doubt somebody cares about that a lot). But it had to support everything that is common between CUDA and OpenCL, or at least, everything I need so far. There is curious approach to implementing shared memory – as we mentioned in the previous part, if the max thread block size is 1, the private memory matches the shared one (because it shared along 1 thread). And that is how it is implemented – the max thread block size when using Native mode in GPAPI is 1 (which is perfectly valid according the OpenCL standard). Private thread memory is implemented using C++ automatic/stack variables, and global memory is implemented using heap memory.

There are of course stuff in CUDA that are worth using and are not in OpenCL (and vice versa). We will see such cases in the next parts and we will make each and every one of them compilable in both APIs.

Stay tuned.

Write a Comment