This is the next part in the series of GPU 101 posts I started some time ago. If you haven’t checked part 0, part1, part2 or part3, please do so.
Let’s look a bit deeper into the APIs used for GPGPU programming. Basically, every GPGPU API we will discuss is C extension language, in which you write the kernels that run on the GPU itself, plus an API to call those kernels, send them arguments and receive results. Some APIs (like OpenCL) have API bindings to script languages like PyOpenCL (Python), jock (Java) and even for the browser.
Historically, the first popular API was from nVidia and it is called CUDA. CUDA is proprietary and runs on nVidia hardware only. Being such allowed it to evolve faster than other committee-driven APIs, since decisions for changes in it are much easier to take when there is no committee and you have all the control.
OpenCL showed few years after CUDA in attempt to prevent the chaos that would happen if everybody (read Intel, AMD, Imagination, ARM) did there own API (which in the end is what happened) and to provide a standard, which everybody (making CPU, GPU or whatever) could implement for their hardware. nVidia were not really happy about OpenCL, since programs written in it could possibly run on Intel and AMD GPUs or CPUs, but they supported it nevertheless.
The history of OpenCL so far is dark and full of terror – back then AMD were using VLIW architecture, which was everything but ready for GPGPU. VLIW can’t handle data dependencies well. In a game shader, you may have only few such, but in a general purpose app their number easily goes up, since the general purpose programs usually contain more complicated logic. GPUs from Intel on the other hand were small and not ready for that, too. nVidia had Tesla architecture, but they lack the will to make an OpenCL implementation that runs better than CUDA. There were lack of tools, too – both for debugging and profiling. Compilers were buggy, took a lot of time to compile and often crashed. Most of the runtime bugs in the code ended up in BSOD. Those compilers were often done in a quick ‘n’ dirty manner by modifying some C frontend and appending some shader-like backend. As you would tell, this didn’t end well. There were OpenCL implementation for CPUs as well, both from AMD and Intel, and both of them were really good (and run on all kind of x86 processors, of course). These OpenCL implementations for CPU allowed faster compiling, easier debugging and testing for quite some time – they were the glass of ice water for the GPU developer, who was in the hell named OpenCL.
But it evolved. AMD moved from VLIW to SIMD with the GCN architecture, which allowed them to do GPGPU very well, since SIMD is better suited for data dependencies and since there are already a lot of general purpose compilers that can be used for such approach (like llvm). Intel started making bigger/more powerful GPUs, too. nVidia improved their OpenCL implementation over time (as of 2015 it still runs slower compared to CUDA, but in spite of that, they have one of the best OpenCL implementations as of today – faster than most and quite stable). There are tools both for profiling and debugging, not as mature as C/C++, but they are here. BSOD is something that almost does not happen when there is bug in the GPGPU app.
CUDA evolved with that. It added support for recursion and virtual functions, C++ and even C++11. It has the best tools for debugging and profiling these days and gives deeper control. In the same time, using CUDA is much easier, since the API is simpler and does a lot of stuff for you. Also, it runs really fast. nVidia is making its compiler and drivers for SIMD-like architecture for a much longer time compared to its competitors, so it should be not much of a surprise that they have one of the best toolchains in the industry.
So, we have OpenCL which is a standard and runs everywhere, and CUDA, which was the first GPGPU API implementation. Two is not so bad after all, right ? Wrong.
Firstly, Google declined the support for OpenCL in Android. I don’t know if that has something with the fact that OpenCL was initially pushed by Apple, but here is a statement they made.
Android has no plans to support OpenCL. There is really nothing more to discuss.
They actually had support for OpenCL in Android 4.2, but decided to permanently drop it without any further explanation. They made their own GPGPU language named RenderScript, which as OpenCL can run both on CPU and GPU and automatically decides that. So far, it always chooses to run on the CPU, so I have no idea if that RenderScript is any worthy at the moment (or will it ever be).
Microsoft extended their DirectX API with support for GPGPU as well, named DirectCompute. It runs on Windows systems only, and just like RenderScript is quite similar to OpenCL and CUDA.
After starting the initiative for OpenCL, Apple also created another (!) API with syntax similar to the OpenCL, called Metal. Metal can be used as a low-level replacement both for OpenGL and OpenCL (meaning it has two parts – graphics and compute, just like DirectX) and can run both on iOS and OS X – OpenCL can run only on OS X (no iOS support). And as we speak for OpenCL and Apple – Apple is well known for one of the worst OpenCL implementations, ever.
It is worth mentioning that some of these APIs require the kernel to be precompiled (like CUDA before version 7) – which is called offline compiling, when others require the kernel to be shipped as source with the app and to be build on-the-fly (like OpenCL) – which is called, of course, online compiling.
So, CUDA gives you speed, OpenCL better portability, RenderScript gives you Android and Metal gives you iOS. This sucks and I thought I can make it even worse, but more on that later.
We will use OpenCL for our first in-depth look of hello-gpgpu app, since it can run on most of the systems without need of special OS or hardware.
OpenCL works as follows – from the host (read CPU) you query for platforms (read vendors, like nVidia, Intel, AMD), then you query each vendor for device (you can have multiple OpenCL implementations, let say an Intel CPU one, multiple nVidia and AMD GPUs). After that you create something like context object . Each context is bound to exactly one device. Lastly, you have queues in which you add commands to be executed on the device (this would get clearer in the code). The source for doing that is ugly and old-school, so most people prefer to hide it in prettier and friendlier wrappers.
Also, from the host, you have to alloc, copy, dealloc device memory, call device kernels and wait for their results.
Don’t get scared with that messy looking code, just go through it. You will see why later. The source is modified version of the one from here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
//example for OpenCL kernel that accepts 2 array of floats(in and out) //and computes out[i] = in[i] * in[i] #include <stdio.h> #ifdef __APPLE__ # include "OpenCL/cl.h" #else # include "cl.h" #endif //macro we use to check if any of the OpenCL API returned error //if so, print the file name and line number of the API call and exit from the program #define CHECK_ERROR(X) __checkError(__FILE__, __LINE__, X) inline void __checkError(const char* file, int line, cl_int error) { if (error != CL_SUCCESS) { printf("error %i in file %s, line %i\n", error, file, line); exit(EXIT_FAILURE); } } //the size of the buffers that we will alloc/dealloc on the devices const size_t DATA_SIZE = 1024; // (1) the source of the GPU kernel itself - it does simple x*x on arrays const char* KERNEL_SOURCE = "\n" \ "__kernel void square( \n" \ " __global const float* restrict input, \n" \ " __global float* restrict output, \n" \ " const unsigned int count) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < count) \n" \ " output[i] = input[i] * input[i]; \n" \ "} \n" \ "\n"; int main(int argc, char** argv) { cl_int err = CL_SUCCESS; // error code returned from api calls //(2) allocate memory on the host and fill it with random data //we will later transfer this memory to the devices and use //it as "input" parameter of the "square" kernel float data[DATA_SIZE]; for(int i = 0; i < DATA_SIZE; i++) data[i] = rand() / (float)RAND_MAX; // Connect to a compute device cl_device_id device_id;// compute device id cl_uint numPlatforms = 0; //when called with second arg NULL, clGetPlatformIDs returns in its //third arg the number of the platforms the system has err = clGetPlatformIDs(0, NULL, &numPlatforms); CHECK_ERROR(err); printf("Found %i platforms\n", (int)numPlatforms); if (numPlatforms == 0) exit(EXIT_FAILURE); //now, alloc one unique id per platform cl_platform_id* platforms = new cl_platform_id[numPlatforms]; //and get the ids of all platforms the system has err = clGetPlatformIDs(numPlatforms, platforms, NULL); //print the name of each and every platfrom we had found for (int i = 0; i < numPlatforms; ++i) { char chBuffer[1024]; err = clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, 1024, &chBuffer, NULL); CHECK_ERROR(err); printf("found platform '%i' = \"%s\"\n", i, chBuffer); } err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU | CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); CHECK_ERROR(err); //now, prepare the first device from the first platform //Create a compute context to bind with the device cl_context context; // compute context context = clCreateContext(0 /*platform id 0*/, 1 /*one device from it*/ , &device_id, NULL, NULL, &err); CHECK_ERROR(err); //Create a command queue. We will use those to fill it with commands //that we will send to the device to be executed cl_command_queue commands;// compute command queue commands = clCreateCommandQueue(context, device_id, 0, &err); CHECK_ERROR(err); // (3)Compile the kernel itself from the source buffer cl_program program; // compute program, may have multiple kernels in it //create a program id program = clCreateProgramWithSource(context, 1, (const char **) & KERNEL_SOURCE, NULL, &err); CHECK_ERROR(err); //compile the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); if (err != CL_SUCCESS) { size_t len; char buffer[2048]; printf("Error: Failed to build program executable!\n"); clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len); printf("%s\n", buffer); exit(EXIT_FAILURE); } //get the kernel we want from the compiled program cl_kernel kernel;// compute kernel kernel = clCreateKernel(program, "square", &err); CHECK_ERROR(err); //Create the input and output arrays in device memory for our calculation cl_mem input; // handle to device memory used for the input array cl_mem output; // handle to device memory used for the output array //(4) allocate memory on the device //memory marked as input/output can increase the performance of the kernel input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * DATA_SIZE, NULL, &err); CHECK_ERROR(err); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * DATA_SIZE, NULL, &err); CHECK_ERROR(err); //Copy data to the device memory err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * DATA_SIZE, data, 0, NULL, NULL); CHECK_ERROR(err); //Prepare to call the kernel //Set the arguments to our compute kernel err = 0; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &DATA_SIZE); CHECK_ERROR(err); size_t global; // number of thread blocks size_t local; // thread block size //Get the maximum work group size for executing the kernel on the device err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL); CHECK_ERROR(err); //(5) Execute the kernel over the entire range of our 1d input data set //using the maximum number of work group items for this device global = DATA_SIZE; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); CHECK_ERROR(err); //Wait for the command commands to get serviced before reading back results clFinish(commands); float results[DATA_SIZE];// results returned from device //(6) Read back the results from the device to verify the output err = clEnqueueReadBuffer( commands, output, CL_TRUE, 0, sizeof(float) * DATA_SIZE, results, 0, NULL, NULL ); CHECK_ERROR(err); unsigned int correct; // number of correct results returned //Validate results correct = 0; for(int i = 0; i < DATA_SIZE; i++) { if(results[i] == data[i] * data[i]) correct++; } //Print a brief summary detailing the results printf("Computed '%d/%d' correct values!\n", correct, (int)DATA_SIZE); //Shutdown and cleanup clReleaseMemObject(input); clReleaseMemObject(output); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(commands); clReleaseContext(context); delete[] platforms; return 0; } |
We have the source of the kernel in (1). In (2) we query for devices and we setup context and queue for the first device from the first platform, after that in (3) we compile the source of the program using the OpenCL API call. In (4) we allocate memory on the device and we transfer the input host array in it. In (5) we call the kernel itself and in (6) we wait for its completion and we transfer the result from the device to the host.
Some time ago I decided to wrap all that in a pretty-looking abstract layer over that OpenCL. It became easily clear, that that layer can be over OpenCL and CUDA at the same time – I could just wrap the API calls from both of those APIs and change which one to use only with compiler define. The language differences between the OpenCL kernels and CUDA were easily hidden with some macro-magic. It is true however, that only the common part of each API could be used. But using that wrapper, I could have the portability of OpenCL with the speed of CUDA.
However, it turned out to be even better (or worse, still not sure). It seems that having such a wrapper allows you to run the kernels natively – you have just to compile them as regular C++ code and implement all the methods of the abstraction layer. All that needs to be done is to implement the GPGPU extensions to C, which are not that many.
Unpretentiously, I call that GPAPI (General Purpose API) – you can write your source once and run it in OpenCL (so you get portability), CUDA (so you get speed) and Native(C++) – so you have debugging. Here is how the example from above looks like in GPAPI.
kernel.cl
1 2 3 4 5 6 7 8 |
KERNEL void square(GLOBAL int * RESTRICT a, GLOBAL int * RESTRICT b, unsigned int n) { int id = globalID(); //Get global thread ID if (id < n) { b[id] = a[id] * a[id]; } } |
main.cpp
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
#include "gpapi.h" int main(int argc, const char *argv[]) { using namespace GPAPI; //load kernel source from a file std::string source = getProgramSource("/Developer/git/opencl/opencl/kernel.cl"); //devices will hold handles to all available GPAPI devices std::vector<Device*> devices; //call initGPAPI to init the devices initGPAPI(devices, source); //now prepare some host buffers that will be transfering to the devices int NUM_ELEMENTS = 1024; int *h_a = new int[NUM_ELEMENTS]; int *h_b = new int[NUM_ELEMENTS]; for (int i = 0; i < NUM_ELEMENTS; ++i) { h_a[i] = i; h_b[i] = i * 2; } const size_t bytes = NUM_ELEMENTS * sizeof(int); //calculate global and local size size_t globalSize, localSize; // Number of work items in each local work group localSize = 64; // Number of total work items - localSize must be divisor globalSize = ceil(NUM_ELEMENTS / (float)localSize) * localSize; for (int i = 0; i < devices.size(); ++i) { Device& device = *(devices[i]); //set the kernel name we want to call device.setKernel("square"); //our kernel has 4 args - 2 input buffers, 1 output buffer and a size //set those args device.addParam(h_a, bytes); device.addParam(h_b, bytes); Buffer *result = device.addParam(NULL, bytes); device.addParam(NUM_ELEMENTS); //launch the kernel device.launchKernel(globalSize, localSize); //wait for the result device.wait(); //copy back the data of the result from the device to the host result->download(device.getQueue(), device.getContext(), h_c, bytes); for (int i = 0; i < NUM_ELEMENTS; ++i) { printf("%i ", h_c[i]); } //clean up device.freeMem(); } freeGPAPI(devices); return 0; } |
As a bonus, the sample code with the GPAPI will run the program on all the devices, that are on the system (in my Macbook Pro, there are 3 – Intel CPU, Intel GPU and nVidia GPU). If we want to make the OpenCL code run on all those devices, it would take at least hundred more lines.
The point here is, that the APIs and languages for GPGPU are practically the same. GPAPI could be easily ported to support RenderScript, Metal or DirectCompute. But this should not be any of a surprise – after all the hardware behind that is the same – SIMD machines controlled by a host CPU. In the next parts and code samples, I will use the macro magic words from GPAPI – they make sense (KERNEL, GLOBAL, SHARED, etc), and they can easily translated to whatever API you like using that reference file.
The native version of GPAPI is designed for debug purpose (it could be used to run CUDA on the CPU too, but I doubt somebody cares about that a lot). But it had to support everything that is common between CUDA and OpenCL, or at least, everything I need so far. There is curious approach to implementing shared memory – as we mentioned in the previous part, if the max thread block size is 1, the private memory matches the shared one (because it shared along 1 thread). And that is how it is implemented – the max thread block size when using Native mode in GPAPI is 1 (which is perfectly valid according the OpenCL standard). Private thread memory is implemented using C++ automatic/stack variables, and global memory is implemented using heap memory.
There are of course stuff in CUDA that are worth using and are not in OpenCL (and vice versa). We will see such cases in the next parts and we will make each and every one of them compilable in both APIs.
Stay tuned.