2. GPGPU 101 – GPU thread model

This is the next part in the series of GPGPU 101 posts I started some time ago. If
you haven’t checked part 0 or part1, please do so. In part 1 we’ve talked how the x86 processor approaches the problems of efficiency (locality) and performance (parallelism). Now, we will discuss how the modern GPUs are approaching those. In this part we will talk more about parallelism and in the next we will focus on locality.

As the name of the GPU (Graphics Process Units) suggests it was designed for processing graphics. And graphics tends to be an embarrassingly parallel problem – you have to calculate the color for tens of millions of pixels, and (most of the time) the color of every pixel is independent from the other pixels. But graphics are not the sole embarrassingly parallel problem and at one point people started using GPUs for stuff different from graphics. First it was used in a hacky fashion, since there was no API for general purpose programming and after that CUDA and OpenCL were born. As we mentioned, there were multiple GPUs designs and approaches, but today more or less all desktop GPU vendors (AMD, Intel and nVidia) have converged to similar architectures. They do have differences which should be taken into consideration when optimizing a specific program, but the general approach to writing programs for those GPUs is the same. Here are the overviews of the GPU approaches from Intel, nVidia and AMD.

CPU architecture – optimized for single thread performance

The CPU is designed to perform a single task as fast as possible (or “optimized for low latency“). In order to do so, it uses complex control logic (superscalar pipeline, branch predictor, register renaming) and complex (and big) data and instruction caches. On the other hand the GPUs have many simple cores, that are lacking all of the complex stuff in the CPUs (or “optimized for high throughput“). So, what does this means for us? …

In a nutshell, the GPU is a huge SIMD machine, so they are well suited for data parallelism tasks. The width of the SIMD unit depends on the vendor (4 in Intel GPUs, 32 in nVidia GPUs, 64 in AMD ones – wider lanes do not mean better performance of course, so keep that in mind). Some of the vendors are calling each lane of the SIMD unit a ‘core’. Usually, the vendors are combining some SIMD units into groups (4 SIMD units in nVidia are forming a “Streaming Multiprocessor”, 2 SIMD units in Intel are forming are “Execution Unit”, and in AMD 4 SIMD units are forming a “GCN Compute Unite (CU)”). We will call that “SIMD Group“. The idea of those groups is that the SIMD units inside them are sharing some kind of hardware (most common a cache unit, but also instruction fetch/decode system and others). This also provides some locality, which is good.

GPU architecture – optimized for throughput

In the basic diagram above, every little green square is a lane from a SIMD unit, which can perform a floating point (or other) operation over some data. All green boxes (or SIMD lanes, or “cores”) in one SIMD unit (surrounded by the whitest box) are executing the same instruction. We’ve grouped 4 SIMD units in a SIMD group and we have 2 such groups here.  So instead of multiple complex hardware, we have many simple cores. By the way, if you ever wondered how the vendors are calculating the FLOP/s on their processors – take the number of the SIMD lanes (cores), multiply that by 2 (since every core can executed up to 2 floating point operations per cycle with something called multiply-add or ‘mad‘) and multiply that by the frequency of the core. It is not hard to see that this theoretical FLOP/s peak is really, really very theoretical – there is no way to keep all units busy all the time and make them do mads all the time in any meaningful app.

But, as we discussed in the previous chapter, this complex hardware (branch predictor, register renaming etc.) is very important for the performance, since it keeps the pipelines busy. So if we are lacking that in the GPUs, isn’t the performance going to suffer a lot?

Well, obviously not. The trick that the GPU does is, that instead of trying to minimize the latency, it is trying to hide it with parallelism. It works as follows – every SIMD unit has a number of registers that it can use. The tasks we give to the GPU are in amount of hundreds of thousands. The GPU prepares some of those tasks in the hardware, so at any given moment there are multiple tasks ready to be executed (they share those registers). So, let say that we execute an instruction on a SIMD unit, and this instruction needs some memory from the DRAM also called global memory. In fact it is a bit different from the CPU DRAM and is named GDDRAM, but for our needs it is DRAM enough. On the CPU, if this instruction is not in the cache, we will have to wait and CPU will stall. If you have other thread ready to run, possibly the OS will switch to it (but this switch is expensive, since all the resources that were used by the stalled thread have to be swapped with the new one). The GPU on the other hand can simply choose another task to execute, since the data needed for it is already in the registers and waiting. So in some sense, we can say that the thread switch on the GPU is hardware implemented and basically free. The key here is that you have to give the GPU enough tasks in order to achieve high level of parallelism, that will be used to hide the latency.

But everything comes at a price, right. The price here are the registers. If your threads need a lot of data, at one point the registers would not be able to fit it all. At this point we have a so-called “register spill” to the global memory. Compared to the register, reads and writes to the global memory are ridiculously slow, so it is best to avoid that situation and keep the memory that every thread needs low.

Now, what about synchronization. If you have hundreds of thousands of threads, probably synchronizing them will be hell (after all synchronizing even 2 threads on the CPU is pretty hard). The answer is that the GPU is suited for tasks that don’t need synchronization at all – like graphics (real time or photorealistic), deep learning, simulations, etc. If you need synchronization on a global level – it is impossible. The only possible synchronization can be done is on a SIMD group level. But there are atomics on global level available. Tasks that require global synchronization usually are split into multiple GPU programs, thus offloading the synchronization job to the CPU.

And lastly, some programming. The thread model of the general purpose programming languages for the GPUs follows more or less the thread model of the hardware. For example in CUDA, threads are grouped in “thread blocks”. Threads in a block can talk to each other, even more they run on the same SIMD unit. How many threads are in a block is programmable, so if you have more than 32 (the size of the SIMD unit in nVidia), the thread block will be executed on multiple runs. And you can run multiple of those thread blocks (grouped in so-called “grid”). Both CUDA and OpenCL are offering multi-dimensional entities like groups of thread blocks and grids for convenience, but I’ve always preferred to use 1D spaces. Enough talking, lets do some coding, since it will make it much easier to understand.

Here is a program that sums up two arrays in C++.

And here it is in CUDA. Instead of having a for loop, every thread (or core) will sum one of the array fields.

Finally, some common keywords with the GPGPU apps. The CPU is usually called the “host“. The GPU is called the “device“. The host and the device are having physically different memories, and we have to do explicit copies from one to the other if we want to move stuff (more on that in the next part). The functions written for the GPU, that can be called from the host are called “kernels” (I’ve no idea why). Kernels can be only void functions (this makes sense, since they are executed from many threads, and if every one of them returns a result it could get messy). Functions, that can only be called from the GPU are called “device functions“.

Both OpenCL and CUDA are C based languages, and both of them offer extensions to get a unique per-thread id. CUDA has a C++ support too (though you don’t want to use virtual functions and other advanced stuff, but more on that in the next parts). Malloc and new can’t be used on the device. You have to allocate memory on the device before you start the kernel. This happens with a API call from the host (in fact, there is a way to use them with CUDA, but it is something that we will avoid and will talk about in the next parts).

How to synchronize the work between the host and the device ? What about having multiple devices ? How does a SIMD unit works, when there is a branch in the source and some threads go into the ‘true’ path, while others into the ‘false’ path? We look at all that (and more), again, in the next parts.

CPU thermal profile during runtime

P.S. As you may have noticed the CPU has these very big cores located on one part of the chip, caches and memory on the other. In the GPUs on the other hand, you have kind of a better mixture of cores and memory (and the core are using much less power, since they don’t have any advanced features and a running at around 1GHZ). And we’ve talked how cooling/power is a major problem with current (and next) processors. But it is not only ‘heat’, it is heat per square mm. So in the CPUs, you have parts of the chip that are getting way hotter than the CPU vendors would like (caches require less power compared to ALU units). This makes the life (and the future) for the GPUs even brighter.

1. GPGPU 101 – x86

This is the next part of the GPGPU 101 posts I started some time ago. If you haven’t checked part 0, please do so. In this part we are going to make a brief overview of x86 (mainly based on this article).

So, in part 0 we arrived at the idea that in order to have performance and efficiency from now on, we need to have parallelism and locality. Now we need to introduce some context for our notions of parallelism and locality, so we can quantify those in any meaningful manner. First and foremost, we need basic familiarity with the environment. Which is often underestimated – the environment from the point-of-view a developer is immensely sophisticated. For a C++ developer for example, it starts with the compiler, goes all the way to the operating system and finishes in the hardware. So you have to know how the compiler works (and why it does what it does), what the operating system does (and why we need that) and have at least a basic understanding of the architecture for which you are writing (usually x86-64). If you happen to do scripting, instead of the compiler you have to know the interpreter (normally a Virtual Machine) so you basically have the same issues (along with probably some extra issues).

Software does not run in a magic fairy ether powered by the fevered dreams of CS PhDs” Branimir Karadzic

We will start with the processors. As I already mentioned this is the best article I’ve ever read about how a processor works – really short and hits most of the important stuff. I would say that it is critical for most of the programming needs a developer might have.

In a summary, there are a few ways a processor can be made. What the processor has to do however is fetch an instruction, decode it to understand what exactly has to be done, execute it and store the result somewhere. This process is organized in a pipeline manner, and so it is called pipelining. You can think of the pipeline described above as a 4-step serial operation. Often in real life, processor pipelines are much more complicated, but this explanation is good enough for our needs.

Processor Pipeline (image courtesy of lighterra)

If the stages in the pipeline are independent of each other, you can possibly execute them in parallel relatively easy (you can fetch one instruction, while decoding another, while executing a third one). Also you can possibly multiply that pipeline (create more parallel pipelines to the one you have), so you could process multiple instructions at once. This is called super-scalarity and it is implemented in every modern x86 processor. So here is a buzz-word: this super-scalarity is a way to do Instruction Level Parallelism (ILP) – meaning, you can execute multiple instructions at the same time with the mechanism, we just described. As we talked in the previous chapter, parallelism equals performance, so this is good – ILP gives us some performance here.

ILP works great if there are no data dependencies between the instructions. But what if there are? For example, if in the first pipeline we have an instruction to compute r=(A+B), but “A” is somewhere away so we have to wait for it, whereas in the second pipeline we have an instruction that needs that result, for example, to compute r/2 (the result of (A+B) divided by two). They will both have to wait. The same can happen if both pipelines have to execute an instruction, for which there is a specially dedicated hardware (for example, if you have some circuit that does hardware-implemented exp and both of the pipelines want to do exp(something)). Modern processors usually have a number of such calculation units because of that (so there are multiple floating-point-arithmetic computation units in your processor). We are going to talk more about those units later.

Basically, the text above gives a bird’s eye view of a x86 processor. We can think of the other stuff from the architecture as tools and mechanisms to prevent the processor from stalling and make sure it is as utilized as much as possible.

First, since the processor knows in advance what instructions it will have to execute, it can ask for the data they will need before they are actually in the execution part of the pipeline (and this is really clever). But what if there is a branch in the code and there is no way to know which path the code will take. There is a solution that addresses this issue – it is called speculative execution, and it employs branch predictors – every modern x86 has (at least) one. The branch predictor monitors which branches your control path comes across, and how those branches behave (statistically), which allows the predictor to speculate which direction a pending branch will take, so other units can prefetch the data this path needs in advance (preparing the data from both paths is not an option, since it will flood the data channels). So, basically if you have if-then-else in your code that is being executed a lot, and you are always going into the “if-true” clause (for example), this is practically free. But if your branch path is not that easily predictable, i.e. if it exhibits high entropy, the speculation may go wrong and this is going to cost you. Note that the bulk of the cost may not necessarily come from the instruction decoding and execution, but from the waiting for the operands data (we talked about data moving in part 0).

Cache hierarchy

There is a unit to take care of data moving being too slow. x86-64 processors have multi-level cache hierarchies (L1, L2 and L3, check image below). They are basically super-fast (and super-expensive) memory pools, that are close to the processor (actually, on the chip itself) and they store the data that you need the most (based on the frequency of usage, adjacency, etc). L1 is smallest, but fastest – L3 is largest but slowest. Those caches normally work on cache-line granularity – 64B in modern 64-bit CPUs. What that means is that when you read 1B from memory, the cache will load some adjacent 63B as well, starting from the greatest 64-byte aligned address, lesser-or-equal to the specified address. So, if you read up to 64 characters that are stored in memory arbitrarily apart from each other, you will likely cause multiple memory-to-cache transactions, compared to just one or two, if those characters were adjacent (more on the memory in the computer here). Apart from data, decoded instructions are also cached (on the figure below, L1d is a data L1 cache, L1i is instructions L1 cache). So, again, moving stuff around is expensive, so we need good data locality.

Modern x86 has a lot more inside, but there is at least one more thing that is worth mentioning, and this is so-called “out of order” (OoO) execution. Basically if there is an instruction that has to be executed from a pipeline, but it has to wait for data, this instruction can be put aside for a moment and another one can be executed. So the order of execution of the instruction in a pipeline is not guaranteed. But as you know when you write your programs, the order of execution matters (a lot!), so if you scramble some of the calculations you do, you will most certainly not have the result you are looking for. So the processor is not randomly doing the next instructions in the pipeline – it carefully examines what can be done and what not. Further more, it “renames” the data fields, which helps a lot to resolve the data-dependencies in the code. This introduces another buzz word, called “register renaming“.

Okay, so x86 has ILP, caches, branch predictors, OoO, Register Renaming. Putting all these efforts into keeping the pipelines busy. Actually, there is one more thing that is worth mentioning. It is the so-called (buzz word!) SIMD – Single Instruction Multiple Data. The idea is simple – if you are going to do all that instruction fetching, decoding and so on (and they can be really expensive because of x86’s archaic ISA origins), wouldn’t it be great if you can execute them on multiple data at the same time? For example, if you want to calculate the sum of two vectors (each having float x, y, z; as data members), if you can execute the same instruction (sum) on all of the elements those vector have (at once), that’d be great! And actually every modern processor has special hardware that can do exactly that. Modern compilers are trying to identify parts of your code that are suitable for such SIMD instructions (this is called “auto-vectorization“). Doing SIMD code is important – for example, libraries like Embree are heavily based on it. Pretty much every performance hungry application today has to use them. If you have ever wondered how V-Ray 3.1 got so much faster than 2.4, this is one of the reasons (of course, there are many more). SIMD has multiple versions and it can be basically found in every x86-64 system. So, SIMD is just like ILP, but instead on doing multiple instructions at a time, it does the same instruction on multiple data fields. As with the ILP, it is a way to do parallelism, and since parallelism equals performance, it is a way to boost the performance of your x86-64 code.

SIMD flow

If you want to get the most of SIMD, there is a way to write them yourself. The same goes for the memory access patterns – you can make sure you read/write as linearly as possible, have less data dependencies and so on. The same goes for branch organization and so on and so forth. And, as we talked in part 0, this is going to get more and more important (not less). Fortunately, you can do this fairly easily – at least in C/C++. I am not so sure about the modern languages like Java, C#, Javascript, Haskell and the others …

Remember that there are multiple computation units in the processor, because of the instruction level parallelism? What if there is not enough ILP and not all of them have something to do in the current super-pipelined task? This is where hyper-threading (known outside of x86 as Simultaneous Multi-threading, or SMT) comes. The idea is that you split every physical core into two (or more) logical cores. So you run multiple threads, say, two on the same core. And if one of them can’t occupy all of the hardware you have, the other one can use (some of) it. And no, hyper-threading does not make the processor linearly faster with the number of logical cores, it just helps better utilize its resources (and in some cases, if they are utilized enough or because of other reasons, it gives no benefits at all). Of course, modern x86-64 processors have multiple hardware cores on the same processor too – basically just multiplying the hardware of one core. Hyper-threading, multi-core and multi-processor systems are giving the last type of parallelism, which is on thread level. So we have ILP, SIMD and threads.

There we have it – the complicated x86 architecture at a glance, with its methods for parallelism (ILP, hyper-threading, SIMD, multi-cores) and locality (branch-predictor, caches).

However, there are two more things worth mentioning. First of all, since the compiler can see your code, it can identify the data dependencies during the compilation and spread the instructions for those further apart, thus generating longer sequences of data-independent instructions (and thus help the processor with the OoO task). If the compiler does that really well, the processor could do ILP without the hassle of register renaming, dependency monitoring and so on. There is actually a whole computer architecture approach based on this and it is called VLIW (Very Long Instruction Word). It is called like that because the data-dependency-free instructions that the compiler picks are packed in a single, very long ‘macro’ instruction – sometimes more than a hundred bytes (compared to up to 1-16 in not VLIW architecture, like x86). There were few attempts for such processors – Itanium from Intel, old AMD GPUs and nVidia Denver (which gets ARM instructions runtime (not compile time!), reorders them into VLIW ones and executes them in such a manner. Why nVidia decided to do that – I have no clue). The VLIW sounds great, since when we offload the processor of those tasks, it could do something else (at least, we are sparing transistors on the chip and this is important too). In real life many VLIW attempts have failed – compilers are getting way too complicated to produce, they start wasting way too much time (and memory) to compile and they often fail to generate good code (and during the compile time you know less about the program than at runtime) – meaning that often at the end they generate instructions that are executed in a serial manner. Still, VLIW has its applications and compilers in the wild. For example, everybody can take the LLVM compiler now and use it to generate code for AMD R600 – a VLIW architecture, relatively easy. More on all that in the next part of the GPGPU series.

And of course, there is an option not do all of that OoO with its complicated apparatus, and just use the transistors that would be needed for those to put more units for actual calculations. Those types of processors are called speed-demons (in contrast, complicated ones like x86 are called brainiacs). Speed demons have the potential to be more efficient (but only if there are good programmers and compilers, since they require more careful programming) – those smart features of the brainiacs come at a cost – some of them are power hungry or/and can’t be turned off when they are not needed. And these days a lot people had started talking not about performance, but performance per watt, so the power consumption is getting more and more important.

In the next part we will see how the GPUs are handling the instruction, data and thread level parallelism.

0. GPGPU 101 – Intro

So, if you want to get familiar with why we have to use CUDA and OpenCL and stuff like that, in the text below there are a few links to articles, talks and papers that explain that in length – you can read those in their order of appearance. I strongly advise everybody to at least check those. However, if you want to read just a summary of all that in one place, here you have it …

We have to start with a bit of history, since this is how every good story begins …

At one point early on during the computer era, people decided that they could show the result of the computer calculations on screen in other-than-alpha-numeric form, so their job could be more interactive (and less boring, I guess). Special hardware was needed for that – something that could convert large amounts (for that time) of data from the computer to something appropriate that the screen could understand, and thus the framebuffer was born. In some sense, probably this is the birth of the GPU and it happened long, long time ago (1970-something).

The GPUs have had a really crazy history since that time – till the Voodoo showed up in mid 90s (and OpenGL just before that), when everybody wanted (and did make) GPUs. There were all kinds of designs, approaches and implementations and it was time of really dynamic changes. Many of the companies did not manage to keep up with that, so many of them quit the business. These days we basically have nVidia, AMD and Intel, and in the mobile sector there are ARM (Mali), Imagination Technologies (PowerVR), Qualcomm (Adreno) and a few even smaller IP vendors. Here is pretty much the full story of it.

But the history that brings us to today is actually pretty cool. As it seems from the beginning of the computing era, a lot of chips designs were targeted at the CPU, but at some point everybody just started using x86. Yes, we had MIPS, Power/PowerPC, Sparc, etc, but they were nothing as x86 in terms of market share. More or less, we have been stuck with x86 for a long time. We have ARM now, which is quite different, though in some way it is just a cleaned-up, RISC variant of the same.

In comparison, GPUs have always had the freedom to completely change the architecture or perpetually try new stuff since they don’t have to support tons of legacy software. So there one can find all kind of crazy ideas, some working great, some not.

I would like to split the GPU history into three parts – before Voodoo, after Voodoo and after the GPGPU advent.

The Voodoo GPUs were so great back then (circa 1996), that they marked the whole period. They took the GPU ‘to the masses’ and everybody knew about Voodoo; I was 14-year-old and living in post-socialist country, but even I knew of that Voodoo stuff.

They were good, but not good enough to save their maker, 3dfx, from failing – in the dynamic situation back then it did not take many wrong steps to fail. 3dfx were bought by nVidia (circa 2000), which at that time had some great GPUs too. After a lot of market acquisitions, some time in the beginning of the 21st century there were only two major discrete desktop GPU makers left – nVidia and AMD, and two dominant APIs – DirectX and OpenGL. And actually I find that not so bad – the times when everybody had different approaches on the GPUs were the times when you have to write your code to fit tons of hardware (and this is not fun). This third period continues up to today, but something happened around ~2005, which marked the start of a new era – the GPGPU one, which is what we will focus on for the rest of this text.

Herb Sutter has great article about why that third period is so exciting and important. In my eyes, this article may have turned out to be as foreseeing and important for the computer industry as Dijkstra’s “Go To Statement Considered Harmful”.

Now, lets get away from the history for a moment and look closer at why this GPGPU stuff matters (a lot) …

In a nutshell, because of the way processors are made, we cannot make them a lot faster anymore (as we could every year, till 2005). So our programs can’t get faster and faster as the time goes by (or at least, they can’t get faster without some developer’s efforts).

There is Moore’s Law that everybody talks about – basically it is a rule-of-thumb that says that every 18-24 months or so, the number of transistors in a processor roughly doubles. The way this happens usually is that we start making half as big transistors, so we can put twice as many on the die. And on top of that those transistors can run at two times the frequency they had and because they are smaller, we can feed them considerably less voltage, and this resulted in the roughly same power consumption (but running faster). However, because of a physics phenomenon called “leakage” (if you want to know the details and much more on the topic, this talk is great) we can’t keep lowering the voltage and if we don’t lower the voltage we get four times the power consumption we had before. If you don’t believe that, ask Intel – they predicted 10ghz (and people were thinking a lot of crazy stuff, too) processors for 2011. By the way, because of the limitation of the speed of light, if we want to have a 100GHz single-clock-domain circuit, say, some CPU, the furthest distance an electron can travel during a clock is 2mm. Good luck cooling a chip that is 2mm big!

So, the processors today are power limited – we can’t cool them infinitely, we can’t make them use less power for higher frequencies. So people started saying that Moore’s Law was dead. Here is an image, that represents the number of transistor on Y axis, and the year on X.

As we can see, Moore’s Law was not dead by 2010, but at some point it will be – nothing exponential in nature lasts forever. But this day is not today – and perhaps won’t be in the next 10 years. We can still put more and more transistors, but we can’t make them run at higher frequencies. So the processors these days are getting more cores. Today’s expectations are that single core performance will improve somewhere between 2% and 15% per year (because of better architectures, optimizations in the process of their production and so on).

This makes x30 CPU performance for the next 50 years – for a comparison, for the last 30 years this number is x2500. At the same time, we expect the multi-core speed-up to be around 75% per year. And this happens today. The image below shows the theoretical GFLOPS of modern CPU (which has few very smart cores) vs GPU (which has thousands not-so-smart ones).

I think that most developers underestimate the above, and really bad so.

Let me give you an example – I worked in a company that made MMORPGs. Our code was single-threaded and slow, the game was slow, but we knew that at the moment we start selling that, computers would be fast enough and the game would run fine – MMORPGs usually take 3 to 5 years to make if you do your own engine (the game failed in the end, but that’s another story), so there was plenty of time. These days there are millions of people out there that still write code like that (embarrassingly single-threaded code, using languages and tools that are designed to be such), who still believe the same – that computers are fast enough or that they will get faster (Swift, Javascript – I am looking at you).

Well, they will not, and that’s why from now on the efficiency and performance of the programs will get more important than ever before (this is quite the opposite of what the majority of developers think). In the past, you could have a slow but sophisticated single-core and feature-complete program that runs sluggish on most machines. But you knew that after some time your program would run fine. This is the past – now everybody can design sophisticated programs (since we all have zillions of tools), but not everybody knows how to make them run faster, not to mention that the need for more and more computations has been outgrowing the speed at which processors are getting faster since the beginning of the computing.

“Concurrency is the next revolution in how we write software. The vast majority of programmers today dont grok concurrency, just as the vast majority of programmers 15 years ago didnt yet grok objects” Herb Sutter.*

So, lets get to the power (see the image below). As we noted above, processors are power limited. If we take one 20 mm^2 processor chip and fill it with ALUs only (for instance, FMAD – fused multipy-add, basically simple units that can do some floating point operations), this would give us 12 TFLOPS (using 28nm production process) and would result in 300 Watts of power (since every double FMAD uses 50pj). For a comparison today’s modern GPUs (which are considered really power efficient) are using around 200Watts for ~5 TFLOPS. That is because these units can’t do anything without data – they have to get input data, over which to do their computations, and store the output results somewhere. And moving data comes at its own costs – moving 64 bits at 1mm distance costs 25pj. If you have to move 2×64 bits, you are already wasting more (or the same) power for the data transfer (2×25), than for the calculations (50). These costs are growing linearly – if you have to move the data along 10mm, this would cost 250pj, and so on and so forth. For some cases (as in the DRAM of the GPU) it goes up to 10000pj. These joules can be converted to latency too, since when the data has to travel longer distance, it arrives later. It is clear that you want to have your data as close to the computation units as possible.

And when we look at the “modern” computer languages, we can see that they clearly are “above” all that – C++, Java, Ruby, Haskell, Go – all of them present the world as “flat” (read: fully-coherent) – from their perspective all memory is the same and accessing it costs the same, too. And I think they seem to be proud of that, in some manner. The same goes for x86 architecture itself, by the way – it has some instructions for pre-fetch and cache management, but they are just little patches over a large problem. It is good that we have all those caches in the x86, trying so badly to hide the latencies, but it is just that the developers sometimes (by nature) know more than any heuristics does. There are a lot (really, a lot) problems of such nature. So if there is a way to manually “manage the cache”, a lot of developers/programs could benefit.

“Performance = Parallelism”, “Efficiency = Locality”William Dally

Who targets what and how – in the next part.

* Parallelism is to run stuff in parallel – like having multiple cars on the highway, all going parallel on the road.
Concurrency is to be able to advance multiple tasks in a system. For example, if you have two cars that have to be moved form A to B, and one driver to do the job. The driver might drive the first car for some walkable distance, than go to the second car and drive it to go in front of the first one, than go back to the first car, etc. After a lot of time, both of the cars will be moved from A to B – this is concurrency. If you have two drivers for the job, this is parallelism.
The situation, at which some resource might be needed from different subjects at the same time – for example if the cars have to cross ajunction, they would have to wait for each other before they can cross it (or otherwise a crash will occur) is concurrency problem.
If you have parallelism, you have a concurrency too. Having heavy parallelism presupposes having to deal with concurrency problems.

I admit I was wrong

Okay, I hate that I have to do this, but it seems that if I made a blog about C++, OpenCL and CUDA in a language in which I personally know all the native speakers that do care about those technologies (GPGPU ones the most), I could not have really big audience (they are literally 5 people and Boris. Hey Boris!). So I will start blogging in english. I have bad english, but I will try to do my best.

Before a year or so I decided to write down a paper in which I wished to tell people more about how do I have spend a couple of months in Chaos. Basically it was a story of how we solved a problem, that we had (it is not a general solution that works everywhere and for everybody, but it did worked for us back then). And I learned a lesson – writing papers is boring and frustrating. You have to write a certain number of pages, you have to have references, abstract and images all around the place. Than you have to convince somebody that what you have written is worth it (“it made a popular raytracer faster is not enough of a proof, apparently). And dealing with the academic community was not hours of fun.

So, I am starting to blog about CUDA (and some C++ and OpenCL ofc, but we will start with more CUDA, I think). In english.
I will start with the boring stuff (what is CUDA and why I don’t write more about OpenCL) and I will move to the more interesting one (fast). I promise that even in the boring ones, I will do my best to tell spicy important details that I had hard time finding.

Stay tuned.

no except

Ето една интересна лекция на Meyers, в която той обяснява много добре какво е std::move, std::forward и noexcept :)

Към средата има една част, в която е показано колко по-бърз може да е кода, когато е с noexcept там, където е възможно. Освен това имаше показани примери, как чрез проста смяна на компилатора със C++11/14 enabled такъв, кода става по-бърз. Във връзка с noexcept се появи в един момент и тази графика.

Колко хубаво, помислих си .. Ще направим сега едно NOEXCEPT макро (което ще е #define NOEXCEPT noexcept когато компилираме със С++11/14 компилатор), ще прекомпилираме всичко което имаме (и вкъщи и в офиса) с него и ще получим ей така едни проценти забързване от нищото ..

Та изрових една моя стара (С++98) микро библиотека за математика (вектор, точка, матрица, quaternion, в общи линии това) и реших да завъртя един цикъл с тях (тя всъщност е толкова проста, че не очаквах каквито и да е разлики в резултатите).
Ползвам clang-600.0.51(llvm 3.5svn), компилирам с -Os (което е по дефаулт) (с -O3 резултатите са почти същите) и C++14 enabled.

Кода от теста се намира тук.

Интересно, че противно на приказките на Meyers, Chandler (from LLVM) и не знам си кой друг, на моята машина (I7-3615QM) кода върви по-бавно с между 2 и 5 процента, когато функциите имат noexcept.
Или аз го ползвам много не както трябва, или някой някъде малко послъгва …

No country for old men

Още един “benchmark” между ARM и Intel CPUs (идеята за него дойде след като разни JS benchmark-ове на iPad Mini Retina минаваха само 2-3-4 пъти по-бавно от колкото на i7 2600 … Логичният въпрос, който някой (Тео) зададе беше дали наистина е толкова по-бавен, на приложение като raytracer).

Тестваме smallpt (не че това е представителен raytracer, но е raytracer), naive quick sort, random mem access & floating point operations. Кода на тестовете е тук.

Тестваме I7-3720QM (2.6GHZ) (Macbook Pro Retina 2012) vs 64-bit A7 (1.3GHZ) (iPad Mini Retina).

Тестваме еднонишково.

Тестваме с clang-503.0.40 (based on LLVM 3.4svn) (с опции -O3, relaxed IEEE & strict aliasing).

Основният тест е smallpt, покрай него има и няколко проби с по-прости сметки.
Тестовете са правени по 5 пъти (да, само 5), максималната разликата в резултатите на Intel-a е в рамките на 8% (което е бая), а на iPad-a e в рамките на 2%.

По време на тестовете не е имало други пуснати приложения (а лаптопа не е бил в power-save mode, etc). Сиреч, почти може и да си помисли човек да им вярва.

Показани са усреднените стойности на времето, което е отнел всеки тест в секунди (тоест, по-ниските правоъгълничета означават по-добри резултати).

Ако някой има С/С++/Objective-C/Swift код който иска да бъде тестван, PM me :).

Като цяло, резултатите са такива, каквито се предполагаше че ще бъдат.

There is no country for old men.

p.s. през *почти* същите тестове мина и един iPhone 4, с ~ резултати Raytrace:240s, Sort:72s, Mem:56s, FPU 21s & iPhone 6 – Raytrace 13s, Sort 11s, Rand mem access 9.5s, FPU 1.92s.

VTable по поръчка

Ето в това видео от Going Native 2013 A. Alexandrescu разказва за някои от проблемите, които виртуалните функции в С++ създават. А именно : всеки обект държи по един (или повече) 64 битов (най-често) указател към виртуалната таблица, това измества разни неща от cache-а; винаги този указател седи в началото на обекта (а в началото е добре, да има неща които се ползват по-често. Това може да е указателят към виртуалната таблица, но може да е друго :)). Рядко в живота може човек да се докара до там, че да иска и това да забързв. Но се случва.

Идеята е да си напишем наша имплементация на виртуални функции, която да ползва не 64 битово число, ами 8 (така ще се ограничим по максимален брой виртуални функции в интерфейс или по максимален брой класове в йерархия, но 256 (2^8) често е повече от достатъчно).

Ето и примерна имплементация (с малко macro-та и нагласявки :)). На места е малко redundant; мисля че може да се постигне и по-приятен синтаксис.

Не успях да събера мотивация да направя тестове за да видя има ли забързване, в кои случаи има и колко е то. Според Alexanderscu Facebook работи 4% по-бързо заради тази магия :).
Предимство е, че така можем и да изберем къде да поместим този “char”, който ползваме като указател към виртуална таблица -> ако не се очаква виртуални методи да се викат често, навярно е по-добре да не е в началото на обекта.

Raytrace Interop

Oпитвам се напоследък да правя разни бързи неща с CUDA.
Често има проблем с вземането на резултата от GPUs app (то не става особено бързо, да започне да се случва асинхронно е голяма болка, а пък понякога трябва да става особено често), та един начин това да се случва е с CUDA/OpenGL interop.

Идеята (поне моята) е с някой CUDA kernel да рисувам нещо в някоя текстура, а с OpenGL само да си я рисувам на екрана. И така, понеже всичко си е на едно място (ze gpu), да спестя всякакво четене/писане от видео паметта.

В резултат на ровене из форуми, stackoverflow, tutorials && references, ето какво се получи :

1. YouTube (beware of the video compression)(out-of-core version).
2. GitHub.

Имам около 40fps, интериорна сцена с голяма лампа (cornell box), Macbook Pro, i7 2.6GHZ, nVidia 650M, 512×512 resolution, ray depth 8, rays per pixel 1.

Това, което се случва е : CPU-то инициализира CUDA & OpenGL, стартира CUDA kernel които трасира няколко лъча, резултата се записва в текстура, и накрая се рисува с OpenGL.

Целта на кода по-горе е да се ползва като бърз startup за някои CUDA driven apps (като raytracers :)).

Pros на моят вариант : почти всичко се смята на GPU-то (генериране на семпли, random числа, трасиране, събиране на резултата), ползва Halton редици (beware, nVidia държат патент над тях), ползва CUDA/OpenGL interop (сиреч, ъпдейта на картинката се случва максимално бързо).

Cons : това не е production, aми по-скоро naive имплементация на ray trace, може да се стане няколко пъти по-бързо и noise-free (няма shadow rays, complex sampling), GPU-тата се feed-ват от 1 CPU thread, etc.