So, if you want to get familiar with why we have to use CUDA and OpenCL and stuff like that, in the text below there are a few links to articles, talks and papers that explain that in length – you can read those in their order of appearance. I strongly advise everybody to at least check those. However, if you want to read just a summary of all that in one place, here you have it …
We have to start with a bit of history, since this is how every good story begins …
At one point early on during the computer era, people decided that they could show the result of the computer calculations on screen in other-than-alpha-numeric form, so their job could be more interactive (and less boring, I guess). Special hardware was needed for that – something that could convert large amounts (for that time) of data from the computer to something appropriate that the screen could understand, and thus the framebuffer was born. In some sense, probably this is the birth of the GPU and it happened long, long time ago (1970-something).
The GPUs have had a really crazy history since that time – till the Voodoo showed up in mid 90s (and OpenGL just before that), when everybody wanted (and did make) GPUs. There were all kinds of designs, approaches and implementations and it was time of really dynamic changes. Many of the companies did not manage to keep up with that, so many of them quit the business. These days we basically have nVidia, AMD and Intel, and in the mobile sector there are ARM (Mali), Imagination Technologies (PowerVR), Qualcomm (Adreno) and a few even smaller IP vendors. Here is pretty much the full story of it.
But the history that brings us to today is actually pretty cool. As it seems from the beginning of the computing era, a lot of chips designs were targeted at the CPU, but at some point everybody just started using x86. Yes, we had MIPS, Power/PowerPC, Sparc, etc, but they were nothing as x86 in terms of market share. More or less, we have been stuck with x86 for a long time. We have ARM now, which is quite different, though in some way it is just a cleaned-up, RISC variant of the same.
In comparison, GPUs have always had the freedom to completely change the architecture or perpetually try new stuff since they don’t have to support tons of legacy software. So there one can find all kind of crazy ideas, some working great, some not.
I would like to split the GPU history into three parts – before Voodoo, after Voodoo and after the GPGPU advent.
The Voodoo GPUs were so great back then (circa 1996), that they marked the whole period. They took the GPU ‘to the masses’ and everybody knew about Voodoo; I was 14-year-old and living in post-socialist country, but even I knew of that Voodoo stuff.
They were good, but not good enough to save their maker, 3dfx, from failing – in the dynamic situation back then it did not take many wrong steps to fail. 3dfx were bought by nVidia (circa 2000), which at that time had some great GPUs too. After a lot of market acquisitions, some time in the beginning of the 21st century there were only two major discrete desktop GPU makers left – nVidia and AMD, and two dominant APIs – DirectX and OpenGL. And actually I find that not so bad – the times when everybody had different approaches on the GPUs were the times when you have to write your code to fit tons of hardware (and this is not fun). This third period continues up to today, but something happened around ~2005, which marked the start of a new era – the GPGPU one, which is what we will focus on for the rest of this text.
Herb Sutter has great article about why that third period is so exciting and important. In my eyes, this article may have turned out to be as foreseeing and important for the computer industry as Dijkstra’s “Go To Statement Considered Harmful”.
Now, lets get away from the history for a moment and look closer at why this GPGPU stuff matters (a lot) …
In a nutshell, because of the way processors are made, we cannot make them a lot faster anymore (as we could every year, till 2005). So our programs can’t get faster and faster as the time goes by (or at least, they can’t get faster without some developer’s efforts).
There is Moore’s Law that everybody talks about – basically it is a rule-of-thumb that says that every 18-24 months or so, the number of transistors in a processor roughly doubles. The way this happens usually is that we start making half as big transistors, so we can put twice as many on the die. And on top of that those transistors can run at two times the frequency they had and because they are smaller, we can feed them considerably less voltage, and this resulted in the roughly same power consumption (but running faster). However, because of a physics phenomenon called “leakage” (if you want to know the details and much more on the topic, this talk is great) we can’t keep lowering the voltage and if we don’t lower the voltage we get four times the power consumption we had before. If you don’t believe that, ask Intel – they predicted 10ghz (and people were thinking a lot of crazy stuff, too) processors for 2011. By the way, because of the limitation of the speed of light, if we want to have a 100GHz single-clock-domain circuit, say, some CPU, the furthest distance an electron can travel during a clock is 2mm. Good luck cooling a chip that is 2mm big!
So, the processors today are power limited – we can’t cool them infinitely, we can’t make them use less power for higher frequencies. So people started saying that Moore’s Law was dead. Here is an image, that represents the number of transistor on Y axis, and the year on X.
As we can see, Moore’s Law was not dead by 2010, but at some point it will be – nothing exponential in nature lasts forever. But this day is not today – and perhaps won’t be in the next 10 years. We can still put more and more transistors, but we can’t make them run at higher frequencies. So the processors these days are getting more cores. Today’s expectations are that single core performance will improve somewhere between 2% and 15% per year (because of better architectures, optimizations in the process of their production and so on).
This makes x30 CPU performance for the next 50 years – for a comparison, for the last 30 years this number is x2500. At the same time, we expect the multi-core speed-up to be around 75% per year. And this happens today. The image below shows the theoretical GFLOPS of modern CPU (which has few very smart cores) vs GPU (which has thousands not-so-smart ones).
I think that most developers underestimate the above, and really bad so.
Well, they will not, and that’s why from now on the efficiency and performance of the programs will get more important than ever before (this is quite the opposite of what the majority of developers think). In the past, you could have a slow but sophisticated single-core and feature-complete program that runs sluggish on most machines. But you knew that after some time your program would run fine. This is the past – now everybody can design sophisticated programs (since we all have zillions of tools), but not everybody knows how to make them run faster, not to mention that the need for more and more computations has been outgrowing the speed at which processors are getting faster since the beginning of the computing.
“Concurrency is the next revolution in how we write software. The vast majority of programmers today dont grok concurrency, just as the vast majority of programmers 15 years ago didnt yet grok objects” Herb Sutter.*
So, lets get to the power (see the image below). As we noted above, processors are power limited. If we take one 20 mm^2 processor chip and fill it with ALUs only (for instance, FMAD – fused multipy-add, basically simple units that can do some floating point operations), this would give us 12 TFLOPS (using 28nm production process) and would result in 300 Watts of power (since every double FMAD uses 50pj). For a comparison today’s modern GPUs (which are considered really power efficient) are using around 200Watts for ~5 TFLOPS. That is because these units can’t do anything without data – they have to get input data, over which to do their computations, and store the output results somewhere. And moving data comes at its own costs – moving 64 bits at 1mm distance costs 25pj. If you have to move 2×64 bits, you are already wasting more (or the same) power for the data transfer (2×25), than for the calculations (50). These costs are growing linearly – if you have to move the data along 10mm, this would cost 250pj, and so on and so forth. For some cases (as in the DRAM of the GPU) it goes up to 10000pj. These joules can be converted to latency too, since when the data has to travel longer distance, it arrives later. It is clear that you want to have your data as close to the computation units as possible.
And when we look at the “modern” computer languages, we can see that they clearly are “above” all that – C++, Java, Ruby, Haskell, Go – all of them present the world as “flat” (read: fully-coherent) – from their perspective all memory is the same and accessing it costs the same, too. And I think they seem to be proud of that, in some manner. The same goes for x86 architecture itself, by the way – it has some instructions for pre-fetch and cache management, but they are just little patches over a large problem. It is good that we have all those caches in the x86, trying so badly to hide the latencies, but it is just that the developers sometimes (by nature) know more than any heuristics does. There are a lot (really, a lot) problems of such nature. So if there is a way to manually “manage the cache”, a lot of developers/programs could benefit.
“Performance = Parallelism”, “Efficiency = Locality”William Dally
Who targets what and how – in the next part.
* Parallelism is to run stuff in parallel – like having multiple cars on the highway, all going parallel on the road.
Concurrency is to be able to advance multiple tasks in a system. For example, if you have two cars that have to be moved form A to B, and one driver to do the job. The driver might drive the first car for some walkable distance, than go to the second car and drive it to go in front of the first one, than go back to the first car, etc. After a lot of time, both of the cars will be moved from A to B – this is concurrency. If you have two drivers for the job, this is parallelism.
The situation, at which some resource might be needed from different subjects at the same time – for example if the cars have to cross ajunction, they would have to wait for each other before they can cross it (or otherwise a crash will occur) is concurrency problem.
If you have parallelism, you have a concurrency too. Having heavy parallelism presupposes having to deal with concurrency problems.