This is the next part of the GPGPU 101 posts I started some time ago. If you haven’t checked part 0, please do so. In this part we are going to make a brief overview of x86 (mainly based on this article).
So, in part 0 we arrived at the idea that in order to have performance and efficiency from now on, we need to have parallelism and locality. Now we need to introduce some context for our notions of parallelism and locality, so we can quantify those in any meaningful manner. First and foremost, we need basic familiarity with the environment. Which is often underestimated – the environment from the point-of-view a developer is immensely sophisticated. For a C++ developer for example, it starts with the compiler, goes all the way to the operating system and finishes in the hardware. So you have to know how the compiler works (and why it does what it does), what the operating system does (and why we need that) and have at least a basic understanding of the architecture for which you are writing (usually x86-64). If you happen to do scripting, instead of the compiler you have to know the interpreter (normally a Virtual Machine) so you basically have the same issues (along with probably some extra issues).
“Software does not run in a magic fairy ether powered by the fevered dreams of CS PhDs” Branimir Karadzic
We will start with the processors. As I already mentioned this is the best article I’ve ever read about how a processor works – really short and hits most of the important stuff. I would say that it is critical for most of the programming needs a developer might have.
In a summary, there are a few ways a processor can be made. What the processor has to do however is fetch an instruction, decode it to understand what exactly has to be done, execute it and store the result somewhere. This process is organized in a pipeline manner, and so it is called pipelining. You can think of the pipeline described above as a 4-step serial operation. Often in real life, processor pipelines are much more complicated, but this explanation is good enough for our needs.
If the stages in the pipeline are independent of each other, you can possibly execute them in parallel relatively easy (you can fetch one instruction, while decoding another, while executing a third one). Also you can possibly multiply that pipeline (create more parallel pipelines to the one you have), so you could process multiple instructions at once. This is called super-scalarity and it is implemented in every modern x86 processor. So here is a buzz-word: this super-scalarity is a way to do Instruction Level Parallelism (ILP) – meaning, you can execute multiple instructions at the same time with the mechanism, we just described. As we talked in the previous chapter, parallelism equals performance, so this is good – ILP gives us some performance here.
ILP works great if there are no data dependencies between the instructions. But what if there are? For example, if in the first pipeline we have an instruction to compute r=(A+B), but “A” is somewhere away so we have to wait for it, whereas in the second pipeline we have an instruction that needs that result, for example, to compute r/2 (the result of (A+B) divided by two). They will both have to wait. The same can happen if both pipelines have to execute an instruction, for which there is a specially dedicated hardware (for example, if you have some circuit that does hardware-implemented exp and both of the pipelines want to do exp(something)). Modern processors usually have a number of such calculation units because of that (so there are multiple floating-point-arithmetic computation units in your processor). We are going to talk more about those units later.
Basically, the text above gives a bird’s eye view of a x86 processor. We can think of the other stuff from the architecture as tools and mechanisms to prevent the processor from stalling and make sure it is as utilized as much as possible.
First, since the processor knows in advance what instructions it will have to execute, it can ask for the data they will need before they are actually in the execution part of the pipeline (and this is really clever). But what if there is a branch in the code and there is no way to know which path the code will take. There is a solution that addresses this issue – it is called speculative execution, and it employs branch predictors – every modern x86 has (at least) one. The branch predictor monitors which branches your control path comes across, and how those branches behave (statistically), which allows the predictor to speculate which direction a pending branch will take, so other units can prefetch the data this path needs in advance (preparing the data from both paths is not an option, since it will flood the data channels). So, basically if you have if-then-else in your code that is being executed a lot, and you are always going into the “if-true” clause (for example), this is practically free. But if your branch path is not that easily predictable, i.e. if it exhibits high entropy, the speculation may go wrong and this is going to cost you. Note that the bulk of the cost may not necessarily come from the instruction decoding and execution, but from the waiting for the operands data (we talked about data moving in part 0).
There is a unit to take care of data moving being too slow. x86-64 processors have multi-level cache hierarchies (L1, L2 and L3, check image below). They are basically super-fast (and super-expensive) memory pools, that are close to the processor (actually, on the chip itself) and they store the data that you need the most (based on the frequency of usage, adjacency, etc). L1 is smallest, but fastest – L3 is largest but slowest. Those caches normally work on cache-line granularity – 64B in modern 64-bit CPUs. What that means is that when you read 1B from memory, the cache will load some adjacent 63B as well, starting from the greatest 64-byte aligned address, lesser-or-equal to the specified address. So, if you read up to 64 characters that are stored in memory arbitrarily apart from each other, you will likely cause multiple memory-to-cache transactions, compared to just one or two, if those characters were adjacent (more on the memory in the computer here). Apart from data, decoded instructions are also cached (on the figure below, L1d is a data L1 cache, L1i is instructions L1 cache). So, again, moving stuff around is expensive, so we need good data locality.
Modern x86 has a lot more inside, but there is at least one more thing that is worth mentioning, and this is so-called “out of order” (OoO) execution. Basically if there is an instruction that has to be executed from a pipeline, but it has to wait for data, this instruction can be put aside for a moment and another one can be executed. So the order of execution of the instruction in a pipeline is not guaranteed. But as you know when you write your programs, the order of execution matters (a lot!), so if you scramble some of the calculations you do, you will most certainly not have the result you are looking for. So the processor is not randomly doing the next instructions in the pipeline – it carefully examines what can be done and what not. Further more, it “renames” the data fields, which helps a lot to resolve the data-dependencies in the code. This introduces another buzz word, called “register renaming“.
Okay, so x86 has ILP, caches, branch predictors, OoO, Register Renaming. Putting all these efforts into keeping the pipelines busy. Actually, there is one more thing that is worth mentioning. It is the so-called (buzz word!) SIMD – Single Instruction Multiple Data. The idea is simple – if you are going to do all that instruction fetching, decoding and so on (and they can be really expensive because of x86’s archaic ISA origins), wouldn’t it be great if you can execute them on multiple data at the same time? For example, if you want to calculate the sum of two vectors (each having float x, y, z; as data members), if you can execute the same instruction (sum) on all of the elements those vector have (at once), that’d be great! And actually every modern processor has special hardware that can do exactly that. Modern compilers are trying to identify parts of your code that are suitable for such SIMD instructions (this is called “auto-vectorization“). Doing SIMD code is important – for example, libraries like Embree are heavily based on it. Pretty much every performance hungry application today has to use them. If you have ever wondered how V-Ray 3.1 got so much faster than 2.4, this is one of the reasons (of course, there are many more). SIMD has multiple versions and it can be basically found in every x86-64 system. So, SIMD is just like ILP, but instead on doing multiple instructions at a time, it does the same instruction on multiple data fields. As with the ILP, it is a way to do parallelism, and since parallelism equals performance, it is a way to boost the performance of your x86-64 code.
If you want to get the most of SIMD, there is a way to write them yourself. The same goes for the memory access patterns – you can make sure you read/write as linearly as possible, have less data dependencies and so on. The same goes for branch organization and so on and so forth. And, as we talked in part 0, this is going to get more and more important (not less). Fortunately, you can do this fairly easily – at least in C/C++. I am not so sure about the modern languages like Java, C#, Javascript, Haskell and the others …
Remember that there are multiple computation units in the processor, because of the instruction level parallelism? What if there is not enough ILP and not all of them have something to do in the current super-pipelined task? This is where hyper-threading (known outside of x86 as Simultaneous Multi-threading, or SMT) comes. The idea is that you split every physical core into two (or more) logical cores. So you run multiple threads, say, two on the same core. And if one of them can’t occupy all of the hardware you have, the other one can use (some of) it. And no, hyper-threading does not make the processor linearly faster with the number of logical cores, it just helps better utilize its resources (and in some cases, if they are utilized enough or because of other reasons, it gives no benefits at all). Of course, modern x86-64 processors have multiple hardware cores on the same processor too – basically just multiplying the hardware of one core. Hyper-threading, multi-core and multi-processor systems are giving the last type of parallelism, which is on thread level. So we have ILP, SIMD and threads.
There we have it – the complicated x86 architecture at a glance, with its methods for parallelism (ILP, hyper-threading, SIMD, multi-cores) and locality (branch-predictor, caches).
However, there are two more things worth mentioning. First of all, since the compiler can see your code, it can identify the data dependencies during the compilation and spread the instructions for those further apart, thus generating longer sequences of data-independent instructions (and thus help the processor with the OoO task). If the compiler does that really well, the processor could do ILP without the hassle of register renaming, dependency monitoring and so on. There is actually a whole computer architecture approach based on this and it is called VLIW (Very Long Instruction Word). It is called like that because the data-dependency-free instructions that the compiler picks are packed in a single, very long ‘macro’ instruction – sometimes more than a hundred bytes (compared to up to 1-16 in not VLIW architecture, like x86). There were few attempts for such processors – Itanium from Intel, old AMD GPUs and nVidia Denver (which gets ARM instructions runtime (not compile time!), reorders them into VLIW ones and executes them in such a manner. Why nVidia decided to do that – I have no clue). The VLIW sounds great, since when we offload the processor of those tasks, it could do something else (at least, we are sparing transistors on the chip and this is important too). In real life many VLIW attempts have failed – compilers are getting way too complicated to produce, they start wasting way too much time (and memory) to compile and they often fail to generate good code (and during the compile time you know less about the program than at runtime) – meaning that often at the end they generate instructions that are executed in a serial manner. Still, VLIW has its applications and compilers in the wild. For example, everybody can take the LLVM compiler now and use it to generate code for AMD R600 – a VLIW architecture, relatively easy. More on all that in the next part of the GPGPU series.
And of course, there is an option not do all of that OoO with its complicated apparatus, and just use the transistors that would be needed for those to put more units for actual calculations. Those types of processors are called speed-demons (in contrast, complicated ones like x86 are called brainiacs). Speed demons have the potential to be more efficient (but only if there are good programmers and compilers, since they require more careful programming) – those smart features of the brainiacs come at a cost – some of them are power hungry or/and can’t be turned off when they are not needed. And these days a lot people had started talking not about performance, but performance per watt, so the power consumption is getting more and more important.
In the next part we will see how the GPUs are handling the instruction, data and thread level parallelism.
p.s. and million thanks to Marto (blu) & Boris for the edit!