0. GPGPU 101 – Intro

So, if you want to get familiar with why we have to use CUDA and OpenCL and stuff like that, in the text below there are a few links to articles, talks and papers that explain that in length – you can read those in their order of appearance. I strongly advise everybody to at least check those. However, if you want to read just a summary of all that in one place, here you have it …

We have to start with a bit of history, since this is how every good story begins …

At one point early on during the computer era, people decided that they could show the result of the computer calculations on screen in other-than-alpha-numeric form, so their job could be more interactive (and less boring, I guess). Special hardware was needed for that – something that could convert large amounts (for that time) of data from the computer to something appropriate that the screen could understand, and thus the framebuffer was born. In some sense, probably this is the birth of the GPU and it happened long, long time ago (1970-something).

The GPUs have had a really crazy history since that time – till the Voodoo showed up in mid 90s (and OpenGL just before that), when everybody wanted (and did make) GPUs. There were all kinds of designs, approaches and implementations and it was time of really dynamic changes. Many of the companies did not manage to keep up with that, so many of them quit the business. These days we basically have nVidia, AMD and Intel, and in the mobile sector there are ARM (Mali), Imagination Technologies (PowerVR), Qualcomm (Adreno) and a few even smaller IP vendors. Here is pretty much the full story of it.

But the history that brings us to today is actually pretty cool. As it seems from the beginning of the computing era, a lot of chips designs were targeted at the CPU, but at some point everybody just started using x86. Yes, we had MIPS, Power/PowerPC, Sparc, etc, but they were nothing as x86 in terms of market share. More or less, we have been stuck with x86 for a long time. We have ARM now, which is quite different, though in some way it is just a cleaned-up, RISC variant of the same.

In comparison, GPUs have always had the freedom to completely change the architecture or perpetually try new stuff since they don’t have to support tons of legacy software. So there one can find all kind of crazy ideas, some working great, some not.

I would like to split the GPU history into three pars – before Voodoo, after Voodoo and after the GPGPU advent.

The Voodoo GPUs were so great back than (circa 1996), that they marked the whole period. They took the GPU ‘to the masses’ and everybody knew about Voodoo; I was 14-year-old and living in post-socialist country, but even I knew of that Voodoo stuff.

They were good, but not good enough to save their maker, 3dfx, from failing – in the dynamic situation back than it did not take many wrong steps to fail. 3dfx were bought by nVidia (circa 2000), which at that time had some great GPUs too. After a lot of market acquisitions, some time in the beginning of the 21st century there were only two major discrete desktop GPU makers left – nVidia and AMD, and two dominant APIs – DirectX and OpenGL. And actually I find that not so bad – the times when everybody had different approaches on the GPUs were the times when you have to write your code to fit tons of hardware (and this is not fun). This third period continues up to today, but something happened around ~2005, which marked the start of a new era – the GPGPU one, which is what we will focus on for the rest of this text.

Herb Sutter has great article about why that third period is so exciting and important. In my eyes, this article may have turned out to be as foreseeing and important for the computer industry as Dijkstra’s “Go To Statement Considered Harmful”.

Now, lets get away from the history for a moment and look closer at why this GPGPU stuff matters (a lot) …

In a nutshell, because of the way processors are made, we cannot make them a lot faster anymore (as we could every year, till 2005). So our programs can’t get faster and faster as the time goes by (or at least, they can’t get faster without some developer’s efforts).

There is Moore’s Law that everybody talks about – basically it is a rule-of-thumb that says that every 18-24 months or so, the number of transistors in a processor roughly doubles. The way this happens usually is that we start making half as big transistors, so we can put twice as many on the die. And on top of that those transistors can run at two times the frequency they had and because they are smaller, we can feed them considerably less voltage, and this resulted in the roughly same power consumption (but running faster). However, because of a physics phenomenon called “leakage” (if you want to know the details and much more on the topic, this talk is great) we can’t keep lowering the voltage and if we don’t lower the voltage we get four times the power consumption we had before. If you don’t believe that, ask Intel – they predicted 10ghz (and people were thinking a lot of crazy stuff, too) processors for 2011. By the way, because of the limitation of the speed of light, if we want to have a 100GHz single-clock-domain circuit, say, some CPU, the furthest distance an electron can travel during a clock is 2mm. Good luck cooling a chip that is 2mm big!

So, the processors today are power limited – we can’t cool them infinitely, we can’t make them use less power for higher frequencies. So people started saying that Moore’s Law was dead. Here is an image, that represents the number of transistor on Y axis, and the year on X.

As we can see, Moore’s Law was not dead by 2010, but at some point it will be – nothing exponential in nature lasts forever. But this day is not today – and perhaps won’t be in the next 10 years. We can still put more and more transistors, but we can’t make them run at higher frequencies. So the processors these days are getting more cores. Today’s expectations are that single core performance will improve somewhere between 2% and 15% per year (because of better architectures, optimizations in the process of their production and so on).

This makes x30 CPU performance for the next 50 years – for a comparison, for the last 30 years this number is x2500. At the same time, we expect the multi-core speed-up to be around 75% per year. And this happens today. The image below shows the theoretical GFLOPS of modern CPU (which has few very smart cores) vs GPU (which has thousands not-so-smart ones).

I think that most developers underestimate the above, and really bad so.

Let me give you an example – I worked in a company that made MMORPGs. Our code was single-threaded and slow, the game was slow, but we knew that at the moment we start selling that, computers would be fast enough and the game would run fine – MMORPGs usually take 3 to 5 years to make if you do your own engine (the game failed in the end, but that’s another story), so there was plenty of time. These days there are millions of people out there that still write code like that (embarrassingly single-threaded code, using languages and tools that are designed to be such), who still believe the same – that computers are fast enough or that they will get faster (Swift, Javascript – I am looking at you).

Well, they will not, and that’s why from now on the efficiency and performance of the programs will get more important than ever before (this is quite the opposite of what the majority of developers think). In the past, you could have a slow but sophisticated single-core and feature-complete program that runs sluggish on most machines. But you knew that after some time your program would run fine. This is the past – now everybody can design sophisticated programs (since we all have zillions of tools), but not everybody knows how to make them run faster, not to mention that the need for more and more computations has been outgrowing the speed at which processors are getting faster since the beginning of the computing.

“Concurrency is the next revolution in how we write software. The vast majority of programmers today dont grok concurrency, just as the vast majority of programmers 15 years ago didnt yet grok objects” Herb Sutter.*

So, lets get to the power (see the image below). As we noted above, processors are power limited. If we take one 20 mm^2 processor chip and fill it with ALUs only (for instance, FMAD – fused multipy-add, basically simple units that can do some floating point operations), this would give us 12 TFLOPS (using 28nm production process) and would result in 300 Watts of power (since every double FMAD uses 50pj). For a comparison today’s modern GPUs (which are considered really power efficient) are using around 200Watts for ~5 TFLOPS. That is because these units can’t do anything without data – they have to get input data, over which to do their computations, and store the output results somewhere. And moving data comes at its own costs – moving 64 bits at 1mm distance costs 25pj. If you have to move 2×64 bits, you are already wasting more (or the same) power for the data transfer (2×25), than for the calculations (50). These costs are growing linearly – if you have to move the data along 10mm, this would cost 250pj, and so on and so forth. For some cases (as in the DRAM of the GPU) it goes up to 10000pj. These joules can be converted to latency too, since when the data has to travel longer distance, it arrives later. It is clear that you want to have your data as close to the computation units as possible.

And when we look at the “modern” computer languages, we can see that they clearly are “above” all that – C++, Java, Ruby, Haskell, Go – all of them present the world as “flat” (read: fully-coherent) – from their perspective all memory is the same and accessing it costs the same, too. And I think they seem to be proud of that, in some manner. The same goes for x86 architecture itself, by the way – it has some instructions for pre-fetch and cache management, but they are just little patches over a large problem. It is good that we have all those caches in the x86, trying so badly to hide the latencies, but it is just that the developers sometimes (by nature) know more than any heuristics does. There are a lot (really, a lot) problems of such nature. So if there is a way to manually “manage the cache”, a lot of developers/programs could benefit.

“Performance = Parallelism”, “Efficiency = Locality”William Dally

Who targets what and how – in the next part.

* Parallelism is to run stuff in parallel – like having multiple cars on the highway, all going parallel on the road. Concurrency is the situation, at which some resource might be needed from different subjects at the same time – for example if thecars have to cross ajunction, they would have to wait for each other before they can cross it (or otherwise a crash will occur). Having heavy parallelism presupposes having to deal with concurrency problems.

I admit I was wrong

Okay, I hate that I have to do this, but it seems that if I made a blog about C++, OpenCL and CUDA in a language in which I personally know all the native speakers that do care about those technologies (GPGPU ones the most), I could not have really big audience (they are literally 5 people and Boris. Hey Boris!). So I will start blogging in english. I have bad english, but I will try to do my best.

Before a year or so I decided to write down a paper in which I wished to tell people more about how do I have spend a couple of months in Chaos. Basically it was a story of how we solved a problem, that we had (it is not a general solution that works everywhere and for everybody, but it did worked for us back then). And I learned a lesson – writing papers is boring and frustrating. You have to write a certain number of pages, you have to have references, abstract and images all around the place. Than you have to convince somebody that what you have written is worth it (“it made a popular raytracer faster is not enough of a proof, apparently). And dealing with the academic community was not hours of fun.

So, I am starting to blog about CUDA (and some C++ and OpenCL ofc, but we will start with more CUDA, I think). In english.
I will start with the boring stuff (what is CUDA and why I don’t write more about OpenCL) and I will move to the more interesting one (fast). I promise that even in the boring ones, I will do my best to tell spicy important details that I had hard time finding.

Stay tuned.

no except

Ето една интересна лекция на Meyers, в която той обяснява много добре какво е std::move, std::forward и noexcept :)

Към средата има една част, в която е показано колко по-бърз може да е кода, когато е с noexcept там, където е възможно. Освен това имаше показани примери, как чрез проста смяна на компилатора със C++11/14 enabled такъв, кода става по-бърз. Във връзка с noexcept се появи в един момент и тази графика.

Колко хубаво, помислих си .. Ще направим сега едно NOEXCEPT макро (което ще е #define NOEXCEPT noexcept когато компилираме със С++11/14 компилатор), ще прекомпилираме всичко което имаме (и вкъщи и в офиса) с него и ще получим ей така едни проценти забързване от нищото ..

Та изрових една моя стара (С++98) микро библиотека за математика (вектор, точка, матрица, quaternion, в общи линии това) и реших да завъртя един цикъл с тях (тя всъщност е толкова проста, че не очаквах каквито и да е разлики в резултатите).
Ползвам clang-600.0.51(llvm 3.5svn), компилирам с -Os (което е по дефаулт) (с -O3 резултатите са почти същите) и C++14 enabled.

Кода от теста се намира тук.

Интересно, че противно на приказките на Meyers, Chandler (from LLVM) и не знам си кой друг, на моята машина (I7-3615QM) кода върви по-бавно с между 2 и 5 процента, когато функциите имат noexcept.
Или аз го ползвам много не както трябва, или някой някъде малко послъгва …

No country for old men

Още един “benchmark” между ARM и Intel CPUs (идеята за него дойде след като разни JS benchmark-ове на iPad Mini Retina минаваха само 2-3-4 пъти по-бавно от колкото на i7 2600 … Логичният въпрос, който някой (Тео) зададе беше дали наистина е толкова по-бавен, на приложение като raytracer).

Тестваме smallpt (не че това е представителен raytracer, но е raytracer), naive quick sort, random mem access & floating point operations. Кода на тестовете е тук.

Тестваме I7-3720QM (2.6GHZ) (Macbook Pro Retina 2012) vs 64-bit A7 (1.3GHZ) (iPad Mini Retina).

Тестваме еднонишково.

Тестваме с clang-503.0.40 (based on LLVM 3.4svn) (с опции -O3, relaxed IEEE & strict aliasing).

Основният тест е smallpt, покрай него има и няколко проби с по-прости сметки.
Тестовете са правени по 5 пъти (да, само 5), максималната разликата в резултатите на Intel-a е в рамките на 8% (което е бая), а на iPad-a e в рамките на 2%.

По време на тестовете не е имало други пуснати приложения (а лаптопа не е бил в power-save mode, etc). Сиреч, почти може и да си помисли човек да им вярва.

Показани са усреднените стойности на времето, което е отнел всеки тест в секунди (тоест, по-ниските правоъгълничета означават по-добри резултати).

Ако някой има С/С++/Objective-C/Swift код който иска да бъде тестван, PM me :).

Като цяло, резултатите са такива, каквито се предполагаше че ще бъдат.

There is no country for old men.

p.s. през *почти* същите тестове мина и един iPhone 4, с ~ резултати Raytrace:240s, Sort:72s, Mem:56s, FPU 21s & iPhone 6 – Raytrace 13s, Sort 11s, Rand mem access 9.5s, FPU 1.92s.

VTable по поръчка

Ето в това видео от Going Native 2013 A. Alexandrescu разказва за някои от проблемите, които виртуалните функции в С++ създават. А именно : всеки обект държи по един (или повече) 64 битов (най-често) указател към виртуалната таблица, това измества разни неща от cache-а; винаги този указател седи в началото на обекта (а в началото е добре, да има неща които се ползват по-често. Това може да е указателят към виртуалната таблица, но може да е друго :)). Рядко в живота може човек да се докара до там, че да иска и това да забързв. Но се случва.

Идеята е да си напишем наша имплементация на виртуални функции, която да ползва не 64 битово число, ами 8 (така ще се ограничим по максимален брой виртуални функции в интерфейс или по максимален брой класове в йерархия, но 256 (2^8) често е повече от достатъчно).

Ето и примерна имплементация (с малко macro-та и нагласявки :)). На места е малко redundant; мисля че може да се постигне и по-приятен синтаксис.

Не успях да събера мотивация да направя тестове за да видя има ли забързване, в кои случаи има и колко е то. Според Alexanderscu Facebook работи 4% по-бързо заради тази магия :).
Предимство е, че така можем и да изберем къде да поместим този “char”, който ползваме като указател към виртуална таблица -> ако не се очаква виртуални методи да се викат често, навярно е по-добре да не е в началото на обекта.

Raytrace Interop

Oпитвам се напоследък да правя разни бързи неща с CUDA.
Често има проблем с вземането на резултата от GPUs app (то не става особено бързо, да започне да се случва асинхронно е голяма болка, а пък понякога трябва да става особено често), та един начин това да се случва е с CUDA/OpenGL interop.

Идеята (поне моята) е с някой CUDA kernel да рисувам нещо в някоя текстура, а с OpenGL само да си я рисувам на екрана. И така, понеже всичко си е на едно място (ze gpu), да спестя всякакво четене/писане от видео паметта.

В резултат на ровене из форуми, stackoverflow, tutorials && references, ето какво се получи :

1. YouTube (beware of the video compression)(out-of-core version).
2. GitHub.

Имам около 40fps, интериорна сцена с голяма лампа (cornell box), Macbook Pro, i7 2.6GHZ, nVidia 650M, 512×512 resolution, ray depth 8, rays per pixel 1.

Това, което се случва е : CPU-то инициализира CUDA & OpenGL, стартира CUDA kernel които трасира няколко лъча, резултата се записва в текстура, и накрая се рисува с OpenGL.

Целта на кода по-горе е да се ползва като бърз startup за някои CUDA driven apps (като raytracers :)).

Pros на моят вариант : почти всичко се смята на GPU-то (генериране на семпли, random числа, трасиране, събиране на резултата), ползва Halton редици (beware, nVidia държат патент над тях), ползва CUDA/OpenGL interop (сиреч, ъпдейта на картинката се случва максимално бързо).

Cons : това не е production, aми по-скоро naive имплементация на ray trace, може да се стане няколко пъти по-бързо и noise-free (няма shadow rays, complex sampling), GPU-тата се feed-ват от 1 CPU thread, etc.

Singleton & C++11

От време на време всеки ползва Singleton като (анти) шаблон за дизайн.
От време на време пишем и многонишкови програми.
От време на време ползваме тези двете в една програма.

Сложното тогава е, как да сме сигурни че ще имаме точно една инстанция на обекта, който желаем.
Нишките трябва да се разберат една да създаде инстанцията, а другите да не и се пречкат.

Наивният начин за имплементиране на това е с обикновен lock :

Така нещата ще потръгнат. Проблема е, че в (1) ще имаме lock-ване всеки път, когато викаме get(). Това може да е много често. А всъщност имаме нужда от 1 локване – само при създаването на обекта (тоест в (2)).

От тази книга навремето научих за double-checking метода. В общи линии, идеята е проста.

Вместо да локваме всеки път, локваме само ако няма instance (1). След това локваме и проверяваме пак (2) – така хем имаме синхронизация, хем почти няма локване.
Като начин това работи *почти* винаги, и заради това е известен anti-pattern.

Без да е volatile указателя, компилатора(или интерпретатора) могат да спретнат кофти номер. Въпреки че логически изглежда абсолютно коректен, тов не е съвсем така. Както в С++98, така и в Java, така и навярно в други езици. Заради разни полу-странни оптимизации описани тук.

Ако указателя е volatile, тогава всичко се случва както се очаква.
Той обаче може да се ползва и на други места из кода – това че е volatile може да причини overhead именно там – онези полу-странни оптимизации ще бъдат блокирани и на места, на които не е нужно.

А и нещата с тази проста идея почнаха да загрубяват.
Lock, volatile, double-checking, overheads.

Сега, по темата. Със С++11 освен unique_ptr и move constructors получаваме още няколко неща. Ето една извадка от стандарта.

Shall wait for completion of the initialization.
Така, с новият стандарт, всичко казано дотук може да бъде заменено със следното.

Програмата самичка ще се грижи нишките да се изчакат правилно.
А с решаването на всички възможни проблеми се заемат хората, пишещи компилатори :).

п.с.
В новият стандарт се намира и това, което също може да реши проблема сравнително елегантно, с неизвестено какъв overhead.
Anyway, pretty cool stuff.