980, 1080, P100

GPU board close shot

This is a follow up on the “Pascal and Polaris” post I did some time ago.

As you might remember, based on the pre-released benchmarks and some rough estimations, I got to the conclusion that the 1080/1070 (Pascal) seems to have exactly the same performance per clock as Titan X (Maxwell). It seemed to be about 40% more energy efficient, which is the expected benefit from a new process (16nm) and the FinFet transistors. Being cheaper is a bonus.

Well, today we have the Tesla P100 (GP100 core) white paper, as well as the GTX 1080/1070 (GP104 core) white paper. And here are some important things …

Without further ado, here is how a streaming multiprocessor looks like in P100.

Pascal GP100 SMX design

We can see a few things – there are double precision (DP) calculation units on top of the single precision 32bit (core) ones, we have 32kb registers for each 32FP units (1 to 1 ratio), we have 64kb of shared memory for 64 FP units (1 to 1 ratio).

Lets move to 1080 (GP104).
Pascal GP104 SMX designWe can spot easily a few differences. First, there are no double precision (64bit) units. Second, we can see that there are 16kb of register for 32 single precision (1 : 2 ratio), there is 96kb of shared for 128 cores (1 : 1.3 ratio).

Now, let me get from the Maxwell white paper how the old Maxwell (Titan X/980/970/etc) streaming multiprocessors looked like.
Maxwell SMX design

So, do you see the difference between the Maxwell SMX and the GP104 SMX ? Well, no need to scroll up, because there is no difference. The GP104 is actually the good old Maxwell architecture build into 16nm transistors, instead of 28nm ones. With very few (if any) adjustments. And thats why the performance per clock of GP104 is exactly the same as Maxwell GM204. Making architecture work with two different memory types (GDDR vs HBM) is very hard from what I have heard, and I always asked myself how NVIDIA are going to achieve that. Now I know.

This of course means that all the GPGPU features of Pascal, like native half 16bit and double 64bit support, more registers, more shared memory, shared virtual memory with the CPU, HBM2.0 and NVLINK will be exclusive for the premium professional Tesla P100 (GP100). The compute preemption seems to be in the 1080 GP104 however. And the Pascal Quadro/Tesla GPUs seems to be quite different to the gamer ones, at last.

I can guess that this will be the case until the HBM2 becomes cheaper. Which will take a while.

Pascal and Polaris

pascal vs polaris

So the early reports claim that GTX 1080 is about 25% (1.25) faster compared to Titan X in game performance. This is achieved however with 2560 CUDA cores, compared to the 3072 CUDA cores of the Titan X, or with 20% (1.2) less. Overall, this gives 50% (1.2 x 1.25) better performance in game performance.

However 1080 runs at 1.6GHZ base clock, which is >50% more compared to the ~1GHZ for the Titan X.

In the best case scenario, we have 0 (zero) improvement in game performance per clock in Pascal compared to Maxwell.

But 1080 (GP104) gets exactly the same game performance per clock as Titan X (GM200) for 180W, compared to the 250W of the Titan X, which is ~40% improvement. Meaning, that most likely the Pascal architecture is exactly as efficient as Maxwell for games and that this 40% improvement are coming from the new 16nm FinFet process (the lower price comes as a bonus).

I expect there to be some (or a lot) of improvement per clock for GPGPU apps, mainly because of the doubled amount of registers – 32k vs 16k (game shaders care about those register far less compared to the GPGPU apps) and the native 16bit floating point number support. Pascal has a number of other GPGPU specific features, like: compute preemption, which allows using of GPGPU apps with 1 GPU without the OS UI becoming sluggish; shared virtual memory with the CPU; nvlink which should allow stacking of the GPU memory frame buffers (so 2 GPUs with 16GB memory each will give 32GB of usable memory); more shared memory. All of these take a lot of transistors to make and don’t really contribute to gaming. Finally after Kepler and Maxwell, it seems like NVidia are focusing not only in games …

Good thing is that it seems that AMD Polaris should be better suited compared to the current GCN architecture as well. It will have new L2 Cache (critical for GPGPU) and new memory controller. On top of that it seems to have new dynamic instruction scheduler, which happens to help a lot of the complex GPGPU apps (this was last seen in the nVidia Fermi GPUs, which probably was one of the reasons the Fermi cores were that powerful). AMD will most likely go for performance per watt, and not necessarily the highest performance (so the Polaris GPUs will be relatively small).

All we have to do is wait and see if all of this will actually work …

p.s. check the follow up of this post here.