Parallel++: October 2013

Wednesday, October 23, 2013

AMD "Hawai" compute performance extrapolation

Here is a graph of the theoretical peak performance of current top AMD GPUs. These include the Tahiti GPU known from the HD-7970 and the soon to be released Hawai GPU as the heart of the AMD R9-290X and R9-290. In this extrapolation each compute element in the GPU is supposed to perform 2 floating point operations per clock which is 1 MAD (multiply-add) operation per clock.

Each vendor will probably provide different cards operating in different frequencies so this diagram could be helpful for anybody who intends to by a new card for compute.

Wednesday, October 2, 2013

A note on the GPU programming paradigm

GPU programming employ a highly aggresive parallelism model. When someone endeavors to program GPUs has to exploit dual levels of parallelism. These are the coarse grained and fine grained parallelism models.

The coarse grained model is the one that allows scalability as it relies on the blocks of threads which have no way to synchronize together. Therefore, they are able to execute independently in parallel, feeding all the available compute units of the hardware device. If the programmer has specified enough of them in a kernel invocation they will keep the hardware highly utilized. It resembles programming the CPU at the thread level, although without any efficient option to synchronize multiple threads together.

Speaking about fine grained parallelism is like programming down to the SIMD level provided by the use of SSE or AVX instructions available on modern Intel & AMD x86 CPUs. However, programming is quite more flexible as using the so called SIMT (Singe Instruction Multiple Thread) model, in CUDA or OpenCL programming environments, one can program without even being aware of the SIMD nature. NVidia GPUs' SIMD width is 32 elements while AMD GPUs' width is 64 elements. Practically, the thread block size should be a multiple of this number, especially on NVidia GPUs, because memory access and pipeline latencies require a large number of threads on the compute unit in order to be hidden. At this parallelism level, threads are able to synchronize and communicate by exhanging data through the fast shared memory.

In this sense, GPUs employ parallelism to extreme levels. A modern GPU would require more than a thousand threads per compute unit to keep it utilized and it might consists of dozens of compute units.

GPU programming is both fascinating and dirty!