Saturday, February 16, 2013

AMD Sea Islands instruction set documentation is online

A fresh PDF about the Southern Islands GPU instruction set is now available online. Southern Islands is the architecture of the new AMD GPUs yet to be released. Here are some notes found inside:


Differences Between Southern Islands and Sea Islands Devices

Important differences between S.I. and C.I. GPUs
•Multi queue compute
Lets multiple user-level queues of compute workloads be bound to the device and processed simultaneous. Hardware supports up to eight compute pipelines with up to eight queues bound to each pipeline.
•System unified addressing
Allows GPU access to process coherent address space.
•Device unified addressing
Lets a kernel view LDS and video memory as a single addressable memory. It also adds shader instructions, which provide access to “flat” memory space.
•Memory address watch
Lets a shader determine if a region of memory has been accessed.
•Conditional debug
Adds the ability to execute or skip a section of code based on state bits under control of debugger software. This feature adds two bits of state to each wavefront; these bits are initialized by the state register values set by the debugger, and they can be used in conditional branch instructions to skip or execute debug-only code in the kernel.
•Support for unaligned memory accesses
•Detection and reporting of violations in memory accesses

It seems as the Sea Islands architecture will feature multiple queues similar to the NVidia Kepler's promoted as "Hyper-Q" technology (see GK110 whitepaper).

Link: AMD_Sea_Islands_Instruction_Set_Architecture.pdf

UPDATE: For some reason the referred file is not available anymore. 

Friday, February 8, 2013

Memory intensive microbenchmarks on GPUs via OpenCL

I have authored a set of micro-benchmarks implemented in OpenCL API. When they are executed, the optimum set of values for 3 basic parameters is estimated. These parameters are the workgroup size, the thread granularity with which I refer to the number of elements computed by each workitem and the vector's width which is the width of the native vector type used in the kernel. The latter is a quite close parameter to the "thread granularity" although not exactly the same.

Here I provide some results of the estimated bandwidth of some memory intensive operations performed by some contemporary GPUs:

Device reduction memset vector addition
GB/sec
NVidia GTX 480 147.37 157.6 149.95
NVidia S2050 95.07 117.71 106.35
NVidia K20X 196.85 200.36 191.71
AMD HD 7750 63.27 38.55 53.07
AMD HD 5870 136.23 106.8 111.33
AMD HD 7970 225.24 156.42 205.26



The dataset of each array involved in the computation is 16M 32bit integers (i.e. the addition of 16M + 16M elements, or the reduction of 16M elements).

The Southern Islands based GPUs (7750 & 7970) seem to have reduced performance on memory access writes. This can be explained on the AMD OpenCL programming guide on which is noted that memory access writes cannot be coalesced on the SI architecture. I don't know though why AMD had to reduce the global memory write throughput on SI GPUs.

Another thing to note is that the S2050 had ECC memories enabled whereas in the K20X they were disabled. According to the NVidia manuals enabling ECC memories leads to ~20% lower memory bandwidth.