Friday, February 8, 2013

Memory intensive microbenchmarks on GPUs via OpenCL

I have authored a set of micro-benchmarks implemented in OpenCL API. When they are executed, the optimum set of values for 3 basic parameters is estimated. These parameters are the workgroup size, the thread granularity with which I refer to the number of elements computed by each workitem and the vector's width which is the width of the native vector type used in the kernel. The latter is a quite close parameter to the "thread granularity" although not exactly the same.

Here I provide some results of the estimated bandwidth of some memory intensive operations performed by some contemporary GPUs:

Device reduction memset vector addition
NVidia GTX 480 147.37 157.6 149.95
NVidia S2050 95.07 117.71 106.35
NVidia K20X 196.85 200.36 191.71
AMD HD 7750 63.27 38.55 53.07
AMD HD 5870 136.23 106.8 111.33
AMD HD 7970 225.24 156.42 205.26

The dataset of each array involved in the computation is 16M 32bit integers (i.e. the addition of 16M + 16M elements, or the reduction of 16M elements).

The Southern Islands based GPUs (7750 & 7970) seem to have reduced performance on memory access writes. This can be explained on the AMD OpenCL programming guide on which is noted that memory access writes cannot be coalesced on the SI architecture. I don't know though why AMD had to reduce the global memory write throughput on SI GPUs.

Another thing to note is that the S2050 had ECC memories enabled whereas in the K20X they were disabled. According to the NVidia manuals enabling ECC memories leads to ~20% lower memory bandwidth.


  1. Where are the benchmarks ??

    1. The source code of benchmarks used are not so polished and thus I haven't made them public yet. I intend to do this in the future.