Parallel++: Memory intensive microbenchmarks on GPUs via OpenCL

Friday, February 8, 2013

Memory intensive microbenchmarks on GPUs via OpenCL

I have authored a set of micro-benchmarks implemented in OpenCL API. When they are executed, the optimum set of values for 3 basic parameters is estimated. These parameters are the workgroup size, the thread granularity with which I refer to the number of elements computed by each workitem and the vector's width which is the width of the native vector type used in the kernel. The latter is a quite close parameter to the "thread granularity" although not exactly the same.

Here I provide some results of the estimated bandwidth of some memory intensive operations performed by some contemporary GPUs:

Device	reduction	memset	vector addition
Device	GB/sec
NVidia GTX 480	147.37	157.6	149.95
NVidia S2050	95.07	117.71	106.35
NVidia K20X	196.85	200.36	191.71
AMD HD 7750	63.27	38.55	53.07
AMD HD 5870	136.23	106.8	111.33
AMD HD 7970	225.24	156.42	205.26

The dataset of each array involved in the computation is 16M 32bit integers (i.e. the addition of 16M + 16M elements, or the reduction of 16M elements).

The Southern Islands based GPUs (7750 & 7970) seem to have reduced performance on memory access writes. This can be explained on the AMD OpenCL programming guide on which is noted that memory access writes cannot be coalesced on the SI architecture. I don't know though why AMD had to reduce the global memory write throughput on SI GPUs.

Another thing to note is that the S2050 had ECC memories enabled whereas in the K20X they were disabled. According to the NVidia manuals enabling ECC memories leads to ~20% lower memory bandwidth.

2 comments:

AnonymousJanuary 28, 2015 at 12:05 PM
Where are the benchmarks ??
ReplyDelete
Replies

Add comment