Wednesday, August 14, 2013

My new teraflopper accelerator (?)

Althought the title above would seem to be eye catching some years ago, modern GPUs offer multiples of teraflops of compute power. Thus, my "slight" upgrading from GTX465 (Fermi based) to GTX660 (Kepler based) GPU appear as not so remarkable news.

Now, let me elaborate on what I mean by the word "slight". I mainly use the GPU as a development device for the CUDA and OpenCL environments (actually, I almost never use it for playing games). Therefore, I'm particularly interested on the compute performance of these cards. As it has been shown in various reviews the compute performance of Fermi based cards have proven more efficient than Kepler based cards. However, Kepler GPUs tend to be more power efficient.

The older GTX465
GTX660
Nevertheless, the GTX660 provides multiple benefits over the GTX465. First, the available memory on device is doubled, 2GB instead of 1GB. This is important as I can run compute kernels for bigger workloads. Additionally, the memory bandwidth is significantly higher, 144GB/sec instead of 102GB/sec. This proves usefull for memory bound kernels. Regarding the compute throughput of both cards the GTX465 features 855GFlops in single precision operations whilst the GTX660 features a decend 1880GFlops. All these in just 140W TDP instead of 200W TDP. This is why the GTX660 requires just one 6-pin power connector instead of 2.

Now, the drawback. The GTX660 exhibits significantly less peak performance in double precision operations as NVidia tends to criple double precision potential of desktop products on the Kepler GPUs by a factor of 1/24 rather than the 1/8 it did in the first Fermi generation. This yields a 78GFlops for the GTX660 and 106GFlops for the GTX465.
I definitely wanted GPU with CUDA support as most of my work is based on it so I couldn't consider getting a Radeon GPU. However, in the future I intend to focus on open standards i.e. OpenCL.

OpenCL performance of GTX 660

As I upgraded my GPU card I wanted to run some OpenCL benchmarks. I have developed a small tool that performs some benchmarks by identifying the best configuration in terms of workgroup size, vector size and workitem granularity on each kernel. I also included results from a Radeon HD7750 which is an even less power hungry card as it does not require a power connector at all.

AMD HD7750

Benchmark results:
Reduction
GB/sec
Leibniz formula for PI computation (SP)
GFLOPS
Leibniz formula for PI computation (DP)
GFLOPS
GTX465 89.56 304.26 26.56
GTX660 108.12 656.08 21.77
HD7750 63.15 214.45 13.81

Result chart:

It doesn't seems so bad. Memory bandwidth is improved and the single precision computations are more than doubled! As predicted the double precision computation exhibits a noticable performance drop but it can be tolerated.

Peak throughput numbers were used from the respective wikipedia pages:
http://en.wikipedia.org/wiki/GeForce_400_Series
http://en.wikipedia.org/wiki/GeForce_600_Series

No comments:

Post a Comment