Parallel++

Wednesday, August 14, 2013

My new teraflopper accelerator (?)

Althought the title above would seem to be eye catching some years ago, modern GPUs offer multiples of teraflops of compute power. Thus, my "slight" upgrading from GTX465 (Fermi based) to GTX660 (Kepler based) GPU appear as not so remarkable news.

Now, let me elaborate on what I mean by the word "slight". I mainly use the GPU as a development device for the CUDA and OpenCL environments (actually, I almost never use it for playing games). Therefore, I'm particularly interested on the compute performance of these cards. As it has been shown in various reviews the compute performance of Fermi based cards have proven more efficient than Kepler based cards. However, Kepler GPUs tend to be more power efficient.

The older GTX465

GTX660

Nevertheless, the GTX660 provides multiple benefits over the GTX465. First, the available memory on device is doubled, 2GB instead of 1GB. This is important as I can run compute kernels for bigger workloads. Additionally, the memory bandwidth is significantly higher, 144GB/sec instead of 102GB/sec. This proves usefull for memory bound kernels. Regarding the compute throughput of both cards the GTX465 features 855GFlops in single precision operations whilst the GTX660 features a decend 1880GFlops. All these in just 140W TDP instead of 200W TDP. This is why the GTX660 requires just one 6-pin power connector instead of 2.

Now, the drawback. The GTX660 exhibits significantly less peak performance in double precision operations as NVidia tends to criple double precision potential of desktop products on the Kepler GPUs by a factor of 1/24 rather than the 1/8 it did in the first Fermi generation. This yields a 78GFlops for the GTX660 and 106GFlops for the GTX465.

I definitely wanted GPU with CUDA support as most of my work is based on it so I couldn't consider getting a Radeon GPU. However, in the future I intend to focus on open standards i.e. OpenCL.

OpenCL performance of GTX 660

As I upgraded my GPU card I wanted to run some OpenCL benchmarks. I have developed a small tool that performs some benchmarks by identifying the best configuration in terms of workgroup size, vector size and workitem granularity on each kernel. I also included results from a Radeon HD7750 which is an even less power hungry card as it does not require a power connector at all.

AMD HD7750

Benchmark results:

	Reduction GB/sec	Leibniz formula for PI computation (SP) GFLOPS	Leibniz formula for PI computation (DP) GFLOPS
GTX465	89.56	304.26	26.56
GTX660	108.12	656.08	21.77
HD7750	63.15	214.45	13.81

Result chart:

It doesn't seems so bad. Memory bandwidth is improved and the single precision computations are more than doubled! As predicted the double precision computation exhibits a noticable performance drop but it can be tolerated.

Peak throughput numbers were used from the respective wikipedia pages:
http://en.wikipedia.org/wiki/GeForce_400_Series
http://en.wikipedia.org/wiki/GeForce_600_Series

Friday, August 2, 2013

Petition for AMD in order to improve its GPU driver

I would like to encourage everyone who is serious about GPU computing to sign the following petition in order to urge AMD to improve its driver. If this happens we all would benefit. AMD products would prove as viable GPU computing platform and NVidia would be forced to drop it's prices or even provide more GPUs oriented for computing at more affordable prices. "Titan" is not an option for me even if it is considered as a consumer product.

Petition URL:
http://www.change.org/petitions/advanced-micro-devices-fix-bugs-in-opencl-compiler-drivers-and-eventually-opensource-them

Source:
http://streamcomputing.eu/blog/2013-08-01/amd-fix-bugs-in-opencl-compilerdrivers/

Saturday, May 11, 2013

Volcanic Islands to erupt at the end of year?

Update:

Sadly, this post is probably based on an old speculation posted in a forum and not on reliable information. For more info click here:
http://semiaccurate.com/2013/05/20/amds-volcanic-islands-architecture/

The time that follows seems to unveil fascinating GPU disclosures. AMD is rumored to unveil a new GPU architecture (named Volcanic Islands) in the Q4 2013 which according to the purportedly leaked diagram seen bellow, will contain a vast number of parallel compute elements (4096) plus 16 serial processors. Hopefully, these serial processors will alleviate serial code execution bottlenecks evident on the GPU. Thus, GPU compute could be further adopted for algorithms containing interleaved serial parts.

Volcanic Islands (Serial Processing Modules and Parallel Compute Modules)

Volcanic Islands block diagram

It is not known whether this information is correct but it highlights the trend of GPUs towards compute. Whatever the truth is I hope it will push AMD's rival to enforce the GPU computing capabilities of its desktop products that proved to be weak in it's last generation in favor of gaming and efficiency.

Wednesday, April 17, 2013

About bitcoining performance on GPUs

Traditionally when it comes to bitcoining performance the results favour AMD GPUs. There is an interesting article about the bitcoining performance of contemporary GPUs (HD7970, GTX Titan). It aims to explain the extreme performance discrepancy between the AMD and NVidia architectures in this benchmark.

link: http://www.extremetech.com/computing/153467-amd-destroys-nvidia-bitcoin-mining

Monday, March 4, 2013

FUSE emulator on CRT TV screen

The FUSE emulator looks more realistic when run on a CRT TV screen. That's because the original ZX-Spectrum featured a TV tuner in order to use an ordinary TV as a monitor. Here I use a Raspberry PI to run the emulator and it runs really well!

Saturday, February 16, 2013

AMD Sea Islands instruction set documentation is online

A fresh PDF about the Southern Islands GPU instruction set is now available online. Southern Islands is the architecture of the new AMD GPUs yet to be released. Here are some notes found inside:

Differences Between Southern Islands and Sea Islands Devices

Important differences between S.I. and C.I. GPUs
•Multi queue compute
Lets multiple user-level queues of compute workloads be bound to the device and processed simultaneous. Hardware supports up to eight compute pipelines with up to eight queues bound to each pipeline.
•System unified addressing
Allows GPU access to process coherent address space.
•Device unified addressing
Lets a kernel view LDS and video memory as a single addressable memory. It also adds shader instructions, which provide access to “flat” memory space.
•Memory address watch
Lets a shader determine if a region of memory has been accessed.
•Conditional debug
Adds the ability to execute or skip a section of code based on state bits under control of debugger software. This feature adds two bits of state to each wavefront; these bits are initialized by the state register values set by the debugger, and they can be used in conditional branch instructions to skip or execute debug-only code in the kernel.
•Support for unaligned memory accesses
•Detection and reporting of violations in memory accesses

It seems as the Sea Islands architecture will feature multiple queues similar to the NVidia Kepler's promoted as "Hyper-Q" technology (see GK110 whitepaper).

~~Link: AMD_Sea_Islands_Instruction_Set_Architecture.pdf~~

UPDATE: For some reason the referred file is not available anymore.

Friday, February 8, 2013

Memory intensive microbenchmarks on GPUs via OpenCL

I have authored a set of micro-benchmarks implemented in OpenCL API. When they are executed, the optimum set of values for 3 basic parameters is estimated. These parameters are the workgroup size, the thread granularity with which I refer to the number of elements computed by each workitem and the vector's width which is the width of the native vector type used in the kernel. The latter is a quite close parameter to the "thread granularity" although not exactly the same.

Here I provide some results of the estimated bandwidth of some memory intensive operations performed by some contemporary GPUs:

Device	reduction	memset	vector addition
Device	GB/sec
NVidia GTX 480	147.37	157.6	149.95
NVidia S2050	95.07	117.71	106.35
NVidia K20X	196.85	200.36	191.71
AMD HD 7750	63.27	38.55	53.07
AMD HD 5870	136.23	106.8	111.33
AMD HD 7970	225.24	156.42	205.26

The dataset of each array involved in the computation is 16M 32bit integers (i.e. the addition of 16M + 16M elements, or the reduction of 16M elements).

The Southern Islands based GPUs (7750 & 7970) seem to have reduced performance on memory access writes. This can be explained on the AMD OpenCL programming guide on which is noted that memory access writes cannot be coalesced on the SI architecture. I don't know though why AMD had to reduce the global memory write throughput on SI GPUs.

Another thing to note is that the S2050 had ECC memories enabled whereas in the K20X they were disabled. According to the NVidia manuals enabling ECC memories leads to ~20% lower memory bandwidth.