Saturday, December 14, 2013

A silly(?) prediction of a future CPU


First, let me warn you that what follows is not based on any recent scientific discovery but just on my imagination instead. As nowdays more and more computational units are fused within the CPU package the following picture could be some sort of a picture of a future chip.
As nowdays the CPU already incorporates GPU elements the following is a (rediculous) projection of a future central processing unit. It could contain a variety of combined cores for instance:
1) Serial compute cores (just like classic CPU cores)
2) Massively parallel cores (just like GPU compute units)
3) FPGA cores (for even more specialized tasks)
4) Quantum cores (???!! for NP-complete search problems?)

The last element is certainly not possible to be produced with current technology but in the future... who knows. It might turns to be just another component of the CPU package. How would it be called? QAPU?

Monday, November 4, 2013

More affordable scientific computing on GPUs

We have good news for affordable scientific computing on GPUs. Even though the AMD R9-290X made a step back by reducing double precision performance to 1/8 of single precision performance, this proved the R9-280X (with 1/4 dp to sp ratio) a true bargain for a ~1TFLOPS double precision performer for just $300. Furthermore, NVidia's upcomming GTX-780Ti GPU is rumored to have unlocked double precision units* turning it to a true number cruncher (more than 1.5 TFLOPS in dp compute) for a respectable $699, which is much less than GTX-Titan's price though.


Update: Unfortunately, it seems that the DP compute capability of GTX 780Ti is controversial. According to other information**, which seems valid, the DP potential is still limited to 1/24 of SP.  This proves the first link bellow inaccurate. Too bad for low budget researchers! 

Sources:
*: http://videocardz.com/47576/nvidia-geforce-gtx-780-ti-official-specifications
**: http://www.tomshardware.com/reviews/geforce-gtx-780-ti-review-benchmarks,3663.html#xtor=RSS-182

Wednesday, October 23, 2013

AMD "Hawai" compute performance extrapolation

Here is a graph of the theoretical peak performance of current top AMD GPUs. These include the Tahiti GPU known from the HD-7970 and the soon to be released Hawai GPU as the heart of the AMD R9-290X and R9-290. In this extrapolation each compute element in the GPU is supposed to perform 2 floating point operations per clock which is 1 MAD (multiply-add) operation per clock.


Each vendor will probably provide different cards operating in different frequencies so this diagram could be helpful for anybody who intends to by a new card for compute.

Wednesday, October 2, 2013

A note on the GPU programming paradigm

GPU programming employ a highly aggresive parallelism model. When someone endeavors to program GPUs has to exploit dual levels of parallelism. These are the coarse grained and fine grained parallelism models.









The coarse grained model is the one that allows scalability as it relies on the blocks of threads which have no way to synchronize together. Therefore, they are able to execute independently in parallel, feeding all the available compute units of the hardware device. If the programmer has specified enough of them in a kernel invocation they will keep the hardware highly utilized. It resembles programming the CPU at the thread level, although without any efficient option to synchronize multiple threads together.

Speaking about fine grained parallelism is like programming down to the SIMD level provided by the use of SSE or AVX instructions available on modern Intel & AMD x86 CPUs. However, programming is quite more flexible as using the so called SIMT (Singe Instruction Multiple Thread) model, in CUDA or OpenCL programming environments, one can program without even being aware of the SIMD nature. NVidia GPUs' SIMD width is 32 elements while AMD GPUs' width is 64 elements. Practically, the thread block size should be a multiple of this number, especially on NVidia GPUs, because memory access and pipeline latencies require a large number of threads on the compute unit in order to be hidden. At this parallelism level, threads are able to synchronize and communicate by exhanging data through the fast shared memory.

In this sense, GPUs employ parallelism to extreme levels. A modern GPU would require more than a thousand threads per compute unit to keep it utilized and it might consists of dozens of compute units.

GPU programming is both fascinating and dirty!





Wednesday, August 14, 2013

My new teraflopper accelerator (?)

Althought the title above would seem to be eye catching some years ago, modern GPUs offer multiples of teraflops of compute power. Thus, my "slight" upgrading from GTX465 (Fermi based) to GTX660 (Kepler based) GPU appear as not so remarkable news.

Now, let me elaborate on what I mean by the word "slight". I mainly use the GPU as a development device for the CUDA and OpenCL environments (actually, I almost never use it for playing games). Therefore, I'm particularly interested on the compute performance of these cards. As it has been shown in various reviews the compute performance of Fermi based cards have proven more efficient than Kepler based cards. However, Kepler GPUs tend to be more power efficient.

The older GTX465
GTX660
Nevertheless, the GTX660 provides multiple benefits over the GTX465. First, the available memory on device is doubled, 2GB instead of 1GB. This is important as I can run compute kernels for bigger workloads. Additionally, the memory bandwidth is significantly higher, 144GB/sec instead of 102GB/sec. This proves usefull for memory bound kernels. Regarding the compute throughput of both cards the GTX465 features 855GFlops in single precision operations whilst the GTX660 features a decend 1880GFlops. All these in just 140W TDP instead of 200W TDP. This is why the GTX660 requires just one 6-pin power connector instead of 2.

Now, the drawback. The GTX660 exhibits significantly less peak performance in double precision operations as NVidia tends to criple double precision potential of desktop products on the Kepler GPUs by a factor of 1/24 rather than the 1/8 it did in the first Fermi generation. This yields a 78GFlops for the GTX660 and 106GFlops for the GTX465.
I definitely wanted GPU with CUDA support as most of my work is based on it so I couldn't consider getting a Radeon GPU. However, in the future I intend to focus on open standards i.e. OpenCL.

OpenCL performance of GTX 660

As I upgraded my GPU card I wanted to run some OpenCL benchmarks. I have developed a small tool that performs some benchmarks by identifying the best configuration in terms of workgroup size, vector size and workitem granularity on each kernel. I also included results from a Radeon HD7750 which is an even less power hungry card as it does not require a power connector at all.

AMD HD7750

Benchmark results:
Reduction
GB/sec
Leibniz formula for PI computation (SP)
GFLOPS
Leibniz formula for PI computation (DP)
GFLOPS
GTX465 89.56 304.26 26.56
GTX660 108.12 656.08 21.77
HD7750 63.15 214.45 13.81

Result chart:

It doesn't seems so bad. Memory bandwidth is improved and the single precision computations are more than doubled! As predicted the double precision computation exhibits a noticable performance drop but it can be tolerated.

Peak throughput numbers were used from the respective wikipedia pages:
http://en.wikipedia.org/wiki/GeForce_400_Series
http://en.wikipedia.org/wiki/GeForce_600_Series

Friday, August 2, 2013

Petition for AMD in order to improve its GPU driver

I would like to encourage everyone who is serious about GPU computing to sign the following petition in order to urge AMD to improve its driver. If this happens we all would benefit. AMD products would prove as viable GPU computing platform and NVidia would be forced to drop it's prices or even provide more GPUs oriented for computing at more affordable prices. "Titan" is not an option for me even if it is considered as a consumer product.

Petition URL:
http://www.change.org/petitions/advanced-micro-devices-fix-bugs-in-opencl-compiler-drivers-and-eventually-opensource-them

Source:
http://streamcomputing.eu/blog/2013-08-01/amd-fix-bugs-in-opencl-compilerdrivers/

Saturday, May 11, 2013

Volcanic Islands to erupt at the end of year?

Update:
Sadly, this post is probably based on an old speculation posted in a forum and not on reliable information. For more info click here:
http://semiaccurate.com/2013/05/20/amds-volcanic-islands-architecture/

The time that follows seems to unveil fascinating GPU disclosures. AMD is rumored to unveil a new GPU architecture (named Volcanic Islands) in the Q4 2013 which according to the purportedly leaked diagram seen bellow, will contain a vast number of parallel compute elements (4096) plus 16 serial processors. Hopefully, these serial processors will alleviate serial code execution bottlenecks evident on the GPU. Thus, GPU compute could be further adopted for algorithms containing interleaved serial parts.

Volcanic Islands (Serial Processing Modules and Parallel Compute Modules)
Volcanic Islands block diagram


It is not known whether this information is correct but it highlights the trend of GPUs towards compute. Whatever the truth is I hope it will push AMD's rival to enforce the GPU computing capabilities of its desktop products that proved to be weak in it's last generation in favor of gaming and efficiency.

Wednesday, April 17, 2013

About bitcoining performance on GPUs

Traditionally when it comes to bitcoining performance the results favour AMD GPUs. There is an interesting article about the bitcoining performance of contemporary GPUs (HD7970, GTX Titan). It aims to explain the extreme performance discrepancy between the AMD and NVidia architectures in this benchmark.

link: http://www.extremetech.com/computing/153467-amd-destroys-nvidia-bitcoin-mining

Monday, March 4, 2013

FUSE emulator on CRT TV screen

The FUSE emulator looks more realistic when run on a CRT TV screen. That's because the original ZX-Spectrum featured a TV tuner in order to use an ordinary TV as a monitor. Here I use a Raspberry PI to run the emulator and it runs really well!


Saturday, February 16, 2013

AMD Sea Islands instruction set documentation is online

A fresh PDF about the Southern Islands GPU instruction set is now available online. Southern Islands is the architecture of the new AMD GPUs yet to be released. Here are some notes found inside:


Differences Between Southern Islands and Sea Islands Devices

Important differences between S.I. and C.I. GPUs
•Multi queue compute
Lets multiple user-level queues of compute workloads be bound to the device and processed simultaneous. Hardware supports up to eight compute pipelines with up to eight queues bound to each pipeline.
•System unified addressing
Allows GPU access to process coherent address space.
•Device unified addressing
Lets a kernel view LDS and video memory as a single addressable memory. It also adds shader instructions, which provide access to “flat” memory space.
•Memory address watch
Lets a shader determine if a region of memory has been accessed.
•Conditional debug
Adds the ability to execute or skip a section of code based on state bits under control of debugger software. This feature adds two bits of state to each wavefront; these bits are initialized by the state register values set by the debugger, and they can be used in conditional branch instructions to skip or execute debug-only code in the kernel.
•Support for unaligned memory accesses
•Detection and reporting of violations in memory accesses

It seems as the Sea Islands architecture will feature multiple queues similar to the NVidia Kepler's promoted as "Hyper-Q" technology (see GK110 whitepaper).

Link: AMD_Sea_Islands_Instruction_Set_Architecture.pdf

UPDATE: For some reason the referred file is not available anymore. 

Friday, February 8, 2013

Memory intensive microbenchmarks on GPUs via OpenCL

I have authored a set of micro-benchmarks implemented in OpenCL API. When they are executed, the optimum set of values for 3 basic parameters is estimated. These parameters are the workgroup size, the thread granularity with which I refer to the number of elements computed by each workitem and the vector's width which is the width of the native vector type used in the kernel. The latter is a quite close parameter to the "thread granularity" although not exactly the same.

Here I provide some results of the estimated bandwidth of some memory intensive operations performed by some contemporary GPUs:

Device reduction memset vector addition
GB/sec
NVidia GTX 480 147.37 157.6 149.95
NVidia S2050 95.07 117.71 106.35
NVidia K20X 196.85 200.36 191.71
AMD HD 7750 63.27 38.55 53.07
AMD HD 5870 136.23 106.8 111.33
AMD HD 7970 225.24 156.42 205.26



The dataset of each array involved in the computation is 16M 32bit integers (i.e. the addition of 16M + 16M elements, or the reduction of 16M elements).

The Southern Islands based GPUs (7750 & 7970) seem to have reduced performance on memory access writes. This can be explained on the AMD OpenCL programming guide on which is noted that memory access writes cannot be coalesced on the SI architecture. I don't know though why AMD had to reduce the global memory write throughput on SI GPUs.

Another thing to note is that the S2050 had ECC memories enabled whereas in the K20X they were disabled. According to the NVidia manuals enabling ECC memories leads to ~20% lower memory bandwidth.


Saturday, January 19, 2013

Problems with AMD Catalyst 13.1 on Ubuntu

AMD Catalyst 13.1 driver was released two days ago but I encountered problems when I tried to generate Ubuntu packages for both 10.04 & 12.04 releases.

In case anybody is interested, I'm providing some brief workarounds here:
One has to extract the driver files using the "--extract" option.
In both cases the problem was due to the "rules" file located under the "packages/Ubuntu/dists/{precise/lucid}" directory. Thus, the following changes had to be made in "rules" file.

In case of Ubuntu 12.04 the following line:
dh_install -p$(PKG_driver) "arch/x86_64/usr/share/ati/lib" "$(datadir)/ati"
had to be replaced with:
dh_install -p$(PKG_driver) "arch/x86/usr/share/ati/lib" "$(datadir)/ati"


In case of Ubuntu 10.04 the following line had to be appended after line 69:
 SRC_other_arch := x86_64
and the following line had to be appended after line 151:
  -e "s|#SRCOTHERARCH#|$(SRC_other_arch)|g" \

All packages then should be created as usual by giving:
 sudo ./ati-installer.sh 9.012 --buildpkg Ubuntu/precise
or
 sudo ./ati-installer.sh 9.012 --buildpkg Ubuntu/lucid




Monday, January 14, 2013

nbench on small linux devices

One of the benchmark programs that I find most convenient to use is nbench. The reason is that it's applicable on almost every device that can execute plain C code. This means that it can run on a desktop computer as well as on a smartphone (nbench is freely available on Google Play) or a flashed router with a custom firmware (e.g. DD-Wrt with optware).

Here are three devices that I have tried it on:
Raspberry PI
Raspberry PI
Asus RT-N16
ASUS RT-N16
Linksys NSLU2 
RaspPI and NSLU2 are ARM based where the RT-N16 is MIPS based.

Here you can see the results running it on a Raspberry PI (Raspbian OS):

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          221.64  :       5.68  :       1.87
STRING SORT         :          31.709  :      14.17  :       2.19
BITFIELD            :      8.4099e+07  :      14.43  :       3.01
FP EMULATION        :          46.363  :      22.25  :       5.13
FOURIER             :          2372.8  :       2.70  :       1.52
ASSIGNMENT          :          2.4781  :       9.43  :       2.45
IDEA                :           696.1  :      10.65  :       3.16
HUFFMAN             :          424.38  :      11.77  :       3.76
NEURAL NET          :          3.0098  :       4.83  :       2.03
LU DECOMPOSITION    :           78.72  :       4.08  :       2.94
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 11.729
FLOATING-POINT INDEX: 3.761
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 3.2.27+
C compiler          : gcc-4.7
libc                : /lib/arm-linux-gnueabihf/libgcc_s.so.1
MEMORY INDEX        : 2.528
INTEGER INDEX       : 3.266
FLOATING-POINT INDEX: 2.086
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


Here running it on Linksys nslu2 fileserver (flashed with SlugOS):

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          74.271  :       1.90  :       0.63
STRING SORT         :          6.9679  :       3.11  :       0.48
BITFIELD            :      1.8159e+07  :       3.11  :       0.65
FP EMULATION        :          17.645  :       8.47  :       1.95
FOURIER             :          75.723  :       0.09  :       0.05
ASSIGNMENT          :         0.96228  :       3.66  :       0.95
IDEA                :          176.19  :       2.69  :       0.80
HUFFMAN             :          104.82  :       2.91  :       0.93
NEURAL NET          :         0.10509  :       0.17  :       0.07
LU DECOMPOSITION    :          3.3757  :       0.17  :       0.13
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 3.324
FLOATING-POINT INDEX: 0.136
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 2.6.27.8
C compiler          : gcc version 4.2.4
libc                :
MEMORY INDEX        : 0.668
INTEGER INDEX       : 0.976
FLOATING-POINT INDEX: 0.076
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.


And finally here running it on a Asus RT-N16 router (flashed with DD-Wrt with optware):

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :           160.6  :       4.12  :       1.35
STRING SORT         :          3.7864  :       1.69  :       0.26
BITFIELD            :      6.3597e+07  :      10.91  :       2.28
FP EMULATION        :            28.6  :      13.72  :       3.17
FOURIER             :          19.904  :       0.02  :       0.01
ASSIGNMENT          :           1.753  :       6.67  :       1.73
IDEA                :          670.35  :      10.25  :       3.04
HUFFMAN             :          40.453  :       1.12  :       0.36
NEURAL NET          :        0.015345  :       0.02  :       0.01
LU DECOMPOSITION    :         0.43656  :       0.02  :       0.02
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 5.017
FLOATING-POINT INDEX: 0.023
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 :
L2 Cache            :
OS                  : Linux 2.6.24.111
C compiler          : gcc version 4.1.1
libc                : ld-uClibc-0.9.28.so
MEMORY INDEX        : 1.011
INTEGER INDEX       : 1.470
FLOATING-POINT INDEX: 0.013
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

It should be noted that the latter two devices do not feature a floating point unit and thus the performance on floating point intensive is extremely low.

One of the drawbacks of nbench application is that it is written as a single threaded application so it cannot exploit the extra cores of a multicore CPU. One of my future hobby projects could be porting nbench program to OpenMP or even OpenCL in order to exploit the full capabilities of a contemporary CPU or even a GPU. It would be fun of comparing a Raspberry PI with a GTX580 on nbench!

Saturday, January 5, 2013

A GPGPU comparison (K20, 7970, GTX680, M2050 & GTX580)

I found a nice GPGPU comparison on a blog. It's very interesting as it exposes some practical benchmark results of all the latest GPUs in market in a range of 4 problems of different nature (bandwidth limited or compute intensive).

The GPUs compared are:

  1. NVidia Tesla K20
  2. NVidia GTX 680
  3. NVidia Tesla M2050
  4. AMD HD 7970

The 4 problems are:
  1. Digital Hydraulics
  2. Ambient Occlusion
  3. Running Sum
  4. Geometry Sampling
The results as presented are illustrated bellow:

As can be seen, the Keplel architecture is not as great as it was expected (at least the compute-optimized K20 chip). The older Fermi architecture seems to sustain a decent performance. In addition, the AMD GPU seem to be a good opponent exposing the benefits of the Southern Islands architecture in compute applications.

For the original full article click here:
http://wili.cc/blog/gpgpu-faceoff.html

Wednesday, January 2, 2013

Raspberry PI as a home server

Now, this is my very first post.
Here is my new low energy server known as the Raspberry PI. Here it serves web pages via lighttpd and VOIP telephony via asterisk. It's definitely a low power server so one does not have to mind turning it on and off in order to save energy.


It's also very cheap (~35$) and quite popular.