Monday, December 26, 2016

OpenCL/ROCm clinfo output on AMD Fiji

This month with the release of AMD ROCm v1.4 we also had a taste of the preview version of the OpenCL runtime on ROCm. For anyone curious about it here is the clinfo output on an AMD R9-Nano GPU (external URL on gist):

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (2300.5)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Fiji [Radeon R9 FURY / NANO Series]
  Device Topology:                               PCI[ B#1, D#0, F#0 ]
  Max compute units:                             64
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1000Mhz
  Address bits:                                  64
  Max memory allocation:                         3221225472
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    29440
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            4294967296
  Constant buffer size:                          3221225472
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            0
  Max pipe active reservations:                  0
  Max pipe packet size:                          0
  Max global variable size:                      3221225472
  Max global variable preferred total size:      4294967296
  Max read/write image args:                     64
  Max on device events:                          0
  Queue on device max size:                      0
  Max on device queues:                          0
  Queue on device preferred size:                0
  SVM capabilities:
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:
    Out-of-Order:                                No
    Profiling :                                  No
  Platform ID:                                   0x7f7273868198
  Name:                                          gfx803
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0
  Driver version:                                1.1 (HSA,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images


Sunday, September 18, 2016

NVidia Pascal's GPU architecture most exciting feature

Few months ago NVidia announced the Pascal GPU architecture and more specifically the GP100 GPU. This is a monstrous GPU with more than 15 billion transistors built using a 16nm FinFET fabrication. Though, the alleged performance numbers are arguably impressive (10.6 TFlops SP, 5.3 TFlops DP) I personally think that this is not the most impressive feature of this GPU.

The most impressive feature I found on as advertised is the unified memory support. In CUDA 6 and CC3.0 & CC3.5 devices (Kepler architecture) this term had been first introduced. But it didn't actually provide any real benefits at the time other than programming laziness. In particular, the run-time took care of moving the whole data to/from the GPU memory whenever it was used on either the host or GPU. The GP100 memory unification seems far more complete as according to specifications it seems to take memory unification to the next level. It supports data migration at the granularity of memory page! This means that programmer is able to "see" the whole system memory and the run-time takes care of which memory page should be moved at the time it is actually needed. This is a great feature! It allows porting CPU programs to CUDA without caring which data will actually be accessed.

For instance, imagine having a huge tree or graph structure and and you have a GPU kernel that needs to access just a few nodes on it without knowing which beforehand. Using the Kepler memory unification feature would require copying the whole structure from the host to GPU memory which could potentially cannibalize performance. The Pascal memory unification would actually copy only the memory pages residing on the accessed nodes, instead. This releases programmer from a great pain and that's why I think this is the most exciting feature.

I really hope this feature will be eventually supported on consumer GPU variants and stays not just an HPC feature for in Tesla products. I also hope that AMD will also support such a feature in its emerging ROCm platform.

Resources:

Thursday, May 19, 2016

mixbench on an AMD Fiji GPU

Recently, I had the quite pleasant opportunity to be granted with the Radeon R9 Nano GPU card. This card features the Fiji GPU and as such it seems to be a compute beast as it features 4096 shader units and HBM memory with bandwidth reaching to 512GB/sec. If one considers the card's remarkably small size and low power consumption, this card proves to be a great and efficient compute device for handling parallel compute tasks via OpenCL (or HIP, but more on this on a later post).

AMD R9 Nano GPU card

One of the first experiments I tried on it was the mixbench microbenchmark tool, of course. Expressing the execution results via gnuplot in the memory bandwidth/compute throughput plane is depicted here:

mixbench-ocl-ro as executed on the R9 Nano
GPU performance effectively approaches 8 TeraFlops of single precision compute performance on heavily compute intensive kernels whereas it exceeds 450GB/sec memory bandwidth on memory oriented kernels.

For anyone interested in trying mixbench on their CUDA/OpenCL/HIP GPU please follow the link to github:
https://github.com/ekondis/mixbench

Here is an example of execution on Ubuntu Linux:



Acknowledgement: I would like to greatly thank the Radeon Open Compute department of AMD for kindly supplying the Radeon R9 Nano GPU card for the support of our research.

Saturday, March 19, 2016

Raspberry PI 3 is here!

Some days ago the Raspberry PI 3 arrived home as I had ordered one when I heard of its launch. It's certainly a faster PI than the PI 2 due to the ARM Cortex-A53 cores. More or less the +50% performance ratio is true, depending on the application of course. There are some other additions as well like WiFi and bluetooth.

The Raspberry PI 3

A closer look of the PI 3

As usual, I am providing some nbench execution results. These are consistent with the +50% performance claim. For those interested I had published nbench results on the PI 2 in the past.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          654.04  :      16.77  :       5.51
STRING SORT         :          72.459  :      32.38  :       5.01
BITFIELD            :      1.9972e+08  :      34.26  :       7.16
FP EMULATION        :          134.28  :      64.44  :      14.87
FOURIER             :          6677.3  :       7.59  :       4.27
ASSIGNMENT          :          10.381  :      39.50  :      10.25
IDEA                :          2740.7  :      41.92  :      12.45
HUFFMAN             :          1008.9  :      27.98  :       8.93
NEURAL NET          :          9.8057  :      15.75  :       6.63
LU DECOMPOSITION    :          365.38  :      18.93  :      13.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 34.272
FLOATING-POINT INDEX: 13.131
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 4 CPU ARMv7 Processor rev 4 (v7l)
L2 Cache            :
OS                  : Linux 4.1.18-v7+
C compiler          : gcc-4.9
libc                : libc-2.19.so
MEMORY INDEX        : 7.162
INTEGER INDEX       : 9.769
FLOATING-POINT INDEX: 7.283
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

As I crossed some reports on temperature issues of PI 3 I wanted to execute some experiments on power consumption of the PI 3. I used a power meter on which I plugged the power supply unit feeding the PI. I run a few experiments and I got the following power consumption ratings:


PI running statePower consumption
Idle1.4W
Single threaded benchmark2.2W
Multithreaded benchmark4.0W
After running "poweroff"0.5W

So, for my case it doesn't seem consume to much power. However, a comparison with the PI 2 should be performed in order to have a better picture.