Saturday, December 20, 2014

Workgroup reduction function evaluation. How well do they perform?

The initial AMD driver for OpenCL 2.0 has already been released. The latest version of the OpenCL parallel programming API is quite interesting as it supports shared virtual memory, dynamic parallelism, pipes and other features. Among the rest of them are the workgroup and sub-workgroup functions which are abstractions that on one hand simplify parallel primitive operations such as broadcast, scan and reduction operations and provide the opportunity for the compiler for further optimizations on the other.

In order to evaluate the workgroup function performance I developed a test case experiment for a reduction of the sum 1+2+3+...+N. Reduction is implemented in 3 different ways with 3 kernels. The first kernel is performed in the classical manner with shared memory. The last performs the reduction with the workgroup reduction function. The intermediate kernel uses shared memory for the inter-wavefront stages and the subgroup reduction operation for the intra-wavefront stage.

The results seem somehow disappointing. The execution configuration is a 64bit Linux system, with an R7-260X GPU. The results are as follows:

Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case
Platform/Device selection
Total platforms: 1
AMD Accelerated Parallel Processing
 1. Bonaire/Advanced Micro Devices, Inc.
 2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel
Select device index: 
Device info
Platform:       AMD Accelerated Parallel Processing
Device:         Bonaire
Driver version: 1642.5 (VM)
OpenCL version: OpenCL 2.0 AMD-APP (1642.5)
Great! OpenCL 2.0 is supported :)
Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"

1. Shared memory only kernel
Executing...Done!
Output: 2147450880 / Time: 0.089481 msecs (0.732401 billion elements/second)
PASSED!

2. Hybrid kernel via subgroup functions
Executing...Done!
Output: 2147450880 / Time: 0.215851 msecs (0.303617 billion elements/second)
Relative speed-up to kernel 1: 0.41455
PASSED!

3. Workgroup function kernel
Executing...Done!
Output: 2147450880 / Time: 0.475408 msecs (0.137852 billion elements/second)
Relative speed-up to kernel 1: 0.188219
PASSED!

The kernel with the workgroup function seems to perform more than 5 times slower than using just shared memory. This should definitely not be the case in a performance oriented environment like OpenCL. The performance of workgroup functions should be at least the same as using shared memory. Otherwise the workgroup functions are not essentially useful.

Unfortunately, CodeXL version 1.6 does not support static analysing of OpenCL 2.0 kernels and therefore I cannot inspect the resulting assembly code produced for the workgroup functions. According to theory swizzle operations has to be leveraged in order to optimize such operations.

Test case download link on github:
https://github.com/ekondis/cl2-reduce-bench

In case you notice any different results please let me know.

Tuesday, December 9, 2014

AMD OpenCL 2.0 SDK is available (BETA)

Eventually, the AMD SDK for OpenCL 2.0 has been released in a beta form. There are many examples exhibiting the new features. There are new accompanying documentation files though they are not written from scratch. For instance, table 2.5 in the optimization guide refers to only HD 7xxx devices. It wouldn't be hard for AMD to add the respective tables for Rx 2xx devices. Overall, this is a significant step forward for the OpenCL 2.0 adoption.

The device driver supporting OpenCL 2.0 was also released today.

For more information and download: