Saturday, December 20, 2014

Workgroup reduction function evaluation. How well do they perform?

The initial AMD driver for OpenCL 2.0 has already been released. The latest version of the OpenCL parallel programming API is quite interesting as it supports shared virtual memory, dynamic parallelism, pipes and other features. Among the rest of them are the workgroup and sub-workgroup functions which are abstractions that on one hand simplify parallel primitive operations such as broadcast, scan and reduction operations and provide the opportunity for the compiler for further optimizations on the other.

In order to evaluate the workgroup function performance I developed a test case experiment for a reduction of the sum 1+2+3+...+N. Reduction is implemented in 3 different ways with 3 kernels. The first kernel is performed in the classical manner with shared memory. The last performs the reduction with the workgroup reduction function. The intermediate kernel uses shared memory for the inter-wavefront stages and the subgroup reduction operation for the intra-wavefront stage.

The results seem somehow disappointing. The execution configuration is a 64bit Linux system, with an R7-260X GPU. The results are as follows:

Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case
Platform/Device selection
Total platforms: 1
AMD Accelerated Parallel Processing
 1. Bonaire/Advanced Micro Devices, Inc.
 2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel
Select device index: 
Device info
Platform:       AMD Accelerated Parallel Processing
Device:         Bonaire
Driver version: 1642.5 (VM)
OpenCL version: OpenCL 2.0 AMD-APP (1642.5)
Great! OpenCL 2.0 is supported :)
Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"

1. Shared memory only kernel
Output: 2147450880 / Time: 0.089481 msecs (0.732401 billion elements/second)

2. Hybrid kernel via subgroup functions
Output: 2147450880 / Time: 0.215851 msecs (0.303617 billion elements/second)
Relative speed-up to kernel 1: 0.41455

3. Workgroup function kernel
Output: 2147450880 / Time: 0.475408 msecs (0.137852 billion elements/second)
Relative speed-up to kernel 1: 0.188219

The kernel with the workgroup function seems to perform more than 5 times slower than using just shared memory. This should definitely not be the case in a performance oriented environment like OpenCL. The performance of workgroup functions should be at least the same as using shared memory. Otherwise the workgroup functions are not essentially useful.

Unfortunately, CodeXL version 1.6 does not support static analysing of OpenCL 2.0 kernels and therefore I cannot inspect the resulting assembly code produced for the workgroup functions. According to theory swizzle operations has to be leveraged in order to optimize such operations.

Test case download link on github:

In case you notice any different results please let me know.


  1. question: is it possible to install two amd gpu cards (no crossfire) and run two independent command queues (one in each gpu device) in parallel under linux? if yes where can one get the info?

    1. Yes, at least theoretically. But I haven't tried it. AMD developer central forum is more appropriate for such questions:

  2. Elias, I am sure you might be aware of this.But just in case.
    Though CodeXL does not support static analysis of 2.0 kernels, you can still get the assembly code (IL and ISA files) using -save-temps compiler switch while building or during execution of your program.

    1. Oups! Actually, I hadn't thought of that!
      Thanks! This will help solving the mystery.

    2. Ok, I've posted the generated disassembly code in a thread I had opened a while ago. The generated code is quite huge which doesn't seem normal!

  3. This comment has been removed by a blog administrator.