Wednesday, April 16, 2014

Loop execution performance comparison in various programming languages

The main focus of a GPU programmer is performance. Therefore the execution time of various time consuming loops is of significant consideration. In this regard I performed some experiments in various programming languages of a small nested loop. The problem investigated is a trivial one though it needs significant number of operations to be performed in a nested loop form.

Problem definition

Search for a pair of integers in the [1..15000] range whose multiple is equal to 87654321.

Loop implementations

A trivial solution of this problem is provided in the following python code:
for i in range(1, 15001):
 for j in range(i+1, 15001):
  if i*j==87654321:
   print "Found! ",str(i)," ",str(j)

Converting the code above to C is straightforward. The code can be easily parallelized using OpenMP constructs by adding a single line:

#pragma omp parallel for private(j) schedule(dynamic,500)
        for(i=1; i<=15000; i++)
                for(j=i+1; j<=15000; j++)
                        if( i*j==87654321 )
                                printf("Found! %d %d\n", i, j);

The schedule parameter directs the compiler to apply dynamic scheduling in order to address the unbalanced nature of the iterations (first outer loop performs 14999 operations while the last one does none).

A naive implementation in OpenCL is also provided. A workitem is assigned to each iteration of the outer loop:

__kernel void factor8to1(unsigned int limit, global int *results){
 int i = get_global_id(0);

 if( i<=limit )
  for(int j=i+1; j<=limit; j++)
   if( i*j==87654321 ){
    results[0] = i;
    results[1] = j;

The OpenCL kernel requires to be launched with an NDRange of 15000 workitems. These are not adequate especially for large GPUs but they should be enough for a demo.

Of course this kernel is not well balanced neither optimized, in order to be clear to read and understand. Note that the goal of this project is not to provide an optimized factorization algorithm but to demonstrate the loop code efficiency in various scripting and compiled languages, as well as, to provide a glimpse to the gains of parallel processing.

Code is written in the following languages:
  1.  Python
  2.  JavaScript
  3.  Free pascal compiler
  4.  C
  5.  OpenMP/C
  6.  OpenCL

All sources are provided on github:

Execution results on A6-1450 APU

Here are provided the execution results of executions on an AMD A6-1450 APU which is a low power processing unit which combines a CPU and a GPU on the same die package. It features a quad core CPU (Jaguar cores) running at 1GHz and a GCN-GPU with 2 compute units (128 processing elements in total).

The benefits of parallel processing are apparent. The advancements of javascript JIT engines are also evident.

No comments:

Post a Comment