Dawn Project – Static Automatic Code Parallelization



Introduction

Motivation

In recent times, there has been an increasing interest in general-purpose computing on graphics processing units (GPGPU). This practice consists of developing general purpose programs, i.e., not necessarily related to graphics processing, to run on hardware that is specialized for graphics computing. Executing programs on such chips can be advantageous due to the parallel nature of their architecture: while a typical CPU is composed of a small number of cores capable of a wide variety of computations, GPUs usually contain hundreds of simpler processors, which perform operations in separate chunks of memory concurrently. Thus, a graphics chip can run programs that are sufficiently parallel much faster than a CPU. In some cases, this speedup can reach several orders of magnitude. Some experiments also show that not only can GPU execution be faster, but in many cases they are also more energy-efficient.

This model, however, has its shortcomings. Historically, parallel programming has been a difficult paradigm to adopt, usually requiring that developers be familiarized with particular instruction sets of different graphics chips. Recently, a few standards such as OpenCL and CUDA have been designed to provide some level of abstraction to these platforms, which in turn has led to the development of several compiler directive-based models, e.g. OpenACC and offloading in OpenMP. While these have somewhat bridged the gap between programmers and parallel programming interfaces, they still rely on manual insertion of compiler directives in production code, an error-prone process that also commands in-depth knowledge of the target program.

Amongst the hurdles involved in annotating code, two tasks are particularly challenging: identifying parallel loops and estimating memory bounds. Regarding the former, opportunities for parallelism are usually buried in complex code, and rely on non-trivially verifiable information, such as the absence of pointer aliasing. As to the latter, languages such as C and C++ do not provide any information on the size of memory being accessed during the execution of the program. However, when offloading code for parallel execution, it is necessary to inform which chunks of memory must be copied to other devices. Therefore, the onus of keeping track of these bounds falls on the programmer.

Implementation

We have developed DawnCC as a tool to automate the performance of these tasks. Through the implementation of a static analysis that derives memory access bounds from source code, it infers the size of memory regions in C and C++ programs. With these bounds, it is capable of inserting data copy directives in the original source code. These directives provide a compatible compiler with information on which data must be moved between devices. Given the source code of a program as input, our tool can then provide the user with a modified version containing directives with proper memory bounds specified, all without any further intervention from the user, effectively freeing developers from the burdensome task of manual code modification.

We implemented DawnCC as a collection of compiler modules, or passes, for the LLVM compiler infrastructure, with this webpage functioning as a front-end for users to provide program source codes as input. The program is then compiled by LLVM and transformed by our passes to produce the modified source code as output to the webpage user. There are also a few options to customize the output, such as printing runtime statistics.


Functionality

Compiler Directive Standards

Compiler directive-oriented programming standards are some of the newest developments in features for parallel programming. These standards aim to simplify the creation of parallel programs by providing an interface for programmers to indicate specific regions in source code to be run as parallel. Parallel execution has application in several different hardware settings, such as multiple processors in a multicore architecture, or offloading to a separate device in a heterogeneous system. Compilers that support these standards can check for the presence of directives (also known as pragmas) in the source code, and generate parallel code for the specific regions annotated, so they can be run on a specified target device. DawnCC currently supports two standards, OpenACC and OpenMP, but it can easily be extended to support others. You can read more on the subject in the links below:

OpenACC FAQ
OpenMP FAQ

In order to use these standards to offload execution to accelerators, it is necessary to compile the modified source code with a compiler that supports the given directive standard (OpenMP 4.0 or OpenACC). In our internal testing environment, we use Portland Group's C Compiler for OpenACC support. You can find out more about it in the following link:

Portland Group

Working Example

To demonstrate the functionality of our tool, we can use the C code snippet below as an input program:

  void example (int *a, int *b, int c) {
    int n[5000];
    int i, j, k;

    /*loop 1*/
    for (i = 0; i < 1000; i++) {
          /*loop 2*/
          for (j = 0; j < 1000; j++) {
              n[i+j] = i*j;
          }
    }

    /*loop 3*/
    for (k = 0; k < 5000; k++) {
        a[k] = b[k]+(k*k);
    }

    /*loop 4*/
    for (k = 0; k < c; k++) {
        a[k] = k*(k+1);
    }
  }

Compiling the program above with OpenACC as the chosen standard and annotating parallel loops exclusively yields the following code as output:

  void example (int *a, int *b, int c) {
   int n[5000];
   int i, j, k;

   /*loop 1*/
   #pragma acc data pcopy(n[0:1999])
   #pragma acc kernels
   #pragma acc loop independent
   for (i = 0; i < 1000; i++) {
       /*loop 2*/
       #pragma acc loop independent
       for (j = 0; j < 1000; j++) {
           n[i+j] = i*j;
       }
   }

   /*loop 3*/
   char RST_AI2 = 0;
   RST_AI2 |= !((A + 0 > b + 5000)
   || (b + 0 > a + 5000));
   #pragma acc data pcopy(a[0:5000],b[0:5000]) if(!RST_AI2)
   #pragma acc kernels if(!RST_AI2)
   #pragma acc loop independent
   for (k = 0; k < 5000; k++) {
       a[k] = b[k]+(k*k);
   }

   /*loop 4*/
   long long int AI3[6];  
   AI3[0] = c + -1;
   AI3[1] = 4 * AI3[0];
   AI3[2] = AI3[1] + 4;
   AI3[3] = AI3[2] / 4;
   AI3[4] = (AI[3] > 0);
   AI3[5] = (AI3[4] ? AI3[3] : 0);
   #pragma acc data pcopy(a[0:AI3[5]])
   #pragma acc kernels
   #pragma acc loop independent
   for (k = 0; k < c; k++) {
       a[k] = k*(k+1);
   }
 }

Note that the original code remains unchanged, having been reconstructed back to its source from the compiler's intermediate representation. For each analyzable loop in the program, the tool inserts the appropriate data copy directives. It also writes "kernels" directives, which instruct the compiler that the following code region should be run as parallel. Then, for each parallelizable loop, an "independent" pragma is included, indicating that the respective loop can be executed with multiple iterations running concurrently.

The loops in the above program present a few particular characteristics that exemplify some of the features of our analysis, which we describe in greater depth below.

Memory Bound Accuracy

Balancing the overhead of offloading computation to a separate device and the gain in performance from parallel execution is a cornerstone of efficient GPGPU programming. In many cases, the effort spent in performing tasks not related to effective computation, such as copying data or synchronizing execution, counterweighs the advantages of abusing parallelism. This can cause the parallel version of the program to show no significant gain in performance, or even a slowdown, even when the algorithm involved presents ripe opportunities for concurrent execution. Therefore, handling the amount of overhead involved in parallelizing code is vital.

When it comes to measuring memory bounds for performing data copies, a naive analysis could simply measure the total size of memory blocks referenced by the variables accessed inside the loops, and use the block's entire size as the upper bound in a data copy directive. While correct, such an approach could potentially generate redundant data copy instructions, since it is possible for chunks of data which are not used within the loop to be copied back and forth between devices. Our analysis instead calculates the bounds of the memory region that is effectively accessed inside the loop, thus minimizing the amount of potentially redundant computation performed. This can be observed in the first and second loops in the example code above. Since the upper limit for the array subscript is at most the sum of the maximum values reachable by the induction variables, it is not necessary to copy the entire array. As a result, the data directive generated contains a more precise value that better reflects the effective limits. The directives inserted are highlighted below.

  /*loop 1*/
  #pragma acc data pcopy(n[0:1999])
  #pragma acc kernels
  #pragma acc loop independent
  for (i = 0; i < 1000; i++) {
       /*loop 2*/
       #pragma acc loop independent
       for (j = 0; j < 1000; j++) {
           n[i+j] = i*j;
       }
   }
  

Treating Pointer Aliasing

There are many reasons that might prevent a given piece of a program from being parallelizable. Most of these include memory dependences of some form. In C and C++ code involving dynamically allocated memory regions, one of the main culprits behind such dependences is the possibility of pointer aliasing. That is to say, when a set of instructions accesses memory referenced by multiple pointer variables, there is no guarantee that the regions pointed to do not overlap. In such cases, an implicit memory dependence exists, and the possibility of parallelization is typically discarded. Usually, treating these cases statically involves the employment of complex and costly interprocedural pointer analyses.

However, as a by-product of our alias analysis, our tool is capable of disambiguating pointers. This means it can infer conditions that ensure the absence of aliasing, in which case the memory dependences do not exist, and parallel execution might be possible. By combining this with conditional compilation directives, we can solve the problem in an elegant and concise manner. The conditional directives instruct a compiler to create two different versions of a loop, whose execution is controlled by an aliasing check. Then, during execution, the conditional is evaluated. If the absence of aliasing is confirmed, the loop is executed in parallel. Otherwise, a sequential version is executed instead. This aliasing check and its respective conditional directive can be seen in the third loop in the example above, and also highlighted in the snippet below.

  /*loop 3*/
  char RST_AI2 = 0;
  RST_AI2 |= !((A + 0 > b + 5000)
  || (b + 0 > a + 5000));
  #pragma acc data pcopy(a[0:5000],b[0:5000]) if(!RST_AI2)
  #pragma acc kernels if(!RST_AI2)
  #pragma acc loop independent
  for (k = 0; k < 5000; k++) {
      a[k] = b[k]+(k*k);
  }
  

Symbolic Inference

The last loop shows a case where the scalar value of the variables used as subscripts in the memory accesses in the loop are not predictable in the function's scope. In this case, our tool is capable of inferring the proper limits by inserting a series of value checks to determine which variable effectively defines the upper bound for the memory accesses. It then inserts the proper value in the copy directive. Note that in this case the memory bounds can vary during execution, yet the limits defined remain correct for every execution context.

  /*loop 4*/
  long long int AI3[6];  
  AI3[0] = c + -1;
  AI3[1] = 4 * AI3[0];
  AI3[2] = AI3[1] + 4;
  AI3[3] = AI3[2] / 4;
  AI3[4] = (AI[3] > 0);
  AI3[5] = (AI3[4] ? AI3[3] : 0);
  #pragma acc data pcopy(a[0:AI3[5]])
  #pragma acc kernels
  #pragma acc loop independent
  for (k = 0; k < c; k++) {
      a[k] = k*(k+1);
    }
  }
  


Benchmarks

To evaluate the code generated by DawnCC, we have used it to annotate code found in a few widely available benchmark suites.
You can download the pack through this link: DawnCC Benchmarks