In recent times, there has been an increasing interest in general-purpose
computing on graphics processing units (GPGPU). This practice consists of
developing general purpose programs, i.e., not necessarily related to graphics
processing, to run on hardware that is specialized for graphics computing.
Executing programs on such chips can be advantageous due to the parallel
nature of their architecture: while a typical CPU is composed of a small
number of cores capable of a wide variety of computations, GPUs usually
contain hundreds of simpler processors, which perform operations in separate
chunks of memory concurrently. Thus, a graphics chip can run programs that
are sufficiently parallel much faster than a CPU. In some cases, this speedup
can reach several orders of magnitude.
Some experiments also show that not only can GPU execution be faster, but
in many cases they are also more energy-efficient.
This model, however, has its shortcomings. Historically, parallel
programming has been a difficult paradigm to adopt, usually requiring that
developers be familiarized with particular instruction sets of different
graphics chips. Recently, a few standards such as OpenCL and CUDA have been
designed to provide some level of abstraction to these platforms, which in turn
has led to the development of several compiler directive-based models, e.g.
OpenACC and offloading in OpenMP. While these have somewhat bridged the gap between
programmers and parallel programming interfaces, they still rely on manual
insertion of compiler directives in production code, an error-prone process
that also commands in-depth knowledge of the target program.
Amongst the hurdles involved in annotating code, two tasks are
particularly challenging: identifying parallel loops and estimating memory
bounds. Regarding the former, opportunities for parallelism are usually
buried in complex code, and rely on non-trivially verifiable information, such
as the absence of pointer aliasing. As to the latter, languages such as C and C++
do not provide any information on the size of memory being accessed during the
execution of the program. However, when offloading code for parallel
execution, it is necessary to inform which chunks of memory must be copied to
other devices. Therefore, the onus of keeping track of these bounds falls on
the programmer.
We have developed DawnCC as a tool to automate the performance of these tasks.
Through the implementation of a static analysis that derives memory
access bounds from source code, it infers the size of memory regions in
C and C++ programs. With these bounds, it is capable of inserting data copy
directives in the original source code. These directives provide a compatible
compiler with information on which data must be moved between devices. Given
the source code of a program as input, our tool can then provide the user
with a modified version containing directives with proper memory bounds
specified, all without any further intervention from the user, effectively
freeing developers from the burdensome task of manual code modification.
We implemented DawnCC as a collection of compiler modules, or passes, for the
LLVM compiler infrastructure, with this webpage functioning as a front-end
for users to provide program source codes as input. The program is then
compiled by LLVM and transformed by our passes to produce the modified source
code as output to the webpage user. There are also a few options to customize
the output, such as printing runtime statistics.
Compiler directive-oriented programming standards are some of the newest developments in features for parallel programming. These standards aim to simplify the creation of parallel programs by providing an interface for programmers to indicate specific regions in source code to be run as parallel. Parallel execution has application in several different hardware settings, such as multiple processors in a multicore architecture, or offloading to a separate device in a heterogeneous system. Compilers that support these standards can check for the presence of directives (also known as pragmas) in the source code, and generate parallel code for the specific regions annotated, so they can be run on a specified target device. DawnCC currently supports two standards, OpenACC and OpenMP, but it can easily be extended to support others. You can read more on the subject in the links below:
OpenACC FAQIn order to use these standards to offload execution to accelerators, it is necessary to compile the modified source code with a compiler that supports the given directive standard (OpenMP 4.0 or OpenACC). In our internal testing environment, we use Portland Group's C Compiler for OpenACC support. You can find out more about it in the following link:
Portland GroupTo demonstrate the functionality of our tool, we can use the C code snippet below as an input program:
void example (int *a, int *b, int c) { int n[5000]; int i, j, k; /*loop 1*/ for (i = 0; i < 1000; i++) { /*loop 2*/ for (j = 0; j < 1000; j++) { n[i+j] = i*j; } } /*loop 3*/ for (k = 0; k < 5000; k++) { a[k] = b[k]+(k*k); } /*loop 4*/ for (k = 0; k < c; k++) { a[k] = k*(k+1); } }
Compiling the program above with OpenACC as the chosen standard and annotating parallel loops exclusively yields the following code as output:
void example (int *a, int *b, int c) { int n[5000]; int i, j, k; /*loop 1*/ #pragma acc data pcopy(n[0:1999]) #pragma acc kernels #pragma acc loop independent for (i = 0; i < 1000; i++) { /*loop 2*/ #pragma acc loop independent for (j = 0; j < 1000; j++) { n[i+j] = i*j; } } /*loop 3*/ char RST_AI2 = 0; RST_AI2 |= !((A + 0 > b + 5000) || (b + 0 > a + 5000)); #pragma acc data pcopy(a[0:5000],b[0:5000]) if(!RST_AI2) #pragma acc kernels if(!RST_AI2) #pragma acc loop independent for (k = 0; k < 5000; k++) { a[k] = b[k]+(k*k); } /*loop 4*/ long long int AI3[6]; AI3[0] = c + -1; AI3[1] = 4 * AI3[0]; AI3[2] = AI3[1] + 4; AI3[3] = AI3[2] / 4; AI3[4] = (AI[3] > 0); AI3[5] = (AI3[4] ? AI3[3] : 0); #pragma acc data pcopy(a[0:AI3[5]]) #pragma acc kernels #pragma acc loop independent for (k = 0; k < c; k++) { a[k] = k*(k+1); } }
Note that the original code remains unchanged, having been reconstructed
back to its source from the compiler's intermediate representation. For each
analyzable loop in the program, the tool inserts the appropriate data copy
directives. It also writes "kernels" directives, which instruct the compiler that
the following code region should be run as parallel. Then, for each
parallelizable loop, an "independent" pragma is included, indicating that the
respective loop can be executed with multiple iterations running concurrently.
The loops in the above program present a few particular characteristics that
exemplify some of the features of our analysis, which we describe in greater
depth below.
Balancing the overhead of offloading computation to a separate device and
the gain in performance from parallel execution is a cornerstone of efficient
GPGPU programming. In many cases, the effort spent in performing tasks not
related to effective computation, such as copying data or synchronizing
execution, counterweighs the advantages of abusing parallelism. This can cause
the parallel version of the program to show no significant gain in performance,
or even a slowdown, even when the algorithm involved presents ripe opportunities
for concurrent execution. Therefore, handling the amount of overhead involved
in parallelizing code is vital.
When it comes to measuring memory bounds for performing data copies, a
naive analysis could simply measure the total size of memory blocks
referenced by the variables accessed inside the loops, and use the block's
entire size as the upper bound in a data copy directive. While correct, such
an approach could potentially generate redundant data copy instructions, since
it is possible for chunks of data which are not used within the loop to be
copied back and forth between devices. Our analysis instead calculates the
bounds of the memory region that is effectively accessed inside the loop,
thus minimizing the amount of potentially redundant computation performed.
This can be observed in the first and second loops in the example
code above. Since the upper limit for the array subscript is at most the sum
of the maximum values reachable by the induction variables, it is not
necessary to copy the entire array. As a result, the data directive generated
contains a more precise value that better reflects the effective limits.
The directives inserted are highlighted below.
/*loop 1*/ #pragma acc data pcopy(n[0:1999]) #pragma acc kernels #pragma acc loop independent for (i = 0; i < 1000; i++) { /*loop 2*/ #pragma acc loop independent for (j = 0; j < 1000; j++) { n[i+j] = i*j; } }
There are many reasons that might prevent a given piece of a program from
being parallelizable. Most of these include memory dependences of some form.
In C and C++ code involving dynamically allocated memory regions, one of the
main culprits behind such dependences is the possibility of pointer aliasing.
That is to say, when a set of instructions accesses memory referenced
by multiple pointer variables, there is no guarantee that the regions pointed
to do not overlap. In such cases, an implicit memory dependence exists, and
the possibility of parallelization is typically discarded. Usually, treating
these cases statically involves the employment of complex and costly
interprocedural pointer analyses.
However, as a by-product of our alias analysis, our tool is capable of
disambiguating pointers. This means it can infer conditions that ensure
the absence of aliasing, in which case the memory dependences do not exist,
and parallel execution might be possible. By combining this with conditional
compilation directives, we can solve the problem in an elegant and concise
manner. The conditional directives instruct a compiler to create two different
versions of a loop, whose execution is controlled by an aliasing check. Then,
during execution, the conditional is evaluated. If the absence of aliasing is
confirmed, the loop is executed in parallel. Otherwise, a sequential version
is executed instead. This aliasing check and its respective conditional
directive can be seen in the third loop in the example above, and also
highlighted in the snippet below.
/*loop 3*/ char RST_AI2 = 0; RST_AI2 |= !((A + 0 > b + 5000) || (b + 0 > a + 5000)); #pragma acc data pcopy(a[0:5000],b[0:5000]) if(!RST_AI2) #pragma acc kernels if(!RST_AI2) #pragma acc loop independent for (k = 0; k < 5000; k++) { a[k] = b[k]+(k*k); }
The last loop shows a case where the scalar value of the variables used as subscripts in the memory accesses in the loop are not predictable in the function's scope. In this case, our tool is capable of inferring the proper limits by inserting a series of value checks to determine which variable effectively defines the upper bound for the memory accesses. It then inserts the proper value in the copy directive. Note that in this case the memory bounds can vary during execution, yet the limits defined remain correct for every execution context.
/*loop 4*/ long long int AI3[6]; AI3[0] = c + -1; AI3[1] = 4 * AI3[0]; AI3[2] = AI3[1] + 4; AI3[3] = AI3[2] / 4; AI3[4] = (AI[3] > 0); AI3[5] = (AI3[4] ? AI3[3] : 0); #pragma acc data pcopy(a[0:AI3[5]]) #pragma acc kernels #pragma acc loop independent for (k = 0; k < c; k++) { a[k] = k*(k+1); } }
To evaluate the code generated by DawnCC, we have used it to annotate code
found in a few widely available benchmark suites.
You can download the pack
through this link:
DawnCC Benchmarks