CREV: Function Call Re-Vectorization

function call revectorization

Function Call Re-Vectorization, or CREV, is an extension to SPMD languages that allows issuing kernels from within kernels. This facility is akin to CUDA's dynamic parallelism, except it provides the same functionality without the burden of reserving extra resources nor scheduling new kernels.
Understanding Dynamic Parallelism
Dynamic Parallelism is desirable in several situations, e.g., graph traversals, string searches and other irregular applications. Unfortunately, current abstractions do not support well dynamic parallelism, as they either compromise efficiency or programmability. In this work, we propose a new SPMD programming language idiom, that lets us take the best of both worlds: efficiency and programmability.
Consider the following scenario. You have a divergent kernel, and within its divergent region, you perform data copies. Each working thread owns a pointer to a section of memory from which data should be read, and another indicating where the data should be copied to. Figure 1 shows a sample memory layout:
Memory layout for 4 threads

Figure 1: Sample memory layout and thread activity. The lefthand side portraits a series of parallel accesses to different memory locations; the righthand side shows the expected behavior after employing dynamic parallelism.
If we simply call a procedure to perform the copies, we shall observe the behavior depicted by the lefthand side of Figure 1: each active thread issues a memory read, waits until the data is received, and finally issues a write to the proper memory location. The obvious drawback to this solution is the suboptimal usage of resources, as we have inactive threads waiting on others to finish their processing.
Although the implementation may come out simple, bad resource management is not the only drawback of this approach. Since different memory locations are accessed concurrently, we harm locality of reference. Another solution is to have all threads to become active, and all work on the copy of a single contiguous data location at a time. This solution is, in fact, rather simple to implement: NVidia's SPMD programming language (CUDA) already provides the notion of nested kernel calls, formally, dynamic parallelism.
// CUDA nested kernel
__device__ void memcpy(int *src, int *dest, int length) {
    for (int i = 0; i < length; i += blockDim.x) {
        int idx = i + threadIdx.x;
        dest[idx] = src[idx];

// CUDA kernel
__global__ void foo(int **srcs, int **dests, int *lengths) {
    int idx = threadIdx.x;
    if (some_property(idx)) {
        memcpy<<<32, 1>>>(srcs[idx], dests[idx], lengths[idx]);

int main(void) {
    // kernel launch
    foo<<<32, 1>>>(srcs, dests, lengths);
The special syntax conveys the allocation of basic resources for executing the new kernel, as well as its scheduling. It is worth launching nested kernels in CUDA whenever the workload for the upcoming threads is large enough. Otherwise, if only a couple of iterations will take place, the overhead from resource allocation and scheduling may exceed the performance improvements of activating all threads to work on closer memory locations.
Function Call Re-Vectorization
On a divergent branch, some threads within a warp will be inactive. If the work being done by the active threads contains data parallelism, can we use these inactive threads in the same warp to help speedup that work? And how?
// SIMD function: should only be called under uniform control flow
void simd(argument arg) {
    parallel for (thread_id in warp):
        process data arg[thread_id]

// SPMD kernel: regular SPMD function with potential control flow divergences
void spmd(void) {
    if (thread_id should be active):
        crev simd(arguments)
We propose a new SPMD idiom, crev, that allows SIMD functions to be called within divergent regions. The approach is similar to a nested-kernel call in CUDA, except it does not require extra allocation and scheduling: the cost is equivalent to a regular function call.
// Optimized version of function "spmd" with thread re-enabling
void optimized_spmd(void) {
    if (thread_id should be active):
        everywhere: re-enable threads within this block {
            for each thread_id formerly active:
                simd(arguments for thread_id)
The code above shows an optimized version, which is the actual outcome of applying our crev directive. The simplicity of our approach, in comparison to the optimized version, encourages the usage of irregular structures (linked lists) over regular ones (matrices): crev relieves the complexity of changing the dimension of parallelism.
Check out the performance of our ISPC/CREV benchmarks:
We currently have 7 handcrafted benchmarks that help us understand how CREV relieves the burden of writing efficient code under control flow divergencecs. We exploit a broad spectre of algorithms: from string through graph applications, to sorting techniques.
Book Filter
Book Filter is an algorithm for processing pages from books, scrapping those that do not match a given pattern. The idea is similar to a grep "pattern", as pages are given as input lines to the program.
// SPMD memory copy                                                        
void mem_cpy_par(uniform int8 * varying dest, const uniform int8 * varying src,
    const varying int n) {                                                     
    for (varying int i = 0; i < n; ++i) dest[i] = src[i];                      

// SIMD memory copy                                                             
void mem_cpy_simd(uniform int8 * uniform dest, const uniform int8 * uniform src,
    const uniform int n) {                                                      
    foreach (i = 0 ... n) dest[i] = src[i];                                     

// Copies only pages containing the input pattern                     
export void bookfilter_par(const String * uniform page,               
    const uniform int num_pages, const uniform String& pattern,       
    String * uniform output) {                                        

    foreach (i = 0 ... num_pages) {                                   
        bool match = false;                                           
        str_match_par(page[i].data, page[i].length, pattern, match);  
        if (match != false) {                                         
            mem_cpy_par(output[i].data, page[i].data, page[i].length);
            output[i].length = page[i].length;                        

The code above performs the copy of up to warp-size page[i] in parallel, as from line 22. This means that only threads whose page matches the pattern will be active during the call to mem_cpy_par. Now, by simply replacing line 22 by the call to mem_cpy_simd using crev:
    crev mem_cpy_simd(output[i].data, page[i].data, page[i].length);
We can temporarily change the dimension of parallelism: we re-enable all threads from the warp to work on the copy of the pages selected. Check out the performance of our approach in comparison to other possible implementations.
String matching
String matching is a simple set of pattern matching algorithm implementations. We have a naive version parallel version; a crev based implementation; one that uses ISPC's dynamic parallelism (launch); and a sequential implementation o Knuth-Morris-Pratt algorithm. Our crev implementation is presented below.
struct String {
    uniform int length;
    int8 * uniform data;

// Matches the pattern with the input text from the given offset
void pattern_match(uniform String& text, uniform String& pattern,
    uniform int offset) {

    varying bool match = true;
    foreach (i = 1 ... pattern.length) {
        if ([i] !=[i + offset]) match &= false;

    if (all(match)) print("match!\n");


// Finds the given pattern along the input text
export void String_match(uniform String& text, uniform String& pattern,
    int * uniform matches, uniform int& num_matches) {

    // For each character in the text
    for (uniform int i = 0; i < text.length; i += programCount) {

        // If such character matches the initial character from the pattern
        // and we still got room for the whole pattern in the input string
        varying int pos = i + programIndex;
        if ([pos] ==[0]
            && (pos + pattern.length) <= text.length) {
            // Activate all threads to help decide whether there is a match
            crev pattern_match(text, pattern, pos);


Check out the performance of our approach in comparison to other possible implementations.
Check out the performance of our approach in comparison to other possible implementations.
Depth-first traversal
Some text...
Some more text... Check out the performance of our approach in comparison to other possible implementations.
Connected-component leader
Check out the performance of our approach in comparison to other possible implementations.
Bitonic Quicksort
Check out the performance of our approach in comparison to other possible implementations.
Bitonic Mergesort
Check out the performance of our approach in comparison to other possible implementations.
Download our CREV-extension to Intel's SPMD compiler:
uSIMD: Prolog Abstract Machine to simulate Everywhere blocks CREV extended ISPC SPMD compiler A virtual machine with a running version of CREV-ISPC
Learn more about our work! :D
Presentation: Brazilian Symposium on Programming Languages 2016 SBLP16: Blocos Everywhere para Programacao SIMD (PT-BR) PPoPP17: Function Call Re-Vectorization Msc: Dissertation Msc: Final Presentation
Share your thoughts!
Tell us what you think of this idea! Share any experience with our crev-extended ISPC. Send us an email!