CREV: Function Call Re-Vectorization

// CUDA nested kernel __device__ void memcpy(int *src, int *dest, int length) { for (int i = 0; i < length; i += blockDim.x) { int idx = i + threadIdx.x; dest[idx] = src[idx]; } } // CUDA kernel __global__ void foo(int **srcs, int **dests, int *lengths) { ...; int idx = threadIdx.x; if (some_property(idx)) { memcpy<<<32, 1>>>(srcs[idx], dests[idx], lengths[idx]); } ...; } int main(void) { ...; // kernel launch foo<<<32, 1>>>(srcs, dests, lengths); }

// SIMD function: should only be called under uniform control flow void simd(argument arg) { parallel for (thread_id in warp): process data arg[thread_id] } // SPMD kernel: regular SPMD function with potential control flow divergences void spmd(void) { if (thread_id should be active): crev simd(arguments) }

// Optimized version of function "spmd" with thread re-enabling void optimized_spmd(void) { if (thread_id should be active): everywhere: re-enable threads within this block { for each thread_id formerly active: simd(arguments for thread_id) } }

Book Filter

Book Filter is an algorithm for processing pages from books, scrapping those that do not match a given pattern. The idea is similar to a grep "pattern", as pages are given as input lines to the program.

// SPMD memory copy                                                        
void mem_cpy_par(uniform int8 * varying dest, const uniform int8 * varying src,
    const varying int n) {                                                     
    for (varying int i = 0; i < n; ++i) dest[i] = src[i];                      
}                                                                              

// SIMD memory copy                                                             
void mem_cpy_simd(uniform int8 * uniform dest, const uniform int8 * uniform src,
    const uniform int n) {                                                      
    foreach (i = 0 ... n) dest[i] = src[i];                                     
}                                                                               

// Copies only pages containing the input pattern                     
export void bookfilter_par(const String * uniform page,               
    const uniform int num_pages, const uniform String& pattern,       
    String * uniform output) {                                        

    foreach (i = 0 ... num_pages) {                                   
        bool match = false;                                           
        str_match_par(page[i].data, page[i].length, pattern, match);  
        if (match != false) {                                         
            mem_cpy_par(output[i].data, page[i].data, page[i].length);
            output[i].length = page[i].length;                        
        }                                                             
    }                                                                 

}

The code above performs the copy of up to warp-size page[i] in parallel, as from line 22. This means that only threads whose page matches the pattern will be active during the call to mem_cpy_par. Now, by simply replacing line 22 by the call to mem_cpy_simd using crev:

    crev mem_cpy_simd(output[i].data, page[i].data, page[i].length);

We can temporarily change the dimension of parallelism: we re-enable all threads from the warp to work on the copy of the pages selected. Check out the performance of our approach in comparison to other possible implementations.

close

String matching

String matching is a simple set of pattern matching algorithm implementations. We have a naive version parallel version; a crev based implementation; one that uses ISPC's dynamic parallelism (launch); and a sequential implementation o Knuth-Morris-Pratt algorithm. Our crev implementation is presented below.

struct String {
    uniform int length;
    int8 * uniform data;
};

// Matches the pattern with the input text from the given offset
void pattern_match(uniform String& text, uniform String& pattern,
    uniform int offset) {

    varying bool match = true;
    foreach (i = 1 ... pattern.length) {
        if (pattern.data[i] != text.data[i + offset]) match &= false;
    }

    if (all(match)) print("match!\n");

}

// Finds the given pattern along the input text
export void String_match(uniform String& text, uniform String& pattern,
    int * uniform matches, uniform int& num_matches) {

    // For each character in the text
    for (uniform int i = 0; i < text.length; i += programCount) {

        // If such character matches the initial character from the pattern
        // and we still got room for the whole pattern in the input string
        varying int pos = i + programIndex;
        if (text.data[pos] == pattern.data[0]
            && (pos + pattern.length) <= text.length) {
            // Activate all threads to help decide whether there is a match
            crev pattern_match(text, pattern, pos);
        }

    }

}

Check out the performance of our approach in comparison to other possible implementations.

close

Bellman-Ford

Check out the performance of our approach in comparison to other possible implementations.

close

Depth-first traversal

Some text...

...

Some more text... Check out the performance of our approach in comparison to other possible implementations.

close

Connected-component leader

Check out the performance of our approach in comparison to other possible implementations.

close

Bitonic Quicksort

Check out the performance of our approach in comparison to other possible implementations.

close

Bitonic Mergesort

Check out the performance of our approach in comparison to other possible implementations.

close