30) Parallel reductions with CUDA.jl#

Last time:

  • Memory management with CUDA.jl

  • Simple kernels using global memory

  • Simple kernels using shared/local memory

  • Instruction Level Parallelism

  • Bank Conflicts

Today:

  1. Outline

  2. Parallel reduction on the GPU

  3. Parallel reduction complexity

  4. Different strategies for optimization

1. Outline#

In this lecture, we will use an efficient parallel reduction on the GPU as an example to talk about:

  • communication across thread blocks

  • global synchronization problem across thread blocks

  • use of different memory pools for optimization strategies

Note

References:

  • We will losely follow Optimizing Parallel Reduction in CUDA by Mark Harris. Note that some of our kernels will be different than those given in the talk, but the ideas are (mostly) the same.

2. Parallel reduction on the GPU#

We want to perform a Parallel Reduction operation on an array of size \(N\).

This is a common and important data parallel primitive. It should be:

  • Easy to implement in CUDA

  • But often harder to get it right

  • We can use this as a great optimization example

Basic idea#

  • Tree-based approach used within each thread block

A tree-based approach for a parallel reduction

  • Need to be able to use multiple thread blocks

    • to process very large arrays

    • to keep all multiprocessors on the GPU busy

    • each thread block reduces a portion of the array

  • But how do we communicate partial results between thread blocks?

Problem: Global Synchronization#

  • If we could synchronize across all thread blocks, we could easily reduce very large arrays:

    • global sync after each block would produce its result

    • once all blocks reach sync, we would continue recursively

  • But CUDA has no global synchronization. Why?

    • It’s expensive to build in hardware for GPUs with high processor count

    • It would force programmers to run fewer blocks (no more than # multiprocessors \(\times\) # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency

Solution: decompose into multiple kernels.

  • Kernel launch serves as a global synchronization point

  • Kernel launch has negligible hardware overhead and low software overhead

Solution: Kernel Decomposition#

  • We avoid global sync by decomposing the computation into multiple kernel invocations

A kernel decomposition for a parallel reduction on the GPU

  • In the case of reductions, the code for all levels is the same:

    • we have recursive kernel invocations

Measuring success: what are we striving for?#

  • We should strive to reach GPU peak performance

  • But we need to choose the right metric to measure success:

    • GFLOP/s: for compute-bound kernels

    • Bandwidth: for memory-bound kernels

  • Reductions have very low arithmetic intensity

    • 1 flop per element loaded (bandwidth-optimal)

  • Therefore we are not compute-bound, hence we should strive for peak bandwidth

3. Parallel reduction complexity#

  • \(\log(N)\) parallel steps, each step \(S\) does \(N/2^S\) independent operations

    • Step complexity is \(O(\log N)\)

  • For \(N=2^D\) (we are sticking to power-of-2 block sizes), performs \(\Sigma_{S \in [1, \dots, D]} 2^{D-S} = N-1\) operations

    • Work complexity is \(O(N)\) - It is work-efficient, i.e., does not perform more operations than a sequential algorithm

  • With \(P\) threads physically in parallel (\(P\) processors), time complexity is \(O(\frac{N}{P} + \log N)\)

    • Compare to \(O(N)\) for sequential reduction

    • In a thread block, \(N=P\), so \(O(\log N)\)

4. Different strategies for optimization#

Basic implementation idea:#

Version 1#

Because there is no global synchronization, we need to launch the kernel multiple times in work groups.

In each launch, each work group will work together to reduce the set of numbers to a single summed value.

  • In the first iterations we launch, say, num_groups work groups to reduce a vector of length N to a vector of length num_groups.

  • We will then recurse and reduce the vector of length num_groups to a smaller vector, and continue until we only have 1 element in the partially reduced vector.

Version 2#

  • Work-group size elements of the given array into shared memory

  • Use binary sum to reduce the values to a single number

  • Recurse until there is only one group

Version 3#

  • Use sequential addressing of main memory

Sequential addressing reduction

  • This version with sequential addressing is conflict free (i.e., no bank conflicts)

Version 4#

  • Let each thread load OVERLAP numbers