30) Parallel reductions with CUDA.jl#
Last time:
Memory management with CUDA.jl
Simple kernels using global memory
Simple kernels using shared/local memory
Instruction Level Parallelism
Bank Conflicts
Today:
Outline
Parallel reduction on the GPU
Parallel reduction complexity
Different strategies for optimization
1. Outline#
In this lecture, we will use an efficient parallel reduction on the GPU as an example to talk about:
communication across thread blocks
global synchronization problem across thread blocks
use of different memory pools for optimization strategies
Note
References:
We will losely follow Optimizing Parallel Reduction in CUDA by Mark Harris. Note that some of our kernels will be different than those given in the talk, but the ideas are (mostly) the same.
2. Parallel reduction on the GPU#
We want to perform a Parallel Reduction operation on an array of size \(N\).
This is a common and important data parallel primitive. It should be:
Easy to implement in CUDA
But often harder to get it right
We can use this as a great optimization example
Basic idea#
Tree-based approach used within each thread block
Need to be able to use multiple thread blocks
to process very large arrays
to keep all multiprocessors on the GPU busy
each thread block reduces a portion of the array
But how do we communicate partial results between thread blocks?
Problem: Global Synchronization#
If we could synchronize across all thread blocks, we could easily reduce very large arrays:
global sync after each block would produce its result
once all blocks reach sync, we would continue recursively
But CUDA has no global synchronization. Why?
It’s expensive to build in hardware for GPUs with high processor count
It would force programmers to run fewer blocks (no more than # multiprocessors \(\times\) # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency
Solution: decompose into multiple kernels.
Kernel launch serves as a global synchronization point
Kernel launch has negligible hardware overhead and low software overhead
Solution: Kernel Decomposition#
We avoid global sync by decomposing the computation into multiple kernel invocations
In the case of reductions, the code for all levels is the same:
we have recursive kernel invocations
Measuring success: what are we striving for?#
We should strive to reach GPU peak performance
But we need to choose the right metric to measure success:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we are not compute-bound, hence we should strive for peak bandwidth
3. Parallel reduction complexity#
\(\log(N)\) parallel steps, each step \(S\) does \(N/2^S\) independent operations
Step complexity is \(O(\log N)\)
For \(N=2^D\) (we are sticking to power-of-2 block sizes), performs \(\Sigma_{S \in [1, \dots, D]} 2^{D-S} = N-1\) operations
Work complexity is \(O(N)\) - It is work-efficient, i.e., does not perform more operations than a sequential algorithm
With \(P\) threads physically in parallel (\(P\) processors), time complexity is \(O(\frac{N}{P} + \log N)\)
Compare to \(O(N)\) for sequential reduction
In a thread block, \(N=P\), so \(O(\log N)\)
4. Different strategies for optimization#
Basic implementation idea:#
Version 1#
Because there is no global synchronization, we need to launch the kernel multiple times in work groups.
In each launch, each work group will work together to reduce the set of numbers to a single summed value.
In the first iterations we launch, say,
num_groups
work groups to reduce a vector of lengthN
to a vector of lengthnum_groups
.We will then recurse and reduce the vector of length
num_groups
to a smaller vector, and continue until we only have1
element in the partially reduced vector.
Version 2#
Work-group size elements of the given array into shared memory
Use binary sum to reduce the values to a single number
Recurse until there is only one group
Version 3#
Use sequential addressing of main memory
This version with sequential addressing is conflict free (i.e., no bank conflicts)
Version 4#
Let each thread load
OVERLAP
numbers