29) Memory management with CUDA.jl#
Last time:
Introduction to CUDA.jl
Today:
Outline
Problem statement
Simple kernels using global memory
Simple kernels using shared/local memory
Instruction Level Parallelism (both with and without local memory)
Bank Conflicts
1. Outline#
In this lecture, we will use efficient matrix transpose on the GPU as an example to illustrate:
global memory access (
CuArrays
)shared memory usage (
@cuStaticSharedMem
)memory bank conflicts
instruction level parallelism
Note
References:
We will follow the outlines of An Efficient Matrix Transpose in CUDA C/C++ by Mark Harris. Though this is an example in CUDA C, the strategy follows through directly in Julia as well.
We will also reference How to Access Global Memory Efficiently in CUDA C/C++ Kernels.
The code for the matrix transpose can be found in julia_codes/module8-2/transpose.jl
2. Problem statement:#
To do an efficient matrix transpose,
copy:
y[i, j] = x[i, j]
Coalesced memory access on the read and write
transpose:
y[j, i] = x[i, j]
Coalesced memory access on the read but not the write
Coalesced memory access on the write but not the read
But before doing that, we need to talk about array indexing.
A typical GPU-porting workflow#
A typical approach for porting or developing an application for the GPU is as follows:
develop an application using generic array functionality, and test it on the CPU with the Array type
port your application to the GPU by switching to the CuArray type
disallow the CPU fallback (“scalar indexing”) to find operations that are not implemented for or incompatible with GPU execution
(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance
Scalar indexing#
Many array operations in Julia are implemented using loops, processing one element at a time.
Doing so with GPU arrays is very ineffective, as the loop won’t actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration with the following error message:
using CUDA
a = CuArray([1])
a[1] += 1
┌ Warning: Performing scalar indexing on task Task (runnable) @0x0000794751ba07e0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore should be avoided.
│
│ If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
│ to enable scalar iteration globally or for the operations in question.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:145
2
Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU and for debugging purposes.
If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar
function:
CUDA.allowscalar(false)
a[1] .+ 1 # this will error again
Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.
If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] errorscalar(op::String)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:151
[3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:124
[4] assertscalar(op::String)
@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:112
[5] getindex(A::CuArray{Int64, 1, CUDA.DeviceMemory}, I::Int64)
@ GPUArrays ~/.julia/packages/GPUArrays/sBzM5/src/host/indexing.jl:50
[6] top-level scope
@ In[2]:3
# but this will work:
a .+ 1
In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default.
There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar
with do-block syntax or @allowscalar
macro:
a = CuArray([1])
CUDA.allowscalar() do
a[1] += 1
end
a
CUDA.@allowscalar a[1] += 1
3. Simple kernels using global memory#
Recall that global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.
Examples of global memory accesses:
# Simple routines using global memory
function copy_naive!(b, a)
N = size(a, 1)
i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
j = (blockIdx().y-1) * TILE_DIM + threadIdx().y
@inbounds if i <= N && j <= N
b[i, j] = a[i, j]
end
nothing
end
function transpose_naive!(b, a)
N = size(a, 1)
i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
j = (blockIdx().y-1) * TILE_DIM + threadIdx().y
@inbounds if i <= N && j <= N
b[j, i] = a[i, j]
end
nothing
end
4. Instruction Level Parallelism (both with and without local memory)#
Instruction level parallelism is the ability of the hardware to execute (in parallel) multiple, independent instructions from the same thread; floating point operations take ~4 cycles
before completing.
Recall that NVIDIA architectures follow the “Single instruction, multiple threads (SIMT) paradigm to achieve instruction level parallelism.
Ideas:
# Non-parallel instructions
x = a * b * c;
# stall
y = x * x;
# Instruction level parallelizable
x = a * b * c;
y = a + b + c; // independent of x
# not stall
z = x * y
This applies to memory access too! If we access memory the thread only stalls once we need the memory
loc = glo_x[n];
#=
lots of work not involving loc...
=#
# stall until memory access is done
loc = loc * loc;
To see how this plays out in practice on the GPU see this forum thread as an example.
How to use all of this for the transpose? Have one thread issue several calls read and write from global memory and shared memory. (In this case I think the performance boost is coming from):
(1) fewer registers being used, and thus more threads being scheduled and
(2) the scheduling on this card is such that there are a few extra loads being issued at the same time.
Example:
#=
Tiled routines using global memory:
Idea here is that we break up the matrix into blocks of size
[TILE_DIM, TILE_DIM]
We then copy the data over in chunks of size
[TILE_DIM, BLOCK_ROWS]
so we have to copy TILE_DIM / BLOCK_ROWS chunks to loop over. For example, if TILE_DIM = 8 and BLOCK_ROWS = 2 then the data is copied in this order in the transposed matrix (where the number indicates the iteration of the for loop the data is filled by)
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
0 0 1 1 2 2 3 3
=#
function copy_tiled!(b, a)
N = size(a, 1)
i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
j = (blockIdx().y-1) * TILE_DIM + threadIdx().y
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
b[i, j+k] = a[i, j+k]
end
end
end
function transpose_tiled!(b, a)
N = size(a, 1)
i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
j = (blockIdx().y-1) * TILE_DIM + threadIdx().y
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
b[j+k, i] = a[i, j+k]
end
end
end
#=
Tiled routines using shared memory:
Combine the tiling idea with shared memory
=#
function copy_tiled_shared!(b, a)
N = size(a, 1)
tidx, tidy = threadIdx().x, threadIdx().y
i = (blockIdx().x-1) * TILE_DIM + tidx
j = (blockIdx().y-1) * TILE_DIM + tidy
tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
tile[tidx, tidy+k] = a[i, j+k]
end
end
sync_threads()
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
b[i, j+k] = tile[tidx, tidy+k]
end
end
nothing
end
function transpose_tiled_shared!(b, a)
N = size(a, 1)
tidx, tidy = threadIdx().x, threadIdx().y
i = (blockIdx().x-1) * TILE_DIM + tidx
j = (blockIdx().y-1) * TILE_DIM + tidy
tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
tile[tidx, tidy+k] = a[i, j+k]
end
end
sync_threads()
i = (blockIdx().y-1) * TILE_DIM + tidx
j = (blockIdx().x-1) * TILE_DIM + tidy
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
b[i, j+k] = tile[tidy+k, tidx]
end
end
nothing
end
5. Bank Conflicts#
Local memory on most (perhaps all?) GPUs is organized into memory banks (16 in the example below, but it could be 32 most likely on modern GPUs) where each bank is 4 bytes (on NVIDIA Kepler cards this can be changed to 8 bytes inside of CUDA; but default is 4 bytes).
Bank | 0 | 1 | 2 | ... | 16
Byte | 0 1 2 3 | 4 5 6 7 | 8 9 10 11 | ... | 28 29 30 31
|32 33 34 35 | 36 37 38 39 | 40 41 42 43 | ... | 60 61 62 63
|64 65 66 67 | 68 69 70 71 | 72 73 74 75 | ... | 92 93 94 95
Each bank has a bandwidth of 32 bits per clock cycle. This means each bank can hold 4 bytes of data (32 bits).
The limitation is that if two threads in a warp (group of 32 threads on NVIDIA cards) access different locations of the same bank, the access will be serialized (i.e., take more than one memory access call). In fact, when multiple threads within a warp (a group of 32 threads) attempt to access the same bank simultaneously, a bank conflict occurs. This can lead to a performance penalty as the accesses need to be serialized, slowing down the memory access.
Since it’s optimal to have our work groups as multiples of 32
, it is also optimal to make sure our shared memory access is not a multiple of 32
, thus we want to make the fastest dimension a little faster for the shared memory so two threads don’t access the same bank.
Example:
#=
Tiled routines using shared memory:
Combine the tiling idea with shared memory, with no bank conflicts
=#
function transpose_tiled_shared_noconflicts!(b, a)
N = size(a, 1)
tidx, tidy = threadIdx().x, threadIdx().y
i = (blockIdx().x-1) * TILE_DIM + tidx
j = (blockIdx().y-1) * TILE_DIM + tidy
tile = @cuStaticSharedMem(eltype(a), (TILE_DIM+1, TILE_DIM)) # note the different TILE_DIM+1
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
tile[tidx, tidy+k] = a[i, j+k]
end
end
sync_threads()
i = (blockIdx().y-1) * TILE_DIM + tidx
j = (blockIdx().x-1) * TILE_DIM + tidy
@inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
if i <= N && (j+k) <= N
b[i, j+k] = tile[tidy+k, tidx]
end
end
nothing
end