29) Memory management with CUDA.jl

29) Memory management with CUDA.jl#

Last time:

Introduction to CUDA.jl

Today:

Outline
Problem statement
Simple kernels using global memory
Simple kernels using shared/local memory
Instruction Level Parallelism (both with and without local memory)
Bank Conflicts

1. Outline#

In this lecture, we will use efficient matrix transpose on the GPU as an example to illustrate:

global memory access (CuArrays)
shared memory usage (@cuStaticSharedMem)
memory bank conflicts
instruction level parallelism

Note

References:

We will follow the outlines of An Efficient Matrix Transpose in CUDA C/C++ by Mark Harris. Though this is an example in CUDA C, the strategy follows through directly in Julia as well.
We will also reference How to Access Global Memory Efficiently in CUDA C/C++ Kernels.

The code for the matrix transpose can be found in julia_codes/module8-2/transpose.jl

2. Problem statement:#

To do an efficient matrix transpose,

\[ Y = X^T, \]

on the GPU, we will be looking at two operations:

copy: y[i, j] = x[i, j]
- Coalesced memory access on the read and write
transpose: y[j, i] = x[i, j]
- Coalesced memory access on the read but not the write
- Coalesced memory access on the write but not the read

But before doing that, we need to talk about array indexing.

A typical GPU-porting workflow#

A typical approach for porting or developing an application for the GPU is as follows:

develop an application using generic array functionality, and test it on the CPU with the Array type
port your application to the GPU by switching to the CuArray type
disallow the CPU fallback (“scalar indexing”) to find operations that are not implemented for or incompatible with GPU execution
(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance

Scalar indexing#

Many array operations in Julia are implemented using loops, processing one element at a time.

Doing so with GPU arrays is very ineffective, as the loop won’t actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration with the following error message:

using CUDA

a = CuArray([1])

a[1] += 1

┌ Warning: Performing scalar indexing on task Task (runnable) @0x0000794751ba07e0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore should be avoided.
│ 
│ If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
│ to enable scalar iteration globally or for the operations in question.
└ @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:145

Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU and for debugging purposes.

If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:

CUDA.allowscalar(false)

a[1] .+ 1 # this will error again

Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] errorscalar(op::String)
   @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:151
 [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
   @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:124
 [4] assertscalar(op::String)
   @ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:112
 [5] getindex(A::CuArray{Int64, 1, CUDA.DeviceMemory}, I::Int64)
   @ GPUArrays ~/.julia/packages/GPUArrays/sBzM5/src/host/indexing.jl:50
 [6] top-level scope
   @ In[2]:3

# but this will work:

a .+ 1

In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default.

There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or @allowscalar macro:

a = CuArray([1])

CUDA.allowscalar() do
    a[1] += 1
  end

a

CUDA.@allowscalar a[1] += 1

3. Simple kernels using global memory#

Recall that global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

Examples of global memory accesses:

# Simple routines using global memory
function copy_naive!(b, a)
  N = size(a, 1)
  i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
  j = (blockIdx().y-1) * TILE_DIM + threadIdx().y

  @inbounds if i <= N && j <= N
    b[i, j] = a[i, j]
  end
  nothing
end

function transpose_naive!(b, a)
  N = size(a, 1)
  i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
  j = (blockIdx().y-1) * TILE_DIM + threadIdx().y

  @inbounds if i <= N && j <= N
    b[j, i] = a[i, j]
  end
  nothing
end

3. Simple kernels using shared/local memory#

CUDA allows us to allocate shared memory which all threads in a thread block can access.

Shared memory is a fast, on-chip memory that is accessible by all threads within a single CUDA thread block. It’s designed to facilitate efficient data sharing between threads within a block.

By using shared memory we can have coalesced reads and writes from global memory (reads from shared memory may not be coalesced: see Bank Conflicts below).

The syntax for shared memory is cuStaticSharedMem and for convenience is as a double array:

tile = @cuStaticSharedMem(Float64, (SIZE_X, SIZE_Y))

where the first index is the continuous/fastest index, e.g.,

tile[i, j] -> loc_x[i + (j-1) * SIZE_X]

Access to local memory on the GPU is faster than global memory (though not as fast as registers).

Note: The size of the array SIZE_X and SIZE_Y must be known at compile time, thus it must be either a global constant or passed into the kernel through a Val (recall that we have seen this in lecture 8) or see the documentation ?Val for how to pass constants through to functions at compile time.

Strategy: Load data first into the local shared memory, then write out from local shared memory to global memory (so that both the reads and writes are coalesced into global memory)

#=
Simple routines using shared memory
Idea is that we first load a patch of x into shared memory then write out from
shared memory
=#
function copy_shared!(b, a)
  N = size(a, 1)
  tidx, tidy = threadIdx().x, threadIdx().y
  i = (blockIdx().x-1) * TILE_DIM + tidx
  j = (blockIdx().y-1) * TILE_DIM + tidy
  tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))

  @inbounds if i <= N && j <= N
    tile[tidx, tidy] = a[i, j]
  end

  sync_threads()

  @inbounds if i <= N && j <= N
    b[i, j] = tile[tidx, tidy]
  end

  nothing
end

function transpose_shared!(b, a)
  N = size(a, 1)
  tidx, tidy = threadIdx().x, threadIdx().y
  bidx, bidy = blockIdx().x, blockIdx().y
  i = (bidx-1) * TILE_DIM + tidx
  j = (bidy-1) * TILE_DIM + tidy
  tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))

  @inbounds if i <= N && j <= N
    tile[tidx, tidy] = a[i, j]
  end

  sync_threads()

  i = (bidy-1) * TILE_DIM + tidx
  j = (bidx-1) * TILE_DIM + tidy

  @inbounds if i <= N && j <= N
    b[i, j] = tile[tidy, tidx]
  end

  nothing
end

4. Instruction Level Parallelism (both with and without local memory)#

Instruction level parallelism is the ability of the hardware to execute (in parallel) multiple, independent instructions from the same thread; floating point operations take ~4 cycles before completing.

Recall that NVIDIA architectures follow the “Single instruction, multiple threads (SIMT) paradigm to achieve instruction level parallelism.

Ideas:

# Non-parallel instructions
x = a * b * c;
# stall
y = x * x;

# Instruction level parallelizable
x = a * b * c;
y = a + b + c; // independent of x
# not stall
z = x * y

This applies to memory access too! If we access memory the thread only stalls once we need the memory

loc = glo_x[n];

#=
lots of work not involving loc...
=#

# stall until memory access is done
loc = loc * loc;

To see how this plays out in practice on the GPU see this forum thread as an example.

How to use all of this for the transpose? Have one thread issue several calls read and write from global memory and shared memory. (In this case I think the performance boost is coming from):

(1) fewer registers being used, and thus more threads being scheduled and
(2) the scheduling on this card is such that there are a few extra loads being issued at the same time.

Example:

#=
Tiled routines using global memory:
Idea here is that we break up the matrix into blocks of size

   [TILE_DIM, TILE_DIM]

We then copy the data over in chunks of size

   [TILE_DIM, BLOCK_ROWS]

so we have to copy TILE_DIM / BLOCK_ROWS chunks to loop over. For example, if TILE_DIM = 8 and BLOCK_ROWS = 2 then the data is copied in this order in the transposed matrix (where the number indicates the iteration of the for loop the data is filled by)

  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
  0 0 1 1 2 2 3 3
=#

function copy_tiled!(b, a)
  N = size(a, 1)
  i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
  j = (blockIdx().y-1) * TILE_DIM + threadIdx().y

  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      b[i, j+k] = a[i, j+k]
    end
  end
end

function transpose_tiled!(b, a)
  N = size(a, 1)
  i = (blockIdx().x-1) * TILE_DIM + threadIdx().x
  j = (blockIdx().y-1) * TILE_DIM + threadIdx().y

  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      b[j+k, i] = a[i, j+k]
    end
  end
end

#=
Tiled routines using shared memory:
Combine the tiling idea with shared memory
=#
function copy_tiled_shared!(b, a)
  N = size(a, 1)
  tidx, tidy = threadIdx().x, threadIdx().y
  i = (blockIdx().x-1) * TILE_DIM + tidx
  j = (blockIdx().y-1) * TILE_DIM + tidy

  tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))

  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      tile[tidx, tidy+k] = a[i, j+k]
    end
  end

  sync_threads()

  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      b[i, j+k] = tile[tidx, tidy+k]
    end
  end

  nothing
end

function transpose_tiled_shared!(b, a)
  N = size(a, 1)
  tidx, tidy = threadIdx().x, threadIdx().y
  i = (blockIdx().x-1) * TILE_DIM + tidx
  j = (blockIdx().y-1) * TILE_DIM + tidy

  tile = @cuStaticSharedMem(eltype(a), (TILE_DIM, TILE_DIM))

  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      tile[tidx, tidy+k] = a[i, j+k]
    end
  end

  sync_threads()

  i = (blockIdx().y-1) * TILE_DIM + tidx
  j = (blockIdx().x-1) * TILE_DIM + tidy
  @inbounds for k = 0:BLOCK_ROWS:TILE_DIM-1
    if i <= N && (j+k) <= N
      b[i, j+k] = tile[tidy+k, tidx]
    end
  end

  nothing
end

5. Bank Conflicts#

Local memory on most (perhaps all?) GPUs is organized into memory banks (16 in the example below, but it could be 32 most likely on modern GPUs) where each bank is 4 bytes (on NVIDIA Kepler cards this can be changed to 8 bytes inside of CUDA; but default is 4 bytes).

Bank |      0     |      1      |      2      | ... |     16
Byte | 0  1  2  3 |  4  5  6  7 |  8  9 10 11 | ... | 28 29 30 31
     |32 33 34 35 | 36 37 38 39 | 40 41 42 43 | ... | 60 61 62 63
     |64 65 66 67 | 68 69 70 71 | 72 73 74 75 | ... | 92 93 94 95

Each bank has a bandwidth of 32 bits per clock cycle. This means each bank can hold 4 bytes of data (32 bits).

The limitation is that if two threads in a warp (group of 32 threads on NVIDIA cards) access different locations of the same bank, the access will be serialized (i.e., take more than one memory access call). In fact, when multiple threads within a warp (a group of 32 threads) attempt to access the same bank simultaneously, a bank conflict occurs. This can lead to a performance penalty as the accesses need to be serialized, slowing down the memory access.

Since it’s optimal to have our work groups as multiples of 32, it is also optimal to make sure our shared memory access is not a multiple of 32, thus we want to make the fastest dimension a little faster for the shared memory so two threads don’t access the same bank.