28) Intro to GPU programming in Julia

28) Intro to GPU programming in Julia#

Last time:

Practical CUDA
Memory
Tuckoo demo for CUDA codes

Today:

Intro to GPU programming in Julia (CUDA.jl)
CUDA Julia code on the tuckoo cluster
2.1 Launching a CUDA.jl code with SLURM
Parallel CUDA Julia

1. Intro to GPU programming in Julia (CUDA.jl)#

For GPU programming there are many great resources. Some that I may refer to are:

Note

References:

Warburton | youtube video (In the first 47 minutes of the video, Tim gives an excellent introduction to the GPU.)
Warburton | ATPESC pdf

Add vectors#

The real “hello world” of the GPU is adding vectors:

\[ C = A + B \]

Big ideas:

threads
blocks
how to launch and time kernels
off-loaded memory on the device

Example with CUDA.jl #

To run this and following Julia CUDA examples, you need to first add CUDA.jl to your environment. That is, start julia and in the REPL run:

using Pkg
Pkg.add("CUDA")

Then you can execute the following script:

using CUDA

a = CuArray(rand(100))
b = CuArray(rand(100))
c = a + b

Memory management#

As we have seen so far, a crucial aspect of working with a GPU is managing the data on it.

The CuArray type is the primary interface for doing so: Creating a CuArray will allocate data on the GPU, copying elements to it will upload, and converting back to an Array will download values to the CPU. Let’s see it in an example test.

Note: to run the following test, you first need to also add the Test.jl package to your environment:

using Pkg
Pkg.add("Test")

Then you can run the following test example:

using CUDA
using Test # to be able to use the @test macro

@testset "CopyCuArray" begin
    print("Begin test. \n")

    # generate some data on the CPU (host array, h_array)
    h_array = rand(Float32, 1024)

    # allocate on the GPU (device array, d_array)
    d_array = CuArray{Float32}(undef, 1024)

    # copy from the CPU to the GPU
    copyto!(d_array, h_array)

    # download/transfer back to the CPU and verify using the Array casting of the device array
    @test h_array == Array(d_array)
    print("Test ended. \n")
end

Observation on garbage collection:

One striking difference between the native C CUDA implementation and the Julia CUDA.jl interface is that instances of the CuArray type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management (like the cudaFree), just make sure your objects are not reachable (i.e., there are no instances or references).

Reverse vectors?#

The “hello world 2.0” of the GPU is (inplace) reverse vector

\[ A_i := A_{N - i + 1} \]

Big ideas:

thread independence
race conditions

CUDA.jl API Overview #

The CUDA.jl package provides three distinct, but related, interfaces for CUDA programming:

the CuArray type: for programming with arrays;
native kernel programming capabilities: for writing CUDA kernels in Julia;
CUDA API wrappers: for low-level interactions with the CUDA libraries.

Much of the Julia CUDA programming stack can be used by just relying on the CuArray type, and using platform-agnostic programming patterns like broadcast and other array abstractions. Only once you hit a performance bottleneck, or some missing functionality, you might need to write a custom kernel or use the underlying CUDA APIs.

CUDA.jl kernel programming #

You can write your own GPU kernels in Julia.

CUDA.jl aims to expose the full power of the CUDA programming model, i.e., at the same level of abstraction as CUDA C/C++, albeit with some Julia-specific improvements.

Defining and launching kernels#

Kernels are written as ordinary Julia functions, but they have to return nothing:

    function my_kernel()
        return
    end

To launch this kernel, use the @cuda macro:

    julia> @cuda my_kernel()

Kernel inputs and outputs#

GPU kernels cannot return values, and should always return or return nothing.
To communicate values from a kernel, you can use a CuArray:

Example:

using CUDA

function my_kernel(a)
    a[1] = 42
    return
end

my_kernel (generic function with 1 method)

a = CuArray{Int}(undef, 1);

@cuda my_kernel(a);
a

1-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 42

Launch configuration and indexing#

Simply using @cuda only launches a single thread, which is not very useful. To launch more threads, use the threads and blocks keyword arguments to @cuda. Example:

function my_kernel(a)
    i = threadIdx().x
    a[i] = 42
    return
end

a = CuArray{Int}(undef, 5);

@cuda threads=length(a) my_kernel(a);

a

5-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 42
 42
 42
 42
 42

2. CUDA Julia code on the tuckoo cluster#

As we have seen in the last lecture, we can run CUDA code on the tuckoo cluster.

Now let’s learn how to configure the CUDA.jl package to run with the proper CUDA version available on the cluster.

Recall that for tuckoo, node8 is the node dedicated for interactive usage, where you can find the nvcc compiler to compile CUDA codes.

Once you are logged in on tuckoo using

ssh your_user_name@tuckoo.sdsu.edu

You will see that you are logged in on the login node because the prompt will show

[your_user_name@tuckoo ~]$

Now you can use rsh to connect to node8 by simply running:

rsh node8

and you will see that you are indeed on node8 because the prompt has changed to:

[your_user_name@node8 ~]$

Now you can start a Julia session with a local environment in a directory of your choice, by running

julia --project=.

Recall that this will generate a Project.toml and Manifest.toml file in the directory in which you have specified the --project= path.

Once your Julia session has started, you can add the CUDA.jl and Test.jl packages needed to test your code.

Pkg.add("CUDA")
Pkg.add("Test")

Now if you try to use the CUDA.jl package by running using CUDA, the first time you’ll hit the following error:

julia> using CUDA
┌ Error: CUDA.jl was precompiled without knowing the CUDA toolkit version. This is unsupported.
│ You should either precompile CUDA.jl in an environment where the CUDA toolkit is available,
│ or call `CUDA.set_runtime_version!` to specify which CUDA version to use.
└ @ CUDA ~/.julia/packages/CUDA/TW8fL/src/initialization.jl:148

Run the CUDA.versioninfo() command to see which CUDA version is available on the cluster:

julia> CUDA.versioninfo()

You will see the following output:

julia> CUDA.versioninfo()
CUDA runtime 12.0, local installation
CUDA driver 12.0
NVIDIA driver 525.60.13

CUDA libraries: 
- CUBLAS: 12.0.1
- CURAND: 10.3.1
- CUFFT: 11.0.0
- CUSOLVER: 11.4.2
- CUSPARSE: 12.0.0
- CUPTI: 2022.4.0 (API 18.0.0)
- NVML: 12.0.0+525.60.13

Julia packages: 
- CUDA: 5.7.2
- CUDA_Driver_jll: 0.12.1+1
- CUDA_Runtime_jll: 0.16.1+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.11.4
- LLVM: 16.0.6

Preferences:
- CUDA_Runtime_jll.local: true

2 devices:
  0: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)
  1: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)

Therefore, we want to set the v12.0 version for our CUDA.jl library. Let’s do it by running the following command:

julia> CUDA.set_runtime_version!(v"12.0")

It will print the following output:

julia> CUDA.set_runtime_version!(v"12.0")
[ Info: Configure the active project to use CUDA 12.0; please re-start Julia for this to take effect.

Now your Julia environment on the cluster is properly setup to use the right CUDA version.
You will notice that your local Project.toml and Manifest.toml files have changed and now include the CUDA.jl dependency.

2.1 Launching a CUDA.jl code with SLURM#

Once you have verified that your local Project.toml and Manifest.toml files have been populated correctly, you can copy the julia_codes/module8-1/copy_cu_array.jl and batch_scripts/batch.cuda-julia files in the same directory where you have the Project.toml and Manifest.toml files.

This is the SLURM batch script to setup and precompile your Julia environment and launch Julia CODE on a P100 on tuckoo:

#!/bin/sh

#note: run julia CUDA copy array a single node

#SBATCH --job-name=add_cu
#SBATCH --output=%A-add_cu.out
#SBATCH --ntasks=16
#SBATCH --constraint=P100

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=tcp,self

#require tcp over infiniband
export OMPI_MCA_btl_tcp_if_include ib0

julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. -e 'using Pkg; Pkg.precompile()'
julia --project=. -e 'using Pkg; Pkg.status()'

julia copy_cu_array.jl

#-------------------------------------------------------------

You can simply launch your SLURM batch script with

sbatch batch.cuda-julia

and it should produce the following output:

Status `~/comp605/spring25/julia_codes/module8-1/Project.toml`
  [052768ef] CUDA v5.7.3
  [8dfed614] Test v1.11.0
Begin test. 
Test ended. 
Test Summary: | Pass  Total  Time
CopyCuArray   |    1      1  1.0s

3. Parallel Julia CUDA#

Tip

Reference: CUDA.jl tutorials

Tasks and threads #

CUDA.jl can be used with Julia tasks and threads, offering a convenient way to work with multiple devices, or to perform independent computations that may execute concurrently on the GPU.

Task-based programming#

Each Julia task gets its own local CUDA execution environment, with its own stream, library handles, and active device selection. That makes it easy to use one task per device, or to use tasks for independent operations that can be overlapped. At the same time, it’s important to take care when sharing data between tasks.

For example, let’s take some dummy expensive computation and execute it from two tasks:

# an expensive computation
function compute(a, b)
    c = a * b             # library call
    broadcast!(sin, c, c) # Julia kernel
    c
end

function run(a, b)
    results = Vector{Any}(undef, 2)

    # computation
    @sync begin
        @async begin
            results[1] = Array(compute(a,b))
            nothing # JuliaLang/julia#40626
        end
        @async begin
            results[2] = Array(compute(a,b))
            nothing # JuliaLang/julia#40626
        end
    end

    # comparison
    results[1] == results[2]
end

# main function
function main(N=1024)
    a = CUDA.rand(N,N)
    b = CUDA.rand(N,N)

    # make sure this data can be used by other tasks!
    synchronize()

    run(a, b)
end

In the above example, we create two tasks and re-synchronize afterwards (@async and @sync), while the dummy compute function demonstrates both the use of a library (matrix multiplication uses CUBLAS) and a native Julia kernel.
The main function illustrates how we need to take care when sharing data between tasks:
- GPU operations typically execute asynchronously, queued on an execution stream*, so if we switch tasks and thus switch execution streams we need to synchronize() to ensure the data is actually available.
** : A stream is simply a sequence of operations that are performed in order on the device.

Add vectors#

Now, let’s go back to adding two vectors, but this time we will be using warps and thread blocks to do it in parallel:

using CUDA

# Loop based function
function addvecs!(c, a, b)
  N = length(a)
  @inbounds for i = 1:N
    c[i] = a[i] + b[i]
  end
end

# Loop based fake "GPU" kernel
function fake_knl_addvecs!(c, a, b, numblocks, bdim)
  N = length(a)
  # loop over the "blocks"
  @inbounds for bidx = 1:numblocks
    # loop over the "threads"
    for tidx = 1:bdim
      i = (bidx - 1) * bdim + tidx
      if i <= N
        c[i] = a[i] + b[i]
      end
    end
  end
end

# Real GPU kernel
function knl_addvecs!(c, a, b)

  N = length(a)

  tidx = threadIdx().x # get the thread ID
  bidx = blockIdx().x  # get the block ID
  bdim = blockDim().x  # how many threads in each block

  # figure out which index we should handle
  i = (bidx - 1) * bdim + tidx
  @inbounds if i <= N
    c[i] = a[i] + b[i]
  end

  #Kernel must return nothing
  nothing
end

let
  N = 100000000
  a = rand(N)
  b = rand(N)

  #
  # Reference Implementation
  #
  c0 = similar(a)
  c0 .= a .+ b
  @time c0 .= a .+ b

  #
  # Simple for loop implementation
  #
  c1 = similar(c0)
  addvecs!(c1, a, b)
  @time addvecs!(c1, a, b)
  @assert c0 ≈ c1

  #
  # fake GPU kernel
  #
  c2 = similar(c0)
  numthreads = 256
  numblocks = div(N + numthreads - 1, numthreads)
  fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
  @time fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
  @assert c0 ≈ c2

  #
  # Real GPU call
  #
  d_a = CuArray(a) # Copy data to GPU
  d_b = CuArray(b) # Copy data to GPU
  d_c3 = CuArray{Float64}(undef, N) # Allocate array on GPU
  # launch GPU kernel
  @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
  # Barrier before timing
  synchronize()
  @time begin
    @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
    # Barrier before completing
    synchronize()
  end

  # Copy back to the CPU
  c3 = Array(d_c3)
  @assert c0 ≈ c3

  #
  # Try Julia's native CUDA version
  #
  d_c3 .= d_a .+ d_b
  synchronize()
  @time begin
    d_c3 .= d_a .+ d_b
    synchronize()
  end
end
nothing

In the above example, which you can find in julia_codes/module8-1/add_vectors.jl, we illustrate the addition of two vectors in different ways:

A reference (CPU) implementation as a baseline for correctness
A simple loop-based implementation
A “fake” GPU kernel, where we manually loop over blocks of indices
A real GPU kernel that uses the built-in CUDA variables for thread index, block index and block dimension

We also synchronize the results between GPU and CPU and time the execution.

As the CUDA.jl profiling documentation states, is we used the CUDA.@time instead of the Julia Base.@time macro, we would not have to worry about synchronization before time measurements.

28) Intro to GPU programming in Julia

Contents

28) Intro to GPU programming in Julia#

1. Intro to GPU programming in Julia (CUDA.jl)#

Add vectors#

Example with CUDA.jl#

Memory management#

Reverse vectors?#

CUDA.jl API Overview#

CUDA.jl kernel programming#

Defining and launching kernels#

Kernel inputs and outputs#

Launch configuration and indexing#

2. CUDA Julia code on the tuckoo cluster#

2.1 Launching a CUDA.jl code with SLURM#

3. Parallel Julia CUDA#

Tasks and threads#

Task-based programming#

Add vectors#

Example with CUDA.jl #

CUDA.jl API Overview #

CUDA.jl kernel programming #

Tasks and threads #