28) Intro to GPU programming in Julia#

Last time:

  • Practical CUDA

  • Memory

  • Tuckoo demo for CUDA codes

Today:

  1. Intro to GPU programming in Julia (CUDA.jl)

  2. CUDA Julia code on the tuckoo cluster
    2.1 Launching a CUDA.jl code with SLURM

  3. Parallel CUDA Julia

1. Intro to GPU programming in Julia (CUDA.jl)#

For GPU programming there are many great resources. Some that I may refer to are:

Note

References:

Add vectors#

The real “hello world” of the GPU is adding vectors:

\[ C = A + B \]

Big ideas:

  • threads

  • blocks

  • how to launch and time kernels

  • off-loaded memory on the device

Example with CUDA.jl#

To run this and following Julia CUDA examples, you need to first add CUDA.jl to your environment. That is, start julia and in the REPL run:

using Pkg
Pkg.add("CUDA")

Then you can execute the following script:

1using CUDA
2
3a = CuArray(rand(100))
4b = CuArray(rand(100))
5c = a + b

Memory management#

As we have seen so far, a crucial aspect of working with a GPU is managing the data on it.

The CuArray type is the primary interface for doing so: Creating a CuArray will allocate data on the GPU, copying elements to it will upload, and converting back to an Array will download values to the CPU. Let’s see it in an example test.

Note: to run the following test, you first need to also add the Test.jl package to your environment:

using Pkg
Pkg.add("Test")

Then you can run the following test example:

 1using CUDA
 2using Test # to be able to use the @test macro
 3
 4@testset "CopyCuArray" begin
 5    print("Begin test. \n")
 6
 7    # generate some data on the CPU (host array, h_array)
 8    h_array = rand(Float32, 1024)
 9
10    # allocate on the GPU (device array, d_array)
11    d_array = CuArray{Float32}(undef, 1024)
12
13    # copy from the CPU to the GPU
14    copyto!(d_array, h_array)
15
16    # download/transfer back to the CPU and verify using the Array casting of the device array
17    @test h_array == Array(d_array)
18    print("Test ended. \n")
19end

Observation on garbage collection:

  • One striking difference between the native C CUDA implementation and the Julia CUDA.jl interface is that instances of the CuArray type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management (like the cudaFree), just make sure your objects are not reachable (i.e., there are no instances or references).

Reverse vectors?#

The “hello world 2.0” of the GPU is (inplace) reverse vector

\[ A_i := A_{N - i + 1} \]

Big ideas:

CUDA.jl API Overview#

The CUDA.jl package provides three distinct, but related, interfaces for CUDA programming:

  • the CuArray type: for programming with arrays;

  • native kernel programming capabilities: for writing CUDA kernels in Julia;

  • CUDA API wrappers: for low-level interactions with the CUDA libraries.

Much of the Julia CUDA programming stack can be used by just relying on the CuArray type, and using platform-agnostic programming patterns like broadcast and other array abstractions. Only once you hit a performance bottleneck, or some missing functionality, you might need to write a custom kernel or use the underlying CUDA APIs.

CUDA.jl kernel programming#

You can write your own GPU kernels in Julia.

  • CUDA.jl aims to expose the full power of the CUDA programming model, i.e., at the same level of abstraction as CUDA C/C++, albeit with some Julia-specific improvements.

Defining and launching kernels#

Kernels are written as ordinary Julia functions, but they have to return nothing:

    function my_kernel()
        return
    end

To launch this kernel, use the @cuda macro:

    julia> @cuda my_kernel()
Kernel inputs and outputs#
  • GPU kernels cannot return values, and should always return or return nothing.

  • To communicate values from a kernel, you can use a CuArray:

Example:

using CUDA

function my_kernel(a)
    a[1] = 42
    return
end
my_kernel (generic function with 1 method)
a = CuArray{Int}(undef, 1);
@cuda my_kernel(a);
a
1-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 42

Launch configuration and indexing#

  • Simply using @cuda only launches a single thread, which is not very useful. To launch more threads, use the threads and blocks keyword arguments to @cuda. Example:

function my_kernel(a)
    i = threadIdx().x
    a[i] = 42
    return
end

a = CuArray{Int}(undef, 5);

@cuda threads=length(a) my_kernel(a);

a
5-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 42
 42
 42
 42
 42

2. CUDA Julia code on the tuckoo cluster#

As we have seen in the last lecture, we can run CUDA code on the tuckoo cluster.

Now let’s learn how to configure the CUDA.jl package to run with the proper CUDA version available on the cluster.

Recall that for tuckoo, node8 is the node dedicated for interactive usage, where you can find the nvcc compiler to compile CUDA codes.

Once you are logged in on tuckoo using

ssh your_user_name@tuckoo.sdsu.edu

You will see that you are logged in on the login node because the prompt will show

[your_user_name@tuckoo ~]$

Now you can use rsh to connect to node8 by simply running:

rsh node8

and you will see that you are indeed on node8 because the prompt has changed to:

[your_user_name@node8 ~]$
  • Now you can start a Julia session with a local environment in a directory of your choice, by running

julia --project=.

Recall that this will generate a Project.toml and Manifest.toml file in the directory in which you have specified the --project= path.

  • Once your Julia session has started, you can add the CUDA.jl and Test.jl packages needed to test your code.

Pkg.add("CUDA")
Pkg.add("Test")

Now if you try to use the CUDA.jl package by running using CUDA, the first time you’ll hit the following error:

julia> using CUDA
 Error: CUDA.jl was precompiled without knowing the CUDA toolkit version. This is unsupported.
 You should either precompile CUDA.jl in an environment where the CUDA toolkit is available,
 or call `CUDA.set_runtime_version!` to specify which CUDA version to use.
 @ CUDA ~/.julia/packages/CUDA/TW8fL/src/initialization.jl:148
  • Run the CUDA.versioninfo() command to see which CUDA version is available on the cluster:

julia> CUDA.versioninfo()

You will see the following output:

julia> CUDA.versioninfo()
CUDA runtime 12.0, local installation
CUDA driver 12.0
NVIDIA driver 525.60.13

CUDA libraries: 
- CUBLAS: 12.0.1
- CURAND: 10.3.1
- CUFFT: 11.0.0
- CUSOLVER: 11.4.2
- CUSPARSE: 12.0.0
- CUPTI: 2022.4.0 (API 18.0.0)
- NVML: 12.0.0+525.60.13

Julia packages: 
- CUDA: 5.7.2
- CUDA_Driver_jll: 0.12.1+1
- CUDA_Runtime_jll: 0.16.1+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.11.4
- LLVM: 16.0.6

Preferences:
- CUDA_Runtime_jll.local: true

2 devices:
  0: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)
  1: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)
  • Therefore, we want to set the v12.0 version for our CUDA.jl library. Let’s do it by running the following command:

julia> CUDA.set_runtime_version!(v"12.0")

It will print the following output:

julia> CUDA.set_runtime_version!(v"12.0")
[ Info: Configure the active project to use CUDA 12.0; please re-start Julia for this to take effect.
  • Now your Julia environment on the cluster is properly setup to use the right CUDA version.

  • You will notice that your local Project.toml and Manifest.toml files have changed and now include the CUDA.jl dependency.

2.1 Launching a CUDA.jl code with SLURM#

Once you have verified that your local Project.toml and Manifest.toml files have been populated correctly, you can copy the julia_codes/module8-1/copy_cu_array.jl and batch_scripts/batch.cuda-julia files in the same directory where you have the Project.toml and Manifest.toml files.

This is the SLURM batch script to setup and precompile your Julia environment and launch Julia CODE on a P100 on tuckoo:

 1#!/bin/sh
 2
 3#note: run julia CUDA copy array a single node
 4
 5#SBATCH --job-name=add_cu
 6#SBATCH --output=%A-add_cu.out
 7#SBATCH --ntasks=16
 8#SBATCH --constraint=P100
 9
10export OMPI_MCA_pml=ob1
11export OMPI_MCA_btl=tcp,self
12
13#require tcp over infiniband
14export OMPI_MCA_btl_tcp_if_include ib0
15
16julia --project=. -e 'using Pkg; Pkg.instantiate()'
17julia --project=. -e 'using Pkg; Pkg.precompile()'
18julia --project=. -e 'using Pkg; Pkg.status()'
19
20julia copy_cu_array.jl
21
22#-------------------------------------------------------------

You can simply launch your SLURM batch script with

sbatch batch.cuda-julia

and it should produce the following output:

Status `~/comp605/spring25/julia_codes/module8-1/Project.toml`
  [052768ef] CUDA v5.7.3
  [8dfed614] Test v1.11.0
Begin test. 
Test ended. 
Test Summary: | Pass  Total  Time
CopyCuArray   |    1      1  1.0s

3. Parallel Julia CUDA#

Tip

Tasks and threads#

CUDA.jl can be used with Julia tasks and threads, offering a convenient way to work with multiple devices, or to perform independent computations that may execute concurrently on the GPU.

Task-based programming#

Each Julia task gets its own local CUDA execution environment, with its own stream, library handles, and active device selection. That makes it easy to use one task per device, or to use tasks for independent operations that can be overlapped. At the same time, it’s important to take care when sharing data between tasks.

For example, let’s take some dummy expensive computation and execute it from two tasks:

 1# an expensive computation
 2function compute(a, b)
 3    c = a * b             # library call
 4    broadcast!(sin, c, c) # Julia kernel
 5    c
 6end
 7
 8function run(a, b)
 9    results = Vector{Any}(undef, 2)
10
11    # computation
12    @sync begin
13        @async begin
14            results[1] = Array(compute(a,b))
15            nothing # JuliaLang/julia#40626
16        end
17        @async begin
18            results[2] = Array(compute(a,b))
19            nothing # JuliaLang/julia#40626
20        end
21    end
22
23    # comparison
24    results[1] == results[2]
25end
26
27# main function
28function main(N=1024)
29    a = CUDA.rand(N,N)
30    b = CUDA.rand(N,N)
31
32    # make sure this data can be used by other tasks!
33    synchronize()
34
35    run(a, b)
36end
  • In the above example, we create two tasks and re-synchronize afterwards (@async and @sync), while the dummy compute function demonstrates both the use of a library (matrix multiplication uses CUBLAS) and a native Julia kernel.

  • The main function illustrates how we need to take care when sharing data between tasks:

    • GPU operations typically execute asynchronously, queued on an execution stream*, so if we switch tasks and thus switch execution streams we need to synchronize() to ensure the data is actually available.

    ** : A stream is simply a sequence of operations that are performed in order on the device.

Add vectors#

Now, let’s go back to adding two vectors, but this time we will be using warps and thread blocks to do it in parallel:

  1using CUDA
  2
  3# Loop based function
  4function addvecs!(c, a, b)
  5  N = length(a)
  6  @inbounds for i = 1:N
  7    c[i] = a[i] + b[i]
  8  end
  9end
 10
 11# Loop based fake "GPU" kernel
 12function fake_knl_addvecs!(c, a, b, numblocks, bdim)
 13  N = length(a)
 14  # loop over the "blocks"
 15  @inbounds for bidx = 1:numblocks
 16    # loop over the "threads"
 17    for tidx = 1:bdim
 18      i = (bidx - 1) * bdim + tidx
 19      if i <= N
 20        c[i] = a[i] + b[i]
 21      end
 22    end
 23  end
 24end
 25
 26# Real GPU kernel
 27function knl_addvecs!(c, a, b)
 28
 29  N = length(a)
 30
 31  tidx = threadIdx().x # get the thread ID
 32  bidx = blockIdx().x  # get the block ID
 33  bdim = blockDim().x  # how many threads in each block
 34
 35  # figure out which index we should handle
 36  i = (bidx - 1) * bdim + tidx
 37  @inbounds if i <= N
 38    c[i] = a[i] + b[i]
 39  end
 40
 41  #Kernel must return nothing
 42  nothing
 43end
 44
 45let
 46  N = 100000000
 47  a = rand(N)
 48  b = rand(N)
 49
 50  #
 51  # Reference Implementation
 52  #
 53  c0 = similar(a)
 54  c0 .= a .+ b
 55  @time c0 .= a .+ b
 56
 57  #
 58  # Simple for loop implementation
 59  #
 60  c1 = similar(c0)
 61  addvecs!(c1, a, b)
 62  @time addvecs!(c1, a, b)
 63  @assert c0  c1
 64
 65  #
 66  # fake GPU kernel
 67  #
 68  c2 = similar(c0)
 69  numthreads = 256
 70  numblocks = div(N + numthreads - 1, numthreads)
 71  fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
 72  @time fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
 73  @assert c0  c2
 74
 75  #
 76  # Real GPU call
 77  #
 78  d_a = CuArray(a) # Copy data to GPU
 79  d_b = CuArray(b) # Copy data to GPU
 80  d_c3 = CuArray{Float64}(undef, N) # Allocate array on GPU
 81  # launch GPU kernel
 82  @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
 83  # Barrier before timing
 84  synchronize()
 85  @time begin
 86    @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
 87    # Barrier before completing
 88    synchronize()
 89  end
 90
 91  # Copy back to the CPU
 92  c3 = Array(d_c3)
 93  @assert c0  c3
 94
 95  #
 96  # Try Julia's native CUDA version
 97  #
 98  d_c3 .= d_a .+ d_b
 99  synchronize()
100  @time begin
101    d_c3 .= d_a .+ d_b
102    synchronize()
103  end
104end
105nothing

In the above example, which you can find in julia_codes/module8-1/add_vectors.jl, we illustrate the addition of two vectors in different ways:

  1. A reference (CPU) implementation as a baseline for correctness

  2. A simple loop-based implementation

  3. A “fake” GPU kernel, where we manually loop over blocks of indices

  4. A real GPU kernel that uses the built-in CUDA variables for thread index, block index and block dimension

We also synchronize the results between GPU and CPU and time the execution.

  • As the CUDA.jl profiling documentation states, is we used the CUDA.@time instead of the Julia Base.@time macro, we would not have to worry about synchronization before time measurements.