28) Intro to GPU programming in Julia#
Last time:
Practical CUDA
Memory
Tuckoo demo for CUDA codes
Today:
Intro to GPU programming in Julia (CUDA.jl)
CUDA Julia code on the tuckoo cluster
2.1 Launching a CUDA.jl code with SLURMParallel CUDA Julia
1. Intro to GPU programming in Julia (CUDA.jl)#
For GPU programming there are many great resources. Some that I may refer to are:
Note
References:
Warburton | youtube video (In the first 47 minutes of the video, Tim gives an excellent introduction to the GPU.)
Add vectors#
The real “hello world” of the GPU is adding vectors:
Big ideas:
threads
blocks
how to launch and time kernels
off-loaded memory on the device
Example with CUDA.jl#
To run this and following Julia CUDA examples, you need to first add CUDA.jl to your environment. That is, start julia
and in the REPL run:
using Pkg
Pkg.add("CUDA")
Then you can execute the following script:
1using CUDA
2
3a = CuArray(rand(100))
4b = CuArray(rand(100))
5c = a + b
Memory management#
As we have seen so far, a crucial aspect of working with a GPU is managing the data on it.
The CuArray
type is the primary interface for doing so: Creating a CuArray
will allocate data on the GPU, copying elements to it will upload, and converting back to an Array
will download values to the CPU. Let’s see it in an example test.
Note: to run the following test, you first need to also add the Test.jl
package to your environment:
using Pkg
Pkg.add("Test")
Then you can run the following test example:
1using CUDA
2using Test # to be able to use the @test macro
3
4@testset "CopyCuArray" begin
5 print("Begin test. \n")
6
7 # generate some data on the CPU (host array, h_array)
8 h_array = rand(Float32, 1024)
9
10 # allocate on the GPU (device array, d_array)
11 d_array = CuArray{Float32}(undef, 1024)
12
13 # copy from the CPU to the GPU
14 copyto!(d_array, h_array)
15
16 # download/transfer back to the CPU and verify using the Array casting of the device array
17 @test h_array == Array(d_array)
18 print("Test ended. \n")
19end
Observation on garbage collection:
One striking difference between the native C CUDA implementation and the Julia CUDA.jl interface is that instances of the
CuArray
type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management (like thecudaFree
), just make sure your objects are not reachable (i.e., there are no instances or references).
Reverse vectors?#
The “hello world 2.0” of the GPU is (inplace) reverse vector
Big ideas:
thread independence
CUDA.jl API Overview#
The CUDA.jl package provides three distinct, but related, interfaces for CUDA programming:
the
CuArray
type: for programming with arrays;native kernel programming capabilities: for writing CUDA kernels in Julia;
CUDA API wrappers: for low-level interactions with the CUDA libraries.
Much of the Julia CUDA programming stack can be used by just relying on the CuArray
type, and using platform-agnostic programming patterns like broadcast
and other array abstractions. Only once you hit a performance bottleneck, or some missing functionality, you might need to write a custom kernel or use the underlying CUDA APIs.
CUDA.jl kernel programming#
You can write your own GPU kernels in Julia.
CUDA.jl aims to expose the full power of the CUDA programming model, i.e., at the same level of abstraction as CUDA C/C++, albeit with some Julia-specific improvements.
Defining and launching kernels#
Kernels are written as ordinary Julia functions, but they have to return nothing
:
function my_kernel()
return
end
To launch this kernel, use the @cuda macro:
julia> @cuda my_kernel()
Kernel inputs and outputs#
GPU kernels cannot return values, and should always
return
orreturn nothing
.To communicate values from a kernel, you can use a
CuArray
:
Example:
using CUDA
function my_kernel(a)
a[1] = 42
return
end
my_kernel (generic function with 1 method)
a = CuArray{Int}(undef, 1);
@cuda my_kernel(a);
a
1-element CuArray{Int64, 1, CUDA.DeviceMemory}:
42
Launch configuration and indexing#
Simply using
@cuda
only launches a single thread, which is not very useful. To launch more threads, use thethreads
andblocks
keyword arguments to@cuda
. Example:
function my_kernel(a)
i = threadIdx().x
a[i] = 42
return
end
a = CuArray{Int}(undef, 5);
@cuda threads=length(a) my_kernel(a);
a
5-element CuArray{Int64, 1, CUDA.DeviceMemory}:
42
42
42
42
42
2. CUDA Julia code on the tuckoo cluster#
As we have seen in the last lecture, we can run CUDA code on the tuckoo cluster.
Now let’s learn how to configure the CUDA.jl
package to run with the proper CUDA version available on the cluster.
Recall that for tuckoo, node8
is the node dedicated for interactive usage, where you can find the nvcc
compiler to compile CUDA codes.
Once you are logged in on tuckoo using
ssh your_user_name@tuckoo.sdsu.edu
You will see that you are logged in on the login node because the prompt will show
[your_user_name@tuckoo ~]$
Now you can use rsh
to connect to node8
by simply running:
rsh node8
and you will see that you are indeed on node8
because the prompt has changed to:
[your_user_name@node8 ~]$
Now you can start a Julia session with a local environment in a directory of your choice, by running
julia --project=.
Recall that this will generate a Project.toml
and Manifest.toml
file in the directory in which you have specified the --project=
path.
Once your Julia session has started, you can add the
CUDA.jl
andTest.jl
packages needed to test your code.
Pkg.add("CUDA")
Pkg.add("Test")
Now if you try to use the CUDA.jl
package by running using CUDA
, the first time you’ll hit the following error:
julia> using CUDA
┌ Error: CUDA.jl was precompiled without knowing the CUDA toolkit version. This is unsupported.
│ You should either precompile CUDA.jl in an environment where the CUDA toolkit is available,
│ or call `CUDA.set_runtime_version!` to specify which CUDA version to use.
└ @ CUDA ~/.julia/packages/CUDA/TW8fL/src/initialization.jl:148
Run the
CUDA.versioninfo()
command to see which CUDA version is available on the cluster:
julia> CUDA.versioninfo()
You will see the following output:
julia> CUDA.versioninfo()
CUDA runtime 12.0, local installation
CUDA driver 12.0
NVIDIA driver 525.60.13
CUDA libraries:
- CUBLAS: 12.0.1
- CURAND: 10.3.1
- CUFFT: 11.0.0
- CUSOLVER: 11.4.2
- CUSPARSE: 12.0.0
- CUPTI: 2022.4.0 (API 18.0.0)
- NVML: 12.0.0+525.60.13
Julia packages:
- CUDA: 5.7.2
- CUDA_Driver_jll: 0.12.1+1
- CUDA_Runtime_jll: 0.16.1+0
- CUDA_Runtime_Discovery: 0.3.5
Toolchain:
- Julia: 1.11.4
- LLVM: 16.0.6
Preferences:
- CUDA_Runtime_jll.local: true
2 devices:
0: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)
1: Tesla P100-PCIE-16GB (sm_60, 15.892 GiB / 16.000 GiB available)
Therefore, we want to set the
v12.0
version for ourCUDA.jl
library. Let’s do it by running the following command:
julia> CUDA.set_runtime_version!(v"12.0")
It will print the following output:
julia> CUDA.set_runtime_version!(v"12.0")
[ Info: Configure the active project to use CUDA 12.0; please re-start Julia for this to take effect.
Now your Julia environment on the cluster is properly setup to use the right CUDA version.
You will notice that your local
Project.toml
andManifest.toml
files have changed and now include theCUDA.jl
dependency.
2.1 Launching a CUDA.jl code with SLURM#
Once you have verified that your local Project.toml
and Manifest.toml
files have been populated correctly, you can copy the julia_codes/module8-1/copy_cu_array.jl and batch_scripts/batch.cuda-julia files in the same directory where you have the Project.toml
and Manifest.toml
files.
This is the SLURM batch script to setup and precompile your Julia environment and launch Julia CODE on a P100 on tuckoo:
1#!/bin/sh
2
3#note: run julia CUDA copy array a single node
4
5#SBATCH --job-name=add_cu
6#SBATCH --output=%A-add_cu.out
7#SBATCH --ntasks=16
8#SBATCH --constraint=P100
9
10export OMPI_MCA_pml=ob1
11export OMPI_MCA_btl=tcp,self
12
13#require tcp over infiniband
14export OMPI_MCA_btl_tcp_if_include ib0
15
16julia --project=. -e 'using Pkg; Pkg.instantiate()'
17julia --project=. -e 'using Pkg; Pkg.precompile()'
18julia --project=. -e 'using Pkg; Pkg.status()'
19
20julia copy_cu_array.jl
21
22#-------------------------------------------------------------
You can simply launch your SLURM batch script with
sbatch batch.cuda-julia
and it should produce the following output:
Status `~/comp605/spring25/julia_codes/module8-1/Project.toml`
[052768ef] CUDA v5.7.3
[8dfed614] Test v1.11.0
Begin test.
Test ended.
Test Summary: | Pass Total Time
CopyCuArray | 1 1 1.0s
3. Parallel Julia CUDA#
Tip
Reference: CUDA.jl tutorials
Tasks and threads#
CUDA.jl can be used with Julia tasks and threads, offering a convenient way to work with multiple devices, or to perform independent computations that may execute concurrently on the GPU.
Task-based programming#
Each Julia task gets its own local CUDA execution environment, with its own stream, library handles, and active device selection. That makes it easy to use one task per device, or to use tasks for independent operations that can be overlapped. At the same time, it’s important to take care when sharing data between tasks.
For example, let’s take some dummy expensive computation and execute it from two tasks:
1# an expensive computation
2function compute(a, b)
3 c = a * b # library call
4 broadcast!(sin, c, c) # Julia kernel
5 c
6end
7
8function run(a, b)
9 results = Vector{Any}(undef, 2)
10
11 # computation
12 @sync begin
13 @async begin
14 results[1] = Array(compute(a,b))
15 nothing # JuliaLang/julia#40626
16 end
17 @async begin
18 results[2] = Array(compute(a,b))
19 nothing # JuliaLang/julia#40626
20 end
21 end
22
23 # comparison
24 results[1] == results[2]
25end
26
27# main function
28function main(N=1024)
29 a = CUDA.rand(N,N)
30 b = CUDA.rand(N,N)
31
32 # make sure this data can be used by other tasks!
33 synchronize()
34
35 run(a, b)
36end
In the above example, we create two tasks and re-synchronize afterwards (
@async
and@sync
), while the dummy compute function demonstrates both the use of a library (matrix multiplication uses CUBLAS) and a native Julia kernel.The
main
function illustrates how we need to take care when sharing data between tasks:GPU operations typically execute asynchronously, queued on an execution stream*, so if we switch tasks and thus switch execution streams we need to
synchronize()
to ensure the data is actually available.
** : A stream is simply a sequence of operations that are performed in order on the device.
Add vectors#
Now, let’s go back to adding two vectors, but this time we will be using warps and thread blocks to do it in parallel:
1using CUDA
2
3# Loop based function
4function addvecs!(c, a, b)
5 N = length(a)
6 @inbounds for i = 1:N
7 c[i] = a[i] + b[i]
8 end
9end
10
11# Loop based fake "GPU" kernel
12function fake_knl_addvecs!(c, a, b, numblocks, bdim)
13 N = length(a)
14 # loop over the "blocks"
15 @inbounds for bidx = 1:numblocks
16 # loop over the "threads"
17 for tidx = 1:bdim
18 i = (bidx - 1) * bdim + tidx
19 if i <= N
20 c[i] = a[i] + b[i]
21 end
22 end
23 end
24end
25
26# Real GPU kernel
27function knl_addvecs!(c, a, b)
28
29 N = length(a)
30
31 tidx = threadIdx().x # get the thread ID
32 bidx = blockIdx().x # get the block ID
33 bdim = blockDim().x # how many threads in each block
34
35 # figure out which index we should handle
36 i = (bidx - 1) * bdim + tidx
37 @inbounds if i <= N
38 c[i] = a[i] + b[i]
39 end
40
41 #Kernel must return nothing
42 nothing
43end
44
45let
46 N = 100000000
47 a = rand(N)
48 b = rand(N)
49
50 #
51 # Reference Implementation
52 #
53 c0 = similar(a)
54 c0 .= a .+ b
55 @time c0 .= a .+ b
56
57 #
58 # Simple for loop implementation
59 #
60 c1 = similar(c0)
61 addvecs!(c1, a, b)
62 @time addvecs!(c1, a, b)
63 @assert c0 ≈ c1
64
65 #
66 # fake GPU kernel
67 #
68 c2 = similar(c0)
69 numthreads = 256
70 numblocks = div(N + numthreads - 1, numthreads)
71 fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
72 @time fake_knl_addvecs!(c2, a, b, numblocks, numthreads)
73 @assert c0 ≈ c2
74
75 #
76 # Real GPU call
77 #
78 d_a = CuArray(a) # Copy data to GPU
79 d_b = CuArray(b) # Copy data to GPU
80 d_c3 = CuArray{Float64}(undef, N) # Allocate array on GPU
81 # launch GPU kernel
82 @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
83 # Barrier before timing
84 synchronize()
85 @time begin
86 @cuda threads=numthreads blocks=numblocks knl_addvecs!(d_c3, d_a, d_b)
87 # Barrier before completing
88 synchronize()
89 end
90
91 # Copy back to the CPU
92 c3 = Array(d_c3)
93 @assert c0 ≈ c3
94
95 #
96 # Try Julia's native CUDA version
97 #
98 d_c3 .= d_a .+ d_b
99 synchronize()
100 @time begin
101 d_c3 .= d_a .+ d_b
102 synchronize()
103 end
104end
105nothing
In the above example, which you can find in julia_codes/module8-1/add_vectors.jl, we illustrate the addition of two vectors in different ways:
A reference (CPU) implementation as a baseline for correctness
A simple loop-based implementation
A “fake” GPU kernel, where we manually loop over blocks of indices
A real GPU kernel that uses the built-in CUDA variables for thread index, block index and block dimension
We also synchronize the results between GPU and CPU and time the execution.
As the CUDA.jl profiling documentation states, is we used the
CUDA.@time
instead of the JuliaBase.@time
macro, we would not have to worry about synchronization before time measurements.