10) Introduction to Multithreading#
Last time:
Packing for L1/L2 cache
Packing for L3 cache
Today:
In this lecture, we are going to discuss running a multithreaded version of the code. We will use multiple threads on our example problem, the matrix-matrix multiply.
1. Introduction to threads#
In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process (i.e., the instance of a computer program that is being executed).
The multiple threads of a given process may be executed concurrently (via multithreading capabilities), sharing resources such as memory, while different processes do not share these resources.
In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.
The implementation of threads and processes differs between operating systems. In fact, depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.
Advantages of multithreading#
If a thread gets a lot of cache misses, the other threads can continue taking advantage of the unused computing resources, which may lead to faster overall execution, as these resources would have been idle if only a single thread were executed. Also, if a thread cannot use all the computing resources of the CPU (because instructions depend on each other’s result), running another thread may prevent those resources from becoming idle.
Disadvantages of multithreading#
Multiple threads can interfere with each other when sharing hardware resources such as caches or translation lookaside buffers (TLBs). As a result, execution times of a single thread are not improved and can be degraded, even when only one thread is executing, due to lower frequencies or additional pipeline stages that are necessary to accommodate thread-switching hardware.
Merging data from two processes (in the case of multiprocessing) or threads can often incur higher costs compared to processing the same data on a single thread, due to overheads such as inter-process communication and synchronization.
Overall, efficiency of multi-threading varies across different vendors and programming implementations.
2. Julia and multithreading#
Tip
For reference see the official Julia documentation on multithreading.
In the Julia REPL, you can see how many threads you have access to with:
Threads.nthreads()
1
By default it will be 1
. To use more than 1
thread we need to either set the environment variable JULIA_NUM_THREADS
or use the command line argument
--threads
(or -t
) when you start Julia from a terminal. Example:
julia --threads=5
3. Julia VSCode extension#
If you use VSCode as an IDE, to install the Julia for VSCode extension, open VSCode, go to Extensions, search for Julia and install.
Once this is done, you should be able to run Julia within VSCode without any further configuration (sometimes you might have to set the Julia Executable path setting).
Examples of what is possible:
A console for Julia which you can get by typing Alt+J, Alt+O in VSCode.
Using Julia in a scripting environment in VSCode. When working on a Julia script (a .jl text file) you may evaluate any amount of code in the script interactively. E.g., Ctrl+Enter evaluates current line being edited. The Julia process that evaluates the script is the same as the one in the console you get in point 1. The Julia console and the text editor window are linked!
You can open Jupyter notebooks without installing anything else. Simply drag+drop the notebook into VSCode and it will request a “kernel” (choose Julia). Each Jupyter notebook launches a dedicated Julia process, so the Jupyter notebooks are not linked with the standard text editor and Julia console of VSCode!
Tip
The instructions above are referenced from a very good resource: the Zero-to-Hero Julia workshop that you should check out.
Going back to multi-threading, in VSCode you can set the number of threads for the REPL by going to the settings for the Julia plugin:
click on extension
click the little gear
search for
threads
set the number you want
Every thread knows its unique threadid
.
Threads.threadid()
1
To access multiple threads we use the Threads.@threads
macro which will split the iterations of a loop over the available threads. For example:
Threads.@threads for i = 1:13
@show (Threads.threadid(), i)
end
(Threads.threadid(), i) = (1, 1)
(Threads.threadid(), i) = (1, 2)
(Threads.threadid(), i) = (1, 3)
(Threads.threadid(), i) = (1, 4)
(Threads.threadid(), i) = (1, 5)
(Threads.threadid(), i) = (1, 6)
(Threads.threadid(), i) = (1, 7)
(Threads.threadid(), i) = (1, 8)
(Threads.threadid(), i) = (1, 9)
(Threads.threadid(), i) = (1, 10)
(Threads.threadid(), i) = (1, 11)
(Threads.threadid(), i) = (1, 12)
(Threads.threadid(), i) = (1, 13)
Note that the threads execute asynchronously (meaning thread 1
executes at the same time as thread 2
, etc.), but the partitioning is deterministic/static.
If I started Julia with, say, \(4\) threads, with
julia --threads=4
then the above for loop would show:
julia> Threads.@threads for i = 1:13
@show (Threads.threadid(), i)
end
(Threads.threadid(), i) = (1, 1)
(Threads.threadid(), i) = (4, 8)
(Threads.threadid(), i) = (2, 5)
(Threads.threadid(), i) = (3, 11)
(Threads.threadid(), i) = (1, 2)
(Threads.threadid(), i) = (2, 6)
(Threads.threadid(), i) = (3, 12)
(Threads.threadid(), i) = (3, 13)
(Threads.threadid(), i) = (1, 3)
(Threads.threadid(), i) = (2, 7)
(Threads.threadid(), i) = (4, 9)
(Threads.threadid(), i) = (1, 4)
(Threads.threadid(), i) = (4, 10)
Thread
1
handles loop index1, 2, 3, 4
Thread
2
handles loop index5, 6, 7
Thread
3
handles loop index11, 12, 13
Thread
4
handles loop index8, 9, 10
4. Multithreading the matrix-matrix multiply#
In principle we can use multiple threads to split any of the loops into multiple threads.
We need to be careful about avoiding race conditions, which may happen if multiple threads will try to update the same micropanel.
Loop 1: mygemm_I_packed
#
Poor performance:
Seeing cost of launching threads
Poor cache reuse (limited amortizing of data movement)
Loop 2: mygemm_JI_packed
#
Better performance
Loop 3: mygemm_I_JI_packed
#
Need to be careful, since this needs multiple copies of
A_pack
(one for each thread)Could do a little better by smoothing out
mc
so that
Loop 4: mygemm_PI_JI_packed
#
Not good, due to race condition with micropanel update
Loop 5: mygemm_JPI_JI_packed
#
Need to be careful, since this needs multiple copies of
A_pack
andB_pack
(one for each thread)Since
nc
is sized for the L3 cache, we need to reduce it so that all the thread data fit in the L3 cache.
Examples in Julia#
You can find examples of multi-threaded, packed implementations in Julia in the julia_codes/module2-5/
directory in the class repository.