10) Introduction to Multithreading#

Last time:

  • Packing for L1/L2 cache

  • Packing for L3 cache

Today:

  1. Introduction to threads

  2. Julia and multithreading

  3. Julia VSCode extension

  4. Multithreading the matrix-matrix multiply

In this lecture, we are going to discuss running a multithreaded version of the code. We will use multiple threads on our example problem, the matrix-matrix multiply.

1. Introduction to threads#

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. In many cases, a thread is a component of a process (i.e., the instance of a computer program that is being executed).

The multiple threads of a given process may be executed concurrently (via multithreading capabilities), sharing resources such as memory, while different processes do not share these resources.

In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

The implementation of threads and processes differs between operating systems. In fact, depending on the OS, a process may be made up of multiple threads of execution that execute instructions concurrently.

Advantages of multithreading#

If a thread gets a lot of cache misses, the other threads can continue taking advantage of the unused computing resources, which may lead to faster overall execution, as these resources would have been idle if only a single thread were executed. Also, if a thread cannot use all the computing resources of the CPU (because instructions depend on each other’s result), running another thread may prevent those resources from becoming idle.

Disadvantages of multithreading#

Multiple threads can interfere with each other when sharing hardware resources such as caches or translation lookaside buffers (TLBs). As a result, execution times of a single thread are not improved and can be degraded, even when only one thread is executing, due to lower frequencies or additional pipeline stages that are necessary to accommodate thread-switching hardware.

Merging data from two processes (in the case of multiprocessing) or threads can often incur higher costs compared to processing the same data on a single thread, due to overheads such as inter-process communication and synchronization.

Overall, efficiency of multi-threading varies across different vendors and programming implementations.

2. Julia and multithreading#

Tip

For reference see the official Julia documentation on multithreading.

In the Julia REPL, you can see how many threads you have access to with:

Threads.nthreads()
1

By default it will be 1. To use more than 1 thread we need to either set the environment variable JULIA_NUM_THREADS or use the command line argument --threads (or -t) when you start Julia from a terminal. Example:

julia --threads=5

3. Julia VSCode extension#

If you use VSCode as an IDE, to install the Julia for VSCode extension, open VSCode, go to Extensions, search for Julia and install.

Once this is done, you should be able to run Julia within VSCode without any further configuration (sometimes you might have to set the Julia Executable path setting).

Examples of what is possible:

  • A console for Julia which you can get by typing Alt+J, Alt+O in VSCode.

  • Using Julia in a scripting environment in VSCode. When working on a Julia script (a .jl text file) you may evaluate any amount of code in the script interactively. E.g., Ctrl+Enter evaluates current line being edited. The Julia process that evaluates the script is the same as the one in the console you get in point 1. The Julia console and the text editor window are linked!

  • You can open Jupyter notebooks without installing anything else. Simply drag+drop the notebook into VSCode and it will request a “kernel” (choose Julia). Each Jupyter notebook launches a dedicated Julia process, so the Jupyter notebooks are not linked with the standard text editor and Julia console of VSCode!

Tip

The instructions above are referenced from a very good resource: the Zero-to-Hero Julia workshop that you should check out.

Going back to multi-threading, in VSCode you can set the number of threads for the REPL by going to the settings for the Julia plugin:

  • click on extension

  • click the little gear

  • search for threads

  • set the number you want

Every thread knows its unique threadid.

Threads.threadid()
1

To access multiple threads we use the Threads.@threads macro which will split the iterations of a loop over the available threads. For example:

Threads.@threads for i = 1:13
    @show (Threads.threadid(), i)
end
(Threads.threadid(), i) = (1, 1)
(Threads.threadid(), i) = (1, 2)
(Threads.threadid(), i) = (1, 3)
(Threads.threadid(), i) = (1, 4)
(Threads.threadid(), i) = (1, 5)
(Threads.threadid(), i) = (1, 6)
(Threads.threadid(), i) = (1, 7)
(Threads.threadid(), i) = (1, 8)
(Threads.threadid(), i) = (1, 9)
(Threads.threadid(), i) = (1, 10)
(Threads.threadid(), i) = (1, 11)
(Threads.threadid(), i) = (1, 12)
(Threads.threadid(), i) = (1, 13)

Note that the threads execute asynchronously (meaning thread 1 executes at the same time as thread 2, etc.), but the partitioning is deterministic/static.

If I started Julia with, say, \(4\) threads, with

julia --threads=4

then the above for loop would show:

julia> Threads.@threads for i = 1:13
           @show (Threads.threadid(), i)
       end
(Threads.threadid(), i) = (1, 1)
(Threads.threadid(), i) = (4, 8)
(Threads.threadid(), i) = (2, 5)
(Threads.threadid(), i) = (3, 11)
(Threads.threadid(), i) = (1, 2)
(Threads.threadid(), i) = (2, 6)
(Threads.threadid(), i) = (3, 12)
(Threads.threadid(), i) = (3, 13)
(Threads.threadid(), i) = (1, 3)
(Threads.threadid(), i) = (2, 7)
(Threads.threadid(), i) = (4, 9)
(Threads.threadid(), i) = (1, 4)
(Threads.threadid(), i) = (4, 10)
  • Thread 1 handles loop index 1, 2, 3, 4

  • Thread 2 handles loop index 5, 6, 7

  • Thread 3 handles loop index 11, 12, 13

  • Thread 4 handles loop index 8, 9, 10

4. Multithreading the matrix-matrix multiply#

In principle we can use multiple threads to split any of the loops into multiple threads.

We need to be careful about avoiding race conditions, which may happen if multiple threads will try to update the same micropanel.

Loop 1: mygemm_I_packed#

Poor performance:

  • Seeing cost of launching threads

  • Poor cache reuse (limited amortizing of data movement)

Loop 2: mygemm_JI_packed#

  • Better performance

Loop 3: mygemm_I_JI_packed#

  • Need to be careful, since this needs multiple copies of A_pack (one for each thread)

  • Could do a little better by smoothing out mc so that

Loop 4: mygemm_PI_JI_packed#

  • Not good, due to race condition with micropanel update

Loop 5: mygemm_JPI_JI_packed#

  • Need to be careful, since this needs multiple copies of A_pack and B_pack (one for each thread)

  • Since nc is sized for the L3 cache, we need to reduce it so that all the thread data fit in the L3 cache.

Examples in Julia#

You can find examples of multi-threaded, packed implementations in Julia in the julia_codes/module2-5/ directory in the class repository.