12) Introduction to OpenMP#
Last time:
Introduction to parallel scaling
Strong Scaling
Weak Scaling
Today:
OpenMP Basics
1.1#pragma omp parallel
1.2#pragma omp simd
1. OpenMP Basics#
What is OpenMP?
A community-developed standard Application Programming Interface (API) (with directives) for:
multithreaded programming
vectorization
offload to coprocessors (such as GPUs)
We have already introduced multithreading:
OpenMP is available for C, C++, and Fortran.
At the time of writing, latest version OpenMP 6.0 (November 2024).
OpenMP Resources:#
1.1. #pragma omp parallel
#
The OpenMP standard is big, but most applications only use a few constructs/directives:
1#include <omp.h>
2#include <stdio.h>
3
4int main() {
5 #pragma omp parallel
6 {
7 int num_threads = omp_get_num_threads();
8 int my_thread_num = omp_get_thread_num();
9 printf("I am thread %d of %d\n", my_thread_num, num_threads);
10 }
11 return 0;
12}
To compile this C program using the GNU compiler gcc
you need to tell the compiler to enable OpenMP capabilities, by passing the -fopenmp
flag. You can also optionally pass the -Wall
falg to show all warnings (unused variables, etc.) - this is good to find potential bugs in your code.
gcc -fopenmp -Wall omp-hello.c -o omp-hello
To execute the generated binary executable file, called omp-hello
in this case, you need to
./omp-hello
The output on my own laptop looks like this:
I am thread 4 of 12
I am thread 11 of 12
I am thread 8 of 12
I am thread 2 of 12
I am thread 1 of 12
I am thread 5 of 12
I am thread 10 of 12
I am thread 9 of 12
I am thread 6 of 12
I am thread 3 of 12
I am thread 7 of 12
I am thread 0 of 12
Let’s specify the number of threads to use. Let’s say we want to use 8 threads:
OMP_NUM_THREADS=8 ./omp-hello
Notice that we did not have to recompile the program. This is a runtime environment variable. The output now looks like:
I am thread 4 of 8
I am thread 7 of 8
I am thread 1 of 8
I am thread 3 of 8
I am thread 0 of 8
I am thread 2 of 8
I am thread 5 of 8
I am thread 6 of 8
Parallelizing triad
#
What does the code below do?
void triad(int N, double *a, const double *b, double scalar, const double *c) {
#pragma omp parallel
{
for (int i=0; i<N; i++)
a[i] = b[i] + scalar * c[i];
}
}
And instead what does the code below do?
void triad(int N, double *a, const double *b, double scalar, const double *c) {
#pragma omp parallel
{
int id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
for (int i=id; i<N; i+=num_threads)
a[i] = b[i] + scalar * c[i];
}
}
Let’s execute both of them. You can find these codes in c_codes/module3-2/triad_omp.c.
The second version achieves a more optimal load balance.
Vectorization#
OpenMP-4.0 added the omp simd
construct, which is a portable way to request that the compiler vectorize code. An example of a reason why a compiler might fail to vectorize code is pointer aliasing, which we investigate below.
1void triad(int N, double *a, const double *b, double scalar, const double *c) {
2 for (int i=0; i<N; i++)
3 a[i] = b[i] + scalar * c[i];
4}
We can compile this code with the following flags:
gcc -O2 -ftree-vectorize -fopt-info-all -c triad.c -o triad
Where we have used the -O2
optimization flag, together with the -ftree-vectorize
vectorization flag and the -fopt-info-all
logging info flag.
For a list of all optimization flags see this reference.
gcc autovectorization starts at
-O3
or if you use-ftree-vectorize
options such as
-fopt-info
give useful diagnostics, but are compiler-dependent and sometimes referring to assembly is usefulyou can also use
man gcc
for the manual page ofgcc
and search with/
Pointer aliasing#
Is this valid code? What is x
after this call?
double x[5] = {1, 2, 3, 4, 5};
triad(2, x+1, x, 10., x);
Aliasing can occur in any language that can refer to one location in memory with more than one name (for example, with pointers). This is a common problem with functions that accept pointer arguments.
C allows memory to overlap arbitrarily. You can inform the compiler of this using the restrict
qualifier (C99/C11; __restrict
or __restrict__
work with most C++ and CUDA
compilers).
With the restrict
qualifier, a programmer hints to the compiler that for the lifetime of the pointer, no other pointer will be used to access the object to which it points. This allows the compiler to make optimizations (for example, vectorization) that would not otherwise have been possible. With the restrict
qualifier, the compiler is allowed to assume that the qialified pointers point to different locations and updating the memory location referenced by one pointer will not affect the memory locations referenced by the other pointers. The programmer, not the compiler, is responsible for ensuring that the pointers do not point to identical locations. The compiler can e.g. rearrange the code, first loading all memory locations, then performing the operations before committing the results back to memory.
Example:
1void triad(int N, double *restrict a, const double *restrict b, double scalar, const double *restrict c) {
2 for (int i=0; i<N; i++)
3 a[i] = b[i] + scalar * c[i];
4}
We can compile this code with:
gcc -O2 -march=native -ftree-vectorize -fopt-info-all -c triad-restrict.c -o triad-restrict
Notice that we’ve added the -march=native
compilation flag that generates instructions specific for the machine the code is being compiled on.
Notice how there is no more loop versioned for vectorization because of possible aliasing
.
The complexity of checking for aliasing can grow combinatorially in the number of arrays being processed, leading to many loop variants and/or preventing vectorization.
Warnings#
The -Wrestrict
flag (included in -Wall
) can catch some programming errors. Let’s see the following:
1void triad(int N, double *restrict a, const double *restrict b, double scalar, const double *restrict c) {
2 for (int i=0; i<N; i++)
3 a[i] = b[i] + scalar * c[i];
4}
5
6void foo(double *x) {
7 triad(2, x, x, 10, x);
8}
Let’s compile it with:
gcc -O2 -Wall -c triad-foo.c -o triad-foo
The powers of -Wrestrict
are limited, however, and (as of gcc 13.3.0
at the time of writing) do not even catch
void foo(double *x) {
triad(2, x+1, x, 10, x);
}
Check the assembly#
To make sure that your code is vectorized properly or not affected by aliasing, you can check the assembly code with objdump
:
objdump -d --prefix-addresses -M intel triad-restrict
Is the assembly qualitatively different without
restrict
(in which case the compiler “versions” the loop)? Check it yourself by inspecting the output ofobjdump
withtriad
.
1.2 #pragma omp simd
#
An alternative (or supplement) to restrict
is #pragma omp simd
.
1void triad(int N, double *a, const double *b, double scalar, const double *c) {
2#pragma omp simd
3 for (int i=0; i<N; i++)
4 a[i] = b[i] + scalar * c[i];
5}
We can compile this with:
gcc -O2 -march=native -ftree-vectorize -fopenmp -fopt-info-all -c triad-omp-simd.c -o triad-omp-simd