12) Introduction to OpenMP#

Last time:

  • Introduction to parallel scaling

  • Strong Scaling

  • Weak Scaling

Today:

  1. OpenMP Basics
    1.1 #pragma omp parallel
    1.2 #pragma omp simd

1. OpenMP Basics#

What is OpenMP?

A community-developed standard Application Programming Interface (API) (with directives) for:

  • multithreaded programming

  • vectorization

  • offload to coprocessors (such as GPUs)

We have already introduced multithreading:

  • OpenMP is available for C, C++, and Fortran.

At the time of writing, latest version OpenMP 6.0 (November 2024).

OpenMP Resources:#

1.1. #pragma omp parallel#

The OpenMP standard is big, but most applications only use a few constructs/directives:

 1#include <omp.h>
 2#include <stdio.h>
 3
 4int main() {
 5  #pragma omp parallel
 6  {
 7    int num_threads = omp_get_num_threads();
 8    int my_thread_num = omp_get_thread_num();
 9    printf("I am thread %d of %d\n", my_thread_num, num_threads);
10  }
11  return 0;
12}

To compile this C program using the GNU compiler gcc you need to tell the compiler to enable OpenMP capabilities, by passing the -fopenmp flag. You can also optionally pass the -Wall falg to show all warnings (unused variables, etc.) - this is good to find potential bugs in your code.

gcc -fopenmp -Wall omp-hello.c -o omp-hello

To execute the generated binary executable file, called omp-hello in this case, you need to

./omp-hello

The output on my own laptop looks like this:

I am thread 4 of 12
I am thread 11 of 12
I am thread 8 of 12
I am thread 2 of 12
I am thread 1 of 12
I am thread 5 of 12
I am thread 10 of 12
I am thread 9 of 12
I am thread 6 of 12
I am thread 3 of 12
I am thread 7 of 12
I am thread 0 of 12

Let’s specify the number of threads to use. Let’s say we want to use 8 threads:

OMP_NUM_THREADS=8 ./omp-hello

Notice that we did not have to recompile the program. This is a runtime environment variable. The output now looks like:

I am thread 4 of 8
I am thread 7 of 8
I am thread 1 of 8
I am thread 3 of 8
I am thread 0 of 8
I am thread 2 of 8
I am thread 5 of 8
I am thread 6 of 8

Parallelizing triad#

What does the code below do?

void triad(int N, double *a, const double *b, double scalar, const double *c) {
    
#pragma omp parallel
    {
        for (int i=0; i<N; i++)
            a[i] = b[i] + scalar * c[i];
    }
}

And instead what does the code below do?

void triad(int N, double *a, const double *b, double scalar, const double *c) {

#pragma omp parallel
    {
        int id = omp_get_thread_num();
        int num_threads = omp_get_num_threads();
        for (int i=id; i<N; i+=num_threads)
            a[i] = b[i] + scalar * c[i];
    }
}

Let’s execute both of them. You can find these codes in c_codes/module3-2/triad_omp.c.

The second version achieves a more optimal load balance.

Vectorization#

OpenMP-4.0 added the omp simd construct, which is a portable way to request that the compiler vectorize code. An example of a reason why a compiler might fail to vectorize code is pointer aliasing, which we investigate below.

1void triad(int N, double *a, const double *b, double scalar, const double *c) {
2    for (int i=0; i<N; i++)
3        a[i] = b[i] + scalar * c[i];
4}

We can compile this code with the following flags:

gcc -O2 -ftree-vectorize -fopt-info-all -c triad.c -o triad

Where we have used the -O2 optimization flag, together with the -ftree-vectorize vectorization flag and the -fopt-info-all logging info flag.

For a list of all optimization flags see this reference.

  • gcc autovectorization starts at -O3 or if you use -ftree-vectorize

  • options such as -fopt-info give useful diagnostics, but are compiler-dependent and sometimes referring to assembly is useful

  • you can also use man gcc for the manual page of gcc and search with /

Pointer aliasing#

Is this valid code? What is x after this call?

double x[5] = {1, 2, 3, 4, 5};
triad(2, x+1, x, 10., x);

Aliasing can occur in any language that can refer to one location in memory with more than one name (for example, with pointers). This is a common problem with functions that accept pointer arguments.

C allows memory to overlap arbitrarily. You can inform the compiler of this using the restrict qualifier (C99/C11; __restrict or __restrict__ work with most C++ and CUDA compilers).

With the restrict qualifier, a programmer hints to the compiler that for the lifetime of the pointer, no other pointer will be used to access the object to which it points. This allows the compiler to make optimizations (for example, vectorization) that would not otherwise have been possible. With the restrict qualifier, the compiler is allowed to assume that the qialified pointers point to different locations and updating the memory location referenced by one pointer will not affect the memory locations referenced by the other pointers. The programmer, not the compiler, is responsible for ensuring that the pointers do not point to identical locations. The compiler can e.g. rearrange the code, first loading all memory locations, then performing the operations before committing the results back to memory.

Example:

1void triad(int N, double *restrict a, const double *restrict b, double scalar, const double *restrict c) {
2    for (int i=0; i<N; i++)
3        a[i] = b[i] + scalar * c[i];
4}

We can compile this code with:

gcc -O2 -march=native -ftree-vectorize -fopt-info-all -c triad-restrict.c -o triad-restrict

Notice that we’ve added the -march=native compilation flag that generates instructions specific for the machine the code is being compiled on.

Notice how there is no more loop versioned for vectorization because of possible aliasing.

The complexity of checking for aliasing can grow combinatorially in the number of arrays being processed, leading to many loop variants and/or preventing vectorization.

Warnings#

The -Wrestrict flag (included in -Wall) can catch some programming errors. Let’s see the following:

1void triad(int N, double *restrict a, const double *restrict b, double scalar, const double *restrict c) {
2    for (int i=0; i<N; i++)
3        a[i] = b[i] + scalar * c[i];
4}
5
6void foo(double *x) {
7    triad(2, x, x, 10, x);
8}

Let’s compile it with:

gcc -O2 -Wall -c triad-foo.c -o triad-foo

The powers of -Wrestrict are limited, however, and (as of gcc 13.3.0 at the time of writing) do not even catch

void foo(double *x) {
  triad(2, x+1, x, 10, x);
}

Check the assembly#

To make sure that your code is vectorized properly or not affected by aliasing, you can check the assembly code with objdump:

objdump -d --prefix-addresses -M intel triad-restrict
  • Is the assembly qualitatively different without restrict (in which case the compiler “versions” the loop)? Check it yourself by inspecting the output of objdump with triad.

1.2 #pragma omp simd#

An alternative (or supplement) to restrict is #pragma omp simd.

1void triad(int N, double *a, const double *b, double scalar, const double *c) {
2#pragma omp simd
3    for (int i=0; i<N; i++)
4        a[i] = b[i] + scalar * c[i];
5}

We can compile this with:

gcc -O2 -march=native -ftree-vectorize -fopenmp -fopt-info-all -c triad-omp-simd.c -o triad-omp-simd