13) More on OpenMP and OpenMP Tasks#

Last time:

  • OpenMP Basics

  • #pragma omp parallel

  • #pragma omp simd

Today:

  1. More on OpenMP

  2. Memory semantics

  3. A quick demo on perf

  4. OpenMP Tasks

1. More on OpenMP#

What does the compiler do when we add the #pragma openmp parallel directive?

static double dot_opt3(size_t n, const double *a, const double *b) {
  double sum = 0;
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    #pragma omp for reduction(+:sum)
    for (size_t i=0; i<n; i++)
      sum += a[i] * b[i];
  }
  return sum;
}
! gcc -Os -march=native -fopenmp ../c_codes/module3-3/dot.c -o dot
! objdump -d --prefix-addresses -M intel dot | grep dot_opt3
0000000000001379 <main+0x1b9> call   0000000000001849 <dot_opt3>
0000000000001849 <dot_opt3> push   r12
000000000000184b <dot_opt3+0x2> mov    r12,rdx
000000000000184e <dot_opt3+0x5> push   rbp
000000000000184f <dot_opt3+0x6> mov    rbp,rsi
0000000000001852 <dot_opt3+0x9> push   rbx
0000000000001853 <dot_opt3+0xa> mov    rbx,rdi
0000000000001856 <dot_opt3+0xd> mov    edi,0x4
000000000000185b <dot_opt3+0x12> sub    rsp,0x30
000000000000185f <dot_opt3+0x16> mov    rax,QWORD PTR fs:0x28
0000000000001868 <dot_opt3+0x1f> mov    QWORD PTR [rsp+0x28],rax
000000000000186d <dot_opt3+0x24> xor    eax,eax
000000000000186f <dot_opt3+0x26> call   0000000000001140 <omp_set_num_threads@plt>
0000000000001874 <dot_opt3+0x2b> lea    rsi,[rsp+0x8]
0000000000001879 <dot_opt3+0x30> xor    ecx,ecx
000000000000187b <dot_opt3+0x32> xor    edx,edx
000000000000187d <dot_opt3+0x34> lea    rdi,[rip+0xc4]        # 0000000000001948 <dot_opt3._omp_fn.0>
0000000000001884 <dot_opt3+0x3b> mov    QWORD PTR [rsp+0x18],r12
0000000000001889 <dot_opt3+0x40> mov    QWORD PTR [rsp+0x10],rbp
000000000000188e <dot_opt3+0x45> mov    QWORD PTR [rsp+0x8],rbx
0000000000001893 <dot_opt3+0x4a> mov    QWORD PTR [rsp+0x20],0x0
000000000000189c <dot_opt3+0x53> call   00000000000011b0 <GOMP_parallel@plt>
00000000000018a1 <dot_opt3+0x58> vmovsd xmm0,QWORD PTR [rsp+0x20]
00000000000018a7 <dot_opt3+0x5e> mov    rax,QWORD PTR [rsp+0x28]
00000000000018ac <dot_opt3+0x63> sub    rax,QWORD PTR fs:0x28
00000000000018b5 <dot_opt3+0x6c> je     00000000000018bc <dot_opt3+0x73>
00000000000018b7 <dot_opt3+0x6e> call   0000000000001150 <__stack_chk_fail@plt>
00000000000018bc <dot_opt3+0x73> add    rsp,0x30
00000000000018c0 <dot_opt3+0x77> pop    rbx
00000000000018c1 <dot_opt3+0x78> pop    rbp
00000000000018c2 <dot_opt3+0x79> pop    r12
00000000000018c4 <dot_opt3+0x7b> ret
0000000000001948 <dot_opt3._omp_fn.0> endbr64
000000000000194c <dot_opt3._omp_fn.0+0x4> push   r14
000000000000194e <dot_opt3._omp_fn.0+0x6> mov    r14,QWORD PTR [rdi+0x8]
0000000000001952 <dot_opt3._omp_fn.0+0xa> push   r13
0000000000001954 <dot_opt3._omp_fn.0+0xc> mov    r13,QWORD PTR [rdi+0x10]
0000000000001958 <dot_opt3._omp_fn.0+0x10> push   r12
000000000000195a <dot_opt3._omp_fn.0+0x12> push   rbp
000000000000195b <dot_opt3._omp_fn.0+0x13> mov    rbp,rdi
000000000000195e <dot_opt3._omp_fn.0+0x16> push   rbx
000000000000195f <dot_opt3._omp_fn.0+0x17> mov    rbx,QWORD PTR [rdi]
0000000000001962 <dot_opt3._omp_fn.0+0x1a> test   rbx,rbx
0000000000001965 <dot_opt3._omp_fn.0+0x1d> jne    0000000000001996 <dot_opt3._omp_fn.0+0x4e>
0000000000001967 <dot_opt3._omp_fn.0+0x1f> vxorpd xmm0,xmm0,xmm0
000000000000196b <dot_opt3._omp_fn.0+0x23> mov    rdx,QWORD PTR [rbp+0x18]
000000000000196f <dot_opt3._omp_fn.0+0x27> lea    rcx,[rbp+0x18]
0000000000001973 <dot_opt3._omp_fn.0+0x2b> vmovq  xmm2,rdx
0000000000001978 <dot_opt3._omp_fn.0+0x30> mov    rax,rdx
000000000000197b <dot_opt3._omp_fn.0+0x33> vaddsd xmm1,xmm0,xmm2
000000000000197f <dot_opt3._omp_fn.0+0x37> vmovq  rsi,xmm1
0000000000001984 <dot_opt3._omp_fn.0+0x3c> lock cmpxchg QWORD PTR [rcx],rsi
0000000000001989 <dot_opt3._omp_fn.0+0x41> mov    rsi,rdx
000000000000198c <dot_opt3._omp_fn.0+0x44> mov    rdx,rax
000000000000198f <dot_opt3._omp_fn.0+0x47> cmp    rsi,rax
0000000000001992 <dot_opt3._omp_fn.0+0x4a> je     00000000000019e7 <dot_opt3._omp_fn.0+0x9f>
0000000000001994 <dot_opt3._omp_fn.0+0x4c> jmp    0000000000001973 <dot_opt3._omp_fn.0+0x2b>
0000000000001996 <dot_opt3._omp_fn.0+0x4e> call   0000000000001170 <omp_get_num_threads@plt>
000000000000199b <dot_opt3._omp_fn.0+0x53> mov    r12d,eax
000000000000199e <dot_opt3._omp_fn.0+0x56> call   0000000000001130 <omp_get_thread_num@plt>
00000000000019a3 <dot_opt3._omp_fn.0+0x5b> movsxd rsi,r12d
00000000000019a6 <dot_opt3._omp_fn.0+0x5e> xor    edx,edx
00000000000019a8 <dot_opt3._omp_fn.0+0x60> movsxd rcx,eax
00000000000019ab <dot_opt3._omp_fn.0+0x63> mov    rax,rbx
00000000000019ae <dot_opt3._omp_fn.0+0x66> div    rsi
00000000000019b1 <dot_opt3._omp_fn.0+0x69> cmp    rcx,rdx
00000000000019b4 <dot_opt3._omp_fn.0+0x6c> jb     00000000000019e0 <dot_opt3._omp_fn.0+0x98>
00000000000019b6 <dot_opt3._omp_fn.0+0x6e> imul   rcx,rax
00000000000019ba <dot_opt3._omp_fn.0+0x72> vxorpd xmm0,xmm0,xmm0
00000000000019be <dot_opt3._omp_fn.0+0x76> add    rdx,rcx
00000000000019c1 <dot_opt3._omp_fn.0+0x79> add    rax,rdx
00000000000019c4 <dot_opt3._omp_fn.0+0x7c> cmp    rdx,rax
00000000000019c7 <dot_opt3._omp_fn.0+0x7f> jae    000000000000196b <dot_opt3._omp_fn.0+0x23>
00000000000019c9 <dot_opt3._omp_fn.0+0x81> vmovsd xmm3,QWORD PTR [r14+rdx*8]
00000000000019cf <dot_opt3._omp_fn.0+0x87> vfmadd231sd xmm0,xmm3,QWORD PTR [r13+rdx*8+0x0]
00000000000019d6 <dot_opt3._omp_fn.0+0x8e> inc    rdx
00000000000019d9 <dot_opt3._omp_fn.0+0x91> cmp    rax,rdx
00000000000019dc <dot_opt3._omp_fn.0+0x94> jne    00000000000019c9 <dot_opt3._omp_fn.0+0x81>
00000000000019de <dot_opt3._omp_fn.0+0x96> jmp    000000000000196b <dot_opt3._omp_fn.0+0x23>
00000000000019e0 <dot_opt3._omp_fn.0+0x98> inc    rax
00000000000019e3 <dot_opt3._omp_fn.0+0x9b> xor    edx,edx
00000000000019e5 <dot_opt3._omp_fn.0+0x9d> jmp    00000000000019b6 <dot_opt3._omp_fn.0+0x6e>
00000000000019e7 <dot_opt3._omp_fn.0+0x9f> pop    rbx
00000000000019e8 <dot_opt3._omp_fn.0+0xa0> pop    rbp
00000000000019e9 <dot_opt3._omp_fn.0+0xa1> pop    r12
00000000000019eb <dot_opt3._omp_fn.0+0xa3> pop    r13
00000000000019ed <dot_opt3._omp_fn.0+0xa5> pop    r14
00000000000019ef <dot_opt3._omp_fn.0+0xa7> ret

Anatomy of a parallel region#

"Anatomy of a parallel region"

Where GOMP stands for GNU Offloading and Multi-Processing Project (GOMP) and is an implementation of OpenMP and OpenACC for GNU compilers.

2. Memory semantics#

For each variable accessed within the parallel region, we can specify the following data-sharing policies:

  • private: private is the clause that contains the variables that each thread in the OpenMP parallel region will have a copy of. These copies are not initialised upon entering the parallel region.

  • firstprivate: Like private, but by contrast, firstprivate variables are initialised with the value of the original variable upon entering the parallel region.

  • lastprivate: lastprivate is a clause that can be used in a parallelised loop or sections. The lastprivate clause shares some of the semantics of the private clause. That is, each thread will have an uninitialised copy of the variables passed as lastprivate. However, unlike a private variable, at the end of the parallelised loop or sections, a lastprivate variable will take the value of the copy hosted at the thread that executed the last iteration (in the case of a parallelised loop) or section. The “last” iteration or section is the one that would be executed last if they were executed sequentially.

  • shared: shared is the clause that contains the variables shared across the threads belonging to the OpenMP parallel region concerned. Such variables are therefore accessed concurrently, arising potential data-races.

int a=0, b=1, c=2;

#pragma omp parallel private(a) firstprivate(b) shared(c)
{
    int id = omp_get_thread_num();
    a++;
    b++;
    c++;
    printf("[%d] %d %d %d\n", id, a, b, c);
}
printf("END: %d %d %d\n", a, b, c);
! gcc -fopenmp -Wall ../c_codes/module3-3/omp-mem.c -o omp-mem

Programming styles#

The private semantics is actually unnecessary and error-prone. We can just declare those variables at inner-most scope.

int b=1, c=2;

#pragma omp parallel firstprivate(b) shared(c)
{
    int a = 0;
    int id = omp_get_thread_num();
    a++;
    b++;
    c++;
    printf("[%d] %d %d %d\n", id, a, b, c);
}
printf("END: %d %d %d\n", a, b, c); // Error: a not in scope here

Updating shared variables#

We see that the shared variable c has lots of opportunities for conflict.

Updating a shared variable

If we run the above many times, we may sometimes find that multiple processes have the same value of c, each thread can observe different increments from others, and the total number of increments may vary.

We can define ordering semantics using:

  • atomic: The atomic construct ensures that a specific storage location is accessed atomically, rather than exposing it to the possibility of multiple, simultaneous reading and writing threads that may result in indeterminate values.

  • critical: The critical construct restricts execution of the associated structured block to a single thread at a time.

  • barrier: The barrier construct specifies an explicit barrier at the point at which the construct appears. The barrier construct is a stand-alone directive.

int b=1, c=2;
  
#pragma omp parallel firstprivate(b) shared(c)
{
    int a = 1;
    int id = omp_get_thread_num();
    b++;
    #pragma omp critical
    c++;
    #pragma omp barrier
    printf("[%d] %d %d %d\n", id, a, b, c);
}
printf("END: _ %d %d\n", b, c);

3. A quick demo on perf#

Linux perf is a kernel interrupt-based profiling tool. It uses performance counters and interrupts to diagnose all sorts of bottlenecks.

! perf stat ../c_codes/module3-3/dot -n 10000 > /dev/null
 Performance counter stats for '../c_codes/module3-3/dot -n 10000':

             11.21 msec task-clock                       #    2.441 CPUs utilized             
                 0      context-switches                 #    0.000 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
               119      page-faults                      #   10.614 K/sec                     
        19,901,621      cycles                           #    1.775 GHz                       
        11,499,889      instructions                     #    0.58  insn per cycle            
         1,015,154      branches                         #   90.548 M/sec                     
            14,598      branch-misses                    #    1.44% of all branches           

       0.004592519 seconds time elapsed

       0.007619000 seconds user
       0.004232000 seconds sys
! perf record -g ../c_codes/module3-3/dot -n 10000 -r 1000 > /dev/null
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.195 MB perf.data (2115 samples) ]
! perf report -M intel
7?47h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ┌Processing events... [2K/199K]────────────────────────────────────────────────┐│                                                                              │└──────────────────────────────────────────────────────────────────────────────┘                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  5 7 10K/199K] 2 57  20 3 5 8 30 3 5 8 40 3 6 851  3 6 861  3 6 9 71 4 6 981  4 7 9 92 4 79  102K/199K] 4 7 9 12 5 7 202  5 7 302  5 8 403 5  850  3 5 861  3 68  713  6 881 4 6  9     914    6   9time ordered events...                                                                                                                                                                                                                                                                                                                                                                                                                     ┌─Warning:─────────────────────────────────────────────────────────┐│Kernel address maps (/proc/{kallsyms,modules}) were restricted.   ││││Check /proc/sys/kernel/kptr_restrict before running 'perf record'.││││As no suitable kallsyms nor vmlinux was found, kernel samples││can't be resolved.││││Samples in kernel modules can't be resolved as well.││││││││Press any key...│└──────────────────────────────────────────────────────────────────┘
                                                                                     
                                                                               
?47l8

Note how GOMP overhead dominates the cost in this experiment. We need more work (longer arrays, etc.) to justify the overhead of distributing and collecting the parallel work.

We can drill down into particular functions (especially those that we’ve written, which we have hopefully compiled with -g to include debugging information).

Perf report annotated

From this, we see specific instructions, and their corresponding lines of code, that are most frequently being processed when the kernel interrupts to check. In this experiment, we see *sd “scalar double” instructions, indicating lack of vectorization.

In contrast, the following annotation shows use of *pd “packed double” instructions, indicating that the “hot” loop has been vectorized.

Perf vectorized report annotated

The reason for vectorization can sometimes be determined by -fopt-info -fopt-info-missed, and can be encouraged by techniques like manually splitting accumulators, preventing aliasing by using restrict, directives like #pragma omp simd, and global compiler flags like -ffast-math (although, very dangerous to use).

Tip

For more on perf, see Brendan Gregg’s Linux Performance site.

4. OpenMP Tasks#

Tip

See this resource for OpenMP Task Parallelism.

Using #pragma omp task#

Up to now, we’ve been expressing parallelism for iterating over an array.

  • The application programmer specifies regions of code to be executed in a task with the #pragma omp task construct

  • All tasks can be executed independently

  • When any thread encounters a task construct, a task is generated

  • Tasks are executed asynchronously by any thread of the parallel region

  • Completion of the tasks can be guaranteed using the taskwait synchronization construct

 1#include <stdio.h>
 2int main() {
 3  int x = 1;
 4  #pragma omp parallel
 5  #pragma omp single
 6  {
 7    #pragma omp task shared(x) depend(out: x)
 8    x = 2;
 9    #pragma omp task shared(x) depend(in: x)
10    printf("x + 1 = %d. ", x+1);
11    #pragma omp task shared(x) depend(in: x)
12    printf("x + 2 = %d. ", x+2);
13  }
14  puts("");
15  return 0;
16}
! gcc -fopenmp ../c_codes/module3-3/task_dep.4.c -o task_dep.4
! for i in {1..10}; do ./task_dep.4; done
  • The single construct specifies that the associated structured block is executed by only one of the threads in the team (not necessarily the master thread), in the context of its implicit task. The other threads in the team, which do not execute the block, wait at an implicit barrier at the end of the single construct unless a nowait clause is specified.

  • The depend clause allows you to provide information on the way a task will access data

    • It is followed by an access mode that can be in, out or inout. Examples:

    • depend(in: x, y, z): the task will read variables x, y and z

    • depend(out: res): the task will write variable res; Any previous value of res will be ignored and overwritten

    • depend(inout: k, buffer[0:n]): the task will both read and write the variables k and buffer; the content of n elements of buffer starting from index 0 will be used in the read-and-write

  • The OpenMP runtime system dynamically decides whether a task is ready for execution or not considering its dependencies (there is no need for further user intervention here).

 1#include <stdio.h>
 2int main() {
 3  int x = 1;
 4  #pragma omp parallel
 5  #pragma omp single
 6  {
 7    #pragma omp task shared(x) depend(out: x)
 8    x = 2;
 9    #pragma omp task shared(x) depend(inout: x)
10    printf("x + 1 = %d. ", x+1);
11    #pragma omp task shared(x) depend(in: x)
12    printf("x + 2 = %d. ", x+2);
13  }
14  puts("");
15  return 0;
16}
! gcc -fopenmp ../c_codes/module3-3/task_dep.4inout.c -o task_dep.4inout
! for i in {1..10}; do ./task_dep.4inout; done

In general, creating tasks (even with only one thread) creates an expensive overhead.

The OpenMP loop scheduler#

  • When we put together the #pragma omp parallel (which spawns a group of threads) and #pragma omp for (which divides loop iterations between the spawned threads) constructs, as in #pragma omp parallel for we do both things at once.

    • To this, you can optionally add schedule(static,n), where n is the chunk size that you want the tasks to be divided into for the threads. (Note: adding schedule(static,1) as in #pragma omp parallel for schedule(static,1) is equivalent to just #pragma omp parallel for)

    • schedule(dynamic,n) still tells OpenMP to split task into size chunks, but distribute them to threads dynamically without any specific order.

    • Check other options in this resource.

To fork/join or to task?#

One of the main issues in High-Performance Computing (HPC) systems is the underutilization of resources. Parallel applications partition and distribute compute and data across processors in the system that work together to solve a given problem. In this operation, processors synchronize and communicate which may lead to some of them spending time idle, waiting for other processors to complete their part. Idle processors mean wasted time and power. This can happen for serial sections of the code, load imbalance, or if you are waiting for synchronization.

These issues are common in bulk synchronous parallel applications, especially those that statically assign work to processors.

Many codes rely on bulk synchronous parallelization constructs to distribute and synchronize work across multiple threads in a system. In this model, multiple threads operate in parallel on different parts of a problem, and perform a global synchronization when the parallel work is completed.

Fork-join is a similar model where a single thread, sometimes called a master thread, is the application entry point. This forks into multiple threads that concurrently work on different parts of a problem, and then synchronize to join into a single thread when the work in the parallel section is complete (similar to the worksharing-parallel constructs in OpenMP that distribute the iterations in a loop across multiple threads).

Bulk synchronous parallelization

Load imbalance appears when different threads receive an uneven amount of work to do, or perform the work at different speeds, leading to different amounts of compute time. In this scenario, the faster threads need to wait for lagging threads on global synchronizations, therefore being in an idle state and wasting resources during that time. In the fork-join model, serial sections in between parallel regions become an increasing bottleneck, as parallel regions are shortened with increasing numbers of threads.

A task is a piece of compute that operates on a piece of data and that may be executed concurrently with other tasks. This parallel programming abstraction is intuitive and allows to specify data-flow dependencies between tasks instead of costly global synchronizations. This mitigates idle time created as a result of load imbalance, given that threads pick up work as they complete, and there is also less time spent on serial sections due to the reduced number of global synchronizations.

Tasking

For tasking to be efficient, it relies on overdecomposition, i.e., creating more work units than there are processing units. For many numerical algorithms, there is some overhead to overdecomposition. For example, in array processing of an array size \(n\), a halo/fringe/ghost/overlap region might need to be computed as part of each work patch, leading to time models along the lines of

\[ t_{\textrm{tile}} = t_{\textrm{latency}} + \frac{(n+2)^3}{R} \]

where \(R\) is the processing rate.

Tip

Recommended reading: Tasking Lives Up to its Promises