29) ISPC, OpenMP target, OpenACC, and all that#

Last time:

  • Parallel reductions with CUDA.jl

  • Different strategies of optmization on the GPU

Today:

  1. ISPC

  2. OpenMP target offload
    2.1 Terminology

  3. OpenACC

Architecture

Directives

SIMD

SPMD

Intel AVX+ (SIMD)

#pragma omp simd

intrinsics

ISPC

CUDA (SIMT)

#pragma omp target

C++ templates and other high-level APIs

CUDA

1. ISPC: Intel SPMD Program Compiler#

The Intel Implicit SPMD Program Compiler (ISPC) is a compiler for writing single program multiple data (SPMD) programs to run on the CPU and GPU.

The SPMD programming approach is similar to approaches used in computer graphics and general-purpose-GPU programming; it is used for GPU shaders and CUDA and OpenCL (standard for cross-platform, parallel programming of heterogeneous systems) kernels, for example.

  • The main idea behind SPMD is that one writes programs as if they were operating on a single data element (a pixel for a pixel shader, for example), but then the underlying hardware and runtime system executes multiple invocations of the program in parallel with different inputs (the values for different pixels, for example).

  • In summary, we can program SIMT (e.g., CUDA) devices using directives, but we can also program SIMD (e.g., Intel CPUs) using a SPMD (recall, the CUDA-like, acronym that comes from “single program” versus “single instruction”) programming model.

 1export void simple_ispc(uniform double vin[], uniform double vout[],
 2                        uniform int count) {
 3  foreach (index = 0 ... count) {
 4    double v = vin[index];
 5    if (v < 3.)
 6      v = v * v;
 7    else
 8      v = sqrt(v);
 9    vout[index] = v;
10  }
11}

This function is callable from native C code. Example:

 1#include <stdio.h>
 2#include <math.h>
 3
 4void simple_ispc(double vin[], double vout[], int count);
 5
 6void simple_c(double vin[], double vout[], int count) {
 7  for (int index=0; index<count; index++) {
 8    double v = vin[index];
 9    if (v < 3.)
10      v = v * v;
11    else
12      v = sqrt(v);
13    vout[index] = v;
14  }
15}
16
17int main() {
18  double vin[16], vout[16];
19  for (int i = 0; i < 16; ++i)
20    vin[i] = i;
21
22  simple_ispc(vin, vout, 16);
23
24  for (int i = 0; i < 16; ++i)
25    printf("%d: simple_ispc(%f) = %f\n", i, vin[i], vout[i]);
26
27  simple_c(vin, vout, 16);
28
29  for (int i = 0; i < 16; ++i)
30    printf("%d: simple_c(%f) = %f\n", i, vin[i], vout[i]);
31  return 0;
32}
! gcc -O3 -march=native -o simple.o -c ../c_codes/module9-1/simple.c && ispc -O3 --target=avx2-i32x8 ../c_codes/module9-1/simple-ispc.ispc -o simple-ispc.o && gcc simple.o simple-ispc.o  -lm -o simple  && ./simple
/bin/bash: line 1: ispc: command not found
! objdump -d --prefix-addresses -M intel simple | grep sqrt
objdump: 'simple': No such file
  • ISPC is a good option for code with cross-lane dependencies or vector lane divergence (branches that affect some lanes differently than others).

  • Writing such code with intrinsics is laborious and compilers often do a poor job of inferring good vectorization strategies (despite #pragma omp simd and the like).

  • An example of successful use of ISPC is Intel’s Embree ray tracing engine.

(As with most vendor-reported performance numbers, we can probably take this with a grain of salt. But it indicates that CPUs remain highly competitive for ray tracing.)

Intel Embree performance

2. OpenMP target offload#

  • CUDA is relatively hard to maintain and logic/tuning is spread out (between the kernel launch and the device code).

  • OpenMP target offload and OpenACC attempt to provide a more friendly story for maintenance and incremental migration of legacy code.

Terminology#

CUDA Concept

CUDA keyword

OpenACC

OpenMP target

Thread block

blockIdx

gang

teams

Warp

(implicit)

worker

thread

Thread

threadIdx

vector

simd

3. OpenACC#

Incremental porting with unified memory: OpenACC steps#

OpenACC steps

Example#

OpenACC example from a Lattice-Boltzmann miniapp

void LBM::stream(Real* const __restrict a_f,
                 const Real* const __restrict a_f_post,
                 const int* a_loStr,
                 const int* a_hiStr,
                 const int* a_loAll,
                 const int* a_hiAll,
                 const int a_numPts) const
{

  const int* const __restrict latI = &m_lattice[0][0];
  const int* const __restrict latJ = &m_lattice[1][0];
  const int* const __restrict latK = &m_lattice[2][0];

  const int
    klo = a_loStr[2], khi = a_hiStr[2],
    jlo = a_loStr[1], jhi = a_hiStr[1],
    ilo = a_loStr[0], ihi = a_hiStr[0];

#pragma acc parallel loop independent collapse(3) \
        copyin(a_loAll[SPACEDIM],a_hiAll[SPACEDIM],a_f_post[a_numPts*m_numVels]) \
        copyout(a_f[a_numPts*m_numVels]) vector_length(256)
  for (int k = klo; k <= khi; ++k) {
    for (int j = jlo; j <= jhi; ++j) {
      for (int i = ilo; i <= ihi; ++i) {
#pragma acc loop seq independent
        for (int m = 0; m < NUMV; ++m) {
          const long int offset = m * a_numPts;
          const long int index0 = INDEX(i          ,           j,           k, a_loAll, a_hiAll);
          const long int index2 = INDEX(i - latI[m], j - latJ[m], k - latK[m], a_loAll, a_hiAll);
          a_f[index0 + offset]    = a_f_post[index2 + offset];  // new f comes from upwind
        }
      }
    }
  }
}

Resources#