29) ISPC, OpenMP target, OpenACC, and all that

29) ISPC, OpenMP target, OpenACC, and all that#

Last time:

Parallel reductions with CUDA.jl
Different strategies of optmization on the GPU

Today:

ISPC
OpenMP target offload
2.1 Terminology
OpenACC

Architecture	Directives	SIMD	SPMD
Intel AVX+ (SIMD)	`#pragma omp simd`	intrinsics	ISPC
CUDA (SIMT)	`#pragma omp target`	C++ templates and other high-level APIs	CUDA

1. ISPC: Intel SPMD Program Compiler #

The Intel Implicit SPMD Program Compiler (ISPC) is a compiler for writing single program multiple data (SPMD) programs to run on the CPU and GPU.

The SPMD programming approach is similar to approaches used in computer graphics and general-purpose-GPU programming; it is used for GPU shaders and CUDA and OpenCL (standard for cross-platform, parallel programming of heterogeneous systems) kernels, for example.

The main idea behind SPMD is that one writes programs as if they were operating on a single data element (a pixel for a pixel shader, for example), but then the underlying hardware and runtime system executes multiple invocations of the program in parallel with different inputs (the values for different pixels, for example).
In summary, we can program SIMT (e.g., CUDA) devices using directives, but we can also program SIMD (e.g., Intel CPUs) using a SPMD (recall, the CUDA-like, acronym that comes from “single program” versus “single instruction”) programming model.

export void simple_ispc(uniform double vin[], uniform double vout[],
                        uniform int count) {
  foreach (index = 0 ... count) {
    double v = vin[index];
    if (v < 3.)
      v = v * v;
    else
      v = sqrt(v);
    vout[index] = v;
  }
}

This function is callable from native C code. Example:

#include <stdio.h>
#include <math.h>

void simple_ispc(double vin[], double vout[], int count);

void simple_c(double vin[], double vout[], int count) {
  for (int index=0; index<count; index++) {
    double v = vin[index];
    if (v < 3.)
      v = v * v;
    else
      v = sqrt(v);
    vout[index] = v;
  }
}

int main() {
  double vin[16], vout[16];
  for (int i = 0; i < 16; ++i)
    vin[i] = i;

  simple_ispc(vin, vout, 16);

  for (int i = 0; i < 16; ++i)
    printf("%d: simple_ispc(%f) = %f\n", i, vin[i], vout[i]);

  simple_c(vin, vout, 16);

  for (int i = 0; i < 16; ++i)
    printf("%d: simple_c(%f) = %f\n", i, vin[i], vout[i]);
  return 0;
}

! gcc -O3 -march=native -o simple.o -c ../c_codes/module9-1/simple.c && ispc -O3 --target=avx2-i32x8 ../c_codes/module9-1/simple-ispc.ispc -o simple-ispc.o && gcc simple.o simple-ispc.o  -lm -o simple  && ./simple

/bin/bash: line 1: ispc: command not found

! objdump -d --prefix-addresses -M intel simple | grep sqrt

objdump: 'simple': No such file

ISPC is a good option for code with cross-lane dependencies or vector lane divergence (branches that affect some lanes differently than others).
Writing such code with intrinsics is laborious and compilers often do a poor job of inferring good vectorization strategies (despite #pragma omp simd and the like).
An example of successful use of ISPC is Intel’s Embree ray tracing engine.

(As with most vendor-reported performance numbers, we can probably take this with a grain of salt. But it indicates that CPUs remain highly competitive for ray tracing.)

Intel Embree performance

2. OpenMP target offload#

CUDA is relatively hard to maintain and logic/tuning is spread out (between the kernel launch and the device code).
OpenMP target offload and OpenACC attempt to provide a more friendly story for maintenance and incremental migration of legacy code.

Terminology#

CUDA Concept	CUDA keyword	OpenACC	OpenMP `target`
Thread block	`blockIdx`	`gang`	`teams`
Warp	(implicit)	`worker`	thread
Thread	`threadIdx`	`vector`	`simd`

3. OpenACC#

Incremental porting with unified memory: OpenACC steps#

OpenACC steps

Example#

OpenACC example from a Lattice-Boltzmann miniapp

void LBM::stream(Real* const __restrict a_f,
                 const Real* const __restrict a_f_post,
                 const int* a_loStr,
                 const int* a_hiStr,
                 const int* a_loAll,
                 const int* a_hiAll,
                 const int a_numPts) const
{

  const int* const __restrict latI = &m_lattice[0][0];
  const int* const __restrict latJ = &m_lattice[1][0];
  const int* const __restrict latK = &m_lattice[2][0];

  const int
    klo = a_loStr[2], khi = a_hiStr[2],
    jlo = a_loStr[1], jhi = a_hiStr[1],
    ilo = a_loStr[0], ihi = a_hiStr[0];

#pragma acc parallel loop independent collapse(3) \
        copyin(a_loAll[SPACEDIM],a_hiAll[SPACEDIM],a_f_post[a_numPts*m_numVels]) \
        copyout(a_f[a_numPts*m_numVels]) vector_length(256)
  for (int k = klo; k <= khi; ++k) {
    for (int j = jlo; j <= jhi; ++j) {
      for (int i = ilo; i <= ihi; ++i) {
#pragma acc loop seq independent
        for (int m = 0; m < NUMV; ++m) {
          const long int offset = m * a_numPts;
          const long int index0 = INDEX(i          ,           j,           k, a_loAll, a_hiAll);
          const long int index2 = INDEX(i - latI[m], j - latJ[m], k - latK[m], a_loAll, a_hiAll);
          a_f[index0 + offset]    = a_f_post[index2 + offset];  // new f comes from upwind
        }
      }
    }
  }
}

29) ISPC, OpenMP target, OpenACC, and all that

Contents

29) ISPC, OpenMP target, OpenACC, and all that#

1. ISPC: Intel SPMD Program Compiler #

2. OpenMP target offload#

Terminology#

3. OpenACC#

Incremental porting with unified memory: OpenACC steps#

Example#

Resources#

29) ISPC, OpenMP target, OpenACC, and all that

Contents

29) ISPC, OpenMP target, OpenACC, and all that#

1. ISPC: Intel SPMD Program Compiler#

2. OpenMP target offload#

Terminology#

3. OpenACC#

Incremental porting with unified memory: OpenACC steps#

Example#

Resources#

1. ISPC: Intel SPMD Program Compiler #