# 29) ISPC, OpenMP target, OpenACC, and all that

Last time:

- Parallel reductions with CUDA.jl
- Different strategies of optmization on the GPU

Today: 

1. ISPC  
2. OpenMP target offload  
  2.1 Terminology
3. OpenACC  

| Architecture | Directives | SIMD | SPMD |
|---------|-----------|------|-----|
| Intel AVX+ (SIMD) | `#pragma omp simd` | [intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#) | [ISPC](https://ispc.github.io/ispc.html) |
| CUDA (SIMT) | `#pragma omp target` | C++ templates and other high-level APIs | CUDA |

## 1. [ISPC: Intel SPMD Program Compiler](https://ispc.github.io/ispc.html)

The Intel **Implicit SPMD Program Compiler (ISPC)** is a compiler for writing **single program multiple data (SPMD)** programs to run on the CPU and GPU. 

The SPMD programming approach is similar to approaches used in computer graphics and general-purpose-GPU programming; it is used for GPU shaders and CUDA and OpenCL (standard for cross-platform, parallel programming of heterogeneous systems) kernels, for example.

- The main idea behind SPMD is that one writes programs as if they were operating on a single data element (a pixel for a pixel shader, for example), but then the underlying hardware and runtime system executes multiple invocations of the program in parallel with different inputs (the values for different pixels, for example).

- In summary, we can program **SIMT** (e.g., CUDA) devices using directives, but we can also program **SIMD** (e.g., Intel CPUs) using a **SPMD** (recall, the CUDA-like, acronym that comes from "single program" versus "single instruction") programming model.


```{literalinclude} ../c_codes/module9-1/simple-ispc.ispc
:language: c
:linenos: true
```

This function is callable from native C code. Example:

```{literalinclude} ../c_codes/module9-1/simple.c
:language: c
:linenos: true
```


In [2]:
! gcc -O3 -march=native -o simple.o -c ../c_codes/module9-1/simple.c && ispc -O3 --target=avx2-i32x8 ../c_codes/module9-1/simple-ispc.ispc -o simple-ispc.o && gcc simple.o simple-ispc.o  -lm -o simple  && ./simple

In [None]:
! objdump -d --prefix-addresses -M intel simple | grep sqrt

- ISPC is a good option for code with cross-lane dependencies or vector lane divergence (branches that affect some lanes differently than others). 

- Writing such code with intrinsics is laborious and compilers often do a poor job of inferring good vectorization strategies (despite `#pragma omp simd` and the like). 

- An example of successful use of ISPC is Intel's [Embree](https://www.embree.org/) ray tracing engine.

(As with most vendor-reported performance numbers, we can probably take this with a grain of salt. But it indicates that CPUs remain highly competitive for ray tracing.)

![Intel Embree performance](../img/embree-performance.png "Intel Embree perf")

## 2. OpenMP target offload

- CUDA is relatively hard to maintain and logic/tuning is spread out (between the kernel launch and the device code). 
- OpenMP target offload and OpenACC attempt to provide a more friendly story for maintenance and incremental migration of legacy code.

### Terminology
| CUDA Concept | CUDA keyword | OpenACC | OpenMP `target` |
|----|------|---------|--------|
| Thread block | `blockIdx` | `gang` | `teams` |
| Warp | (implicit) | `worker` | thread |
| Thread | `threadIdx` | `vector` | `simd` |

## 3. OpenACC 

## Incremental porting with unified memory: OpenACC steps

![OpenACC steps](../img/openacc-steps.png "OpenACC steps")


### Example

OpenACC example from a [Lattice-Boltzmann](https://en.wikipedia.org/wiki/Lattice_Boltzmann_methods) miniapp

```cpp
void LBM::stream(Real* const __restrict a_f,
                 const Real* const __restrict a_f_post,
                 const int* a_loStr,
                 const int* a_hiStr,
                 const int* a_loAll,
                 const int* a_hiAll,
                 const int a_numPts) const
{

  const int* const __restrict latI = &m_lattice[0][0];
  const int* const __restrict latJ = &m_lattice[1][0];
  const int* const __restrict latK = &m_lattice[2][0];

  const int
    klo = a_loStr[2], khi = a_hiStr[2],
    jlo = a_loStr[1], jhi = a_hiStr[1],
    ilo = a_loStr[0], ihi = a_hiStr[0];

#pragma acc parallel loop independent collapse(3) \
        copyin(a_loAll[SPACEDIM],a_hiAll[SPACEDIM],a_f_post[a_numPts*m_numVels]) \
        copyout(a_f[a_numPts*m_numVels]) vector_length(256)
  for (int k = klo; k <= khi; ++k) {
    for (int j = jlo; j <= jhi; ++j) {
      for (int i = ilo; i <= ihi; ++i) {
#pragma acc loop seq independent
        for (int m = 0; m < NUMV; ++m) {
          const long int offset = m * a_numPts;
          const long int index0 = INDEX(i          ,           j,           k, a_loAll, a_hiAll);
          const long int index2 = INDEX(i - latI[m], j - latJ[m], k - latK[m], a_loAll, a_hiAll);
          a_f[index0 + offset]    = a_f_post[index2 + offset];  // new f comes from upwind
        }
      }
    }
  }
}
```


### Resources
* [Getting started with OpenACC](https://devblogs.nvidia.com/getting-started-openacc/)
* [Advanced OpenACC form a UTK guest lecture](https://icl.utk.edu/~bosilca/classes/cosc462/2016/pdf/OpenACC_Fundamentals.pdf)
* [SC18 OpenMP Presentations (with videos)](https://www.openmp.org/resources/openmp-presentations/resources-openmp-presentations-sc18-booth-talks/)
* [OpenMP 6.0 Progress and Directions](https://www.openmp.org/wp-content/uploads/OpenMP-api-status-2022.pdf)
* [**OpenACC Hackathon series**](https://www.openacc.org/hackathons)
* [**Bootcamps**](https://www.openhackathons.org/s/upcoming-events?eventType=bootcamps)