6) Measuring Performance

6) Measuring Performance#

Last time:

Intro to Vectorization

Today:

Measuring Performance
Introduction to Performance Modeling
2.1 The vector triad benchmark
2.2 The STREAM benchmark

1. Measuring Performance#

Two of the standard approaches to assessing performance:

Floating Point Operations per Second (FLOP/S)
- FLOPs = Floating Pointer Operations (careful on the small “s” vs big “S”) (see video: 1.1.1 GFLOPS).
- How many floating point operations (add, subtract, multiply, and maybe divide) a code does per second.
- How to estimate: for matrix-matrix multiply we have 2 FLOP per inner loop, and a triply nested loop thus total \(\text{FLOPs} = 2nmk\). The FLOPS are then \(\text{FLOPS} = 2nmk / \text{time}\). Note that usually we use GigaFLOPS or GFLOPS: \(\text{GFLOPS} = 2nmk / \text{time} / 10^9\).
- Calculate for one core of your machine:
  \[ \text{FLOP/S} = \frac{\text{cycle}}{\text{second}} \frac{\text{FLOPs}}{\text{cycle}} \]
  - cycles-per-second can be looked up for your CPU on your machine
  - FLOP/S-per-cycle can be tricky to find, is based on your processors microarchitecture and is most easily found on the Wikipedia FLOP page. For instance, consider a laptop with an Intel Core i5-8210Y CPU at 1.6 GHz. This is an Amber Lake processors with 16 FLOPs-per-cycle, and thus has a calculated FLOP rate of
    \[ \text{FLOPS} = 1.6 \frac{\text{Gigacycles}}{\text{second}} 16 \frac{\text{FLOPs}}{\text{cycle}} \]
    (Note that some processors support turbo boost so you may see even higher performance unless turbo boost is disabled)
Memory Bandwidth:
- Rate at which memory can be moved to semiconductor memory (typically from main memory).
- Can be looked up on manufacture page
- Measure in Gigabits-per-second (GiB/s) or GigaBytes-per-second (GB/s)
  - GigaByte (GB): \(1 \text{ GB} = 10^9 \text{ bytes} = 8 \times 10^9 \text{ bits}\)
  - Gigabit (GiB): \(1 \text{ GiB} = 10^9 \text{ bits}\)

2. Introduction to Performance Modeling#

Models give is a conceptual and roughly quantitative framework by which to answer the following types of questions.

Why is an implementation exhibiting its observed performance?
How will performance change if we:
- optimize this component?
- buy new hardware? (Which new hardware?)
- run a different configuration?
While conceptualizing a new algorithm, what performance can we expect and what will be bottlenecks?

Models are a guide for performance, but not an absolute.

Terms#

Symbol	Meaning
\(n\)	Input parameter related to problem size
\(W\)	Amount of work to solve problem \(n\)
\(T\)	Execution time
\(R\)	Rate at which work is done

2.1 The vector triad benchmark#

Fathoming the chief performance characteristics of a processor or system is one of the purposes of low-level benchmarking.
A low-level benchmark is a program that tries to test some specific feature of the architecture like, e.g., peak performance or memory bandwidth.
One of the prominent examples is the vector triad, introduced by Schönauer (self-edition, 2000). It comprises a nested loop, the inner level executing a multiply add operation on the elements of three vectors and storing the result in a fourth.

Example in Fortran:#

See the following implementation as a Fortran code snippet:

double precision, dimension(N) :: A,B,C,D
double precision :: S,E,MFLOPS

do i=1,N                                !initialize arrays
    A(i) = 0.d0; B(i) = 1.d0
    C(i) = 2.d0; D(i) = 3.d0
enddo

call get_walltime(S)                    ! get time stamp

do j=1,R
    do i=1,N
        A(i) = B(i) + C(i) * D(i)       ! 3 loads, 1 store
    enddo 
    if(A(2).lt.0) call dummy(A,B,C,D)   ! prevent loop interchange
enddo

call get_walltime(E)                    ! get time stamp again

MFLOPS = R*N*2.d0/((E-S)*1.d6)          ! compute MFlop/sec rate

The purpose of this benchmark is to measure the performance of data transfers between memory and arithmetic units of a processor.
On the inner level, three load streams for arrays B, C and D and one store stream for A are active.
Depending on N, this loop might execute in a very small time, which would be hard to measure. The outer loop thus repeats the triad R times so that execution time becomes large enough to be accurately measurable. In practice one would choose R according to N so that the overall execution time stays roughly constant for different N.

Observations:

The aim of the masked-out call to the dummy() subroutine is to prevent the compiler from doing an obvious optimization: Without the call, the compiler might discover that the inner loop does not depend at all on the outer loop index j and drop the outer loop right away.
The possible call to dummy() fools the compiler into believing that the arrays may change between outer loop iterations. This effectively prevents the optimization described, and the additional cost is negligible because the condition is always false (which the compiler does not know in advance).
Please note that the most sensible time measure in benchmarking is wallclock time, also called elapsed time. Any other “time” that the system may provide, first and foremost the much stressed CPU time, is prone to misinterpretation because there might be contributions from I/O, context switches, other processes, etc., which CPU time cannot encompass (this is especially true for parallel programs).

vector triad performance graph

In the above graph, we see the serial vector triad performance versus loop length for several generations of In- tel processor architectures (clock speed and year of introduction is indicated in the legend), and the NEC SX-8 vector processor. Note the entirely different performance characteristics of the latter.

Tip

If you are not familiar, check this reference on How to Compile a C program and How to Compile a Fortran program

2.2 The STREAM benchmark:#

A C snippet:

for (i=0; i<n; i++)
    a[i] = b[i] + scalar*c[i];

Following the notation above, \(n\) is the array size and

\[W = 3 \cdot \texttt{sizeof(double)} \cdot n\]

is the number of bytes transferred.

The rate \(R = W/T\) is measured in bytes per second (or MB/s, etc.).

Dense matrix multiplication#

To perform the operation \(C \gets C + A B\) where \(A,B,C\) are \(n\times n\) matrices. Again, a C snippet would look like:

for (i=0; i<n; i++)
    for (j=0; j<n; j++)
        for (k=0; k<n; k++)
            c[i*n+j] += a[i*n+k] * b[k*n+j];

Can you identify two expressions for the total amount of work \(W(n)\) and the associated units?
Can you think of a context in which one is better than the other and vice-versa?

Note

C is said to follow the row-major order method.
Fortran, MATLAB, and Julia follow the column-major order method.

If you are not familiar with row/column-major order methods, please see this Wiki page.

Support for multi-dimensional arrays may also be provided by external libraries, which may even support arbitrary orderings, where each dimension has a stride value, and row-major or column-major are just two possible resulting interpretations. For example, row-major order is the default in NumPy (for Python) even though Python itself is neither row or column major (uses lists of lists); Similarly, column-major order is the default in Eigen and Armadillo (libraries for C++).

Estimating time#

To estimate time, we need to know how fast hardware executes flops and moves bytes.

using CSV
using DataFrames
using Plots
default(linewidth=4, legendfontsize=12)

hardware = CSV.read("../assets/data/data-intel.csv", DataFrame)

ArgumentError: Package CSV not found in current path.
- Run `import Pkg; Pkg.add("CSV")` to install the CSV package.

Stacktrace:
 [1] macro expansion
   @ ./loading.jl:1842 [inlined]
 [2] macro expansion
   @ ./lock.jl:267 [inlined]
 [3] __require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1823
 [4] #invoke_in_world#3
   @ ./essentials.jl:926 [inlined]
 [5] invoke_in_world
   @ ./essentials.jl:923 [inlined]
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1816

plot(hardware[:,3], hardware[:,5], xlabel = "GFLOPs Double Precision", ylabel = "Memory GBps", primary=false)
scatter!(hardware[:,3], hardware[:,5], label = "Mem-GBps",)

Here we have rates \(R_f = 4660 \cdot 10^9\) flops/second and \(R_m = 175 \cdot 10^9\) bytes/second. Now we need to characterize some algorithms.

algs = CSV.read("../assets/data/algs.csv", DataFrame)

Let’s check the operations per byte, or operational intensity:

algs[!, :intensity] .= algs[:,3]  ./ algs[:,2]
algs

sort!(algs, :intensity)

function exec_time(machine, alg, n)
    bytes = n * alg.bytes
    flops = n * alg.flops
    machine = DataFrame(machine)
    T_mem = bytes ./ (machine[:, "Mem-GBps"] * 1e9)
    T_flops = flops ./ (machine[:, "GFLOPs-DP"] * 1e9)
    return maximum([T_mem[1], T_flops[1]])
end

Xeon_Platinum_9282 = filter(:Name => ==("Xeon Platinum 9282"), hardware)
SpMV = filter(:Name => ==("SpMV"), algs)

exec_time(Xeon_Platinum_9282, SpMV, 1e8)

pl = plot()
for machine in eachrow(hardware)
    for alg in eachrow(algs)
        ns = exp10.(range(4,9,length=6))
        times = [exec_time(machine, alg, n) for n in ns]
        flops = [alg.flops .* n for n in ns]
        rates = flops ./ times
        plot!(ns, rates, linestyle = :solid, marker = :circle, label="", xscale=:log10, yscale=:log10)
    end
end
xlabel!("n")
ylabel!("rate")
display(pl)

It looks like performance does not depend on problem size.

Well, yeah, we chose a model in which flops and bytes were both proportional to \(n\), and our machine model has no sense of cache hierarchy or latency, so time is also proportional to \(n\). We can divide through by \(n\) and yield a more illuminating plot.

for machine in eachrow(hardware)
    times = [exec_time(machine, alg, 1) for alg in eachrow(algs)]
    rates = algs.flops ./ times
    intensities = algs.intensity
    plot!(intensities, rates, xscale=:log10, yscale=:log10, marker = :o, label = machine.Name, legend = :outertopright)
end

xlabel!("intensity")
ylabel!("rate")
ylims!(1e9, 5e12)
xlims!(5e-2,1e1)

We’re seeing the roofline for the older processors while the newer models are memory bandwidth limited for all of these algorithms.
The flat part of the roof means that performance is compute-bound. The slanted part of the roof means performance is memorybound.
Note that the ridge point (where the diagonal and horizontal roofs meet) offers insight into the computer’s overall performance.
- The x-coordinate of the ridge point is the minimum operational intensity required to achieve maximum performance.
- If the ridge point is far to the right, then only kernels with very high operational intensity can achieve the maximum performance of that computer.
- If it is far to the left, then almost any kernel can potentially hit maximum performance.