8) Blocked Matrix-Matrix Multiplication#
Last time:
CPU optimization
Today:
1. Blocked matrix-matrix multiply#
In this lecture, we are primarily concerned about optimizing the code for small matrix-matrix multiplies – that is problems which fit in the cache; efficient use of the cache for large matrices is another topic.
Tip
For this section, watch the video 2.2.1 on the LAFF course to have a basic idea.
The key concept for this unit is blocked matrix-matrix multiply. Namely, we will think of the matrices as partitioned into a set of blocks:
where each block of \(C\) is of size \(m_{b} \times n_{b}\). The matrices \(A\) and \(B\) are partitioned in a conformal manner into \(M \times K\) and \(K \times N\) blocks respectively; the block size associated with \(K\) is \(k_{b}\).
With this partitioning of the matrix, the update for block \(IJ\) of \(C\) is then
Two questions:
What sizes to pick for \(m_{b}\), \(n_{b}\), and \(k_{b}\) to have a matrix-matrix multiplication that makes sense (dimension-wise)?
What sizes to pick for \(m_{b}\), \(n_{b}\), and \(k_{b}\) in order to most efficiently use the available resources?
Which of the previous forms of matrix-matrix multiply should be used for the inner block multiply \(A_{IP} B_{PJ}\)?
Example:#
If we partition the matrix \(C\) into \(3 \times 4\) blocks (that is, \(3\) blocks in the row direction and \(4\) blocks in the column direction):
Then we have to partition the matrix \(A\) into \(3\) blocks in the row direction and the matrix \(B\) into \(4\) blocks in the column direction:
2. Blocking for registers#
Tip
References on the LAFF course to have a basic idea:
Initially we are just going to be concerned about blocking for registers, which means that we want to choose block size \(m_b = m_r\) and \( n_b = n_r\) so that \(C_{IJ}\) fit in registers; see discussion in 2.3.1 A simple model of memory and registers to have the memory hierarchy triangle in mind.
Usually, a typical modern caches hierarchy is composed of:
registers (small, fast, ~ O(10) of
Float64
)L1 Cache
L2 Cache
L3 Cache
Main memory (big, slow)
Disk memory (huge, very slow)
But to extremely simplify things, we can consider just a hierarchy composed of two levels:
registers (small, fast, ~ O(10) of
Float64
)Main memory (big, slow)
Other assumptions/simplifications:
Our processor has only one core.
Moving data between main memory and registers takes time \(\beta_{R \leftrightarrow M}\) per double. The \(R \leftrightarrow M\) is meant to capture movement between registers (\(R\)) and memory (\(M\)).
The registers can hold 64 doubles.
Performing a flop with data in registers takes time \(\gamma_R\).
Data movement and computation cannot overlap.
Since we can only have tens of numbers in registers at a time, this means that \(m_r\) and \(n_r\) are going to be smaller; typically multiples of \(4\) do to vectorization.
The first, naive implementation we could think of would be translated as (in a pseudo-Julia snippet):
for J = 1:N
for I = 1:M
for P = 1:K
Load C_{IJ}, Load A_{IP} and B_{PJ}
C_{IJ} := C_{IJ} + A_{IP} B_{PJ}
Store C_{IJ}
end
end
end
This leads to the following cost:
where we have used that \(m = M m_r\), \(n = N n_r\), and \(k = K k_r\).
But if we realize that we do not need to load/store \(C\) every time in the interior loop, we can move the load/store of \(C\) outside of the interior loop. This would optimize things a bit (in the following pseudo-Julia snippet):
for J = 1:N
for I = 1:M
Load C_{IJ} into registers
for P = 1:K
Load A_{IP} and B_{PJ}
C_{IJ} := C_{IJ} + A_{IP} B_{PJ}
end
Store C_{IJ} to memory
end
end
This leads to the following cost:
for computation, and the following only for the loads and stores:
where, again, we have used that \(m = M m_r\), \(n = N n_r\), and \(k = K k_r\).
What about \(k_r\)? Does this need to be small? The answer will be now, because we will perform \(A_{IP}B_{PJ}\) using rank-1 updatess:
since we only use each vector \(\tilde{a}_{p}^T\) and \(b_{p}\) once in the update of \(C_{IJ}\), then we can let \(k_{r} = k\).
Note
Note that if we moved/permuted the P
loop earlier we’d save on loading the blocks of \(A\) or \(B\) but would pay for extra stores of \(C\) and it would be a losing strategy, because moving data is much more expensive that computing with data.
With this the algorithm would become:
for J = 1:N
for I = 1:M
Load C_{IJ} into registers
Load A_{I} and B_{J} into registers
C_{IJ} := C_{IJ} + A_{I} B_{J} with micro kernel
Store C_{IJ} to memory
end
end
We call \(C_{IJ}\) a micro-tile of \(C\), and \(A_{I}\) and \(B_{J}\) micro-panels of \(A\) and \(B\), respectively.
3. Optimizing the micro kernel#
Let’s look at a single rank-1 update:
In Julia code this becomes
for p = 1:k # Loop over vectors
for j = j:mr # Select element of vector of b_p
for i = 1:nr # Select element of vector of ãᵀ_p
C[i, j] = C[i, j] + A[i,p] * B[p, j]
end
end
end
Let’s imagine that the vector register size nr = 4
, then the removing the inner for-loop we have
for p = 1:k # Loop over vectors
for j = j:mr # Select element of vector of b_p
C[1, j] = C[1, j] + A[1,p] * B[p, j] # multadd
C[2, j] = C[2, j] + A[2,p] * B[p, j] # multadd
C[3, j] = C[3, j] + A[3,p] * B[p, j] # multadd
C[4, j] = C[4, j] + A[4,p] * B[p, j] # multadd
end
end
Notice that each of the these is performing the same operation (\(\gamma = \gamma + \alpha * \beta \)) just with different data; note that this is a fused multiply add (FMA) type operation.
Modern CPUs have vector registers which allow us to store small vectors (of length
4
here) and in the same time it takes to apply an operation to a single number we can apply it to all the elements in the vector register.
Thus, we can imagine our code becoming:
for p = 1:k # Loop over vectors
for j = j:mr # Select element of vector of b_p
C[:, j] = C[:, j] + A[:, p] * B[p, j]
end
end
where all the elements in the column vectors \(C[:, j]\) and \(A[:, p]\) are handled at once (here we assume that \(B[p, j]\) has been put in a vector register that gets reused for each \(j\)). Thus we have sped up our code by a factor of \(4\)! (Also, here we have assumed that all the elements in \(A[:, p]\) can be loaded at once, which is true if memory is properly aligned.)
Similar vectorization optimization strategies in other languages#
Let’s look at how different this would look in a C code, with explicitly declared, vendor-specific (Intel’s AVX2 vector instructions set in this case) FMA directives:
#include <immintrin.h> /* to use intrinsic functions */
void Gemm_MRxNRKernel( int k, double *A, int ldA, double *B, int ldB,
double *C, int ldC )
{
/* Declare vector registers to hold 4x4 C and load them */
__m256d gamma_0123_0 = _mm256_loadu_pd( &gamma( 0,0 ) );
__m256d gamma_0123_1 = _mm256_loadu_pd( &gamma( 0,1 ) );
__m256d gamma_0123_2 = _mm256_loadu_pd( &gamma( 0,2 ) );
__m256d gamma_0123_3 = _mm256_loadu_pd( &gamma( 0,3 ) );
for ( int p=0; p<k; p++ ){
/* Declare vector register for load/broadcasting beta( p,j ) */
__m256d beta_p_j;
/* Declare a vector register to hold the current column of A and load it with the four elements of that column. */
__m256d alpha_0123_p = _mm256_loadu_pd( &alpha( 0,p ) );
/* Load/broadcast beta( p,0 ). */
beta_p_j = _mm256_broadcast_sd( &beta( p, 0) );
/* update the first column of C with the current column of A times beta ( p,0 ) */
gamma_0123_0 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_0 );
/* REPEAT for second, third, and fourth columns of C. Notice that the current column of A needs not be reloaded. */
}
/* Store the updated results */
_mm256_storeu_pd( &gamma(0,0), gamma_0123_0 );
_mm256_storeu_pd( &gamma(0,1), gamma_0123_1 );
_mm256_storeu_pd( &gamma(0,2), gamma_0123_2 );
_mm256_storeu_pd( &gamma(0,3), gamma_0123_3 );
}
In addition to the above vectorization optimization strategy in C, which involves FMA directives, another popular way to tell the compiler to use single-instruction multiple-data (SIMD) operations is throught the:
#pragma omp simd
directive, which uses theOpenMP (Open Multi-Processing)
API that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran-fopenmp-simd
C/C++ compiler optimization flag option
4. SIMD.jl
#
If we were using C or C++ we would need to use the vector intrinsics for the hardware we were targeting. For example, for Intel vector intrinsics see Intel Intrinsics Guide.
In Julia we can use a more general interface.
In Julia, we only need to use the SIMD.jl
package. To do this you need to add the
package to your environment, which can be done from the repl with:
]add SIMD
Basic operation we will need:
vload
to load data from memory into a vector registervstore
to store data from a vector register to memory
If we were using C and C++ we would need to use a special fma call (_mm256_fmadd_pd
), but in Julia this is done behind the scenes for us. When muladd
is called with vector data Julia does the right thing; remember that one of the most powerful features of Julia is dynamic (multiple) dispatch, that means:
Functions can have multiple definitions as long each definition restricts the type of the parameters differently. It is the type of the parameters that define which “definition” (or “method” in Julia terminology) will be called.
This is the way Julia mimics polymorphism that some compiled languages (e.g., C++) have.
This means that the Julia compiler dispatches to call the right muladd
function below depending on the argument types.
So in psuedo-lowered-code we now have:
c1 = vload column 1 of C
c2 = vload column 2 of C
c3 = vload column 3 of C
c4 = vload column 4 of C
for p = 1:k # Loop over vectors
ap = vload column p of A
β = load B[p, 1])
c1 += β * ap
β = load B[p, 2])
c2 += β * ap
β = load B[p, 3])
c3 += β * ap
β = load B[p, 4])
c4 += β * ap
end
vstore c1 column 1 of C
vstore c2 column 2 of C
vstore c3 column 3 of C
vstore c4 column 4 of C
Since we have loaded data from a matrix, we need to pass a reference / pointer to the first element of the data we want to load. In Julia, pointer(C)
will be a pointer to the first element of \(C\) (like in the C programming language). To get to the next element we need to stride by the number of elements in a column of \(C\), call it m
.
So our code becomes:
# data we are storing
T = eltype(C)
# size and type of the vector register
VecT = Vec{4, T}
# vector loads of the data
c1 = vload(VecT, pointer(C) + 0m * sizeof(T)) # no stride
c2 = vload(VecT, pointer(C) + 1m * sizeof(T)) # stride by m (skip one column)
c3 = vload(VecT, pointer(C) + 2m * sizeof(T)) # stride by 2m (skip two columns)
c4 = vload(VecT, pointer(C) + 3m * sizeof(T)) # stride by 3m (skip three columns)
In the pointers you are not striding by the number of elements, but the number of bytes, and sizeof(T)
here returns the number of bytes required to store T
:
julia> sizeof(Float64)
8
julia> sizeof(Float32)
4
Generalizing using Julia magic#
Above we assumed that mr
and nr
were both 4
, ideally we would like to generalize this with loops. We confront two issues with this
where to put each of the column vectors of
C
how can we make the numbers represented by
mr
andnr
known at compile time; the more data we know at compile time the more optimization can occur by the compiler.
Compile time constants: Val
#
To get constants to be known at compile time we use the Val
construct in
Julia.
function foo(N)
v = 0
for i = 1:N
v = v + i
end
return v
end
foo (generic function with 1 method)
function bar(::Val{N}) where N
v = 0
for i = 1:N
v = v + i
end
return v
end
bar (generic function with 1 method)
@code_llvm foo(10)
; @ In[1]:1 within `foo`
define
i64 @julia_foo_752(i64 signext %0) #0 {
top:
; @ In[1]:3 within `foo`
; ┌ @ range.jl:897 within `iterate`
; │┌ @ range.jl:672 within `isempty`
; ││┌ @ operators.jl:378 within `>`
; │││┌ @ int.jl:83 within `<`
%1 = icmp slt i64 %0, 1
; └└└└
br i1 %1, label %L32, label %L17.preheader
L17.preheader: ; preds = %top
; @ In[1]:5 within `foo`
%2 = shl nuw i64 %0, 1
%3 = add nsw i64 %0, -1
%4 = zext i64 %3 to i65
%5 = add nsw i64 %0, -2
%6 = zext i64 %5 to i65
%7 = mul i65 %4, %6
%8 = lshr i65 %7, 1
%9 = trunc i65 %8 to i64
%10 = add i64 %2, %9
%11 = add i64 %10, -1
; @ In[1]:6 within `foo`
br label %L32
L32: ; preds = %L17.preheader, %top
%value_phi10 = phi i64 [ 0, %top ], [ %11, %L17.preheader ]
ret i64 %value_phi10
}
@code_llvm bar(Val(10))
; @ In[2]:1 within `bar`
define i64 @julia_bar_777() #0 {
top:
; @ In[2]:6 within `bar`
ret i64 55
}
Notice that the code generated for
bar(Val(10))
just returns55
whereas the code forfoo(10)
is much more complicated.In the call to
bar
the for loop is precomputed since the number10
is known when the code is compiled.What the
Val(10)
does is it makes the number10
essentially a type.
a = Val(10)
Val{10}()
typeof(a)
Val{10}
The critical thing for us is that this number is now known when the code is compiled (as opposed to at runtime).
The downside of this is that new version of the code must be compiled for each unique input which adds additional overhead to the first time a function is called;
bar(Val(10))
andbar(Val(20))
result in different compiled code, whereasfoo(10)
andfoo(20)
use the same compiled code.A similar optimization occurs when
foo
is used with a constant value inside a function:
function baz()
foo(10)
end
baz (generic function with 1 method)
@code_llvm baz()
; @ In[7]:1 within `baz`
define i64 @julia_baz_914() #0 {
top:
; @ In[7]:2 within `baz`
; ┌ @ In[1]:6 within `foo`
ret i64 55
; └
}
this is known as constant propagation in Julia, where the return of foo(10)
can be figured out at compile time.
In the cells above, we have used the
@code_llvm
macro, which shows the LLVM (Julia compiler) code generated.There are different levels of “lowering” the code from the high-level Julia (human readable) syntax, to the machine/assembly level.
Refer to the documentation for each of them:
Recommended Reading: How to optimise Julia code: A practical guide
5. Constant sized arrays: StaticArrays.jl
#
It is natural define the vector unit as
VecT = Vector{T}(undef,vl)
where vl
is the length of the vector unit with unspecified entries (undef
). Thus the microcolumns of \(C\) and \(A\) get broken up into \(n_r / v_l\) pieces.
Now the question becomes, what should we use to handle the nr
vectors of the microtile?
We could do:
c = Array{VecT}(undef, mr_vl, nr)
which would give us an array of size mv_vl * nr
that can hold vectors with unspecified entries (undef
). Unfortunately, even though this array is only used in the microkernel, the array c
will be allocated on the heap and not the stack; for our purposes, you can think of the stack as memory that is local to the function and heap as memory that can hold data needed across functions. If small enough, stack data should just live in registers.
How can we get stack (register) allocated arrays in Julia?
We need a package called StaticArrays.jl
:
using Pkg
Pkg.add("StaticArrays")
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.10/Project.toml`
No Changes to `~/.julia/environments/v1.10/Manifest.toml`
There are two types of arrays in StaticArrays
:
SArray
: Arrays whose size is know at compile time and data does not change; ifa
is anSArray
you CANNOT doa[1] = 0
.MArray
(mutable arrays): Arrays whose size is know at compile time but that data can change; ifa
is anMArray
you CAN doa[1] = 0
Since our data will be changing, we want to store the column vectors as
MArray
s. Our algorithm will become:
# Load the columns of the microtile of C
mr_vl = div(mr, vl)
c = MArray{Tuple{mr_vl, nr}, VecT}(undef)
@inbounds for j = 1:nr
for i = 1:mr_vl
offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
c[i, j] = vload(VecT, pointer(C) + offset)
end
end
and similarly for \(a\) we want:
a = MVector{mr_vl, VecT}(undef)
@inbounds 4 for p = 1:pend
for i = 1:mr_vl
offset = ((p - 1) * m + (i - 1) * vl) * sizeof(T)
a[i] = vload(VecT, pointer(A) + offset)
end
for j = 1:nr
β = B[p, j]
for i = 1:mr_vl
c[i, j] = muladd(β, a[i], c[i, j])
end
end
end
6. Loop unrolling#
Another (and possibly unneeded) optimization is loop unrolling. The basic idea of loop unrolling is to change:
@inbounds for j = 1:nr
for i = 1:mr_vl
offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
c[i, j] = vload(VecT, pointer(C) + offset)
end
end
to
c[1,1] = vload(VecT, pointer(C))
c[2,1] = vload(VecT, pointer(C) + m * sizeof(T)
c[1,2] = vload(VecT, pointer(C) + 2 * m * sizeof(T)
c[2,2] = vload(VecT, pointer(C) + 3 * m * sizeof(T)
c[1,3] = vload(VecT, pointer(C) + 4 * m * sizeof(T)
c[2,3] = vload(VecT, pointer(C) + 5 * m * sizeof(T)
c[1,4] = vload(VecT, pointer(C) + 6 * m * sizeof(T)
c[2,4] = vload(VecT, pointer(C) + 7 * m * sizeof(T)
...
which achieves two things:
Instructions needed to control the flow of the loop is removed
The compiler has more optimizations at its disposal
The downside of loop unrolling is that the binary size gets larger and register pressure can increase.
The reason I say that this might be unneeded is that for small loops Julia (LLVM) is likely already unrolling the loops; in some little tests I’ve seen Julia (LLVM) unroll loops of length 36
but I don’t know the details of what heuristics are used for this…
That said, we can force loop unrolling using the @unroll
macro from KernelAbstractions.jl
:
using Pkg
Pkg.add("KernelAbstractions")
Resolving package versions...
No Changes to `~/.julia/environments/v1.10/Project.toml`
No Changes to `~/.julia/environments/v1.10/Manifest.toml`
so that our loop code becomes
@inbounds @unroll for j = 1:nr
@unroll for i = 1:mr_vl
offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
c[i, j] = vload(VecT, pointer(C) + offset)
end
end
We can also ask for a fixed integer unroll factor:
@inbounds @unroll 4 for p = 1:pend
...
end
which will unroll the loop in factors of 4
, then handle the remainder safely; see the LLVM Loop Unrolling for an example.
7. Examples in Julia#
You can find examples of a \(4 \times 4\) micro-kernel implementation in Julia in the julia_codes/module2-3/
directory in the class repository.
8. Summary#
In Section 2. Blocking for registers we found the following cost:
for computation, and the following only for the loads and stores:
The ratio between flops and memory operations between the registers and memory is then
We can see that if \(k\) is large, then \(2mn\) (the cost of loading and storing the \(m_r \times n_r\) submatrices of \(C\)) can be ignored in the denominator, yielding, approximately,
This is the ratio of floating point operations to memory operations that we want to be high.
If \(m_r = n_r = 4\), then this ratio is 4.
Which means that, for every memory operation (read) of an element of \(A\) or \(B\), approximately \(4\) floating point operations are performed with data that resides in registers.