8) Blocked Matrix-Matrix Multiplication

8) Blocked Matrix-Matrix Multiplication#

Last time:

CPU optimization

Today:

Blocked matrix-matrix multiply
Blocking for registers
Optimizing the micro kernel
SIMD.jl
Constant sized arrays: StaticArrays.jl
Loop unrolling
Examples in Julia
Summary

1. Blocked matrix-matrix multiply#

In this lecture, we are primarily concerned about optimizing the code for small matrix-matrix multiplies – that is problems which fit in the cache; efficient use of the cache for large matrices is another topic.

Tip

For this section, watch the video 2.2.1 on the LAFF course to have a basic idea.

The key concept for this unit is blocked matrix-matrix multiply. Namely, we will think of the matrices as partitioned into a set of blocks:

\[\begin{split} A = \begin{bmatrix} \begin{array}{c|c|c|c} A_{11} & A_{12} & \dots & A_{1K}\\ \hline A_{21} & A_{22} & \dots & A_{2K}\\ \hline \vdots & \vdots & \ddots & \vdots\\ \hline A_{M1} & A_{M2} & \dots & A_{MK} \end{array} \end{bmatrix}, B = \begin{bmatrix} \begin{array}{c|c|c|c} B_{11} & B_{12} & \dots & B_{1N}\\ \hline B_{21} & B_{22} & \dots & B_{2N}\\ \hline \vdots & \vdots & \ddots & \vdots\\ \hline B_{K1} & B_{K2} & \dots & B_{KN} \end{array} \end{bmatrix}, C = \begin{bmatrix} \begin{array}{c|c|c|c} C_{11} & C_{12} & \dots & C_{1N}\\ \hline C_{21} & C_{22} & \dots & C_{2N}\\ \hline \vdots & \vdots & \ddots & \vdots\\ \hline C_{M1} & C_{M2} & \dots & C_{MN} \end{array} \end{bmatrix}, \end{split}\]

where each block of \(C\) is of size \(m_{b} \times n_{b}\). The matrices \(A\) and \(B\) are partitioned in a conformal manner into \(M \times K\) and \(K \times N\) blocks respectively; the block size associated with \(K\) is \(k_{b}\).

With this partitioning of the matrix, the update for block \(IJ\) of \(C\) is then

\[ C_{IJ} := C_{IJ} + \sum_{P=1}^{K} A_{IP} B_{PJ} \]

Two questions:

What sizes to pick for \(m_{b}\), \(n_{b}\), and \(k_{b}\) to have a matrix-matrix multiplication that makes sense (dimension-wise)?
What sizes to pick for \(m_{b}\), \(n_{b}\), and \(k_{b}\) in order to most efficiently use the available resources?
Which of the previous forms of matrix-matrix multiply should be used for the inner block multiply \(A_{IP} B_{PJ}\)?

Example:#

If we partition the matrix \(C\) into \(3 \times 4\) blocks (that is, \(3\) blocks in the row direction and \(4\) blocks in the column direction):

\[\begin{split} C = \begin{bmatrix} \begin{array}{c|c|c|c} C_{11} & C_{12} & C_{12} & C_{14}\\ \hline C_{21} & C_{22} & C_{23} & C_{24}\\ \hline C_{31} & C_{32} & C_{33} & C_{34} \end{array} \end{bmatrix}, \end{split}\]

Then we have to partition the matrix \(A\) into \(3\) blocks in the row direction and the matrix \(B\) into \(4\) blocks in the column direction:

\[\begin{split} A = \begin{bmatrix} \begin{array}{cccc} & & & \\ \hline & & & \\ \hline & & & \end{array} \end{bmatrix}, B = \begin{bmatrix} \begin{array}{c|c|c|c} & & & \\ & & & \\ & & & \\ \end{array} \end{bmatrix} \end{split}\]

2. Blocking for registers#

Tip

References on the LAFF course to have a basic idea:

Initially we are just going to be concerned about blocking for registers, which means that we want to choose block size \(m_b = m_r\) and \( n_b = n_r\) so that \(C_{IJ}\) fit in registers; see discussion in 2.3.1 A simple model of memory and registers to have the memory hierarchy triangle in mind.

Usually, a typical modern caches hierarchy is composed of:

registers (small, fast, ~ O(10) of Float64)
L1 Cache
L2 Cache
L3 Cache
Main memory (big, slow)
Disk memory (huge, very slow)

But to extremely simplify things, we can consider just a hierarchy composed of two levels:

registers (small, fast, ~ O(10) of Float64)
Main memory (big, slow)

Other assumptions/simplifications:

Our processor has only one core.
Moving data between main memory and registers takes time \(\beta_{R \leftrightarrow M}\) per double. The \(R \leftrightarrow M\) is meant to capture movement between registers (\(R\)) and memory (\(M\)).
The registers can hold 64 doubles.
Performing a flop with data in registers takes time \(\gamma_R\).
Data movement and computation cannot overlap.

Since we can only have tens of numbers in registers at a time, this means that \(m_r\) and \(n_r\) are going to be smaller; typically multiples of \(4\) do to vectorization.

The first, naive implementation we could think of would be translated as (in a pseudo-Julia snippet):

for J = 1:N
  for I = 1:M
    for P = 1:K
      Load C_{IJ}, Load A_{IP} and B_{PJ}
      C_{IJ} := C_{IJ} + A_{IP} B_{PJ}
      Store C_{IJ}
    end
  end
end

This leads to the following cost:

\[ 2 (M m_r) (N n_r) (K k_r) \gamma_R + [2 (M m_r) (N n_r) K + (M m_r) N (K k_r) + M(N n_r) (K k_r) ] \beta_{R \leftrightarrow M} = \underbrace{2mnk \gamma_R}_{\textrm{computation}} + \underbrace{ mnk \left(\frac{2}{k_r} +\frac{1}{n_r} + \frac{1}{m_r}\right)\beta_{R \leftrightarrow M} }_{\textrm{overhead}} \]

where we have used that \(m = M m_r\), \(n = N n_r\), and \(k = K k_r\).

But if we realize that we do not need to load/store \(C\) every time in the interior loop, we can move the load/store of \(C\) outside of the interior loop. This would optimize things a bit (in the following pseudo-Julia snippet):

for J = 1:N
  for I = 1:M
    Load C_{IJ} into registers
    for P = 1:K
      Load A_{IP} and B_{PJ}
      C_{IJ} := C_{IJ} + A_{IP} B_{PJ}
    end
    Store C_{IJ} to memory
  end
end

This leads to the following cost:

\[ 2mnk \gamma_R \]

for computation, and the following only for the loads and stores:

\[ 2 m_r n_r M N + MNK (m_r k_r + k_r n_r) = 2mn + mnk \left(\frac{1}{n_r} + \frac{1}{m_r}\right) \]

where, again, we have used that \(m = M m_r\), \(n = N n_r\), and \(k = K k_r\).

What about \(k_r\)? Does this need to be small? The answer will be now, because we will perform \(A_{IP}B_{PJ}\) using rank-1 updatess:

\[ A_{IP} B_{JP} = \tilde{a}_{1}^T b_{1} + \tilde{a}_{2}^T b_{2} + \cdots + \tilde{a}_{k}^T b_{k_r} = \sum_{p = 1}^{k_r} \tilde{a}_{p}^T b_{p}, \]

since we only use each vector \(\tilde{a}_{p}^T\) and \(b_{p}\) once in the update of \(C_{IJ}\), then we can let \(k_{r} = k\).

Note

Note that if we moved/permuted the P loop earlier we’d save on loading the blocks of \(A\) or \(B\) but would pay for extra stores of \(C\) and it would be a losing strategy, because moving data is much more expensive that computing with data.

With this the algorithm would become:

for J = 1:N
  for I = 1:M
    Load C_{IJ} into registers
    Load A_{I} and B_{J} into registers
    C_{IJ} := C_{IJ} + A_{I} B_{J} with micro kernel
    Store C_{IJ} to memory
  end
end

We call \(C_{IJ}\) a micro-tile of \(C\), and \(A_{I}\) and \(B_{J}\) micro-panels of \(A\) and \(B\), respectively.

3. Optimizing the micro kernel#

Tip

References on the LAFF course to have a basic idea:

2.4 Optimizing the Micro-kernel

Let’s look at a single rank-1 update:

\[ C_{IJ} = \tilde{a}_{1}^T b_{1} + \tilde{a}_{2}^T b_{2} + \cdots + \tilde{a}_{k}^T b_{k} \]

In Julia code this becomes

  for p = 1:k             # Loop over vectors
    for j = j:mr          # Select element of vector of b_p
      for i = 1:nr        # Select element of vector of ãᵀ_p
        C[i, j] = C[i, j] + A[i,p] * B[p, j]
      end
    end
  end

Let’s imagine that the vector register size nr = 4, then the removing the inner for-loop we have

  for p = 1:k           # Loop over vectors
    for j = j:mr        # Select element of vector of b_p
      C[1, j] = C[1, j] + A[1,p] * B[p, j] # multadd
      C[2, j] = C[2, j] + A[2,p] * B[p, j] # multadd
      C[3, j] = C[3, j] + A[3,p] * B[p, j] # multadd
      C[4, j] = C[4, j] + A[4,p] * B[p, j] # multadd
    end
  end

Notice that each of the these is performing the same operation (\(\gamma = \gamma + \alpha * \beta \)) just with different data; note that this is a fused multiply add (FMA) type operation.
Modern CPUs have vector registers which allow us to store small vectors (of length 4 here) and in the same time it takes to apply an operation to a single number we can apply it to all the elements in the vector register.

Thus, we can imagine our code becoming:

  for p = 1:k          # Loop over vectors
    for j = j:mr       # Select element of vector of b_p
      C[:, j] = C[:, j] + A[:, p] * B[p, j]
    end
  end

where all the elements in the column vectors \(C[:, j]\) and \(A[:, p]\) are handled at once (here we assume that \(B[p, j]\) has been put in a vector register that gets reused for each \(j\)). Thus we have sped up our code by a factor of \(4\)! (Also, here we have assumed that all the elements in \(A[:, p]\) can be loaded at once, which is true if memory is properly aligned.)

Similar vectorization optimization strategies in other languages#

Let’s look at how different this would look in a C code, with explicitly declared, vendor-specific (Intel’s AVX2 vector instructions set in this case) FMA directives:

  #include <immintrin.h> /* to use intrinsic functions */

  void Gemm_MRxNRKernel( int k, double *A, int ldA, double *B, int ldB, 
      double *C, int ldC ) 
  {
    /* Declare vector registers to hold 4x4 C and load them */
    __m256d gamma_0123_0 = _mm256_loadu_pd( &gamma( 0,0 ) ); 
    __m256d gamma_0123_1 = _mm256_loadu_pd( &gamma( 0,1 ) ); 
    __m256d gamma_0123_2 = _mm256_loadu_pd( &gamma( 0,2 ) ); 
    __m256d gamma_0123_3 = _mm256_loadu_pd( &gamma( 0,3 ) ); 
      
    for ( int p=0; p<k; p++ ){
      /* Declare vector register for load/broadcasting beta( p,j ) */
      __m256d beta_p_j; 
      
      /* Declare a vector register to hold the current column of A and load it with the four elements of that column. */
      __m256d alpha_0123_p = _mm256_loadu_pd( &alpha( 0,p ) ); 

      /* Load/broadcast beta( p,0 ). */
      beta_p_j = _mm256_broadcast_sd( &beta( p, 0) ); 
      
      /* update the first column of C with the current column of A times beta ( p,0 ) */
      gamma_0123_0 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_0 ); 
      
      /* REPEAT for second, third, and fourth columns of C.  Notice that the current column of A needs not be reloaded. */
    }
    
    /* Store the updated results */
    _mm256_storeu_pd( &gamma(0,0), gamma_0123_0 ); 
    _mm256_storeu_pd( &gamma(0,1), gamma_0123_1 ); 
    _mm256_storeu_pd( &gamma(0,2), gamma_0123_2 ); 
    _mm256_storeu_pd( &gamma(0,3), gamma_0123_3 ); 
  }

In addition to the above vectorization optimization strategy in C, which involves FMA directives, another popular way to tell the compiler to use single-instruction multiple-data (SIMD) operations is throught the:
- #pragma omp simd directive, which uses the OpenMP (Open Multi-Processing) API that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran
- -fopenmp-simd C/C++ compiler optimization flag option

4. `SIMD.jl`#

If we were using C or C++ we would need to use the vector intrinsics for the hardware we were targeting. For example, for Intel vector intrinsics see Intel Intrinsics Guide.
In Julia we can use a more general interface.

In Julia, we only need to use the SIMD.jl package. To do this you need to add the package to your environment, which can be done from the repl with:

  ]add SIMD

Basic operation we will need:

vload to load data from memory into a vector register
vstore to store data from a vector register to memory

If we were using C and C++ we would need to use a special fma call (_mm256_fmadd_pd), but in Julia this is done behind the scenes for us. When muladd is called with vector data Julia does the right thing; remember that one of the most powerful features of Julia is dynamic (multiple) dispatch, that means:

Functions can have multiple definitions as long each definition restricts the type of the parameters differently. It is the type of the parameters that define which “definition” (or “method” in Julia terminology) will be called.
This is the way Julia mimics polymorphism that some compiled languages (e.g., C++) have.

This means that the Julia compiler dispatches to call the right muladd function below depending on the argument types.

So in psuedo-lowered-code we now have:

  c1 = vload column 1 of C
  c2 = vload column 2 of C
  c3 = vload column 3 of C
  c4 = vload column 4 of C

  for p = 1:k        # Loop over vectors
    ap = vload column p of A

    β  = load B[p, 1])
    c1 += β * ap

    β  = load B[p, 2])
    c2 += β * ap

    β  = load B[p, 3])
    c3 += β * ap

    β  = load B[p, 4])
    c4 += β * ap
  end

  vstore c1 column 1 of C
  vstore c2 column 2 of C
  vstore c3 column 3 of C
  vstore c4 column 4 of C

Since we have loaded data from a matrix, we need to pass a reference / pointer to the first element of the data we want to load. In Julia, pointer(C) will be a pointer to the first element of \(C\) (like in the C programming language). To get to the next element we need to stride by the number of elements in a column of \(C\), call it m.

So our code becomes:

# data we are storing
T = eltype(C)

# size and type of the vector register
VecT = Vec{4, T}

# vector loads of the data
c1 = vload(VecT, pointer(C) + 0m * sizeof(T)) # no stride
c2 = vload(VecT, pointer(C) + 1m * sizeof(T)) # stride by m  (skip one column)
c3 = vload(VecT, pointer(C) + 2m * sizeof(T)) # stride by 2m (skip two columns)
c4 = vload(VecT, pointer(C) + 3m * sizeof(T)) # stride by 3m (skip three columns)

In the pointers you are not striding by the number of elements, but the number of bytes, and sizeof(T) here returns the number of bytes required to store T:

julia> sizeof(Float64)
8

julia> sizeof(Float32)
4

Generalizing using Julia magic#

Above we assumed that mr and nr were both 4, ideally we would like to generalize this with loops. We confront two issues with this

where to put each of the column vectors of C
how can we make the numbers represented by mr and nr known at compile time; the more data we know at compile time the more optimization can occur by the compiler.

Compile time constants: `Val`#

To get constants to be known at compile time we use the Val construct in Julia.

function foo(N)
    v = 0
    for i = 1:N
      v = v + i
    end
    return v
end

foo (generic function with 1 method)

function bar(::Val{N}) where N
    v = 0
    for i = 1:N
      v = v + i
    end
    return v
end

bar (generic function with 1 method)

@code_llvm foo(10)

;  @ In[1]:1 within `foo`
define i64 @julia_foo_742(i64 signext %0) #0 {
top:
;  @ In[1]:3 within `foo`
; ┌ @ range.jl:897 within `iterate`
; │┌ @ range.jl:672 within `isempty`
; ││┌ @ operators.jl:378 within `>`
; │││┌ @ int.jl:83 within `<`
      %1 

= icmp slt i64 %0, 1
; └└└└
  br i1 %1, label %L32, label %L17.preheader

L17.preheader:                                    ; preds = %top
;  @ In[1]:5 within `foo`
  %2 = shl nuw i64 %0, 1
  %3 = add nsw i64 %0, -1
  %4 = zext i64 %3 to i65
  %5 = add nsw i64 %0, -2
  %6 = zext i64 %5 to i65
  %7 = mul i65 %4, %6
  %8 = lshr i65 %7, 1
  %9 = trunc i65 %8 to i64
  %10 = add i64 %2, %9
  %11 = add i64 %10, -1
;  @ In[1]:6 within `foo`
  br label %L32

L32:                                              ; preds = %L17.preheader, %top
  %value_phi10 = phi i64 [ 0, %top ], [ %11, %L17.preheader ]
  ret i64 %value_phi10
}

@code_llvm bar(Val(10))

;  @ In[2]:1 within `bar`
define i64 @julia_bar_767() #0 {
top:
;  @ In[2]:6 within `bar`
  ret i64 55
}

Notice that the code generated for bar(Val(10)) just returns 55 whereas the code for foo(10) is much more complicated.
In the call to bar the for loop is precomputed since the number 10 is known when the code is compiled.
What the Val(10) does is it makes the number 10 essentially a type.

a = Val(10)

Val{10}()

typeof(a)

Val{10}

The critical thing for us is that this number is now known when the code is compiled (as opposed to at runtime).
The downside of this is that new version of the code must be compiled for each unique input which adds additional overhead to the first time a function is called; bar(Val(10)) and bar(Val(20)) result in different compiled code, whereas foo(10) and foo(20) use the same compiled code.
A similar optimization occurs when foo is used with a constant value inside a function:

function baz()
    foo(10)
end

baz (generic function with 1 method)

@code_llvm baz()

;  @ In[7]:1 within `baz`
define i64 @julia_baz_904() #0 {
top:
;  @ In[7]:2 within `baz`
; ┌ @ In[1]:6 within `foo`
   ret i64 55
; └
}

this is known as constant propagation in Julia, where the return of foo(10) can be figured out at compile time.

In the cells above, we have used the @code_llvm macro, which shows the LLVM (Julia compiler) code generated.
There are different levels of “lowering” the code from the high-level Julia (human readable) syntax, to the machine/assembly level.
Refer to the documentation for each of them:

Recommended Reading: How to optimise Julia code: A practical guide

5. Constant sized arrays: `StaticArrays.jl`#

It is natural define the vector unit as

VecT = Vector{T}(undef,vl)

where vl is the length of the vector unit with unspecified entries (undef). Thus the microcolumns of \(C\) and \(A\) get broken up into \(n_r / v_l\) pieces.

Now the question becomes, what should we use to handle the nr vectors of the microtile?

We could do:

c = Array{VecT}(undef, mr_vl, nr)

which would give us an array of size mv_vl * nr that can hold vectors with unspecified entries (undef). Unfortunately, even though this array is only used in the microkernel, the array c will be allocated on the heap and not the stack; for our purposes, you can think of the stack as memory that is local to the function and heap as memory that can hold data needed across functions. If small enough, stack data should just live in registers.

How can we get stack (register) allocated arrays in Julia?

We need a package called StaticArrays.jl:

using Pkg
Pkg.add("StaticArrays")

   Resolving package versions...

    Updating `~/.julia/environments/v1.10/Project.toml`
  [90137ffa] + StaticArrays v1.9.13
    Updating `~/.julia/environments/v1.10/Manifest.toml`
  [90137ffa] + StaticArrays v1.9.13

Precompiling packages...

    627.5 ms  ✓ ConstructionBase → ConstructionBaseStaticArraysExt

  1 dependency successfully precompiled in 2 seconds. 168 already precompiled.

There are two types of arrays in StaticArrays:

SArray: Arrays whose size is know at compile time and data does not change; if a is an SArray you CANNOT do a[1] = 0.
MArray (mutable arrays): Arrays whose size is know at compile time but that data can change; if a is an MArray you CAN do a[1] = 0

Since our data will be changing, we want to store the column vectors as MArrays. Our algorithm will become:

  # Load the columns of the microtile of C
  mr_vl = div(mr, vl)
  c = MArray{Tuple{mr_vl, nr}, VecT}(undef)
  @inbounds for j = 1:nr
    for i = 1:mr_vl
      offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
      c[i, j] = vload(VecT, pointer(C) + offset)
    end
  end

and similarly for \(a\) we want:

  a = MVector{mr_vl, VecT}(undef)
  @inbounds 4 for p = 1:pend
    for i = 1:mr_vl
      offset = ((p - 1) * m + (i - 1) * vl) * sizeof(T)
      a[i] = vload(VecT, pointer(A) + offset)
    end
    for j = 1:nr
      β = B[p, j]
      for i = 1:mr_vl
        c[i, j] = muladd(β, a[i], c[i, j])
      end
    end
  end

6. Loop unrolling#

Another (and possibly unneeded) optimization is loop unrolling. The basic idea of loop unrolling is to change:

  @inbounds for j = 1:nr
    for i = 1:mr_vl
      offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
      c[i, j] = vload(VecT, pointer(C) + offset)
    end
  end

to

  c[1,1] = vload(VecT, pointer(C))
  c[2,1] = vload(VecT, pointer(C) + m * sizeof(T)
  c[1,2] = vload(VecT, pointer(C) + 2 * m * sizeof(T)
  c[2,2] = vload(VecT, pointer(C) + 3 * m * sizeof(T)
  c[1,3] = vload(VecT, pointer(C) + 4 * m * sizeof(T)
  c[2,3] = vload(VecT, pointer(C) + 5 * m * sizeof(T)
  c[1,4] = vload(VecT, pointer(C) + 6 * m * sizeof(T)
  c[2,4] = vload(VecT, pointer(C) + 7 * m * sizeof(T)
  ...

which achieves two things:

Instructions needed to control the flow of the loop is removed
The compiler has more optimizations at its disposal

The downside of loop unrolling is that the binary size gets larger and register pressure can increase.

The reason I say that this might be unneeded is that for small loops Julia (LLVM) is likely already unrolling the loops; in some little tests I’ve seen Julia (LLVM) unroll loops of length 36 but I don’t know the details of what heuristics are used for this…

That said, we can force loop unrolling using the @unroll macro from KernelAbstractions.jl:

using Pkg
Pkg.add("KernelAbstractions")

   Resolving package versions...

   Installed UnsafeAtomics ────── v0.3.0
   Installed Atomix ───────────── v1.1.1

   Installed KernelAbstractions ─ v0.9.35

    Updating `~/.julia/environments/v1.10/Project.toml`
  [63c18a36] + KernelAbstractions v0.9.35
    Updating `~/.julia/environments/v1.10/Manifest.toml`
  [79e6a3ab] + Adapt v4.3.0
  [a9b6321e] + Atomix v1.1.1
  [63c18a36] + KernelAbstractions v0.9.35
  [013be700] + UnsafeAtomics v0.3.0

Precompiling packages...

   1651.3 ms  ✓ UnsafeAtomics

    465.2 ms  ✓ Atomix

   2173.7 ms  ✓ KernelAbstractions

    645.9 ms  ✓ KernelAbstractions → LinearAlgebraExt

    759.4 ms  ✓ KernelAbstractions → SparseArraysExt
  5 dependencies successfully precompiled in 6 seconds. 172 already precompiled.

so that our loop code becomes

  @inbounds @unroll for j = 1:nr
    @unroll for i = 1:mr_vl
      offset = ((i - 1) * vl + (j - 1) * m) * sizeof(T)
      c[i, j] = vload(VecT, pointer(C) + offset)
    end
  end

We can also ask for a fixed integer unroll factor:

  @inbounds @unroll 4 for p = 1:pend
    ...
  end

which will unroll the loop in factors of 4, then handle the remainder safely; see the LLVM Loop Unrolling for an example.

7. Examples in Julia#

You can find examples of a \(4 \times 4\) micro-kernel implementation in Julia in the julia_codes/module2-3/ directory in the class repository.

8. Summary#

In Section 2. Blocking for registers we found the following cost:

\[ 2mnk \gamma_R \]

for computation, and the following only for the loads and stores:

\[ 2 m_r n_r M N + MNK (m_r k_r + k_r n_r) = 2mn + mnk \left(\frac{1}{n_r} + \frac{1}{m_r}\right). \]

The ratio between flops and memory operations between the registers and memory is then

\[ \frac{2mnk}{2mn + mnk \left(\frac{1}{n_r} + \frac{1}{m_r}\right)} \]

We can see that if \(k\) is large, then \(2mn\) (the cost of loading and storing the \(m_r \times n_r\) submatrices of \(C\)) can be ignored in the denominator, yielding, approximately,

\[ \frac{2mnk}{2mn + mnk \left(\frac{1}{n_r} + \frac{1}{m_r}\right)} = \frac{2}{\frac{1}{n_r} + \frac{1}{m_r}} = \frac{2}{\frac{m_r}{m_r n_r} + \frac{n_r}{m_r n_r}} = \frac{2 m_r n_r}{{n_r} + {m_r}} \]

This is the ratio of floating point operations to memory operations that we want to be high.
If \(m_r = n_r = 4\), then this ratio is 4.
Which means that, for every memory operation (read) of an element of \(A\) or \(B\), approximately \(4\) floating point operations are performed with data that resides in registers.