4) Introduction to Computer Architectures#

Last time:

  • Introduction to Linux File System and commands

Today:

  1. Architectures

  2. Memory

1. Architectures#

A von Neumann Architecture:#

von Neumann architecture

The original document by von Neumann in 1945 describes a design architecture for an electronic digital computer with these components:

  • A processing unit with both an arithmetic logic unit and processor registers

  • A control unit that includes an instruction register and a program counter

  • Memory that stores data and instructions

  • External mass storage

  • Input and output mechanisms

The term von Neumann architecture has evolved to refer to any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time (since they share a common bus). This is referred to as the von Neumann bottleneck, which often limits the performance of the corresponding system.

A stored-program computer uses the same underlying mechanism to encode both program instructions and data as opposed to designs which use a mechanism such as discrete plugboard wiring or fixed control circuitry for instruction implementation.

The vast majority of modern computers use the same hardware mechanism to encode and store both data and program instructions, but have caches between the CPU and memory, and, for the caches closest to the CPU, have separate caches for instructions and data, so that most instruction and data fetches use separate buses (split-cache architecture).

A contemporary architecture:#

Core 2

My laptop:#

We can get this kind of information for our machine using hwloc, which provides a library as well as the command-line tool lstopo.

lstopo --output-format svg > lstopo-my_laptop_architecture.svg

lstopo my laptop architecture

A double-socket compute node with two GPUs:#

2x Xeon Ivy-Bridge-EP E5-2680v2 + 2x NVIDIA GPUs (from 2013, with hwloc v1.11). GPUs are reported as CUDA devices and X11 display :1.0: (from the hwloc gallery)

Block diagrams:#

A block diagram from a vendor can include additional information about how cores are physically connected.

Ring bus (Xeon E5-2600 family):#

Intel Xeon E5-2600

Mesh bus (Xeon Scalable family):#

Intel Xeon Scalable

Multi-socket configurations:#

A 4-socket ring configuration:

4-socket ring

Multiple nodes go into racks or cabinets#

Blue Gene/P Racks

OLCF Summit

Oak Ridge National Lab Frontier

Terminology:#

  • Core (virtual and physical): has a single program counter (logically sequential processing of instructions)

  • Memory channel: e.g., DDR4-2400: transfers 64 bits (8 bytes) at a rate of 2400 MHz = 15.36 GB/s

  • Socket or CPU: contains multiple cores in a single piece* of silicon

  • Non-Uniform Memory Access (NUMA): different channels may be different distances from a core

  • Compute node: one or more sockets, plus memory, network card, etc.

2. Memory#

2.1 Cache#

  • Caches are low-capacity, high-speed memories that are commonly integrated on the CPU die. The need for caches can be easily understood by realizing that data transfer rates to main memory are painfully slow compared to the CPU’s arithmetic performance.

  • While peak performance soars at several GFlops/sec per core, memory bandwidth, i.e., the rate at which data can be transferred from memory to the CPU, is still stuck at a couple of GBytes/sec, which is entirely insufficient to feed all arithmetic units and keep them busy continuously.

  • This tells us that Moore’s Law, which for more than thirty years reassured scientists that no matter which technology was implemented to build computer chips, their “complexity” or general “capability” doubled about every 24 months, is not only not valid anymore, but even if it was, it would not be useful in increasing performance on modern complex architectures.

  • To make matters worse, in order to transfer a single data item (usually one or two Douple Precision (DP) words) from memory, an initial waiting time called latency passes until data can actually flow. Thus, latency is often defined as the time it takes to “transfer a zero-byte message”.

  • Memory latency is usually of the order of several hundred CPU cycles and is composed of different contributions from memory chips, the chipset and the processor.

  • Advances in memory performance show up at a much slower rate than the rate of improvement in chip complexity. The term DRAM gap has been coined for the increasing “distance” between CPU and memory in terms of latency and bandwidth.

  • Caches can alleviate the effects of the DRAM gap in many cases. Usually there are at least two levels of cache , called L1 and L2, respectively.

  • L1 is normally split into two parts, one for instructions (I-cache, L1I) and one for data (L1D).

  • Outer cache levels are normally unified, storing data as well as instructions.

  • In general, the “closer” a cache is to the CPU’s registers, i.e., the higher its bandwidth and the lower its latency, the smaller it must be to keep administration overhead low.

How is memory accessed?#

  • Whenever the CPU issues a read request (”load”) for transferring a data item to a register, first-level cache logic checks whether this item already resides in cache.

    • If it does, this is called a cache hit and the request can be satisfied immediately, with low latency.

    • In case of a cache miss, however, data must be fetched from outer cache levels or, in the worst case, from main memory.

    • If all cache entries are occupied, a hardware-implemented algorithm evicts old items from cache and replaces them with new data.

  • The sequence of events for a cache miss on a “write” is generally more involved.

  • Instruction caches are usually of minor importance since scientific codes tend to be largely loop-based; I-cache misses are rare events compared to D-cache misses.

  • Caches can only have a positive effect on performance if the data access pattern of an application shows some locality of reference. More specifically, data items that have been loaded into a cache are to be used again “soon enough” to not have been evicted in the meantime. This is also called temporal locality.

  • Unfortunately, supporting temporal locality is not sufficient. Many applications show streaming patterns where large amounts of data are loaded into the CPU, modified, and written back without the potential of reuse “in time”.

  • In order to reduce the latency penalty for streaming, caches feature a peculiar organization into cache lines.

  • The advantage of cache lines is that the latency penalty of a cache miss occurs only on the first miss on an item belonging to a line. The line is fetched from memory as a whole; neighboring items can then be loaded from cache with much lower latency, increasing the cache hit ratio.

  • So if the application shows some spatial locality (as it often appens when discretizing Partial Differential Equations (PDEs)), i.e., if the probability of successive accesses to neighboring items is high, the latency problem can be significantly reduced.

  • The downside of cache lines is that erratic data access patterns are not supported.

    • On the contrary, not only does each load incur a miss and subsequent latency penalty, it also leads to the transfer of a whole cache line, polluting the memory bus with data that will probably never be used.

    • The effective bandwidth available to the application will thus be low.

  • On the whole, however, the advantages of using cache lines prevail, and very few processor manufacturers have provided means of bypassing the mechanism.

How we define what limits performance:#

  • When performance is governed by main memory bandwidth and latency — the code is memory-bound.

  • In order for an application to be truly cache-bound, i.e., decouple from main memory so that performance is not governed by memory bandwidth or latency any more, the cache hit ratio (sometimes denoted by \(\gamma\)) must be large enough so the time it takes to process in-cache data becomes larger than the time for reloading it. If and when this happens depends of course on the details of the operations performed.

How your program accesses memory:#

double a[1000];

void foo() {
    for (int i=0; i<1000; i++)
        a[i] = 1.234 * i;
}

The compiler turns the loop body into instructions, which we can examine using Godbolt.

pxor xmm0, xmm0                  ; zero the xmm0 register
cvtsi2sd xmm0, eax               ; convert the integer i to double
mulsd xmm0, xmm1                 ; multiply by 1.234 (held in xmm1)
movsd QWORD PTR a[0+rax*8], xmm0 ; store to memory address a[i]

Only one instruction here accesses memory, and the performance will be affected greatly by where that memory resides (which level of cache, where in DRAM).

Most architectures today have 64-byte cache lines: all transfers from main memory (DRAM) to and from cache operate in units of 64 bytes.

Let’s compare three code samples#

for (int i=0; i<N; i++)
    a[i] = b[i];
for (int i=0; i<N; i++)
    a[i] = b[(i*8) % N];
for (int i=0; i<N; i++)
    a[i] = b[random() % N];

What happens when you request a cache line?#

Operating system effects#

Most systems today use virtual addressing, so every address in your program needs to be translated to a physical address before looking for it (in cache or memory). Fortunately, there is hardware to assist with this: the Translation Lookaside Buffer (TLB).

Virtual memory and the page table

Further resources:#

List of frequently found acronyms and abbreviations:#

Acronym

Meaning

ASCII

American standard code for information interchange

ASIC

Application-specific integrated circuit

BIOS

Basic input/output system

BLAS

Basic linear algebra subroutines

CAF

Co-array Fortran

ccNUMA

Cache-coherent nonuniform memory access

CFD

Computational fluid dynamics

CISC

Complex instruction set computer

CL

Cache line

CPI

Cycles per instruction

CPU

Central processing unit

CRS

Compressed row storage

DDR

Double data rate

DMA

Direct memory access

DP

Double precision

DRAM

Dynamic random access memory

ED

Exact diagonalization

EPIC

Explicitly parallel instruction computing

Flop

Floating-point operation

FMA

Fused multiply-add

FP

Floating point

FPGA

Field-programmable gate array

FS

File system

FSB

Frontside bus

GCC

GNU compiler collection

GE

Gigabit Ethernet

GigE

Gigabit Ethernet

GNU

GNU (is not UNIX)

GPU

Graphics processing unit

GUI

Graphical user interface

HPC

High performance computing

HPF

High performance Fortran

HT

HyperTransport

IB

InfiniBand

ILP

Instruction-level parallelism

IMB

Intel MPI benchmarks

I/O

Input/output

IP

Internet protocol

JDS

Jagged diagonals storage

L1D

Level 1 data cache

L1I

Level 1 instruction cache

L2

Level 2 cache

L3

Level 3 cache

LD

Locality domain

LD

Load

LIKWID

Like I knew what I’m doing

LRU

Least recently used

LUP

Lattice site update

MESI

Modified/Exclusive/Shared/Invalid

MI

Memory interface

MIMD

Multiple instruction multiple data

MIPS

Million instructions per second

MMM

Matrix–matrix multiplication

MPI

Message passing interface

MPMD

Multiple program multiple data

MPP

Massively parallel processing

MVM

Matrix–vector multiplication

NORMA

No remote memory access

NRU

Not recently used

NUMA

Nonuniform memory access

OLC

Outer-level cache

OS

Operating system

PAPI

Performance application programming interface

PCI

Peripheral component interconnect

PGAS

Partitioned global address space

PLPA

Portable Linux processor affinity

POSIX

Portable operating system interface for Unix

PPP

Pipeline parallel processing

PVM

Parallel virtual machine

QDR

Quad data rate

QPI

QuickPath interconnect

RAM

Random access memory

RISC

Reduced instruction set computer

RFO

Read for ownership

SDR

Single data rate

SIMD

Single instruction multiple data

SISD

Single instruction single data

SMP

Symmetric multiprocessing

SMT

Simultaneous multithreading

SP

Single precision

SPMD

Single program multiple data

SSE

Streaming SIMD extensions

ST

Store

STL

Standard template library

SYSV

Unix System V

TBB

Threading building blocks

TCP

Transmission control protocol

TLB

Translation lookaside buffer

UMA

Uniform memory access

UPC

Unified parallel C