`
Parallel Computer Architecture and Programming
[[S: Content]]
- S = C: Check, random or related insights
- S = N: Note, some notes
- S = T: Todo, literally
- S = R: Refer, cross reference
- S = Q: Question, some questions, might be collected at the end
Why Parallelism? Why Efficiency
Common definition: A parallel computer is a collection of processing elements that cooperate to solve problems quickly
speedup (P processor) = execution T (1 processor) / execution T (p processor)
- limited by communication -> improve locality
- limited by imbalance in work assignment -> improve distribution of work
Design & Write scale parallel programs
- decomposing works in pieces which can performed in parallel
- assign work to processors
- managing communication/synchronization
Parallel computer hardware implementation
- performance characteristics
- trade-off
Efficiency
Instruction Level Parallelism (ILP)
superscalar execution
Most available ILP is exploited by a processor capable of issuing four instructions per clock
Power wall
- dynamic power $\propto$ capacitive load $\times$ voltage$^2$ $\times$ frequency
- static power: transistors burn power even when inactive due to leakage
Single-core performance scaling
- frequency limited by power
- ILP scaling tapped out
- -> adding more execution units that run in parallel
A Modern Multi-Core Processor
- Multi-core
- increase cores instead of adding processor logic using transistors
pthread
s
- amortize cost/complexity of managing an instruction stream across many ALUs
- SIMD (Single Instruction, Multiple Data) processing
- SSE instructions: 128-bit operations: 4x32 bits or 2x64 bits (4-wide foat vectors)
- AVX256 instructions: 256 bit operations: 8x32 bits or 4x64 bits (8-wide foat vectors)
- AVX512 instruction: 512 bit operations: 16x32 bits
- generated by compiler
- intrinsics
- conveyed using parallel language semantics (E.g. OpenMP)
- inferred by dependency analysis of loops (
-march=native
)
- explicit SIMD: compile time SIMD (
vstoreps
,vmulps
, …) - implicit SIMD: scalar binary -> N instances of program together on the processor
- interface to the hardware itself is data-parallel
- hardware simultaneously executing the same instruction from multiple instances on different data on SIMD ALUs
- conditional code
- instruction stream coherence (coherent execution): Same instruction sequence applies to all elements operated upon simultaneously
- divergent execution: can’t efficiently use of SIMD processing
- SIMD (Single Instruction, Multiple Data) processing
- increase cores instead of adding processor logic using transistors
- Parallel execution forms
- Multi-core: use multiple processing cores
- Provides thread-level parallelism (TLP): simultaneously execute a completely different instruction stream on each core
- Software decides when to create threads (e.g., via pthreads API)
- SIMD: use multiple ALUs controlled by same instruction stream (within a core)
- Efficient design for data-parallel workloads: control amortized over many ALUs
- Vectorization can be done by compiler (explicit SIMD) or at runtime by hardware
- [Lack of] dependencies is known prior to execution (usually declared by programmer, but can be inferred by loop analysis by advanced compiler)
- Superscalar: exploit ILP within an instruction stream. Process different instructions from the same instruction stream in parallel (within a core)
- Parallelism automatically and dynamically discovered by the hardware during execution (not programmer visible)
- Multi-core: use multiple processing cores
- Memory latency: The amount of time for a memory request (e.g., load, store) from a processor to be serviced by the memory system
- Memory bandwidth: The rate at which the memory system can provide data to a processor
- Stall: A processor “stalls” when it cannot run the next instruction in an instruction stream because of a dependency on a previous instruction
- Caches reduce length of stalls (reduce latency, memory access)
- Prefetching reduces stalls (hide latency) [[Check stride prefetcher]]
- Multi-threading reduces stalls (hide latency) (Interleave processing of multiple threads)
- Storing execution contexts
- Many small contexts (high latency hiding ability)
- Few large contexts (low latency hiding ability)
- Hardware-supported multi-threading
- Core manages execution contexts for multiple threads
- Runs instructions from runnable threads (processor makes decision about which thread to run each clock, not the operating system)
- Core still has the same number of ALU resources: multi-threading only helps use them more efficiently in the face of high-latency operations like memory access
- Interleaved multi-threading (a.k.a. temporal multi-threading) [[Check: fine/coarse-grained HT]]
- each clock, the core chooses a thread, and runs an instruction from the thread on the ALUs
- Simultaneous multi-threading (SMT)
- Each clock, core chooses instructions from multiple threads to run on ALUs
- Extension of superscalar (OoO?) CPU
- Use core’s FU resources more efficiently
- Cost
- Requires additional storage for thread contexts (duplicated register, RAT)
- Increases run time of any single thread
- Requires additional independent work in a program
- Relies heavily on memory bandwidth
- larger working set -> less cache space per thread (often shared)
- [[Check: SMT helper thread prefetching]]
- Core manages execution contexts for multiple threads
- GPU: extreme throughput-oriented processors
- warp: instruction streams (32 pieces of data per time, thread issuing 32-wide vector instruction)
- different instructions from up to 4 warps executed simultaneously (SMT)
- 64 warps interleaved on the SM (interleaved MT with thread context)
- Bandwidth bound
- Organize computation to fetch data from memory less often
- Reuse data previously loaded by the same thread (intra-thread temporal locality optimization) cache
- Share data across threads (inter-thread cooperation)
__shared__
- Request data less often
- arithmetic intensity: ratio of math operations to data access operations in an instruction stream
- Organize computation to fetch data from memory less often
Parallel Programming Abstraction
ISPC: Intel SPMD Program Compiler
programCount
: number of simultaneously executing instances in the gang (uniform value)programIndex
: id of the current instance in the gang. (a non-uniform value: “varying”)uniform
: A type modifer. All instances have the same value for this variable. Its use is purely an optimization. Not needed for correctness.Interleaved assignment of program instances to loop iterations
export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) { // assumes N % programCount = 0 for (uniform int i=0; i<N; i+=programCount) { int idx = i + programIndex; float value = x[idx]; float numer = x[idx] * x[idx] * x[idx]; uniform int denom = 6; // 3! uniform int sign = -1; for (uniform int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[idx] * x[idx]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[idx] = value; } }
Blocked assignment of program instances to loop iterations
export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) { // assume N % programCount = 0 uniform int count = N / programCount; int start = programIndex * count; for (uniform int i=0; i<count; i++) { int idx = start + i; float value = x[idx]; float numer = x[idx] * x[idx] * x[idx]; uniform int denom = 6; // 3! uniform int sign = -1; for (uniform int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[idx] * x[idx]; denom *= (j+3) * (j+4); sign *= -1; } result[idx] = value; } }
single packed load SSE (
__mm_load_ps1
) can efficiently implementsfloat value = x[idx]
Raising level of abstraction
foreach
: declares parallel loop iterations. cooperatively perform a loop in a gangISPC decides the assignment
export void sinx( uniform int N, uniform int terms, uniform float* x, uniform float* result) { foreach (i = 0 ... N) { float value = x[i]; float numer = x[i] * x[i] * x[i]; uniform int denom = 6; // 3! uniform int sign = -1; for (uniform int j=1; j<=terms; j++) { value += sign * numer / denom numer *= x[i] * x[i]; denom *= (2*j+2) * (2*j+3); sign *= -1; } result[i] = value; } }
abstraction vs implementation
- SPMD (Single Program, Multiple Data) abstraction: running a gang is spawning
programCount
logical instruction streams (each with a different value ofprogramIndex
) - SIMD implementation: ISPC compiler emits vector instructions (SSE4 or AVX) that carry out the logic performed by a ISPC gang.
- handle mapping of conditional control flow to vector insns (by masking vector lanes)
- SPMD (Single Program, Multiple Data) abstraction: running a gang is spawning
assign
varying
variables touniform
variable causes compile-time error
Thread Programming Model
ISPC Programming Model
Shared address space communication Model
- Threads communicate by
- reading/writing to shared variables
- manipulating synchronization primitives
- natural extension of sequential programming
- HW implementation: any processor can directly reference any memory location
- Intel/AMD: on-chip network
- AMD hyper-transport / Intel QuickPath (QPI)
- UltraSPARC: crossbar (CCX)
- costly for NUMA & scale
- SGI Altix: fat tree
- Intel/AMD: on-chip network
- can be implemented by HW unsupported machines via SW solution (mark all shared pages as invalid, #PF issue network requests)
- Threads communicate by
Message passing communication Model
- Threads operate within their own private address spaces
- Threads communicate by sending/receiving messages
send
: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)receive
: sender, specifies buffer to store data, and optional message identifier
- MPI (Message Passing Interface)
- HW implementation: may accelerate communicate messages
- connect commodity systems together to form large parallel machine (clusters)
- commonly implemented on a shared address space HW (by copying)
Data parallel communication Model
map(function, collection)
Synchronization is implicit at the end of the
map
(map
returns when function has been applied to all elements of collection)like
foreach
in ISPCStream: sequences of elements (can be processed independently)
Kernel: side-effect-free functions
benefits
- global-scale program dependencies known by compiler (enables compiler to perform aggressive optimizations that require global program analysis)
- Independent processing on elements, kernel functions are side-effect free
- parallelize kernel execution
- can’t be non-deterministic
- I/O of each invocation known in advance: prefetching hiding latency
- Producer-consumer dependencies are known in advance: pipelined execution (save bandwidth)
drawbacks
- need library of operators to describe complex data flows
gather/scatter
const int N = 1024; stream<float> input(N); stream<int> indices; stream<float> tmp_input(N); stream<float> output(N); stream_gather(input, indices, tmp_input); absolute_value(tmp_input, output); export void absolute_value( uniform float N, uniform float* input, uniform float* output, uniform int* indices) { foreach (i = 0 ... n) { float tmp = input[indices[i]]; if (tmp < 0) output[i] = -tmp; else output[i] = tmp; } }
const int N = 1024; stream<float> input(N); stream<int> indices; stream<float> tmp_output(N); stream<float> output(N); absolute_value(input, tmp_output); stream_scatter(tmp_output, indices, output); export void absolute_value( uniform float N, uniform float* input, uniform float* output, uniform int* indices) { foreach (i = 0 ... n) { if (input[i] < 0) output[indices[i]] = -input[i]; else output[indices[i]] = input[i]; } }
impose rigid program structure to facilitate simple programming and advanced optimization
ISPC, OpenCL, CUDA
- flexibility/familiarity of imperative C-style syntax over the syntax of a more functional form: for adoption
Mixed programming Model
- shared address space model within a multi-core node of a cluster + message passing between nodes
- data-parallel model support shared memory style synchronization primitives in kernels
- permit limited forms of inter-iteration communications
Parallel Programming Basics
- Decomposition: Break up problem into tasks that can be carried out in parallel
- identify dependencies
- static & dynamic
- E.g. loop iteration
- Assignment: Assigning tasks to workers
- balance workload
- reduce communication costs
- static & dynamic
- worker thread pool
- E.g. interleaved,
foreach
,pthread
- Orchestration
- structuring communication
- adding synchronization to preserve dependencies
- organizing data structure in memory
- scheduling tasks
- reduce cost of communication/sync
- preserve locality of data reference
- reduce overhead
- Mapping
pthread
-> HW execution context on a CPU core (by OS)- ISPC program instance -> vector instruction lanes (by compiler)
- CUDA thread block -> GPU cores (by HW)
- place related threads (cooperating threads) on same processor
- maximize locality, data sharing
- minimize cost of comm/sync
- place unrelated threads on same processor
- hide latency
- Grid solver
Data-parallel | Shared address space | Message Passing | |
---|---|---|---|
Synchronization | Single logical thread of control parallelized forall loop | Mutual exclusive (via lock/atomic) Express dependency (via barrier) | sending & receiving messages |
Communication | implicit load/store special built-in primitives for complex communication | implicit load/store to shared vars | sending & receiving messages (bulk transfer) |
GPU Architecture & CUDA Programming
Real-time graphics pipeline
shaders
vertex processing
fragment processing
uniform sampler2D myTexture; uniform float3 lightDir; varying vec3 norm; varying vec2 uv; void myFragmentShader() { vec3 kd = texture2D(myTexture, uv); kd *= clamp(dot(lightDir, norm), 0.0, 1.0); return vec4(kd, 1.0); }
CUDA: “C-like” language to express GPU interface
Syntax
const int Nx = 12; const int Ny = 6; dim3 threadsPerBlock(4, 3, 1); dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y, 1); // assume A, B, C are allocated Nx x Ny float arrays // this call will trigger execution of 72 CUDA threads: // 6 thread blocks of 12 threads each matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); __device__ float doubleValue(float x) { return 2 * x; } // kernel definition __global__ void matrixAddDoubleB(float A[Ny][Nx], float B[Ny][Nx], float C[Ny][Nx]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; C[j][i] = A[j][i] + doubleValue(B[j][i]); }
host: serial execution
CUDA device: SPMD execution
__global__
: CUDA kernel function on GPUthreadIdx
: grid thread id position in its blockblockIdx
: block position in the gridfloat* A = new float[N]; // allocate buffer in host mem // populate host address space pointer A for (int i=0 i<N; i++) A[i] = (float)i; int bytes = sizeof(float) * N; float* deviceA; // allocate buffer in cudaMalloc(&deviceA, bytes); // device address space // populate deviceA cudaMemcpy(deviceA, A, bytes, cudaMemcpyHostToDevice); // note: directly accessing deviceA[i] is an invalid // operation here (cannot manipulate contents of deviceA // directly from host only from device code, since deviceA // is not a pointer into the host’s address space)
cudaMemcpy
: copy memory between hosts
1D Convolution
#define THREADS_PER_BLK 128 int N = 1024 * 1024 cudaMalloc(&devInput, sizeof(float) * (N+2) ); // allocate input array in device memory cudaMalloc(&devOutput, sizeof(float) * N); // allocate output array in device memory // properly initialize contents of devInput here ... convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput); __global__ void convolve(int N, float* input, float* output) { int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable float result = 0.0f; // thread-local variable for (int i=0; i<3; i++) result += input[index + i]; output[index] = result / 3.f; }
__global__ void convolve(int N, float* input, float* output) { __shared__ float support[THREADS_PER_BLK+2]; // per-block allocation int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable support[threadIdx.x] = input[index]; if (threadIdx.x < 2) { support[THREADS_PER_BLK + threadIdx.x] = input[index+THREADS_PER_BLK]; } __syncthreads(); float result = 0.0f; // thread-local variable for (int i=0; i<3; i++) result += support[threadIdx.x + i]; output[index] = result / 3.f; }
__shared__
: per block shared memory__syncthreads()
: barrier, wait for all threads in the block to arriveEach thread block must execute 128 CUDA threads
Each thread block requires
130 x sizeof(float) = 520
bytes of shared memory
Execution: thread hierarchy
- thread - warp - thread block - stream multi-processor (SM, core)
- thread block execution can be carried out in any order
- dynamic scheduling
Distributed address space
- built-in memcpy primitives between host & device
- per thread / per block (
__shared__
) / per program (__global__
) / constant memory (__constant__
)
SM core operation each clock
- Select up to four runnable warps from 64 resident on SM core (thread-level parallelism)
- Select up to two runnable instructions per warp (instruction-level parallelism)
Warp: CUDA implementation detail
- On modern NVIDIA hardware, groups of 32 CUDA threads in a thread block are executed simultaneously using 32-wide SIMD execution
- share same instruction stream
- The group of 32 threads sharing an instruction stream is called a warp
- On modern NVIDIA hardware, groups of 32 CUDA threads in a thread block are executed simultaneously using 32-wide SIMD execution
Running the CUDA kernel
- Step 1: host sends CUDA device (GPU) a command (“execute this kernel”)
- Step 2: scheduler maps block 0 to core 0 (reserves execution contexts for 128 threads and 520 bytes of shared storage)
- Step 3: scheduler continues to map blocks to available execution contexts
- Step 4: core completes block, reschedule
Question: Explain the statement: The GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution.
GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution in the following manner :
- Interleaved Multi-threading : Since up-to-64 warps could be active at anytime, multiple threads could be interleaved depending on which 4 warps are chosen each time.
- Simultaneous multi-threading : Each SM has four warp selectors. Each warp selector can choose to issue instructions from different threads.
- SIMD Execution : There are SIMD functional units in each SM that helps do SIMD operations.
- Super-scalar Execution : During each cycle, the warp selector can issue up to 2 instructions to be executed, supporting super scalar execution.
- Up to 64*32=2048 CUDA threads can be active on the SM at a time
- But only 4*32=1028 CUDA threads are executed simultaneously.
- Groups of 32 CUDA threads share an instruction stream. (and are executed in a SIMD manner)
- The SM may choose to execute 2 instructions from a CUDA thread at once when an appropriate mixture is available. (And of course each of those two instructions are executed in a SIMD manner as described above.)
Modern GPUs, to support features like virtualization and CUDA dynamic parallelism, are beginning to support limited forms of software-managed thread pre-emption, but we will not talk about it in this class.
pinned memory
atomic op on different memory (global, L1, shared)
pipelined device overlap (stream)
memory mapping (DMA)
data representation (im2col, CSR, ELL, JDS…)
reduction algorithm (winograd, combination tree, Kogge-stone, Brent Kung, …)
unrolling (cache coherence, memory coalescing)
shared/constant/restrict/global memory
synchronization (host, device, thread, block, memory barrier)
warp SIMD (coalescing, adjacent thread adjacent memory)
branch divergence (granularity % wrap)
cache (hierarchy, size, scratchpad, TSM)
SM thread num / memory limit
memory bank conflict
non-cachable load
DRAM burst (detail of memory coalescing)
multiple-gpu
Performance Optimization I: Work Distribution & Scheduling
Balancing the workload
- low computational & synchronization overhead
- Static assignment: pre-determined (compile-time, may runtime)
- when cost & amount of work is predictable
- Semi-static assignment: periodically profile & re-adjust
- cost of work is predictable for near-term future
- Dynamic assignment: runtime
- work queue: tasks - shared work queue - worker threads
- choosing task granularity
- Useful to have many more tasks* than processors (many small tasks enables good workload balance via dynamic assignment)
- small granularity
- Want as few tasks as possible to minimize overhead of managing the assignment
- large granularity
- Useful to have many more tasks* than processors (many small tasks enables good workload balance via dynamic assignment)
- task scheduling
- multiple-queue with steal work
- task dependency managed by scheduler
Common parallel programming patterns
data parallelism
// ISPC foreach foreach (i=0 ... N) { B[i] = foo(A[i]); } // ISPC bulk task launch launch[numTasks] myFooTask(A, B); // bulk CUDA thread launch foo<<<numBlocks, threadsPerBlock>>>(A, B); // using higher-order function ‘map’ map(foo, A, B); // openMP parallel for #pragma omp parallel for for (int i=0; i<N; i++) { B[i] = foo(A[i]); }
explicit management via threads
fork/join pattern - divide & conquer
Cilk Plus
cilk_spawn foo(args)
: invoke foo, but unlike standard function call, caller may continue executing asynchronously with execution of foo- hide implementation, like
std::async
- run continuation first (child stealing): Caller thread spawns work for all iterations before executing any of it (BFS)
- run child first (continuation stealing): Caller thread only creates one item to steal (continuation that represents all remaining iterations) (DFS)
- hide implementation, like
cilk_sync
: returns when all calls spawned by current function have completed. (“sync up” with the spawned calls)- stall join
- greedy policy: When thread that initiates the fork goes idle, it looks to steal new work. Last thread to reach the join point continues execution after sync
- All threads always attempt to steal if there is nothing to do (thread only goes idle if no work to steal is present in system)
- Worker thread that initiated spawn may not be thread that executes logic after
cilk_sync
void quick_sort(int* begin, int* end) { if (begin >= end - PARALLEL_CUTOFF) std::sort(begin, end); else { int* middle = partition(begin, end); cilk_spawn quick_sort(begin, middle); quick_sort(middle+1, last); } }
heavyweight spawn?
context switching overhead
larger working set, less cache locality
-> using worker thread pool + work queue per thread
work stealing: dequeue per worker
- steal work from top of dequeue
- steals largest amount of work
- maximum locality (combined with run child first)
- don’t contend for same elements of dequeue (nice lock-free SPSC dequeue)
- steal work from top of dequeue
Performance Optimization II: Locality, Communication & Contention
Message passing grid solver
int N; int tid = get_thread_id(); int rows_per_thread = N / get_num_threads(); float* localA = allocate(rows_per_thread+2, N+2); // assume localA is initialized with starting values // assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids ////////////////////////////////////// void solve() { bool done = false; while (!done) { float my_diff = 0.0f; if (tid != 0) send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW); if (tid != get_num_threads()-1) send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW); if (tid != 0) recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW); if (tid != get_num_threads()-1) recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW); for (int i=1; i<rows_per_thread+1; i++) { for (int j=1; j<n+1; j++) { float prev = localA[i,j]; localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] + localA[i,j-1] + localA[i,j+1]); my_diff += fabs(localA[i,j] - prev); } } if (tid != 0) { send(&mydiff, sizeof(float), 0, MSG_ID_DIFF); recv(&done, sizeof(bool), 0, MSG_ID_DONE); } else { float remote_diff; for (int i=1; i<get_num_threads()-1; i++) { recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF); my_diff += remote_diff; } if (my_diff/(N*N) < TOLERANCE) done = true; for (int i=1; i<get_num_threads()-1; i++) send(&done, sizeof(bool), i, MSD_ID_DONE); } } }
synchronous & blocking -> deadlock!
send()
: call returns when sender receives acknowledgement that message data resides in address space of receiverrecv()
: call returns when data from received message is copied into address space of receiver and acknowledgement sent back to sender
asynchronous & non-blocking
send()
: call returns immediately, buffer provided to send() cannot be modifed by calling thread since message processing occurs concurrently with thread executionrecv()
: posts intent to receive in the future, returns immediately. Usechecksend()
,checkrecv()
to determine actual status of send/receipt
Pipelining
Communication as extended memory hierarchy
- inherent communication: must occur in a parallel algorithm
- good assignment decision
- artifactual communication: all other communication (from practical details of system)
- minimum data transfer granularity
- unnecessary comm by OS (like cache line block)
- poor placement of data in distributed memory (reside far away)
- finite replication capacity
- minimum data transfer granularity
comm-to-comp ratio = 1 / arith intensity = comm / comp
- Cache miss
- Cold miss: First time data touched. Unavoidable in a sequential program
- Capacity miss: Working set is larger than cache. Can be avoided/reduced by increasing cache size
- Conflict miss: Miss induced by cache management policy. Can be avoided/reduced by changing cache associativity, or data access pattern in application
- Communication miss: Due to inherent or artifactual communication in parallel system
- Reduce communication
- temporal locality
- blocked iteration order (avoid cache ping-pong)
- fusing loops
- combine multiple loops to reduce I/O
- may cause conflict miss
- may use FMAoperation
- sharing data
- co-locate tasks operating same data
- E.g. CUDA thread block
- blocked iteration order (avoid cache ping-pong)
- spatial locality
- granularity of data transfer
- blocked data layout
- E.g. im2col, CSR/CSC (compact sparse row/col)
- blocked data layout
- granularity of cache coherence
- granularity of data transfer
- Reduce overhead of communication to sender/receiver
- coalescing small messages
- Reduce latency of communication
- restructure code (SW), improve arch (HW)
- Reduce contention
- Replicate contended resources (e.g., local copies, fne-grained locks)
- Stagger access to contended resources
- Increase communication/computation overlap
- Application writer: use asynchronous communication (e.g., async messages)
- HW implementer: pipelining, multi-threading, pre-fetching, out-of-order exec
- temporal locality
- inherent communication: must occur in a parallel algorithm
Contention
memory contention in CUDA
__shared__ float A[512]; int index = threadIdx.x; float x2 = A[index]; // single cycle float x3 = A[3*index]; // single cycle float x4 = A[16 * index]; // 16 cycles
The first 2 only take one clock cycle because each of the indices fall on a different bank of the shared memory so you can instantly read the value. However, the last one has half of them in one bank and half in another, so you have to read from two banks 16 times each to get the 32 values you want since you can only read one value per clock.
- why power of 2 sometimes is not good: 1. cache associative 2. memory bank contention
N-body: particles within cells in a grid
- parallelize by cell (insufficient parallelism)
- parallelize over particles (massive contention for single shared data)
- finer-granularity locks (lock per cell)
- partial results + merge (extra work/memory footprint)
- data-parallel
- Step 1: compute cell containing each particle (parallel over input particles)
- Step 2: sort results by cell (particle index array permuted based on sort)
- Step 3: fnd start/end of each cell (parallel over particle_index elements)
Parallel Application Case Studies
Ocean Currents Simulation (grid-based solver)
- Decomposition: Spatial partitioning of grid: each processor receives 2D tile of grid
- Assignment: Static assignment of tiles to processors
- Synchronization: Barriers (separate each pass over grid is a different phase of computation) + Locks for mutual exclusion when updating shared variables (atomic update of ‘diff’)
- Working Set
- Local neighborhood for cell
- Three rows of a processor’s local partition of grid
- Processor’s local partition of grid
- Observation
- Static assignment is sufficient (approximately equal busy time per thread)
- 4D blocking (block-major) of grid reduces time spent on communication
- Synchronization cost is largely due to waiting at barriers
Galaxy evolution
Barnes-Hut Algorithm
for each time step in simulation: build tree structure compute (aggregate mass, center-of-mass) for interior nodes for each particle: traverse tree to accumulate gravitational forces update particle position based on gravitational forces
amount of work is non-uniform
comm pattern is non-uniform
Assignment: semi-static assignment
- Each time step, for each body, record number of interactions with other bodies (the application profles itself)
- cost zones
- Each processor performs depth-frst (post-order) traversal of tree
- Processor Pi responsible for processing bodies corresponding to work:
iW/P to (i+1)W/P
Working sets
- data needed to compute forces between body-body (or body-node) pairs
- data encountered in an entire tree traversal
Data Distribution: static distribution
- Work assignment changes with time
- Data accessed at fne granularity (single tree node)
- high temporal locality
- : Unlike OCEAN, data distribution in Barnes-Hut does not signifcantly impact performance
Data-parallel scan
scan_inclusive(⊕, A) = [a0, a0⊕a1, a0⊕a1⊕a2, ...
scan_exclusive(⊕, A) = [I, a0, a0⊕a1, ...
- [[Check: 倍增,
par_seq
]]
- [[Check: 倍增,
Up-sweep for d=0 to (log2n - 1) do forall k=0 to n-1 by 2d+1 do a[k + 2d+1 - 1] = a[k + 2d - 1] + a[k + 2d+1 - 1] Down-sweep x[n-1] = 0 for d=(log2n - 1) down to 0 do forall k=0 to n-1 by 2d+1 do tmp = a[k + 2d - 1] a[k + 2d - 1] = a[k + 2d+1 - 1] a[k + 2d+1 - 1] = tmp + a[k + 2d+1 - 1]
__device__ void scan_block(volatile int *ptr, const unsigned int idx) { const unsigned int lane = idx & 31; // index of thread in warp (0..31) const unsigned int warp_id = idx >> 5; // warp index in block int val = scan_warp(ptr, idx); // Step 1. per-warp partial scan // (Performed by all threads in block, // with threads in same warp communicating // through shared memory buffer ‘ptr’) if (lane == 31) ptr[warp_id] = ptr[idx]; // Step 2. thread 31 in each warp copies __syncthreads(); // partial-scan bases in per-block // shared mem if (warp_id == 0) scan_warp(ptr, idx); // Step 3. scan to accumulate bases __syncthreads(); // (only performed by warp 0) if (warp_id > 0) // Step 4. apply bases to all elements val = val + ptr[warp_id-1]; // (performed by all threads in block) __syncthreads(); ptr[idx] = val; }
Segmented scan
Simultaneously perform scans on arbitrary contiguous partitions of input collection
Up-sweep for d=0 to (log2n - 1) do: forall k=0 to n-1 by 2d+1 do: if flag[k + 2d+1 - 1] == 0: data[k + 2d+1 - 1] = data[k + 2d - 1] + data[k + 2d+1 - 1] flag[k + 2d+1 - 1] = flag[k + 2d - 1] || flag[k + 2d+1 - 1] Down-sweep data[n-1] = 0 for d=(log2n - 1) down to 0 do: forall k=0 to n-1 by 2d+1 do: tmp = data[k + 2d - 1] data[k + 2d - 1] = data[k + 2d+1 - 1] if flag_original[k + 2d] == 1: # must maintain copy of original flags data[k + 2d+1 - 1] = 0 # start of segment else if flag[k + 2d - 1] == 1: data[k + 2d+1 - 1] = tmp else: data[k + 2d+1 - 1] = tmp + data[k + 2d+1 - 1] flag[k + 2d - 1] = 0
Parallel Ray Tracing
- Given a “ray”, fnd closest intersection with scene geometry
- use hierarchical grouping adapts to non-uniform density of scene objects
- use packets: amortize data fetching & work, unsuitable for incoherent rays
- ray reordering: when packet utilization drops below threshold, resort rays and continue with smaller packet
- Use large packets for eye/refection/point light shadow rays or higher levels of BVH
- ray coherence always high at the top of the tree
- Switch to single ray (intra-ray SIMD) when packet utilization drops below threshold
- For wide SIMD machine, a branching-factor-4 BVH works well for both packet traversal and single ray traversal
- Can use packet reordering to postpone time of switch
- Reordering allows packets to provide beneft deeper into tree
- Not often used in practice due to high implementation complexity
Workload-Driven Performance Evaluation
- [[Check: AutoFDO, PGO]]
- Application-oriented scaling (last lecture)
- Resource-oriented scaling (this lecture)
- Problem-constrained scaling
- use a parallel computer to solve the same problem faster
- Time-constrained scaling
- completing more work in a fxed amount of time
- E.g. rendering, telescope, finance, websites, robotics
- Memory-constrained scaling
- run the largest problem possible without overfowing main memory
- E.g. N-body, machine learning, memcached
PC | TC | MC | |
---|---|---|---|
Scaled grid size | $\sqrt{N^2 / P}$ | $K$ assume linear speed up $K^3/P = N^3$ $K = NP^{1/3}$ | $NP^{1/2}$ |
Exec Time | $1/P$ | fixed at $O(N^3)$ (definition) | $O(NP^{1/2} )^3 / P) = O(P^{1/2})$ |
Elements per Processor | $N^2/P$ | $K^2/P = N^2 / P^{1/3}$ | fixed at $N^2$ (definition) |
Available concurrency | fixed at $P$ | ||
Comm-to-comp ratio | $O(P^{1/2})$ (2D-blocked) | comm per processor $K / P^{1/2} = O(1/P^{1/6})$ ratio $O(P^{1/6})$ | fixed at $1/N$ |
- Simulator
- Execution-driven
- Trace-driven
- Instrument real code running on real machine to record a trace of all memory accesses
- Intel PIN
- Execution-driven
- Problem size = $(n, \Theta, \Delta t, T)$ (grid size, accuracy, time step, total time to simulate)
- Scaling down
- preserve ratio of time spent in different program phases
- Ray-trace and Barnes-Hut: both have tree build and tree traverse phases
- preserve important behavioral characteristics
- arithmetic intensity, load balance, locality, working set sizes
- preserve contention & communication patterns
- contention is a function of timing and ratios
- preserve scaling relationship between problem parameters
- Barnes-Hut: scaling up particle count requires scaling down time step for physics reasons
- preserve ratio of time spent in different program phases
- Roofline model
- Establish high watermarks
- Add “math” (non-memory instructions)
- Does execution time increase linearly with operation count as math is added? (If so, this is evidence that code is instruction-rate limited)
- Remove almost all math, but load same data
- How much does execution-time decrease? If not much, suspect memory bottleneck
- Change all array accesses to A[0]
- How much faster does your code get? (This establishes an upper bound on beneft of improving locality of data access)
- Remove all atomic operations or locks
- How much faster does your code get? (provided it still does approximately the same amount of work) (This establishes an upper bound on beneft of reducing sync overhead.)
- Add “math” (non-memory instructions)
- Profiler
- Intel’s Performance Counter Monitor Tool
- Intel VTune, PAPI, oprofile
- performance counter: instructions completed, clock ticks, L2/L3 cache hits/misses, bytes read from memory controller, etc.
Snooping-Based Cache Coherence
cache coherence: Reading a value at address X should return the last value written to address X by any processor
- device I/O via direct memory access (DMA) -> mark page table uncachable [[Check: what every programmer should know about memory]]
memory coherence: both global storage (main memory) and per-processor local storage (processor caches) implementing the abstraction of a single shared address space
- for each memory location, there is a hypothetical serial order of all program operations (executed by all processors) to the location that is consistent with the results of execution
- Memory operations issued by any one processor occur in the order issued by the processor
- The value returned by a read is the value written by the last write to the location as given by the serial order
- A memory system is coherent if
- A read by processor P to address X that follows a write by P to address X, should return the value of the write by P (assuming no other processor wrote to X in between)
- obey program order
- A read by processor P1 to address X that follows a write by processor P2 to X returns the written value… if the read and write are “sufficiently separated” in time (assuming no other write to X occurs in between)
- “write propagation”: eventually consistency
- Writes to the same address are serialized: two writes to address X by any two processors are observed in the same order by all processors. (Example: if values 1 and then 2 are written to address X, no processor observes X having value 2 before value 1)
- “write serialization”: Writes to the same location are serialized: two writes to address X by any two processors are observed in the same order by all processors.
- A read by processor P to address X that follows a write by P to address X, should return the value of the write by P (assuming no other processor wrote to X in between)
- Implementation
- Software: #PF to propagate -> clusters of workstations
- Hardware: snooping/directory-based
Snooping cache coherence schemes: broadcasts a notifcation to all other cache controllers (decentralized)
shared cache
limited scaling
write-through invalidation
write-back invalidation
- A line in the “exclusive” state can be modifed without notifying the other caches
- Processor can only write to lines in the exclusive state
- MSI
- Write propagation: Achieved via combination of invalidation on BusRdX, and fush from M-state on subsequent BusRd/BusRdX from another processors
- Write serialization
- Writes that appear on interconnect are ordered by the order they appear on interconnect (BusRdX)
- Reads that appear on interconnect are ordered by order they appear on interconnect (BusRd)
- Writes that don’t appear on the interconnect (PrWr to line already in M state)
- Sequence of writes to line comes between two interconnect transactions for the line
- All writes in sequence performed by same processor, P (that processor certainly observes them in correct sequential order)
- All other processors observe notifcation of these writes only after a interconnect transaction for the line. So all the writes come before the transaction
- -> all processors see writes in the same order
- MESI: optimize reading-then-writing (I ->(BusRd) S ->(BusRdX) M)
- add “exclusive clean” E (not-modified, but exclusive)
- MESIF: Intel 5-stage
- MESI + F(Forward) by one cache
- Cache with line in F state services miss
- Simplifes decision of which cache should service miss (basic MESI: all caches respond)
- complex, for clusters
- MOESI: AMD Opteron 5-stage
- In MESI protocol, transition from M to S requires fush to memory
- Instead transition from M to O (O=”owned, but not exclusive”) and do not fush to memory
- Other processors maintain shared line in S state, one processor maintains line in O state
- Data in memory is stale, so cache with line in O state must service cache misses
write-back update
Dragon
Exclusive-clean (E): only one cache has line, memory up-to-date
Shared-clean (SC): multiple caches may have line, and memory may or may not ** be up to date
Shared-modifed (SM): multiple caches may have line, memory not up to date
- Only one cache can be in this state for a given line (but others can be in SC)
- Cache with line in SM state is “owner” of data. Must update memory upon eviction
Modifed (M): only one cache has line, it is dirty, memory is not up to date
- Cache is owner of data. Must update memory upon replacement
no point in time any processor will have an invalid value
low miss rate
suffer from high traffic due to multiple writes before the next read by another processor
Inclusion
- All lines in closer [to processor] cache are also in farther [from processor] cache
- If line is in owned state (M in MSI/MESI) in L1, it must also be in owned state in L2
- When line X is invalidated in L2 cache due to BusRdX from another cache, must also invalidate line X in L1
- each L2 line contains an additional state bit indicating if line data in L1
- propagation
- Assume L1 is a write-back cache. Processor writes to line X. (L1 write hit) . Line X in L2 cache is in modifed state in the coherence protocol, but it has stale data. When coherence protocol requires X to be fushed from L2 (e.g., another processor loads X), L2 cache must request the data
from L1.
- each L2 line contains an additional state bit (“modified-but-stale”)
- flushing requires getting from L1 first
- GPU: write-through, no coherence
False sharing: Condition where two processors write to different addresses, but addresses map to the same cache line
- Cache line “ping-pongs” between caches of writing processors, generating signifcant amounts of communication due to the coherence protocol [[Note: true sharing could also lead this]]
- Are cache-line-ping-pong and false sharing the same?
abstraction of a single shared address space is not implementation by a single storage unit
Directory-Based Cache Coherence
cc-NUMA: cache-cohereent, non-uniform memory access
Distributed shared Memory System (DSM): cache coherent, shared address space, physically distributed memories
Scalable cache coherence
hierarchical snooping
- simple to build
- root can be a bottleneck
- longer latency
- can’t apply to more general network topologies
directories: storing information about the status of the line in one place: a “directory” (centralized)
point-2-point messages instead of broadcast
- storage overhead: $O(PM)$
- increase cache line size (decrease $M$)
- group multiple processors into a single directory “node” (reduce $P$)
- hierarchical: use snooping among processors in a node, directory across nodes
- limited pointer schemes: storage for a limited number of pointers per directory entry should be sufficient (only need a list of the nodes holding a valid copy of the line!)
- overflow
- fallback to broadcast
- just don’t allow
- coarse vector fallback
- overflow
- sparse directories
- Directory at home node maintains pointer to only one node caching line (not a list of sharers)
- Pointer to next node in list is stored as extra information in the cache line (just like the line’s tag, dirty bits, etc)
- On read miss: add requesting node to head of list
- On write miss: propagate invalidations along list
- On evict: need to patch up list (linked list removal)
- Low memory storage overhead, additional directory storage is proportional to cache size, traffic on write is proportional to # of sharers
- Latency of write is proportional to # of sharers, hard to implmenet
distributed directory
home node: node with memory holding the corresponding data for the line
read miss
- read miss message sent to home node of the requested line
- home directory checks entry for line
- dirty bit OFF: respond with contents from memory, set presence[home] (line is cached by processor home)
- dirty bit ON: data must be sourced by another processor
- home node responds with message providing identity of line owner
- requesting node requests data from owner
- owner change state to SHARED, respond to request node
- owner also responds to home node, home clear dirty, update presence bits, updates memory
write miss
- request -> home: write miss message
- home -> request: share ids + data
- request -> ids: invalidate
- ids -> request: acks
Intervention forwarding: read miss to dirty line (forwarded by home) [[Check: EECS470 hop]]
- request -> home: read miss message
- home -> owner: intervention read
- owner -> home: respond with data + dir revision
- home updates directory
- home -> request: data
Request fowarding: read miss to dirty line
request -> home: read miss message
home -> owner: request send data to requestor
owner -> home + request: data
iterative DNS -> recursive DNS
Access pattern
- “Mostly-read” objects: lots of sharers but writes are infrequent, so communicating with all of them on a write has minimal impact on performance (e.g., root node in Barnes-Hut)
- Migratory objects (one processor reads/writes for while, then another, etc.): very few sharers, count does not scale with number of processors
- Frequently read/written objects: frequent invalidations, but sharer count is low because count cannot build up in short time between invalidations (e.g, shared task queue)
- Low-contention locks: infrequent invalidations, so no performance problem
- High-contention locks: can be a challenge, because many readers present when lock released
Multi-socket Intel System
Xeon Phi (Kinghts Landing, KNL)
A Basic Snooping-Based Multi-Processor Implementation
- Deadlock: Deadlock is a state where a system has outstanding operations to complete, but
no operation can make progress
- Mutual exclusion: only one processor can hold a given resource at once
- Hold and wait: processor must hold the resource while waiting for other resources needed to complete an operation
- No preemption: processors don’t give up resources until operation they wish to perform is complete
- Circular wait: waiting processors have mutual dependencies (a cycle exists in the resource dependency graph)
- Livelock: Livelock is a state where a system is executing many operations, but no thread is making meaningful progress
- Starvation: State where a system is making overall progress, but some processes make no progress
- Transaction on an atomic bus -> abstraction, actually non-atomic
- Client is granted bus access (result of arbitration)
- Client places command on bus (may also place data on bus)
- Response to command by another bus client placed on bus
- Next client obtains bus access (arbitration)
- Multi-processor cache controller -> implementation
- contention: requests from both processor & bus
- cache duplicate tags
- multi-ported tag memory
- tags must sync -> update blocking
- Reporting snoop results
- Address
- Data
- Shared: OR of result from all results
- Dirty: OR of result from all results
- Snoop-pending: OR of result from all processors (0 indicates all processors have responded)
- Memory controller could immediately start accessing DRAM, but not respond (squelch response) if a snoop result from another cache indicates it has copy of most recent data
- Memory could assume one of the caches will service request until snoop results are valid (if snoop indicates no cache has data, then memory must respond)
- Handling cache line write-back
- evict -> write-back buffer
- load requested line
- later flush back
- A write commits when a read-exclusive transaction appears on bus and is acknowledged by all other caches
store x <- R0
- lookup line in cache
- arbitrate for bus
- place
BusRdX
on bus / other processor snoop requests - memory responds with data for cache line containing
X
- contents of
R0
written to appropriate bytes of cache line - order of transactions on the bus defnes the global order of writes in the parallel program (write serialization requirement of coherence)
- contention: requests from both processor & bus
- Split-transaction bus
- request + response
- How to match requests with responses?
- How to handle conficting requests on bus?
- Flow control: how many requests can be outstanding at a time, and what should be done when buffers fll up?
- When are snoop results reported? During the request? or during the response?
- basic design
- up to 8 outstanding requests (system wide)
- responses can be reordered requests
- flow control -> NACKs
- initialize a request
- request bus: cmd + address
- response bus: data + tag
- Step 1: Requestor asks for request bus access
- Step 2: Bus arbiter grants access, assigns transaction a tag
- Step 3: Requestor places command + address on the request bus
- [[Check: MSHR]]
- read miss
- phase 1
- Request arbitation: cache controllers present request for address to bus (many caches may be doing so in the same cycle)
- Request resolution: address bus arbiter grants access to one of the requestors. Request table entry allocated for request
- Caches perform snoop: look up tags, update cache state, etc. (memory operation commits)
- Caches acknowledge this snoop result is ready (or signal they could not complete snoop in time here (e.g., raise inhibit wire)
- phase 2
- Data response arbitration: responder presents intent to respond to request with tag T
- Data bus arbiter grants one responder bus access
- Original requestor signals readiness to receive response
- phase 3
- Responder places response data on data bus
- Caches present snoop result for request with the data
- Request table entry is freed
- phase 1
- pipelined transaction
- write backs and BusUpg transactions do not have a response component
- conflicting requests
- disallowing conflicting requests by maintaining a copy of the request table in each cache
- flow control
- Caches/memory have buffers for receiving data off the bus (write-back buffer, data buffer)
- If the buffer flls, client NACKs relevant requests or responses (NACK = negative acknowledgement)
- Triggers a later retry (wound-wait)
- queue in distributed system: to accommodate variable (unpredictable) rates of production and consumption
- Multi-level cache hierarchies
- fetch deadlock problem: cache must be able to service requests while waiting on response to its own request (hierarchies increase response delay)
- Sizing all buffers to accommodate the maximum number of outstanding requests on bus is one solution to avoiding deadlock
- separate request/response queues
- responses can be completed without generating further transactions
- While stalled attempting to send a request, cache must be able to service responses
- Responses will make progress
- fetch deadlock problem: cache must be able to service requests while waiting on response to its own request (hierarchies increase response delay)
store x <- 10
- Virtual address to physical address conversion (TLB lookup)
- TLB miss
- TLB update (might involve OS [[Check: Software-handled TLB]])
- OS traverse page table to see physical memory mapping
- OS may need to swap in page to get the appropriate page table (load from disk to physical address)
- Cache lookup (tag check)
- Determine line not in cache (need to generate BusRdX)
- Arbitrate for bus
- Win bus, place address, command on bus
- All caches perform snoop (e.g., invalidate their local copies of the relevant line)
- Another cache or memory decides it must respond (let’s assume it’s memory)
- Memory request sent to memory controller
- Memory controller is itself a scheduler
- Memory controller checks active row in DRAM row buffer. (May need to activate new DRAM row. Let’s assume it does.)
- DRAM reads values into row buffer
- Memory arbitrates for data bus
- Memory wins bus
- Memory puts data on bus
- Requesting cache grabs data, updates cache line and tags, moves line into exclusive state
- Processor is notifed data exists
- Instruction proceeds
Memory Consistency
Memory coherence defnes requirements for the observed behavior of reads and writes to the same memory location
- The goal of cache coherence is to ensure that the memory system in a parallel computer behaves as if the caches were not there
Memory consistency defnes the behavior of reads and writes to different locations (as observed by other processors)
- Coherence only guarantees that writes to address X will eventually propagate to other processors
- Consistency deals with when writes to X propagate to other processors, relative to reads and writes to other addresses
- defines the allowed behavior of loads and stores to different addresses in a parallel system
Memory operation ordering: w.r.t. program order
- W→R: write to X must commit before subsequent read from Y
- R→R: read from X must commit before subsequent read from Y
- R→W: read to X must commit before subsequent write to Y
- W→W: write to X must commit before subsequent write to Y
Sequentially Consistent (SC): all 4 memory operation orderings, serializable R/W
- A parallel system is sequentially consistent if the result of any parallel execution is the same as if all the memory operations were executed in some sequential order, and the memory operations of any one processor are executed in program order.
- program order
- atomicity: all processors see operations equally
Relaxing order
hiding latency: overlap memory access operations with other operations when they are independent.
sufficient logical order != program order
write buffering (store queue with store-to-load forwarding)
- relax W→R
Total store ordering (TSO): Intel
- relax W→R
- Processor P can read B before its write to A is seen by all processors (processor can move its own reads in front of its own writes) (relax program order)
- Reads by other processors cannot return new value of A until the write to A is observed by all processors (relax atomicity)
- once in L1, observed by anyone
- Safety net: RMW (
lock cmpxchg
), memory barriermm_lfence
(“load fence”: wait for all loads to complete)- prevent R→R, R→W reorder
- no-op in x86
- Does it make any sense to use the LFENCE instruction on x86/x86_64 processors?
mm_sfence
(“store fence”: wait for all stores to complete)- prevent W→W
- Does Intel SFENCE have release semantics?
mm_mfence
(“mem fence”: wait for all me operations to complete) [[Check: seq-cst semantics]]lfence
+sfence
+ W→R- implicitly shown in
lock
-prefixed instructions
Processor consistency (PC)
relax W→R
Any processor can read new value of A before the write is observed by all processors
other processors can read A even if new value of A haven’t been propagated to all processors
Partial Store Ordering (PSO)
- relax W→R, W→W
- reorder write operations in write buffer (E.g. cache miss/hit, ready ops)
Weak Ordering (WO): ARM
Release Consistency (RC)
- relax all
- out-of-order execution of all R/W instructions [[Check: Load-Store Queue with Forwarding & Random Commit]]
- Processors support special synchronization operations
- Memory accesses before memory fence instruction must complete before the fence issues
- Memory accesses after fence cannot begin until fence instruction is complete
clflush
(strongly ordered),clflushopt
(weakly ordered, can overlap)
Acquire/Release semantics
- Operation X with acquire semantics: prevent reordering of X with any load/store after X in program order
- Other processors see X’s effect before the effect of all subsequent operations
lock()
- Operation X with release semantics: prevent reordering of X with any load/store before X in program order
- Other processors see effects of all prior operations before seeing effect of X
unlock()
- acquire-and-release-semantics
- What do each memory_order mean?
- Operation X with acquire semantics: prevent reordering of X with any load/store after X in program order
Data race
- synchronized program: SC result on non-SC system
- unsynchronized program
- Conficting accesses not ordered by synchronization (e.g., a fence, operation with release/acquire semantics, barrier, etc.)
- data races: the output of the program depends on relative speed of processors (non-deterministic program results)
Eventual consistency
- Eventual consistency guarantees that if there are no other updates to X, A’s update will eventually be observed by all other nodes in the system (note: no guarantees on when, so updates to objects X and Y might propagate to different clients differently)
Scaling a Web Site
- simple parallel web server
- Parallelism: use all the server’s cores
- Latency hiding: hide long-latency disk read operations (by context switching between worker processes)
- Concurrency: many outstanding requests, want to service quick requests while long requests are in progress
- Footprint: don’t want too many threads so that aggregate working set of all threads causes thrashing
- Apache: dynamically managed size of worker pool (limit maximum to avoid thrashing)
- process instead of thread
- protection: crash-unrelated, non-thread safe libraries
- parent process periodically recycle workers, robustness to memory leaks
- dynamic web content
- “Scale out” to increase throughput
- Load balancing with persistence
- All requests associated with a session are directed to the same server (aka. session affinity, “sticky sessions”)
- Database contention
- Option 1: “scale up”: buy better hardware for database server, buy professional-grade DB that scales
- Option 2: Replicate data and parallelize reads. (extra storage, consistency)
- Option 3: partition shards
- Inter-request parallelism
- Web traffic
- bursty
- elasticity instead of provisioning
- Site performance monitor detects high load
- Instantiates new web server instances
- Informs load balancer about presence of new servers
- Cache
- Cache commonly accessed objects (
memcached
) - Reduce DB/web server load
- Cache web server responses (pages, pieces) (Front-End Cache)
- Varnish-Cache application accelerator
- Caching using CDN
- physical locality
- Cache commonly accessed objects (
Interconnection Networks
Network node: a network endpoint connected to a router/switch
Network interface: Connects nodes to the network
Switch/router: Connects a fxed number of input links to a fxed number of output links
Link: A bundle of wires carrying a signal
Topology: how switches are connected via links
Routing distance: Number of links (“hops”) along a route between two nodes
Diameter: the maximum routing distance
Average distance: average routing distance over all valid routes
Direct / Indirect: endpoints sit “inside/outside” the network
Bisection bandwidth: Cut network in half, sum bandwidth of all severed links
Blocking / non-Blocking: If connecting any pairing of idle nodes is possible, network is non-blocking
Bus interconnect
Crossbar interconnect (CCX)
- every node is connected to every other node
- Oracle multi-core processing (SPARC T2/T5)
Ring interconnect
- Core i7, IBM CELL Broadband Engine
Mesh interconnect
- Direct network
- Echoes locality in grid-based applications
- Tilera processor
- prototype Intel chips
Torus interconnect
- Characteristics of mesh topology are different based on whether node is near edge or middle of network (torus topology introduces new links to avoid this problem)
Trees interconnect
- planar, hierarchical
- good when traffic has locality (like mesh/torus)
Hypercube
SGI Origin
Multi-stage logarithmic
- Indirect network with multiple switches between terminals
- variations: Omega, Butterfly, Clos Network
Advantages | Disadvantages | |
---|---|---|
Bus | Simple design cost effective for a small # of nodes easy to implement coherence (via snooping) | Contention: all nodes contend for shared bus Limited bandwidth: all nodes communicate over same wires (one communication at a time) High electrical load = low frequency, high power |
Crossbar | $O(1)$ latency & high bandwidth Non-blocking | Not scalable: $O(N^2)$ switches High cost Difficult to arbitrate at scale |
Ring | Simple Cheap: $O(N)$ cost | High latency: $O(N)$ Not scalable: Bisection bandwidth as constant |
Mesh | $O(N)$ cost Average latency: $O(\sqrt{N})$ Easy to lay out on chip: fixed-length links Path diversity: many ways for message to travel from one node to another | Non-uniform: performance based on whether node is near or middle Blocking |
Torus | Still $O(N)$ cost Higher path diversity & bisection BW than mesh | Higher cost than 2D grid Higher complexity: hard to layout on chip & unequal link length |
Tree | Latency: $O(\log{N})$ | Higher BW links near root |
Hypercube | Low latency: $O(\log{N})$ Radix: $O(\log{N})$ # of links $O(N \log{N})$ | |
Multi-stage logarithmic | Cost: $O(N\log{N})$ Latency: $O(\log{N})$ | Indirect Blocking |
- Routing: how a message gets from its source to its destination in the network
- static
- adaptive
- Buffering and fow control
- Circuit switching sets up a full path (acquires all resources) between sender and receiver prior to sending a message
- Establish route (reserve links) then send all data for message
- Higher bandwidth transmission (no per-packet link mgmt overhead)
- Does incur overhead to set up/tear down path
- Reserving links can result in low utilization
- High-granularity resource allocation: : pre-allocate all resources (links across multiple switches) along entire network path for a message (“setup a fow”)
- Cost
- Needs setup phase (“probe”) to set up the path (and to tear it down and release the resources when message complete)
- Lower link utilization. Transmission of two messages cannot share same link (even if some resources on a preallocated path are no longer utilized during a transmission)
- Benefits
- No contention during transmission due to preallocation, so no need for buffering
- Arbitrary message sizes (once path is set up, send data until done)
- Packet switching makes routing decisions per packet
- Route each packet individually (possibly over different network links)
- Opportunity to use link for a packet whenever link is idle
- Overhead due to dynamic switching logic during transmission
- No setup/tear down overhead
- Granularity of communication
- Message: Unit of transfer between network clients
- Packet: Unit of transfer for network
- Header + Payload/Body +Tail
- Flit (fow control digit): Unit of flow control in the network
- Contention
- buffering
- dropping
- reroute (deflection)
- Store-&-Forward (packet-based)
- Packet copied entirely into network switch before moving to next node
- Requires buffering for entire packet in each router
- High per-packet latency (latency = packet transmission time on link x network distance)
- Cut-through flow control (packet-based)
- Switch starts forwarding data on next link as soon as packet header is received (header determines how much link bandwidth packet requires + where to route)
- Reduce transmission latency
- If output link is blocked (cannot transmit head), transmission of tail can continue
- entire message absorbed into a buffer in a switch -> S&F
- Requires buffering for entire packet in each router
- Wormhole flow control (flit-based)
- head-of-line blocking
- virtual channel flow control
- Multiplex multiple operations over single physical channel
- Divide switch’s input buffer into multiple buffers sharing a single physical channel
- Reduces head-of-line blocking
- [[Check: Medium Access Control/Multiplexing]]
- Deadlock avoidance
- Prevent cycles by ensuring requests and responses use different virtual channels
- Prioritization of traffic classes
- Provide quality-of-service guarantees
- head-of-line blocking
- Circuit switching sets up a full path (acquires all resources) between sender and receiver prior to sending a message
- Current Research
- Energy efficiency of interconnections
- Interconnect can be energy intensive (~35% of total chip power in MIT RAW research processor)
- Bufferless networks
- turn on/off regions of network, use fast and slow networks
- Prioritization and quality-of-service guarantees
- Prioritize packets to improve multi-processor performance (e.g., some applications may be more sensitive to network performance than others)
- Throttle endpoints (e.g., cores) based on network feedback
- New/emerging technologies
- Die stacking (3D chips)
- Photonic networks-on-chip (use optical waveguides instead of wires)
- Reconfgurable devices (FPGAs): create custom interconnects tailored to application (see CMU projects: CONNECT, CoRAM, Shrinkwrap)
- Energy efficiency of interconnections
Implementing Synchronization
Acquire method: How a thread attempts to gain access to protected resource
Waiting algorithm: How a thread waits for access to be granted to shared resource
- busy waiting (spinning)
- blocking
- When should one use a spinlock instead of mutex?
Release method: How thread enables other threads to gain resource when its work in the synchronized region is complete
Test-and-set based lock
ts R0, mem[addr]
: loadmem[addr]
intoR0
, ifmem[addr]
is 0, setmem[addr]
to 1 [[Check: consensus number]]lock: ts R0, mem[addr] bnz R0, lock unlock: st mem[addr], #0
Bus contention increases amoung of time to transfer lock
Bus contention slows down execution of critical sections
low latency, high traffic, pool scaling, low storage cost, no provision for fairness
• If every process spins on an exchange, every exchange instruction will attempt a write -> many invalidates and the locked value keeps changing ownership
• Hence, each process keeps reading the lock value – a read does not generate coherence traffic and every process spins on its locally cached copy
• When the lock owner releases the lock by writing a 0, other copies are invalidated, each spinning process generates a read miss, acquires a new copy, sees the 0, attempts an exchange (requires acquiring the block in exclusive state so the write can happen), first process to acquire the block in exclusive state acquires the lock, others keep spinning
CAS lock
lock cmpxchg dst, src
if (dst == EAX) ZF = 1 dst = src else ZF = 0 EAX = dst
Load-Linked & Store-Conditional Lock
LDREX
&STREX
lockit: LL R2, 0(R1) ; load linked, generates no coherence traffic BNEZ R2, lockit ; not available, keep spinning DADDUI R2, R0, #1 ; put value 1 in R2 SC R2, 0(R1) ; store-conditional succeeds if no one ; updated the lock since the last LL BEQZ R2, lockit ; confirm that SC succeeded, else keep trying
Test-and-test-and-set lock [[Check: double lock in singleton]]
void Lock(int* lock) { while (1) { while (*lock != 0); if (test_and_set(*lock) == 0) return; } } void Unlock(int* lock) { *lock = 0; }
Slightly higher latency than test-and-set in uncontended case
Generates much less interconnect traffic
- One invalidation, per waiting processor, per lock release ($O(P)$ invalidations)
- This is $O(P^2)$ interconnect traffic if all processors have the lock cached
- Recall: test-and-set lock generated one invalidation per waiting processor per test
More scalable (due to less traffic)
Low storage cost
Still no provision for fairness
Test-and-set lock with backoff [[Check: ALOHA with backoff]]
void Lock(volatile int* l) { int amount = 1; while (1) { if (test_and_set(*l) == 0) return; delay(amount); amount *= 2; } }
Generates less traffic than test-and-set
Same uncontended latency as test-and-set, but potentially higher latency under contention
Exponential back-off can cause severe unfairness
Ticket lock
struct lock { int next_ticket; int now_serving; }; void Lock(lock* l) { int my_ticket = atomic_increment(&l->next_ticket); // take a “ticket” while (my_ticket != l->now_serving); // wait for number to be called } void unlock(lock* l) { l->now_serving++; }
No atomic operation needed to acquire the lock (only a read)
only one invalidation per lock release ($O(P)$ interconnect traffic)
Array-based lock
struct lock { padded_int status[P]; // padded to keep off same cache line int head; }; int my_element; void Lock(lock* l) { my_element = atomic_circ_increment(&l->head); // assume circular increment while (l->status[my_element] == 1); } void unlock(lock* l) { l->status[my_element] = 1; l->status[circ_next(my_element)] = 0; // next() gives next index }
$O(1)$ interconnect traffic per release, but lock requires space linear in $O(P)$
the atomic circular increment is a more complex operation (higher overhead)
Queueing locks
- the directory controller keeps track of the order in which requests arrived – when the lock is available, it is passed to the next in line (only one process sees the invalidate and update)
Desired lock
- Low latency: If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
- Low interconnect traffic: If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
- Scalability: Latency / traffic should scale reasonably with number of processors
- Low storage cost
- Fairness: Avoid starvation or substantial unfairness
Min by CAS
// atomicCAS: // atomic compare and swap performs the following logic atomically int atomicCAS(int* addr, int compare, int val) { int old = *addr; *addr = (old == compare) ? val : old; return old; } int atomic_min(int* addr, int x) { int old = *addr; int new = min(old, x); while (atomicCAS(addr, old, new) != old) { old = *addr; new = min(old, x); } return old; } void atomic_incr(int* addr, int x){ while(1) { int old = *addr; if (atomicCAS(addr, old, old + x) == old) break; } }
Barrier
centralized barrier
wait for all processes to leave first barrier, before clearing flag for entry into the second
$O(P)$ traffic on interconnect per barrier
- All threads: $2P$ write transactions to obtain barrier lock and update counter ($O(P)$ traffic assuming lock acquisition is implemented in $O(1)$ manner)
- Last thread: 2 write transactions to write to the flag and reset the counter ($O(P)$ traffic since there are many sharers of the flag)
- $P-1$ transactions to read updated flag
- span (latency) of entire operation is $O(P)$
struct Barrier_t { LOCK lock; int arrive_counter; // initialize to 0 (number of threads that have arrived) int leave_counter; // initialize to P (number of threads that have left barrier) int flag; }; void Barrier(Barrier_t* b, int p) { lock(b->lock); if (b->arrive_counter == 0) { // if first to arrive... if (b->leave_counter == P) { // check to make sure no other threads “still in barrier” b->flag = 0; // first arriving thread clears flag } else { unlock(lock); while (b->leave_counter != P); // wait for all threads to leave before clearing lock(lock); b->flag = 0; // first arriving thread clears flag } } int num_arrived = ++(b->arrive_counter); unlock(b->lock); if (num_arrived == p) { // last arriver sets flag b->arrive_counter = 0; b->leave_counter = 1; b->flag = 1; } else { while (b->flag == 0); // wait for flag lock(b->lock); b->leave_counter++; unlock(b->lock); } }
centralized barrier with sense reversal
Sense reversal optimization results in one spin instead of two
LOCK lock; int counter; // initialize to 0 int flag; // initialize to 0 }; thread_local int private_sense = 0; // private per processor. Main idea: processors wait // for flag to be equal to private_sense void Barrier(Barrier_t* b, int p) { private_sense = (private_sense == 0) ? 1 : 0; lock(b->lock); int num_arrived = ++(b->counter); if (b->counter == p) { // last arriver sets flag unlock(b->lock); b->counter = 0; b->flag = private_sense; } else { unlock(b->lock); while (b.flag != private_sense); // wait for flag } }
Combining tree implementation of barrier
- Combining trees make better use of parallelism in more complex interconnect topologies. $\log{P}$ latency
- Barrier acquire: when processor arrives at barrier, performs increment of parent counter
- Barrier release: beginning from root, notify children of release
Fine-grained Synchronization & Lock-free Programming
Fine-grained linked list
struct Node { int value; Node* next; Lock* lock; }; struct List { Node* head; Lock* lock; }; void insert(List* list, int value) { Node* n = new Node; n->value = value; // assume case of insert before head handled // here (to keep slide simple) Node* prev, *cur; lock(list->lock); prev = list->head; cur = list->head->next; lock(prev->lock); unlock(list->lock); if (cur) lock(cur->lock); while (cur) { if (cur->value > value) break; Node* old_prev = prev; prev = cur; cur = cur->next; unlock(old_prev->lock); if (cur) lock(cur->lock); } n->next = cur; prev->next = n; unlock(prev->lock); if (cur) unlock(cur->lock); } void delete(List* list, int value) { // assume case of delete head handled here // (to keep slide simple) Node* prev, *cur; lock(list->lock); prev = list->head; cur = list->head->next; lock(prev->lock); unlock(list->lock); if (cur) lock(cur->lock) while (cur) { if (cur->value == value) { prev->next = cur->next; unlock(prev->lock); unlock(cur->lock); delete cur; return; } Node* old_prev = prev; prev = cur; cur = cur->next; unlock(old_prev->lock); if (cur) lock(cur->lock); } unlock(prev->lock); }
reduce contention for global data structure lock
tricky to ensure correctness
extra storage
extra locking each traversal step
Lock-free: Non-blocking algorithms are lock-free if some thread is guaranteed to make progress (“systemwide progress”)
this defnition does not prevent starvation of any one thread
wait-free: every operation has a bound on the number of steps the algorithm will take before the operation completes (starvation-free)
Single reader, single writer queue
- bounded (untotal function)
- unbounded (use
reclaim
as physical head, head for logical head for producer to move head and deallocate)
Lock-free stack
ABA problem
maintain counter of pop opreations
requires (double compare & swap) or doubleword CAS
cmpxchg8b
,cmpxchg16b
or with careful node allocation or element reuse policies
In summary, the ABA problem is caused by the value being used to check the state of a shared structure not actually being unique (i.e. the structure could have changed but the value compared still looks the same.)
The ABA problem can be solved if another mechanism is involved to ensure that memory not being available to threads and reused too early before no threads are holding reference to that memory.
struct Node { Node* next; int value; }; struct Stack { Node* top; int pop_count; }; void init(Stack* s) { s->top = NULL; } void push(Stack* s, Node* n) { while (1) { Node* old_top = s->top; n->next = old_top; if (compare_and_swap(&s->top, old_top, n) == old_top) return; } } Node* pop(Stack* s) { while (1) { int pop_count = s->pop_count; Node* top = s->top; if (top == NULL) return NULL; Node* new_top = top->next; if (double_compare_and_swap(&s->top, top, new_top, &s->pop_count, pop_count, pop_count+1)) return top; } }
referencing freed memory:
Node* top = s->top;
hazard pointer: avoid freeing a node until it’s known that all other threads do not hold reference to it
“Each reader thread owns a single-writer/multi-reader shared pointer called “hazard pointer.” When a reader thread assigns the address of a map to its hazard pointer, it is basically announcing to other threads (writers), “I am reading this map. You can replace it if you want, but don’t change its contents and certainly keep your deleteing hands off it.”
-Andrei Alexandrescu and Maged Michael, Lock-Free Data Structures with Hazard Pointers
struct Node { Node* next; int value; }; struct Stack { Node* top; int pop_count; }; // per thread ptr (node that cannot // be deleted since the thread is // accessing it) Node* hazard; // per-thread list of nodes this // thread must delete Node* retireList; int retireListSize; // delete nodes if possible void retire(Node* ptr) { push(retireList, ptr); retireListSize++; if (retireListSize > THRESHOLD) for (each node n in retireList) { if (n not pointed to by any thread’s hazard pointer) { remove n from list delete n; } } } void init(Stack* s) { s->top = NULL; } void push(Stack* s, int value) { Node* n = new Node; n->value = value; while (1) { Node* old_top = s->top; n->next = old_top; if (compare_and_swap(&s->top, old_top, n) == old_top) return; } } int pop(Stack* s) { while (1) { Stack old; old.pop_count = s->pop_count; old.top = hazard = s->top; if (old.top == NULL) { return NULL; } Stack new_stack; new_stack.top = old.top->next; new_stack.pop_count = old.pop_count+1; if (doubleword_compare_and_swap(s, old, new_stack)) { int value = old.top->value; retire(old.top); return value; } hazard = NULL; } }
Lock-free linked list
insertion
// insert new node after specified node void insert_after(List* list, Node* after, int value) { Node* n = new Node; n->value = value; // assume case of insert into empty list handled // here (keep code on slide simple for class discussion) Node* prev = list->head; while (prev->next) { if (prev == after) { while (1) { Node* old_next = prev->next; n->next = old_next; if (compare_and_swap(&prev->next, old_next, n) == old_next) return; } } prev = prev->next; } }
deletion
- Consider case where B is deleted simultaneously with successful insertion of E after B
CAS can fail under heavy contention, requiring spins
Transaction Memory
- Transaction Memory (TM)
- Memory transaction: An atomic and isolated sequence of memory accesses
- Atomicity
- Upon transaction commit, all memory writes in transaction take effect at once
- On transaction abort, none of the writes appear to take effect (as if transaction never happened)
- Isolation
- No other processor can observe writes before transaction commits
- Serializability
- Transactions appear to commit in a single serial order
- motivations
- locks prevent concurrency
- failure atomicity: no manually caught exceptions, just abort. No lost locks when a thread fails
- composability: no manually lock order, but global intent. Safe & scalable composition
- easy-to-use
- performs as well as fine-grained locks (provides R->R)
- TM Implementation
- Data versioning policy: How does the system manage uncommitted (new) and previously committed (old) versions of data for concurrenttransactions? [[Check: Undo/Redo Log in DB]]
- eager versioning (undo-log based): Update memory immediately, maintain “undo log” in case of abort [[Check: 2PL]]
- fast commit
- slower aborts, fault tolerance issues [[Check: STEAL policy]]
- lazy versioning (write-buffer based): Log memory updates in transaction write buffer, fush buffer on commit [[Check: MVCC]]
- faster abort
- slower commits
- eager versioning (undo-log based): Update memory immediately, maintain “undo log” in case of abort [[Check: 2PL]]
- Confict detection policy: how/when does the system determine that two concurrent transactions confict? [[Check: OCC in DB]]
Read-set: addresses read during the transaction
Write-set: addresses written during the transaction
Read-write confict: transaction A reads address X, which was written to by pending (but not yet committed) transaction B (W->R dirty read, R->W unrepeatable read)
Write-write confict: transactions A and B are both pending, and both write to address X (W->W)
Pessimistic detection [[Check: 2PL]]
- write priority
- need a manager for livelock
- detect conflicts early (undo less work, turn some aborts to stalls)
- no forward progress guarantees, more aborts in some cases
- fine-grained communication (check on each load/store)
- detection on critical path
Optimistic detection: Detect conficts when a transaction attempts to commit
- forward progress guarantees
- bulk communication and confict detection
- detects conficts late, can still have fairness problems
- [[N: here Case 3 success, I think it’s because MVCC control]]
Granularity
- Object granularity (SW-based)
- reduce overhead
- close to programmer’s reasoning
- false sharing
- Machine word granularity
- minimize false sharing
- increase overhead
- Cache-line granularity
- Mixed
- Object granularity (SW-based)
Examples
Hardware transaction memory (HTM)
- data versioning in caches (cache writer buffer or undo log)
- add new cache line metadata to track read/write set
- R bit: indicates data read by transaction (set on loads)
- W bit: indicates data written by transaction (set on stores)
- R/W bits can be at word or cache-line granularity
- R/W bits gang-cleared on transaction commit or abort
- For eager versioning, need a 2nd cache write for undo log
- conflict detection via cache coherence protocol
- Observing shared request to W-word is a read-write confict
- Observing exclusive (intent to write) request to R-word is a write-read confict
- Observing exclusive (intent to write) request to W-word is a write-write confict
- register checkpoint at transaction begin (HW abort)
- Commit
- Validate: request RdX access to write set lines (if needed)
- Commit: gang-reset R and W bits, turns write set data to valid (dirty) data
- Abort
- Check: lookup exclusive requests in the read set and write set
- Abort: invalidate write set, gang-reset R and W bits, restore to register checkpoint
- Restricted transaction memory (RTM): Intel Haswell
xbegin
,xend
,xabort
: tracks read/write set in L1 cache- eviction of line in r/w set will cause abort (no guarantee on progress)
- data versioning in caches (cache writer buffer or undo log)
- Data versioning policy: How does the system manage uncommitted (new) and previously committed (old) versions of data for concurrenttransactions? [[Check: Undo/Redo Log in DB]]
Heterogeneous Parallelism & Hardware Specialization
- Amdahl’s law: $\text{speedup}(f, n, r) = \frac{1}{\frac{1-f}{\text{perf}(r_{seq})} + \frac{f}{\sum\text{perf}(r_i)}}$
- $f$: fraction of program that is parallelizable
- $n$: total processing resources
- $r$: resources dedicated to each processing core
- Intel “Skylake”
- CPU cores & graphics cores share same memory system & LLC (L3 cache)
- enables low-latency, high-bw communication between CPU & integrated GPU
- Graphic cores are cache coherent with GPU cores
- CPU cores & graphics cores share same memory system & LLC (L3 cache)
- Add discrete GPU
- Keep discrete (power hungry) GPU unless needed for graphics-intensive applications
- Use integrated, low power graphics for basic graphics/window manager/UI
- ARM-based supercomputer
- power-efficient building blocks: ARM CPUs (for control/scheduling) + GPU cores or wide SIMD engines (serving as the primary compute engine)
- Energy-constrained computing
- supercomputer: shear scale
- datacenter: cost of cooling, physical space requirement
- mobile device: limited battery life, heat dissipation
- chip power consumption limit: the longer a task runs the less power it can use
- Processor’s power consumption is limited by heat generated (efficiency is required for more than just maximizing battery life)
- Electrical limit: max power that can be supplied to chip
- Die temp (junction temp - $T_j$): chip becomes unreliable above this temp (chip can run at high power for short period of time until chip heats to $T_j$)
- Case temp: mobile device gets too hot for user to comfortably hold (chip is at suitable operating temp, but heat is dissipating into case)
- Battery life: chip and case are cool, but want to reduce power consumption to sustain long battery life for given task
- Mobile: increasing efficiency
- Run faster for a fixed period of time: Run at higher clock, use more cores (reduce latency of critical task)
- Run at a fixed level of performance for longer
- Computing specialization
- Throughput-maximized processor architectures: e.g., GPU cores (10x)
- Fixed-function ASIC (“application-specifc integrated circuit”) (100-1000x)
- FPGA (Field Programmable Gate Arrays)
- Middle ground between an ASIC and a processor
- FPGA chip provides array of logic blocks, connected by interconnect
- Programmer-defned logic implemented directly by FGPA
- combinatorial logic via LUT
- Project Catapult
- Heterogeneous system
- SW dev: how to map application onto a heterogeneous collection of resources?
- decomposing, scheduling
- HW dev: what is the right mixture of resources?
- Too few throughput oriented resources (lower peak throughput for parallel workloads)
- Too few sequential processing resources (limited by sequential part of workload)
- How much chip area should be dedicated to a specifc function, like video?
- Reducing energy consumption
- specialized processing
- move less data
- Compute less: parallel algorithms that do more work than sequential counterparts may not be desirable even if they run faster
- Specialize compute units
- Heterogeneous processors: CPU-like cores + throughput-optimized cores (GPU-like cores)
- Fixed-function units: audio processing, “movement sensor processing” video decode/encode, image processing/computer vision?
- Specialized instructions: expanding set of AVX vector instructions, new instructions for accelerating AES encryption (AES-NI)
- Programmable soft logic: FPGAs
- Reduce bandwidth requirements
- Exploit locality (restructure algorithms to reuse on-chip data as much as possible)
- Aggressive use of compression: perform extra computation to compress application data before transferring to memory (likely to see fxed-function HW to reduce overhead of general data compression/decompression)
- SW dev: how to map application onto a heterogeneous collection of resources?
- Heterogeneous parallel processing: use a mixture of computing resources that ft mixture of needs of target applications
- Latency-optimized sequential cores, throughput-optimized parallel cores, domain-specialized fixed-function processors
- Current CS research challenge: how to write efficient, portable programs for emerging heterogeneous architectures?
Domain-Specific Programming Systems
Heterogeneous execution capability
- Programmable, latency-centric (e.g., “CPU-like” cores)
- Programmable, throughput-optimized (e.g., “GPU-like” cores)
- Fixed-function, application-specific (e.g., image/video/audio processing)
Different technologies and performance characteristics within the same machine at different scales
- Within a core: SIMD, multi-threading: fine-granularity sync and communication
- Abstractions: SPMD programming (ISPC, Cuda, OpenCL, Metal)
- Across cores: coherent shared memory via fast on-chip network
- Abstractions: OpenMP pragma’s, Cilk, TBB
- Hybrid CPU+GPU multi-core: incoherent (potentially) shared memory
- Abstractions: OpenCL
- Across racks: distributed memory, multi-stage network
- Abstractions: message passing (MPI, Go, Spark, Legion, Charm++)
- Within a core: SIMD, multi-threading: fine-granularity sync and communication
Liszt: for scientific computing on meshes
val Position = FieldWithConst[Vertex,Float3](0.f, 0.f, 0.f) val Temperature = FieldWithConst[Vertex,Float](0.f) val Flux = FieldWithConst[Vertex,Float](0.f) val JacobiStep = FieldWithConst[Vertex,Float](0.f) var i = 0; while ( i < 1000 ) { Flux(vertices(mesh)) = 0.f; JacobiStep(vertices(mesh)) = 0.f; for (e <- edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) - Position(v2) val dT = Temperature(v1) - Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -= dT*step JacobiStep(v1) += step JacobiStep(v2) += step } i += 1 }
fields, mesh, topology functions, iteration over set
Lizst infers “stencils”: “stencil” = mesh elements accessed in an iteration of loop = dependencies for the iteration
Statically analyze code to find stencil of each top-level for loop
- Extract nested mesh element reads
- Extract field operations
Language restrictions:
- Mesh elements are only accessed through built-in topological functions:
cells(mesh)
, …
- Single static assignment: (immutable values)
val v1 = head(e)
- Data in fields can only be accessed using mesh elements:
Pressure(v)
- No recursive functions
- Mesh elements are only accessed through built-in topological functions:
Portable parallelism
- mesh partitioning
- mesh coloring
- mesh partitioning
Distributed memory: Mesh + Stencil -> Graph -> Partition
GPU
parallel reduction
conflict graph
To avoid the conflicts in vertex value updating when parallel in edges, there are generally two scheme:
- Maintain the update value in the edge, and finally doing reduce parallel in vertex.
- Construct the connectivity graph for each edge and colorizing the edges, thus get edge sets where edges are independent within one set. Both method required some additional cost. The former requires extra storage for each edge, the later requires extra time and storage for the connectivity graph and colorization.
Ideally, we would like to color the graph with the minimum amount of colors to maximize parallelism and minimize overhead. Unfortunately, minimum vertex coloring is in NP-Hard so constructing a minimum vertex coloring for any significantly sized graph is usually out of the question. Instead you probably want to use some kind of greedy/approximation method.
According to the halide paper, they actualy use an algorithm from the paper “register allocation via coloring”, by Chaitin
Productivity
- Abstract representation of mesh: vertices, edges, faces, fields
- Intuitive topological operators
Portability
- Same code runs on large cluster of CPUs and GPUs (and combinations)
High performance
- Language is constrained to allow compiler to track dependencies
- Used for locality-aware partitioning (distributed memory implementation)
- Used for graph coloring to avoid sync (GPU implementation)
- Compiler chooses different parallelization strategies for different platforms
- System can customize mesh representation based on application and platform
Halide: image processing
two-pass blur: A 2D separable filter (such as a box filter) can be evaluated via two 1D filtering operations
Halide language
Var x, y; Func blurx, blury, out; Image<uint8_t> in = load_image(“myimage.jpg”); // perform 3x3 box blur in two-passes blurx(x,y) = (in(x-1,y) + in(x,y) + in(x,y)) / 3.f; blury(x,y) = (blurx(x,y-1) + blurx(x,y+1) + blurx(x,y+1)) / 3.f; // brighten blurred result by 25%, then clamp out(x,y) = min(blury(x,y) * 1.25f, 255); // execute pipeline on domain of size 800x600 Image<uint8_t> result = out.realize(800, 600);
Halide function: an infinite (but discrete) set of values
Halide expression: a side-effect free expression describes how to compute a function’s value at a point in it’s domain in terms of the values of other functions
scheduling: how to schedule the algorithm onto a parallel machine
// Vectorize the xi loop (8-wide) // Use threads to parallelize the y loop out.tile(x, y, xi, yi, 256,32).vectorize(xi,8).parallel(y); // Produce elements blurx on demand for each tile of output. // Vectorize the x (innermost) loop blurx.compute_at(x).vectorize(x,8);
blurx.compute_root();
: don’t compute within outer loop, compute allblurx.compute_at(x_i)
: compute necessary elements within out’s $x_i$ loopblurx.compute_at(x)
: compute necessary elements within out’s $x$ loopFunctional primitives for describing image processing operations
Additional primitives for describing schedules
separate “algorithm specification” from schedule
- [[C: Auto Schedule in TVM]]
Domain-Specific Programming on Graphs
GraphLab: A system for describing iterative computations on graphs
$G = (V, E)$, $D_v$ data associated with vertex $v$, $D_{u \to v}$ data associated with direct edge $u \to v$
read-only lobal data
operation: Defnes per-vertex operations on the vertex’s local neighborhood (& global reduction)
neighborhood: current vertex, adjacent edges, adjacent vertices
graphlab::omni_engine<pagerank_program> engine(dc, graph, "sync"); engine.signal_all(); engine.start();
Vertex signaling: primitive for scheduling work (Asynchronous execution model)
Synchronization: Programs specify what granularity of atomicity
- “Full consistency”: implementation ensures no other execution reads or writes to data in scope of $v$ when vertex program for $v$ is running.
- “Edge consistency”: no other execution reads or writes any data in $v$ or in edges adjacent to $v$
- “Vertex consistency”: no other execution reads or writes to data in $v$ …
Job scheduling order
- Synchronous: update all scheduled vertices “simultaneously”
- Round-robin: vertex programs observe most recent updates
- Graph coloring
- Dynamic: based on new work created by signal
- Synchronous: update all scheduled vertices “simultaneously”
Ligra: A simple framework for parallel graph operations
parents = {-1, ..., -1} // d = dst: vertex to “update” (just encountered) // s = src: vertex on frontier with edge to d procedure UPDATE(s, d) return compare-and-swap(parents[d], -1, s); procedure COND(i) return parents[i] == -1; procedure BFS(G, r) parents[r] = r; frontier = {r}; while (size(frontier) != 0) do: frontier = EDGEMAP(G, frontier, UPDATE, COND); procedure EDGEMAP_SPARSE(G, U, F, C): result = {} parallel foreach v in U do: parallel foreach v2 in out_neighbors(v) do: if (C(v2) == 1 and F(v,v2) == 1) then add v2 to result remove duplicates from result return result; procedure EDGEMAP_DENSE(G, U, F, C): result = {} parallel for i in {0,...,|V|-1} do: if (C(i) == 1) then: foreach v in in_neighbors(i) do: if v ∈ U and F(v, i) == 1 then: add i to result; if (C(i) == 0) break; return result; procedure EDGEMAP(G, U, F, C): if (|U| + sum of out degrees > threshold) return EDGEMAP_DENSE(G, U, F, C); else return EDGEMAP_SPARSE(G, U, F, C); procedure VERTEXMAP(U, F): result = {} parallel for u ∈ U do: if (F(u) == 1) then: add u to result; return result;
visiting frontier edge might be wasteful
- claimed child: edge points to unvisited node (useful work)
- failed child: edge points to node found in this step via another edge
- peer: edge points to a vertex that was added to frontier in same step as current vertex
- valid parent: edge points to vertex found in previous step
Entities: Graph, Vertex subsets,
EDGEMAP
,VERTEXMAP
r_cur = {1/|V|, ... 1/|V|}; r_next = {0,...,0}; diff = {} procedure PRUPDATE(s, d): atomicIncrement(&r_next[d], r_cur[s] / vertex_degree(s)); procedure PRLOCALCOMPUTE(i): r_next[i] = alpha * r_next[i] + (1 - alpha) / |V|; diff[i] = |r_next[i] - r_cur[i]|; r_cur[i] = 0; return 1; procedure COND(i): return 1; procedure PAGERANK(G, alpha, eps): frontier = {0, ... , |V|-1} error = HUGE; while (error > eps) do: frontier = EDGEMAP(G, frontier, PRUPDATE, COND); frontier = VERTEXMAP(frontier, PRLOCALCOMPUTE); error = sum of per-vertex diffs // this is a parallel reduce swap(r_cur, r_next); return err
System abstracts graph operations as data-parallel operations over vertices and edges
Good DSL
- #1: good systems identify the most important cases, and provide most beneft in these situations [[C: Lamport paper!]]
- Structure of code mimics the natural structure of problems in the domain
- Halide: pixel-wise view of flters: pixel(x,y) computed as expression of these input pixel values
- Graph processing algorithms are designed in terms of per-vertex operations)
- Efficient expression: common operations are easy and intuitive to express
- Efficient implementation: the most important optimizations in the domain are performed by the system for the programmer
- Structure of code mimics the natural structure of problems in the domain
- #2: good systems are usually simple systems [[C: KISS]]
- A small number of key primitives and operations
- Ligra: only two operations! (vertexmap and edgemap)
- GraphLab: run computation per vertex, trigger new work by signaling (But GraphLab gets messy with all the scheduling options)
- Halide: only a few scheduling primitives
- Hadoop: map + reduce
- Allows compiler/runtime to focus on optimizing these primitives
- #3: good primitives compose
- Composition of primitives allows for wide application scope, even if scope remains limited to a domain
- Halide’s loop ordering + loop interleaving schedule primitives allow for expression of wide range of schedules
- Composition often allows optimization to generalizable
- Composition of primitives allows for wide application scope, even if scope remains limited to a domain
- #1: good systems identify the most important cases, and provide most beneft in these situations [[C: Lamport paper!]]
Implementation on graph operations
- Reorganize graph structure to increase locality: efficient I/O
- -> Streaming graph computation
- Sharded graph representation
- Partition graph vertices into intervals (sized so that subgraph for interval fts in memory)
- Vertices and only incoming edges to these vertices are stored together in a shard
- Sort edges in a shard by source vertex id
- loop over all graph edges
- For each partition i of vertices:
- Load shard $i$ (contains all incoming edges)
- For each other shard $s$
- Load section of $s$ containing data for edges leaving $i$ and entering $s$
- Construct subgraph in memory
- Do processing on subgraph
- For each partition i of vertices:
- Compress the graph: reduce memory BW or disk I/O
- graph operations are often BW-bound
- [[C: *-order unary degree sequence (*OUDS) tree]]
- [[C: succinct data structure]]
- Reorganize graph structure to increase locality: efficient I/O
In-Memory Distributed Computing using Spark
MapReduce
- mapper function -> prepare intermediate data -> reducer function
- checkpoints after map/reduce by writing results to FS
- job scheduler
- exploit data locality
- handling node failures
- handling slow machines (duplicate jobs)
- simple program structure (map, reduce)
- [[C: DryadLINQ for generalization to DAGs]]
- iterative algorithm must load from disk each iteration
Spark: in-memory, fault-tolerant distributed computing
Programming model for cluster-scale computations where there is signifcant reuse of intermediate datasets
- iterative ML, graph algorithm, iterative data mining
Keep intermediates in memory instead of persistent distributed file system
Fault tolerance for in-memory calculation
- Replicate all computations
- Checkpoint and rollback
- Periodically save state of program to persistent storage
- Restart from last checkpoint on node failure
- Maintain log of updates (commands and data)
- High overhead for maintaining logs
Resilient distributed dataset (RDD): read-only ordered collection of records (immutable)
can only be created by deterministic transformations on data in persistent storage or on existing RDDs
Actions on RDDs return data to application
// create RDD from file system data var lines = spark.textFile(“hdfs://15418log.txt”); // create RDD using filter() transformation on lines var mobileViews = lines.filter((x: String) => isMobileClient(x)); // another filter() transformation var safariViews = mobileViews.filter((x: String) => x.contains(“Safari”)); // then count number of elements in RDD via count() action var numViews = safariViews.count(); // 1. create RDD from file system data // 2. create RDD with only lines from mobile clients // 3. create RDD with elements of type (String,Int) from line string // 4. group elements by key // 5. call provided reduction function on all keys to count views var perAgentCounts = spark.textFile(“hdfs://15418log.txt”) .filter(x => isMobileClient(x)) .map(x => (parseUserAgent(x),1)); .reduceByKey((x,y) => x+y) .collect(); // create RDD from file system data var lines = spark.textFile(“hdfs://15418log.txt”); // create RDD using filter() transformation on lines var mobileViews = lines.filter((x: String) => isMobileClient(x)); // instruct Spark runtime to try to keep mobileViews in memory mobileViews.persist(); // create a new RDD by filtering mobileViews // then count number of elements in new RDD via count() action var numViews = mobileViews.filter(_.contains(“Safari”)).count(); // 1. create new RDD by filtering only Chrome views // 2. for each element, split string and take timestamp of // page view // 3. convert RDD to a scalar sequence (collect() action) var timestamps = mobileViews.filter(_.contains(“Chrome”)) .map(_.split(“ ”)(0)) .collect();
[[C: ranges-v3, rust iterator, lazy & consumer]]
Transformation: RDD -> RDD
Action: RDD -> Data
Narrow dependency: each partition of parent RDD referenced by at most one child RDD partition
- Allows for fusing of operations
Wide dependency: each partition of parent RDD referenced by multiple child RDD partitions
groupByKey
, must compute all dependency, trigger significant recomputation of ancestor lineage upon node failurejoin
: depend on partitioning [[C: cluster, uncluster, index in DB]]partitionBy(func)
: inform spark how to partition an RDDpersist()
: store content in memory /persist(RELIABLE)
: store content in durable storage (like a checkpoint)
Scheduling
- Stage 1 Computation: do nothing since input already materialized in memory
- Stage 2 Computation: evaluate map in fused manner, only actually materialize RDD F
- Stage 3 Computation: execute join (could stream the operation to disk, do not need to materialize)
Resilience via lineage
- RDD transformations are bulk, deterministic, and functional
- Implication: runtime can always reconstruct contents of RDD from its lineage (the sequence of transformations used to create it)
- Lineage is a log of transformations
- Efficient: since the log records bulk data-parallel operations, overhead of logging is low (compared to logging fine-grained operations, like in a database)
- Node failure: recompute RDD from lineage
- RDD transformations are bulk, deterministic, and functional
Performance improvement
- With increasing DRAM & SSD, improving CPU utilization
- Efficient code generation (SIMD, target accelerators)
- [[C: IBM’s SparkGPU]]
Files is a poor abstraction for intermediate variables in largescale data-parallel programs
With DSL: SQL, MLlib, GraphX
Efficiently Evaluating Deep Networks
Blocked dense matrix multiplication
Convolution
- FFT? transform/get back $O(N\log{N})$
- Winograd
- [[C: qnnpack]]
Reduce memory footprint
pruning
representing
Step 1: prune low-weight links (iteratively retrain network, then prune)
Step 2: Weight sharing: make surviving connections share a small set of weights (discretization)
Step 3: Huffman encode quantized weights and CSR indices (lossless compression)
[[C: quantization of ML]]
My guess is that sometimes the low weight connections may provide not-so-helpful information about non-dominant features, and the weights assigned to those connections are significant enough that, after propagating through many layers of the DNN, the result might be slightly changed
pruned version can have a lower error rate than the original one. This reminds me of regularizations in the loss functions, which prevent overfitting by preferring simpler models. Also, it looks similar to “dropout”, an interesting regularization technique for reducing overfitting by dropping out units in a neural network. Regularization during training is definitely different from dropping small weights in the evaluation, but I believe they may have some connections
[[C: int8 in ARM-v7]]
Parallel Deep Network Training
Huge computational expense
- Must evaluate the network (forward and backward) for millions of training images
- Must iterate for many iterations of gradient descent
- Training modern networks on big datasets takes days
Large memory footprint
- Must maintain network layer outputs from forward pass
- Additional memory to store gradients/gradient velocity for each parameter
- Recall parameters for popular VGG-16 network require ~500 MB of memory (training requires GBs of memory for academic networks)
- Scaling to larger networks requires partitioning DNN across nodes to keep DNN + intermediates in memory
Dependencies / Synchronization (not embarrassingly parallel)
- Each parameter update step depends on previous
- Many units contribute to same parameter gradients (fine-scale reduction)
- Different images in mini batch contribute to same parameter gradients
Parameter Server
- Training data partitioned among workers
- Copy of parameters sent to workers
- Workers independently compute local “subgradients”
- Worker sends subgradient to parameter server
- Server updates global parameter values based on subgradient
- Updated parameters sent to worker
- Asynchronous parameter update
- avoid global synchronization on all parameter updates between each SGD iteration
- partial subgradient updates
- impact convergence
- fast interconnect? -> alternative in supercomputer
- Shard the PS: reduce contention
Model paralelism
- Partition network parameters across nodes
- Reduce internode communication through network design
- small spatial convolutions (1x1 convolutions)
- reduce/shrink fully connected layers
Addressing the Memory Wall
[[C: what every programmer should know about memory]]
DRAM access latency
best case: read from active row: Column Access Time (CAS)
worst case: bit line not ready, read from new row: Precharge (PRE) + Row Active (RAS) + Column Access (CAS)
Something else that was not mentioned that was DRAM needs to be refreshed, because capacitors lose charge over time. In particular, data for each row is temporarily latched and rewritten after a certain number of cycles so that it’s value is maintained (for instance, a capacitor maintaining a bit value 1 will eventually read a 0 if its charge is not refreshed). RAM that doesn’t need to be refreshed is called SRAM (static RAM) which is much more expensive (stores memory in latches which require more capacitors and transistors than DRAM, which has just one capacitor and transistor per bit), and therefore less common (often used in L1 caches, but not main memory).
Whenever we read a row from the DRAM into the row buffer, we actually discharged the capacitors corresponding to that row in the DRAM, so when we replace the row that is in the row buffer, we need to charge back the row in the DRAM ( to not lose the information we had in that row). That is why the Precharge (PRE) step is needed.
DRAM burst mode: amortize latency over larger transfers
DRAM banks
all banks share same pins
banks allow for pipelining of memory requests
- Precharge/activate rows/send column address to one bank while transferring data from another
- Achieves high data pin utilization
both bank interleaving and channel interleaving (as is discussed in next slide) are interleaved memory design
each bank holds different memory
it will need (PRE+RAS)/CAS bands to hide the latency of row activation process and fully use the pin. Therefore, increasing band count or having larger CAS both work here
Each DRAM bank has one row-buffer
DIMM (Dual in-line memory module): multiple chips
DIMM appears as a single, higher capacity, wider interface DRAM module to the memory controller. Higher aggregate bandwidth, but minimum transfer granularity is now 64 bits
Memory controller converts physical address to DRAM bank, row, column
Physical addresses are interleaved across DRAM chips at byte granularity
DRAM chips transmit frst 64 bits in parallel
On the chip pictured, Kayvon noted that there are 8 DIMMs, and asked what the 9th would be if there were one more. To this, someone responded “ECC”
ECC, or “Error-correcting code memory,” can be used to detect and possibly correct errors in the other 8 DIMMs. Think parity bit checks, or Hamming codes.
For those familiar with hardware, bit errors are not to be ignored. They can occur because of hardware failure, or if your computer is being used in a non-ideal environment (e.g. in a nuclear power plant, the ISS).
the physical addresses could be interleaved across DRAM chips with granularity larger than one byte. For example, we could put bytes 0-3 on DRAM 0, 4-7 on DRAM 1, etc. In that case, we would transmit a strided pattern of bytes over the bus each cycle (e.g. byte 0 from DRAM 0, byte 4 from DRAM 1, etc.), but that’s okay, as long as we eventually obtain the whole cache line we want to load
one possible optimization is to load first the specific word that the memory request asked for and return it so that the processor can begin working with it, while the rest of the cache line finishes loading in. Of course, this would require changes to the semantics of how cache lines are loaded and processors’ expectations about them
[[C: critical word first]]
Memory controller: memory request scheduler
- Receives load/store requests from LLC
- Conficting scheduling goals
- Maximize throughput, minimize latency, minimize energy consumption
- Common scheduling policy: FR-FCFS (frst-ready, frst-come-frst-serve)
- Service requests to currently open row first (maximize row locality)
- Service requests to other rows in FIFO order
- Controller may coalesce multiple small requests into large contiguous requests (take advantage of DRAM “burst modes”)
- Maximize throughput, minimize latency, minimize energy consumption
Multiple-channel memory bus
DDR: double data rate (half cycle)
Embedded DRAM (eDRAM)
- IBM Power 7 server
- Intel Broadwell/Skylake
- XBox GPU
3D stacking DRAM: increase bandwidth, reduce power
- DRAMs connected via through-silicon-vias (TSVs) that run through the chips
- TSVs provide highly parallel connection between logic layer and DRAMs
- Base layer of stack “logic layer” is memory controller, manages requests from processor
- Silicon “interposer” serves as high-bandwidth interconnect between DRAM stack and processor
- Micron/Intel Hybrid Memory Cube (HBC)
- High-bandwidth memory (HBM)
- GPU
- MCDRAM
- Xeon Phi: 16GB in package stacked DRAM, as 16 GB last level cache, or as 16GB separate address space (“flat mode”)
- same latency at DDR4, 5x bandwidth of DDR4, 5x less energy cost per bit transferred
- Xeon Phi: 16GB in package stacked DRAM, as 16 GB last level cache, or as 16GB separate address space (“flat mode”)
Moving computation to the data
- bulk copy
- bulk copy
HW-accelerated data compression
__mm512_extload_ps
: Load 8-bit values from memory, convert to 32-bit float representation for storage in 32-bit register- Cache compression: increase cache’s effective capacity by compressing data resident in cache
- B$\Delta$I compression
- B$\Delta$I compression
- GPU frame buffer compression
- All modern GPUs have hardware support for losslessly compressing frame buffer contents before transfer to/from memory
- On cache line load: transfer compressed data from memory and decompress into cache
- On evict: compress cache line and only transfer compressed bits to memory
- Anchor encoding
- All modern GPUs have hardware support for losslessly compressing frame buffer contents before transfer to/from memory
- ARM GPU “Memory transaction elimination”
- skip output image write if it is unnecessary
- Frame 1:
- Render frame tile at a time
- Compute hash of pixels in each tile on screen
- Frame 2:
- Render frame tile at a time
- Before storing pixel values for tile to memory, compute hash and see if tile is the same as last frame
- If yes, skip memory write
Addressing memory bottleneck
- Application programmer: schedule computation in maximize locality
- HW Architecture
- Intelligent DRAM request scheduling
- Bringing data closer to processor (deep cache hierarchies, eDRAM)
- Increase bandwidth (wider memory systems, 3D memory stacking)
- Ongoing research in locating limited forms of computation “in” or near memory
- Ongoing research in hardware accelerated compression
General principles
- Locate data near processor
- move computations to data storage
- data compression
- [[C: on-chip computation]]
The Future of High-Performance Computing
Titan Programming
- Solving problem over grid
- Bulk Synchronous model: partition into regions, map region per processor
- strive to keep perfectly balanced
- System Level: MPI
- Node Level: OpenMP (cpu) + CUDA (gpu)
MPI Fault Tolerance
- Checkpoint
- Restore
- Performance Scaling
MapReduce
- flexibility in placement, scheduling, load balancing
- high overhead, low raw performance
- data integrity: replicate
- recovery: recompute
Moore’s Law
- Lithography at such small dimensions
- Statistical variations among devices
Dennard Scaling
from what I’ve read, supercomputers run lighweight Linux-derivatives. The CNK kernel used by IBM Blue Gene (at Argonne National Laboratory) has:
- statically mapped physical memory
- no context switching or scheduling at the compute kernel
- no file I/O implementation on compute nodes
15712 talks about microkernels like Mach
There’s many things going on in a traditional OS that aren’t required in a supercomputer. If you know your machine is dedicated to performing one large compute job, you can actually remove kernel functionality that supports anything else.
For example, general-purpose OSs have a timer “tick” that goes off every, say 10ms, that the kernel uses to schedule different processes to run on the CPU. When running a long compute job for which the programmer has partitioned work appropriately, this can cause significant slowdown.
Other examples include the networking stack (and associated daemon processes for that), desktop environments, and others. Technically, HW interrupts shouldn’t detract from performance, but since these machines don’t have disks, extra code in the kernel is just wasted space in RAM that could be used for computation.
Parallel 3D graphics
The visibility problem:
- what pixels does the triangle overlap? (“coverage”)
- what triangle is closest to the camera in each pixel? (“occlusion”)
- what scene geometry is visible within each screen pixel?
- What scene geometry projects into a screen pixel? (coverage)
- Which geometry is visible from the camera at that pixel? (occlusion)
- rays
- What scene geometry is hit by a ray from a pixel through the pinhole? (coverage)
- What object is the frst hit along that ray? (occlusion)
The graphics pipeline
- Geometry: compute vertex positions on screen
- Rasterization: compute covered samples
- Shading: compute color of colored pixels
- Pixel Ops: depth test & depth/color write
Coverage
- $E_i(x, y) = (x - X_i)dY_i - (y - Y_i)d X_i = A_i x +B_i y +C$ (inside < 0 , on edge = 0, outside > 0)
- $inside(sx, sy)$: three edges < 0
- incremental updates: $dE_i(x + 1, y) = E_i(x, y) + A_i$
- tiled triangle traversal: All modern GPUs have fixed-function hardware for efficiently performing data-parallel point-in-triangle tests
- Simplicity of wide parallel execution overcomes cost of extra point-in-triangle tests (most triangles cover many samples, especially when super-sampling coverage)
- Can skip sample testing work: use trianglebox test to classify entire block not in triangle (“early out test”), or entire block entirely within triangle (“early in”)
Occlusion
depth-buffer (Z-buffer): For each coverage sample point, depth-buffer stores depth of closest triangle at this sample point that has been processed by the renderer so far
bool pass_depth_test(d1, d2) { return d1 < d2; } depth_test(tri_d, tri_color, x, y) { if (pass_depth_test(tri_d, zbuffer[x][y]) { // triangle is closest object seen so far at this // sample point. Update depth and color buffers. zbuffer[x][y] = tri_d; // update zbuffer color[x][y] = tri_color; // update color buffer } }
Z-buffer algorithm has high bandwidth requirements
- Number of Z-buffer reads/writes for a frame depends on:
- Depth complexity of the scene: how many triangles cover each pixel on average
- The order triangles are provided to the graphics pipeline (if depth test fails, don’t write to depth buffer or rgba)
Z Depth buffer BW = Freq x MPixel per image x avg. scene depth complexity x Z bit
- GPUs often sample coverage multiple times per pixel for quality: multiply by 4
- Scene is often rendered multiple times per frame: multiply by 2-3
Modern GPUs have fxed-function hardware to implement caching and lossless compression of both color and depth buffers to reduce bandwidth
Frame buffer compression
- depth-buffer compression: reduce BW, lossless, for fixed-point numbers
- screen tile based
- On tile evict:
- Compute zmin/zmax (needed for hierarchical culling and/or compression)
- Attempt to compress
- Update tile table
- Store tile to memory
- On tile load:
- Check tile table for compression scheme
- Load required bits from memory
- Decompress into tile cache
- Anchor encoding
- Choose anchor value and compute DX, DY from adjacent pixels (fits a plane to the data)
- Use plane to predict depths at other pixels, store offset d from prediction at each pixel
- Scheme (for 24-bit depth buffer)
- Anchor: 24 bits (full resolution)
- DX, DY: 15 bits
- Per-sample offsets: 5 bits
- Depth-offset compression
- Assume depth values have low dynamic range relative to tile’s zmin and zmaz (assume two surfaces)
- Store zmin/zmax
- Store low-precision *8-12 bits) offset value for each sample
- Explicit plane encoding: Do not attempt to infer prediction plane, just get the plane equation of the triangle directly from the rasterizer
- Store plane equation in tile (high precision values)
- Store bit per sample indicating coverage
- Hierarchical occlusion culling “Hi-Z”
- Z-Max culling: For each screen tile, compute farthest value in the depth buffer (often needed for compression):
z_max
- During traversal, for each tile:
- Compute closest point on triangle in tile:
tri_min
- If
tri_min
>z_max
, then triangle is completely occluded in this tile. (The depth test will fail for all samples in the tile.) Proceed to next tile without performing coverage tests for individual samples in tile.
- Compute closest point on triangle in tile:
- Z-min optimization: Depth-buffer also stores
z_min
for each tile. Iftri_max
<z_min
, then all depth tests for fragments in tile will pass. (No need to perform depth test on individual fragments.)
- Z-Max culling: For each screen tile, compute farthest value in the depth buffer (often needed for compression):
- depth-buffer compression: reduce BW, lossless, for fixed-point numbers
Sort first: Assign each replicated pipeline responsibility for a region of the output image
- Bandwidth scaling (small amount of sync/communication, simple point-to-point)
- Computation scaling (more parallelism = more performance)
- Simple
- Potential for worload imbalance
- “Tile spread”: as screen tiles get smaller, primitives cover more tiles (duplicate geometry processing across the parallel pipelines)
Sort middle: Assign each rasterizer a region of the render target
Interleaved mapping of screen
- [[C: RGB-Delta arrangement]]
Sort middle interleaved
- Workload balance: both for geometry work AND onto rasterizers (due to interleaving)
- Does not duplicate geometry processing for each overlapped screen region
- Bandwidth scaling: sort is implemented as a broadcast
- If tessellation is enabled, must communicate many more primitives than sort frst
- Duplicated per triangle setup work across rasterizers
Sort middle tiled
- Partition screen into many small tiles (many more tiles than physical rasterizers)
- Sort geometry by tile into buckets (one bucket per tile of screen)
- After all geometry is bucketed, rasterizers process buckets in parallel
- Sort requires point-to-point traffic (assuming each triangle only touches a few buckets)
- Good load balance
- Potentially low bandwidth requirements
Sort last fragment
- Sort after fragment processing based on (x,y) position of fragment
- No redundant geometry processing or in rasterizers (but z-cull is a problem)
- Point-to-point communication during sort
- Interleaved pixel mapping results in good workload balance for frame-buffer ops
- Workload imbalance due to primitives of varying size
- BW limit
- Early Z cull is difficult
Sort everywhere
- detail
- [[TODO: implementation]]