Parallel Computer Architecture and Programming

Spring2017

[[S: Content]]

S = C: Check, random or related insights
S = N: Note, some notes
S = T: Todo, literally
S = R: Refer, cross reference
S = Q: Question, some questions, might be collected at the end

Why Parallelism? Why Efficiency

Common definition: A parallel computer is a collection of processing elements that cooperate to solve problems quickly
speedup (P processor) = execution T (1 processor) / execution T (p processor)
- limited by communication -> improve locality
- limited by imbalance in work assignment -> improve distribution of work
Design & Write scale parallel programs
- decomposing works in pieces which can performed in parallel
- assign work to processors
- managing communication/synchronization
Parallel computer hardware implementation
- performance characteristics
- trade-off
Efficiency
Instruction Level Parallelism (ILP)
- superscalar execution
- Most available ILP is exploited by a processor capable of issuing four instructions per clock
Power wall
- dynamic power $\propto$ capacitive load $\times$ voltage$^2$ $\times$ frequency
- static power: transistors burn power even when inactive due to leakage
Single-core performance scaling
- frequency limited by power
- ILP scaling tapped out
- -> adding more execution units that run in parallel

A Modern Multi-Core Processor

Multi-core
- increase cores instead of adding processor logic using transistors
  - pthreads
- amortize cost/complexity of managing an instruction stream across many ALUs
  - SIMD (Single Instruction, Multiple Data) processing
    - SSE instructions: 128-bit operations: 4x32 bits or 2x64 bits (4-wide foat vectors)
    - AVX256 instructions: 256 bit operations: 8x32 bits or 4x64 bits (8-wide foat vectors)
    - AVX512 instruction: 512 bit operations: 16x32 bits
    - generated by compiler
      - intrinsics
      - conveyed using parallel language semantics (E.g. OpenMP)
      - inferred by dependency analysis of loops (-march=native)
    - explicit SIMD: compile time SIMD (vstoreps, vmulps, …)
    - implicit SIMD: scalar binary -> N instances of program together on the processor
      - interface to the hardware itself is data-parallel
      - hardware simultaneously executing the same instruction from multiple instances on different data on SIMD ALUs
  - conditional code
    - instruction stream coherence (coherent execution): Same instruction sequence applies to all elements operated upon simultaneously
    - divergent execution: can’t efficiently use of SIMD processing
Parallel execution forms
- Multi-core: use multiple processing cores
  - Provides thread-level parallelism (TLP): simultaneously execute a completely different instruction stream on each core
  - Software decides when to create threads (e.g., via pthreads API)
- SIMD: use multiple ALUs controlled by same instruction stream (within a core)
  - Efficient design for data-parallel workloads: control amortized over many ALUs
  - Vectorization can be done by compiler (explicit SIMD) or at runtime by hardware
  - [Lack of] dependencies is known prior to execution (usually declared by programmer, but can be inferred by loop analysis by advanced compiler)
- Superscalar: exploit ILP within an instruction stream. Process different instructions from the same instruction stream in parallel (within a core)
  - Parallelism automatically and dynamically discovered by the hardware during execution (not programmer visible)
Memory latency: The amount of time for a memory request (e.g., load, store) from a processor to be serviced by the memory system
Memory bandwidth: The rate at which the memory system can provide data to a processor
Stall: A processor “stalls” when it cannot run the next instruction in an instruction stream because of a dependency on a previous instruction
- Caches reduce length of stalls (reduce latency, memory access)
- Prefetching reduces stalls (hide latency) [[Check stride prefetcher]]
- Multi-threading reduces stalls (hide latency) (Interleave processing of multiple threads)
Storing execution contexts
- Many small contexts (high latency hiding ability)
- Few large contexts (low latency hiding ability)
Hardware-supported multi-threading
- Core manages execution contexts for multiple threads
  - Runs instructions from runnable threads (processor makes decision about which thread to run each clock, not the operating system)
  - Core still has the same number of ALU resources: multi-threading only helps use them more efficiently in the face of high-latency operations like memory access
- Interleaved multi-threading (a.k.a. temporal multi-threading) [[Check: fine/coarse-grained HT]]
  - each clock, the core chooses a thread, and runs an instruction from the thread on the ALUs
- Simultaneous multi-threading (SMT)
  - Each clock, core chooses instructions from multiple threads to run on ALUs
  - Extension of superscalar (OoO?) CPU
- Use core’s FU resources more efficiently
- Cost
  - Requires additional storage for thread contexts (duplicated register, RAT)
  - Increases run time of any single thread
  - Requires additional independent work in a program
  - Relies heavily on memory bandwidth
    - larger working set -> less cache space per thread (often shared)
    - [[Check: SMT helper thread prefetching]]
GPU: extreme throughput-oriented processors
- warp: instruction streams (32 pieces of data per time, thread issuing 32-wide vector instruction)
- different instructions from up to 4 warps executed simultaneously (SMT)
- 64 warps interleaved on the SM (interleaved MT with thread context)
Bandwidth bound
- Organize computation to fetch data from memory less often
  - Reuse data previously loaded by the same thread (intra-thread temporal locality optimization) cache
  - Share data across threads (inter-thread cooperation) __shared__
- Request data less often
  - arithmetic intensity: ratio of math operations to data access operations in an instruction stream

Parallel Programming Abstraction

ISPC: Intel SPMD Program Compiler

programCount: number of simultaneously executing instances in the gang (uniform value)
programIndex: id of the current instance in the gang. (a non-uniform value: “varying”)
uniform: A type modifer. All instances have the same value for this variable. Its use is purely an optimization. Not needed for correctness.

Interleaved assignment of program instances to loop iterations

export	void	sinx(	
			uniform	int	N,	
			uniform	int	terms,	
			uniform	float*	x,	
			uniform	float*	result)	
{	
			//	assumes	N	%	programCount	=	0	
			for	(uniform	int	i=0;	i<N;	i+=programCount)
			{	
				int	idx	=	i	+	programIndex;	
				float	value	=	x[idx];	
				float	numer	=	x[idx]	*	x[idx]	*	x[idx];	
				uniform	int	denom	=	6;		//	3!	
				uniform	int	sign	=	-1;	
				for	(uniform	int	j=1;	j<=terms;	j++)	
				{		
							value	+=	sign	*	numer	/	denom	
							numer	*=	x[idx]	*	x[idx];	
							denom	*=	(2*j+2)	*	(2*j+3);	
							sign	*=	-1;	
						}	
						result[idx]	=	value;	
			}	
}

Blocked assignment of program instances to loop iterations

export	void	sinx(	
			uniform	int	N,	
			uniform	int	terms,	
			uniform	float*	x,	
			uniform	float*	result)	
{	
			//	assume	N	%	programCount	=	0	
			uniform	int	count	=	N	/	programCount;	
			int	start	=	programIndex	*	count;	
			for	(uniform	int	i=0;	i<count;	i++)
			{	
				int	idx	=	start	+	i;	
				float	value	=	x[idx];	
				float	numer	=	x[idx]	*	x[idx]	*	x[idx];	
				uniform	int	denom	=	6;		//	3!	
				uniform	int	sign	=	-1;	
				for	(uniform	int	j=1;	j<=terms;	j++)	
				{		
							value	+=	sign	*	numer	/	denom	
							numer	*=	x[idx]	*	x[idx];	
							denom	*=	(j+3)	*	(j+4);	
							sign	*=	-1;	
						}	
						result[idx]	=	value;	
			}	
}

single packed load SSE (__mm_load_ps1) can efficiently implements float value = x[idx]

Raising level of abstraction

foreach: declares parallel loop iterations. cooperatively perform a loop in a gang
ISPC decides the assignment

export	void	sinx(	
			uniform	int	N,	
			uniform	int	terms,	
			uniform	float*	x,	
			uniform	float*	result)	
{	
			foreach	(i	=	0	...	N)	
			{	
				float	value	=	x[i];	
				float	numer	=	x[i]	*	x[i]	*	x[i];	
				uniform	int	denom	=	6;		//	3!	
				uniform	int	sign	=	-1;	
				for	(uniform	int	j=1;	j<=terms;	j++)	
				{		
							value	+=	sign	*	numer	/	denom	
							numer	*=	x[i]	*	x[i];	
							denom	*=	(2*j+2)	*	(2*j+3);	
							sign	*=	-1;	
						}	
						result[i]	=	value;	
			}	
}

abstraction vs implementation
- SPMD (Single Program, Multiple Data) abstraction: running a gang is spawning programCount logical instruction streams (each with a different value of programIndex)
- SIMD implementation: ISPC compiler emits vector instructions (SSE4 or AVX) that carry out the logic performed by a ISPC gang.
  - handle mapping of conditional control flow to vector insns (by masking vector lanes)
assign varying variables to uniform variable causes compile-time error

Thread Programming Model
ISPC Programming Model
Shared address space communication Model
- Threads communicate by
  - reading/writing to shared variables
  - manipulating synchronization primitives
- natural extension of sequential programming
- HW implementation: any processor can directly reference any memory location
  - Intel/AMD: on-chip network
    - AMD hyper-transport / Intel QuickPath (QPI)
  - UltraSPARC: crossbar (CCX)
  - costly for NUMA & scale
    - SGI Altix: fat tree
- can be implemented by HW unsupported machines via SW solution (mark all shared pages as invalid, #PF issue network requests)
Message passing communication Model
- Threads operate within their own private address spaces
- Threads communicate by sending/receiving messages
  - send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
  - receive: sender, specifies buffer to store data, and optional message identifier
- MPI (Message Passing Interface)
- HW implementation: may accelerate communicate messages
  - connect commodity systems together to form large parallel machine (clusters)
- commonly implemented on a shared address space HW (by copying)

Data parallel communication Model

map(function, collection)
Synchronization is implicit at the end of the map (map returns when function has been applied to all elements of collection)
like foreach in ISPC
Stream: sequences of elements (can be processed independently)
Kernel: side-effect-free functions
benefits
- global-scale program dependencies known by compiler (enables compiler to perform aggressive optimizations that require global program analysis)
- Independent processing on elements, kernel functions are side-effect free
  - parallelize kernel execution
  - can’t be non-deterministic
- I/O of each invocation known in advance: prefetching hiding latency
- Producer-consumer dependencies are known in advance: pipelined execution (save bandwidth)
drawbacks
- need library of operators to describe complex data flows

gather/scatter

const	int	N	=	1024;	
stream<float>	input(N);	
stream<int>	indices;	
stream<float>	tmp_input(N);			
stream<float>	output(N);	
stream_gather(input,	indices,	tmp_input);
absolute_value(tmp_input,	output);

export	void	absolute_value(	
			uniform	float	N,	
			uniform	float*	input,	
			uniform	float*	output,	
			uniform	int*	indices)	
{	
			foreach	(i	=	0	...	n)	
			{	
						float	tmp	=	input[indices[i]];	
						if	(tmp	<	0)	
									output[i]	=	-tmp;	
						else	
									output[i]	=	tmp;	
			}	
}

const	int	N	=	1024;	
stream<float>	input(N);	
stream<int>	indices;	
stream<float>	tmp_output(N);			
stream<float>	output(N);	
absolute_value(input,	tmp_output);	
stream_scatter(tmp_output,	indices,	output);

export	void	absolute_value(	
			uniform	float	N,	
			uniform	float*	input,	
			uniform	float*	output,	
			uniform	int*	indices)	
{	
			foreach	(i	=	0	...	n)	
			{	
						if	(input[i]	<	0)	
									output[indices[i]]	=	-input[i];	
						else	
									output[indices[i]]	=	input[i];	
			}	
}

impose rigid program structure to facilitate simple programming and advanced optimization
ISPC, OpenCL, CUDA
- flexibility/familiarity of imperative C-style syntax over the syntax of a more functional form: for adoption

Mixed programming Model
- shared address space model within a multi-core node of a cluster + message passing between nodes
- data-parallel model support shared memory style synchronization primitives in kernels
  - permit limited forms of inter-iteration communications

Parallel Programming Basics

Decomposition: Break up problem into tasks that can be carried out in parallel
- identify dependencies
- static & dynamic
- E.g. loop iteration
Assignment: Assigning tasks to workers
- balance workload
- reduce communication costs
- static & dynamic
- worker thread pool
- E.g. interleaved, foreach, pthread
Orchestration
- structuring communication
- adding synchronization to preserve dependencies
- organizing data structure in memory
- scheduling tasks
- reduce cost of communication/sync
- preserve locality of data reference
- reduce overhead
Mapping
- pthread -> HW execution context on a CPU core (by OS)
- ISPC program instance -> vector instruction lanes (by compiler)
- CUDA thread block -> GPU cores (by HW)
- place related threads (cooperating threads) on same processor
  - maximize locality, data sharing
  - minimize cost of comm/sync
- place unrelated threads on same processor
  - hide latency
Grid solver

	Data-parallel	Shared address space	Message Passing
Synchronization	Single logical thread of control parallelized `forall` loop	Mutual exclusive (via lock/atomic) Express dependency (via barrier)	sending & receiving messages
Communication	implicit load/store special built-in primitives for complex communication	implicit load/store to shared vars	sending & receiving messages (bulk transfer)

GPU Architecture & CUDA Programming

Real-time graphics pipeline

shaders

vertex processing
fragment processing

uniform	sampler2D	myTexture;
uniform	float3	lightDir;
varying	vec3	norm;	
varying	vec2	uv;
void	myFragmentShader()
{
		vec3	kd	=	texture2D(myTexture,	uv);
		kd	*=	clamp(dot(lightDir,	norm),	0.0,	1.0);
		return	vec4(kd,	1.0);			
}

CUDA: “C-like” language to express GPU interface

Syntax

const	int	Nx	=	12;	
const	int	Ny	=	6;
dim3	threadsPerBlock(4,	3,	1);	
dim3	numBlocks(Nx/threadsPerBlock.x,	
                  Ny/threadsPerBlock.y,	1);	
//	assume	A,	B,	C	are	allocated	Nx	x	Ny	float	arrays	
//	this	call	will	trigger	execution	of	72	CUDA	threads:
//	6	thread	blocks	of	12	threads	each		
matrixAdd<<<numBlocks,	threadsPerBlock>>>(A,	B,	C);

__device__ float	doubleValue(float	x)	
{	
			return	2	*	x;	
}
//	kernel	definition	
__global__	void	matrixAddDoubleB(float	A[Ny][Nx],	
																																	float	B[Ny][Nx],	
																																	float	C[Ny][Nx])	
{	
			int	i	=	blockIdx.x	*	blockDim.x	+	threadIdx.x;	
			int	j	=	blockIdx.y	*	blockDim.y	+	threadIdx.y;	
			C[j][i]	=	A[j][i]	+	doubleValue(B[j][i]);	
}

host: serial execution
CUDA device: SPMD execution
__global__: CUDA kernel function on GPU
threadIdx: grid thread id position in its block
blockIdx: block position in the grid

float*	A	=	new	float[N];							//	allocate	buffer	in	host	mem	
//	populate	host	address	space	pointer	A	
for	(int	i=0	i<N;	i++)	
			A[i]	=	(float)i;	
int	bytes	=	sizeof(float)	*	N;
float*	deviceA;																	//	allocate	buffer	in		
cudaMalloc(&deviceA,	bytes);				//	device	address	space	
//	populate	deviceA	
cudaMemcpy(deviceA,	A,	bytes,	cudaMemcpyHostToDevice);	
//	note:	directly	accessing	deviceA[i]	is	an	invalid	
//	operation	here	(cannot	manipulate	contents	of	deviceA
//	directly	from	host	only	from	device	code,	since	deviceA	
//	is	not	a	pointer	into	the	host’s	address	space)

cudaMemcpy: copy memory between hosts

1D Convolution

#define	THREADS_PER_BLK	128	

int	N	=	1024	*	1024	
cudaMalloc(&devInput,	sizeof(float)	*	(N+2)	);		//	allocate	input	array	in	device	memory	
cudaMalloc(&devOutput,	sizeof(float)	*	N);						//	allocate	output	array	in	device	memory	
//	properly	initialize	contents	of	devInput	here	...	
convolve<<<N/THREADS_PER_BLK,	THREADS_PER_BLK>>>(N,	devInput,	devOutput);

__global__	void	convolve(int	N,	float*	input,	float*	output)	{	
			int	index	=	blockIdx.x	*	blockDim.x	+	threadIdx.x;		//	thread	local	variable	
			float	result	=	0.0f;		//	thread-local	variable	
			for	(int	i=0;	i<3;	i++)				
					result	+=	input[index	+	i];	
			output[index]	=	result	/	3.f;	
}

__global__	void	convolve(int	N,	float*	input,	float*	output)	{	
		__shared__	float	support[THREADS_PER_BLK+2];								//	per-block	allocation
			int	index	=	blockIdx.x	*	blockDim.x	+	threadIdx.x;		//	thread	local	variable	
			support[threadIdx.x]	=	input[index];	
			if	(threadIdx.x	<	2)	{	
						support[THREADS_PER_BLK	+	threadIdx.x]	=	input[index+THREADS_PER_BLK];		
			}	
			__syncthreads();	
			float	result	=	0.0f;		//	thread-local	variable	
			for	(int	i=0;	i<3;	i++)				
					result	+=	support[threadIdx.x	+	i];	
			output[index]	=	result	/	3.f;	
}

__shared__: per block shared memory
__syncthreads(): barrier, wait for all threads in the block to arrive
Each thread block must execute 128 CUDA threads
Each thread block requires 130 x sizeof(float) = 520 bytes of shared memory

Execution: thread hierarchy
- thread - warp - thread block - stream multi-processor (SM, core)
- thread block execution can be carried out in any order
- dynamic scheduling
Distributed address space
- built-in memcpy primitives between host & device
- per thread / per block (__shared__) / per program (__global__) / constant memory (__constant__)
SM core operation each clock
- Select up to four runnable warps from 64 resident on SM core (thread-level parallelism)
- Select up to two runnable instructions per warp (instruction-level parallelism)
Warp: CUDA implementation detail
- On modern NVIDIA hardware, groups of 32 CUDA threads in a thread block are executed simultaneously using 32-wide SIMD execution
  - share same instruction stream
- The group of 32 threads sharing an instruction stream is called a warp
Running the CUDA kernel
- Step 1: host sends CUDA device (GPU) a command (“execute this kernel”)
- Step 2: scheduler maps block 0 to core 0 (reserves execution contexts for 128 threads and 520 bytes of shared storage)
- Step 3: scheduler continues to map blocks to available execution contexts
- Step 4: core completes block, reschedule
Question: Explain the statement: The GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution.
GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution in the following manner :
1. Interleaved Multi-threading : Since up-to-64 warps could be active at anytime, multiple threads could be interleaved depending on which 4 warps are chosen each time.
2. Simultaneous multi-threading : Each SM has four warp selectors. Each warp selector can choose to issue instructions from different threads.
3. SIMD Execution : There are SIMD functional units in each SM that helps do SIMD operations.
4. Super-scalar Execution : During each cycle, the warp selector can issue up to 2 instructions to be executed, supporting super scalar execution.
- Up to 64*32=2048 CUDA threads can be active on the SM at a time
- But only 4*32=1028 CUDA threads are executed simultaneously.
- Groups of 32 CUDA threads share an instruction stream. (and are executed in a SIMD manner)
- The SM may choose to execute 2 instructions from a CUDA thread at once when an appropriate mixture is available. (And of course each of those two instructions are executed in a SIMD manner as described above.)

Modern GPUs, to support features like virtualization and CUDA dynamic parallelism, are beginning to support limited forms of software-managed thread pre-emption, but we will not talk about it in this class.
pinned memory
atomic op on different memory (global, L1, shared)
pipelined device overlap (stream)
memory mapping (DMA)
data representation (im2col, CSR, ELL, JDS…)
reduction algorithm (winograd, combination tree, Kogge-stone, Brent Kung, …)
unrolling (cache coherence, memory coalescing)
shared/constant/restrict/global memory
synchronization (host, device, thread, block, memory barrier)
warp SIMD (coalescing, adjacent thread adjacent memory)
branch divergence (granularity % wrap)
cache (hierarchy, size, scratchpad, TSM)
SM thread num / memory limit
memory bank conflict
non-cachable load

DRAM burst (detail of memory coalescing)
multiple-gpu

Performance Optimization I: Work Distribution & Scheduling

Balancing the workload
- low computational & synchronization overhead
- Static assignment: pre-determined (compile-time, may runtime)
  - when cost & amount of work is predictable
- Semi-static assignment: periodically profile & re-adjust
  - cost of work is predictable for near-term future
- Dynamic assignment: runtime
  - work queue: tasks - shared work queue - worker threads
  - choosing task granularity
    - Useful to have many more tasks* than processors (many small tasks enables good workload balance via dynamic assignment)
      - small granularity
    - Want as few tasks as possible to minimize overhead of managing the assignment
      - large granularity
  - task scheduling
    - multiple-queue with steal work
    - task dependency managed by scheduler
Common parallel programming patterns
- data parallelism
  - ```
  //	ISPC	foreach	
  foreach	(i=0	...	N)	{	
  			B[i]	=	foo(A[i]);	
  }
  //	ISPC	bulk	task	launch	
  launch[numTasks]	myFooTask(A,	B);
  
  //	bulk	CUDA	thread	launch	
  foo<<<numBlocks,	threadsPerBlock>>>(A,	B);
  
  //	using	higher-order	function	‘map’	
  map(foo,	A,	B);
  
  //	openMP	parallel	for	
  #pragma	omp	parallel	for	
  for	(int	i=0;	i<N;	i++)	{	
  			B[i]	=	foo(A[i]);	
  }	
```
- explicit management via threads
  - fork/join pattern - divide & conquer
    - Cilk Plus
      - cilk_spawn foo(args): invoke foo, but unlike standard function call, caller may continue executing asynchronously with execution of foo
        hide implementation, like std::async
        run continuation first (child stealing): Caller thread spawns work for all iterations before executing any of it (BFS)
        run child first (continuation stealing): Caller thread only creates one item to steal (continuation that represents all remaining iterations) (DFS)
    - cilk_sync: returns when all calls spawned by current function have completed. (“sync up” with the spawned calls)
      - stall join
      - greedy policy: When thread that initiates the fork goes idle, it looks to steal new work. Last thread to reach the join point continues execution after sync
        All threads always attempt to steal if there is nothing to do (thread only goes idle if no work to steal is present in system)
        Worker thread that initiated spawn may not be thread that executes logic after cilk_sync
    - void quick_sort(int* begin, int* end) { if (begin >= end - PARALLEL_CUTOFF) std::sort(begin, end); else { int* middle = partition(begin, end); cilk_spawn quick_sort(begin, middle); quick_sort(middle+1, last); } }
    - heavyweight spawn?
    - context switching overhead
    - larger working set, less cache locality
    - -> using worker thread pool + work queue per thread
    - work stealing: dequeue per worker
      - steal work from top of dequeue
        steals largest amount of work
        maximum locality (combined with run child first)
        don’t contend for same elements of dequeue (nice lock-free SPSC dequeue)

Performance Optimization II: Locality, Communication & Contention

Message passing grid solver

int	N;	
int	tid	=	get_thread_id();	
int	rows_per_thread	=	N	/	get_num_threads();	
float*	localA	=	allocate(rows_per_thread+2,	N+2);	
//	assume	localA	is	initialized	with	starting	values	
//	assume	MSG_ID_ROW,	MSG_ID_DONE,	MSG_ID_DIFF	are	constants	used	as	msg	ids	
//////////////////////////////////////	
void	solve()	{
		bool	done	=	false;	
		while	(!done)	{	

				float	my_diff	=	0.0f;	
				if	(tid	!=	0)	
							send(&localA[1,0],	sizeof(float)*(N+2),	tid-1,	MSG_ID_ROW);	
				if	(tid	!=	get_num_threads()-1)	
							send(&localA[rows_per_thread,0],	sizeof(float)*(N+2),	tid+1,	MSG_ID_ROW);	

				if	(tid	!=	0)	
							recv(&localA[0,0],	sizeof(float)*(N+2),	tid-1,	MSG_ID_ROW);	
				if	(tid	!=	get_num_threads()-1)	
							recv(&localA[rows_per_thread+1,0],	sizeof(float)*(N+2),	tid+1,	MSG_ID_ROW);	
				for	(int	i=1;	i<rows_per_thread+1;	i++)	{	
							for	(int	j=1;	j<n+1;	j++)	{	
									float	prev	=	localA[i,j];	
									localA[i,j]	=	0.2	*	(localA[i-1,j]	+	localA[i,j]	+	localA[i+1,j]	+		
																														localA[i,j-1]	+	localA[i,j+1]);	
							my_diff	+=	fabs(localA[i,j]	-	prev);	
					}	
		}	
		if	(tid	!=	0)	{	
					send(&mydiff,	sizeof(float),	0,	MSG_ID_DIFF);	
					recv(&done,	sizeof(bool),	0,	MSG_ID_DONE);	
		}	else	{	
					float	remote_diff;	
					for	(int	i=1;	i<get_num_threads()-1;	i++)	{	
								recv(&remote_diff,	sizeof(float),	i,	MSG_ID_DIFF);	
								my_diff	+=	remote_diff;	
					}	
					if	(my_diff/(N*N)	<	TOLERANCE)	
							done	=	true;	
					for	(int	i=1;	i<get_num_threads()-1;	i++)	
							send(&done,	sizeof(bool),	i,	MSD_ID_DONE);	
		}		
	}	
}

synchronous & blocking -> deadlock!
- send(): call returns when sender receives acknowledgement that message data resides in address space of receiver
- recv(): call returns when data from received message is copied into address space of receiver and acknowledgement sent back to sender
asynchronous & non-blocking
- send(): call returns immediately, buffer provided to send() cannot be modifed by calling thread since message processing occurs concurrently with thread execution
- recv(): posts intent to receive in the future, returns immediately. Use checksend(), checkrecv() to determine actual status of send/receipt

Pipelining
Communication as extended memory hierarchy
- inherent communication: must occur in a parallel algorithm
  - good assignment decision
- artifactual communication: all other communication (from practical details of system)
  - minimum data transfer granularity
  - unnecessary comm by OS (like cache line block)
  - poor placement of data in distributed memory (reside far away)
  - finite replication capacity
- comm-to-comp ratio = 1 / arith intensity = comm / comp
- Cache miss
  - Cold miss: First time data touched. Unavoidable in a sequential program
  - Capacity miss: Working set is larger than cache. Can be avoided/reduced by increasing cache size
  - Conflict miss: Miss induced by cache management policy. Can be avoided/reduced by changing cache associativity, or data access pattern in application
  - Communication miss: Due to inherent or artifactual communication in parallel system
- Reduce communication
  - temporal locality
    - blocked iteration order (avoid cache ping-pong)
    - fusing loops
      - combine multiple loops to reduce I/O
      - may cause conflict miss
      - may use FMAoperation
    - sharing data
      - co-locate tasks operating same data
      - E.g. CUDA thread block
  - spatial locality
    - granularity of data transfer
      - blocked data layout
        E.g. im2col, CSR/CSC (compact sparse row/col)
    - granularity of cache coherence
  - Reduce overhead of communication to sender/receiver
    - coalescing small messages
  - Reduce latency of communication
    - restructure code (SW), improve arch (HW)
  - Reduce contention
    - Replicate contended resources (e.g., local copies, fne-grained locks)
    - Stagger access to contended resources
  - Increase communication/computation overlap
    - Application writer: use asynchronous communication (e.g., async messages)
    - HW implementer: pipelining, multi-threading, pre-fetching, out-of-order exec
Contention
- memory contention in CUDA
- ```
__shared__	float	A[512];	
int	index	=	threadIdx.x;	
float	x2	=	A[index];						//	single	cycle	
float	x3	=	A[3*index];				//	single	cycle	
float	x4	=	A[16	*	index];	//	16	cycles
```
- The first 2 only take one clock cycle because each of the indices fall on a different bank of the shared memory so you can instantly read the value. However, the last one has half of them in one bank and half in another, so you have to read from two banks 16 times each to get the 32 values you want since you can only read one value per clock.
- - why power of 2 sometimes is not good: 1. cache associative 2. memory bank contention
- N-body: particles within cells in a grid
  - parallelize by cell (insufficient parallelism)
  - parallelize over particles (massive contention for single shared data)
  - finer-granularity locks (lock per cell)
  - partial results + merge (extra work/memory footprint)
  - data-parallel
    - Step 1: compute cell containing each particle (parallel over input particles)
    - Step 2: sort results by cell (particle index array permuted based on sort)
    - Step 3: fnd start/end of each cell (parallel over particle_index elements)

Parallel Application Case Studies

Ocean Currents Simulation (grid-based solver)
- Decomposition: Spatial partitioning of grid: each processor receives 2D tile of grid
- Assignment: Static assignment of tiles to processors
- Synchronization: Barriers (separate each pass over grid is a different phase of computation) + Locks for mutual exclusion when updating shared variables (atomic update of ‘diff’)
- Working Set
  - Local neighborhood for cell
  - Three rows of a processor’s local partition of grid
  - Processor’s local partition of grid
- Observation
  - Static assignment is sufficient (approximately equal busy time per thread)
  - 4D blocking (block-major) of grid reduces time spent on communication
  - Synchronization cost is largely due to waiting at barriers
Galaxy evolution
- Barnes-Hut Algorithm
  - ```
  for	each	time	step	in	simulation:	
  			build	tree	structure	
  			compute	(aggregate	mass,	center-of-mass)	for	interior	nodes	
  			for	each	particle:	
  						traverse	tree	to	accumulate	gravitational	forces	
  						update	particle	position	based	on	gravitational	forces
```
- amount of work is non-uniform
- comm pattern is non-uniform
- Assignment: semi-static assignment
  - Each time step, for each body, record number of interactions with other bodies (the application profles itself)
  - cost zones
    - Each processor performs depth-frst (post-order) traversal of tree
    - Processor Pi responsible for processing bodies corresponding to work: iW/P to (i+1)W/P
- Working sets
  - data needed to compute forces between body-body (or body-node) pairs
  - data encountered in an entire tree traversal
- Data Distribution: static distribution
  - Work assignment changes with time
  - Data accessed at fne granularity (single tree node)
  - high temporal locality
  - : Unlike OCEAN, data distribution in Barnes-Hut does not signifcantly impact performance

Data-parallel scan

scan_inclusive(⊕, A) = [a0, a0⊕a1, a0⊕a1⊕a2, ...
scan_exclusive(⊕, A) = [I, a0, a0⊕a1, ...
- [[Check: 倍增, par_seq]]

Up-sweep
for	d=0	to	(log2n	-	1)	do
			forall	k=0	to	n-1	by	2d+1 do	
					a[k	+	2d+1	-	1]	=	a[k	+	2d	-	1]	+	a[k	+	2d+1	-	1]

Down-sweep
x[n-1]	=	0
for	d=(log2n	-	1)	down	to	0	do
			forall	k=0	to	n-1	by	2d+1 do	
					tmp	=	a[k	+	2d	-	1]	
					a[k	+	2d	-	1]	=	a[k	+	2d+1	-	1]	
					a[k	+	2d+1	-	1]	=	tmp	+	a[k	+	2d+1	-	1]

__device__	void	scan_block(volatile	int	*ptr,	const	unsigned	int	idx)	
{	
			const	unsigned	int	lane	=	idx	&	31;					//	index	of	thread	in	warp	(0..31)	
			const	unsigned	int	warp_id	=	idx	>>	5;		//	warp	index	in	block	
			int	val	=	scan_warp(ptr,	idx);														//	Step	1.	per-warp	partial	scan		
																																															//	(Performed	by	all	threads	in	block,		
																																															//	with	threads	in	same	warp	communicating		
																																															//	through	shared	memory	buffer	‘ptr’)		
			if	(lane	==	31)		ptr[warp_id]	=	ptr[idx];			//	Step	2.	thread	31	in	each	warp	copies
			__syncthreads();																												//	partial-scan	bases	in	per-block	
																																															//	shared	mem
			if	(warp_id	==	0)	scan_warp(ptr,	idx);						//	Step	3.	scan	to	accumulate	bases	
			__syncthreads();																												//	(only	performed	by	warp	0)
			if	(warp_id	>	0)																												//	Step	4.	apply	bases	to	all	elements	
							val	=	val	+	ptr[warp_id-1];													//	(performed	by	all	threads	in	block)
			__syncthreads();	
			ptr[idx]	=	val;	
}

Segmented scan

Simultaneously perform scans on arbitrary contiguous partitions of input collection

Up-sweep
for	d=0	to	(log2n	-	1)	do:
			forall	k=0	to	n-1	by	2d+1	do:
					if	flag[k	+	2d+1	-	1]	==	0:					
								data[k	+	2d+1	-	1]	=	data[k	+	2d	-	1]	+	data[k	+	2d+1	-	1]	
					flag[k	+	2d+1	-	1]	=	flag[k	+	2d	-	1]	||	flag[k	+	2d+1	-	1]

Down-sweep
data[n-1]	=	0
for	d=(log2n	-	1)	down	to	0	do:
			forall	k=0	to	n-1	by	2d+1	do:
					tmp	=	data[k	+	2d	-	1]	
					data[k	+	2d	-	1]	=	data[k	+	2d+1	-	1]	
					if	flag_original[k	+	2d]	==	1:							#	must	maintain	copy	of	original	flags	
								data[k	+	2d+1	-	1]	=	0													#	start	of	segment
					else	if	flag[k	+	2d	-	1]	==	1:	
								data[k	+	2d+1	-	1]	=	tmp	
					else:	
		data[k	+	2d+1	-	1]	=	tmp	+	data[k	+	2d+1	-	1]	
		flag[k	+	2d	-	1]	=	0

Parallel Ray Tracing
- Given a “ray”, fnd closest intersection with scene geometry
- use hierarchical grouping adapts to non-uniform density of scene objects
- use packets: amortize data fetching & work, unsuitable for incoherent rays
- ray reordering: when packet utilization drops below threshold, resort rays and continue with smaller packet
- Use large packets for eye/refection/point light shadow rays or higher levels of BVH
  - ray coherence always high at the top of the tree
- Switch to single ray (intra-ray SIMD) when packet utilization drops below threshold
  - For wide SIMD machine, a branching-factor-4 BVH works well for both packet traversal and single ray traversal
- Can use packet reordering to postpone time of switch
  - Reordering allows packets to provide beneft deeper into tree
  - Not often used in practice due to high implementation complexity

Workload-Driven Performance Evaluation

[[Check: AutoFDO, PGO]]
Application-oriented scaling (last lecture)
Resource-oriented scaling (this lecture)
Problem-constrained scaling
- use a parallel computer to solve the same problem faster
Time-constrained scaling
- completing more work in a fxed amount of time
- E.g. rendering, telescope, finance, websites, robotics
Memory-constrained scaling
- run the largest problem possible without overfowing main memory
- E.g. N-body, machine learning, memcached

	PC	TC	MC
Scaled grid size	$\sqrt{N^2 / P}$	$K$ assume linear speed up $K^3/P = N^3$ $K = NP^{1/3}$	$NP^{1/2}$
Exec Time	$1/P$	fixed at $O(N^3)$ (definition)	$O(NP^{1/2} )^3 / P) = O(P^{1/2})$
Elements per Processor	$N^2/P$	$K^2/P = N^2 / P^{1/3}$	fixed at $N^2$ (definition)
Available concurrency	fixed at $P$
Comm-to-comp ratio	$O(P^{1/2})$ (2D-blocked)	comm per processor $K / P^{1/2} = O(1/P^{1/6})$ ratio $O(P^{1/6})$	fixed at $1/N$

Simulator
- Execution-driven
- Trace-driven
  - Instrument real code running on real machine to record a trace of all memory accesses
  - Intel PIN
Problem size = $(n, \Theta, \Delta t, T)$ (grid size, accuracy, time step, total time to simulate)
Scaling down
- preserve ratio of time spent in different program phases
  - Ray-trace and Barnes-Hut: both have tree build and tree traverse phases
- preserve important behavioral characteristics
  - arithmetic intensity, load balance, locality, working set sizes
- preserve contention & communication patterns
  - contention is a function of timing and ratios
- preserve scaling relationship between problem parameters
  - Barnes-Hut: scaling up particle count requires scaling down time step for physics reasons
Roofline model
Establish high watermarks
- Add “math” (non-memory instructions)
  - Does execution time increase linearly with operation count as math is added? (If so, this is evidence that code is instruction-rate limited)
- Remove almost all math, but load same data
  - How much does execution-time decrease? If not much, suspect memory bottleneck
- Change all array accesses to A[0]
  - How much faster does your code get? (This establishes an upper bound on beneft of improving locality of data access)
- Remove all atomic operations or locks
  - How much faster does your code get? (provided it still does approximately the same amount of work) (This establishes an upper bound on beneft of reducing sync overhead.)
Profiler
- Intel’s Performance Counter Monitor Tool
- Intel VTune, PAPI, oprofile
- performance counter: instructions completed, clock ticks, L2/L3 cache hits/misses, bytes read from memory controller, etc.

Snooping-Based Cache Coherence

cache coherence: Reading a value at address X should return the last value written to address X by any processor
- device I/O via direct memory access (DMA) -> mark page table uncachable [[Check: what every programmer should know about memory]]
memory coherence: both global storage (main memory) and per-processor local storage (processor caches) implementing the abstraction of a single shared address space
- for each memory location, there is a hypothetical serial order of all program operations (executed by all processors) to the location that is consistent with the results of execution
- Memory operations issued by any one processor occur in the order issued by the processor
- The value returned by a read is the value written by the last write to the location as given by the serial order
- A memory system is coherent if
  - A read by processor P to address X that follows a write by P to address X, should return the value of the write by P (assuming no other processor wrote to X in between)
    - obey program order
  - A read by processor P1 to address X that follows a write by processor P2 to X returns the written value… if the read and write are “sufficiently separated” in time (assuming no other write to X occurs in between)
    - “write propagation”: eventually consistency
  - Writes to the same address are serialized: two writes to address X by any two processors are observed in the same order by all processors. (Example: if values 1 and then 2 are written to address X, no processor observes X having value 2 before value 1)
    - “write serialization”: Writes to the same location are serialized: two writes to address X by any two processors are observed in the same order by all processors.
- Implementation
  - Software: #PF to propagate -> clusters of workstations
  - Hardware: snooping/directory-based
Snooping cache coherence schemes: broadcasts a notifcation to all other cache controllers (decentralized)
- shared cache
- limited scaling
- write-through invalidation
- write-back invalidation
  - A line in the “exclusive” state can be modifed without notifying the other caches
  - Processor can only write to lines in the exclusive state
  - MSI
    - Write propagation: Achieved via combination of invalidation on BusRdX, and fush from M-state on subsequent BusRd/BusRdX from another processors
    - Write serialization
      - Writes that appear on interconnect are ordered by the order they appear on interconnect (BusRdX)
      - Reads that appear on interconnect are ordered by order they appear on interconnect (BusRd)
      - Writes that don’t appear on the interconnect (PrWr to line already in M state)
        Sequence of writes to line comes between two interconnect transactions for the line
        All writes in sequence performed by same processor, P (that processor certainly observes them in correct sequential order)
        All other processors observe notifcation of these writes only after a interconnect transaction for the line. So all the writes come before the transaction
        -> all processors see writes in the same order
  - MESI: optimize reading-then-writing (I ->(BusRd) S ->(BusRdX) M)
    - add “exclusive clean” E (not-modified, but exclusive)
  - MESIF: Intel 5-stage
    - MESI + F(Forward) by one cache
    - Cache with line in F state services miss
    - Simplifes decision of which cache should service miss (basic MESI: all caches respond)
    - complex, for clusters
  - MOESI: AMD Opteron 5-stage
    - In MESI protocol, transition from M to S requires fush to memory
    - Instead transition from M to O (O=”owned, but not exclusive”) and do not fush to memory
    - Other processors maintain shared line in S state, one processor maintains line in O state
    - Data in memory is stale, so cache with line in O state must service cache misses
- write-back update
  - Dragon
    - Exclusive-clean (E): only one cache has line, memory up-to-date
    - Shared-clean (SC): multiple caches may have line, and memory may or may not ** be up to date
    - Shared-modifed (SM): multiple caches may have line, memory not up to date
      - Only one cache can be in this state for a given line (but others can be in SC)
      - Cache with line in SM state is “owner” of data. Must update memory upon eviction
    - Modifed (M): only one cache has line, it is dirty, memory is not up to date
      - Cache is owner of data. Must update memory upon replacement
    - no point in time any processor will have an invalid value
  - low miss rate
  - suffer from high traffic due to multiple writes before the next read by another processor
Inclusion
- All lines in closer [to processor] cache are also in farther [from processor] cache
- If line is in owned state (M in MSI/MESI) in L1, it must also be in owned state in L2
- When line X is invalidated in L2 cache due to BusRdX from another cache, must also invalidate line X in L1
  - each L2 line contains an additional state bit indicating if line data in L1
  - propagation
- Assume L1 is a write-back cache. Processor writes to line X. (L1 write hit) . Line X in L2 cache is in modifed state in the coherence protocol, but it has stale data. When coherence protocol requires X to be fushed from L2 (e.g., another processor loads X), L2 cache must request the data from L1.
  - each L2 line contains an additional state bit (“modified-but-stale”)
  - flushing requires getting from L1 first
- GPU: write-through, no coherence
False sharing: Condition where two processors write to different addresses, but addresses map to the same cache line
- Cache line “ping-pongs” between caches of writing processors, generating signifcant amounts of communication due to the coherence protocol [[Note: true sharing could also lead this]]
- Are cache-line-ping-pong and false sharing the same?
abstraction of a single shared address space is not implementation by a single storage unit

Directory-Based Cache Coherence

cc-NUMA: cache-cohereent, non-uniform memory access
Distributed shared Memory System (DSM): cache coherent, shared address space, physically distributed memories
Scalable cache coherence
- hierarchical snooping
  - simple to build
  - root can be a bottleneck
  - longer latency
  - can’t apply to more general network topologies
- directories: storing information about the status of the line in one place: a “directory” (centralized)
  - point-2-point messages instead of broadcast
  - - storage overhead: $O(PM)$
    - increase cache line size (decrease $M$)
    - group multiple processors into a single directory “node” (reduce $P$)
      - hierarchical: use snooping among processors in a node, directory across nodes
    - limited pointer schemes: storage for a limited number of pointers per directory entry should be sufficient (only need a list of the nodes holding a valid copy of the line!)
      - overflow
        fallback to broadcast
        just don’t allow
        coarse vector fallback
    - sparse directories
      - Directory at home node maintains pointer to only one node caching line (not a list of sharers)
      - Pointer to next node in list is stored as extra information in the cache line (just like the line’s tag, dirty bits, etc)
      - On read miss: add requesting node to head of list
      - On write miss: propagate invalidations along list
      - On evict: need to patch up list (linked list removal)
      - Low memory storage overhead, additional directory storage is proportional to cache size, traffic on write is proportional to # of sharers
      - Latency of write is proportional to # of sharers, hard to implmenet
  - distributed directory
    - home node: node with memory holding the corresponding data for the line
    - read miss
      - read miss message sent to home node of the requested line
      - home directory checks entry for line
        dirty bit OFF: respond with contents from memory, set presence[home] (line is cached by processor home)
        dirty bit ON: data must be sourced by another processor
        home node responds with message providing identity of line owner
        requesting node requests data from owner
        owner change state to SHARED, respond to request node
        owner also responds to home node, home clear dirty, update presence bits, updates memory
    - write miss
      - request -> home: write miss message
      - home -> request: share ids + data
      - request -> ids: invalidate
      - ids -> request: acks
    - Intervention forwarding: read miss to dirty line (forwarded by home) [[Check: EECS470 hop]]
      - request -> home: read miss message
      - home -> owner: intervention read
      - owner -> home: respond with data + dir revision
      - home updates directory
      - home -> request: data
    - Request fowarding: read miss to dirty line
      - request -> home: read miss message
      - home -> owner: request send data to requestor
      - owner -> home + request: data
      - iterative DNS -> recursive DNS
    - Access pattern
      - “Mostly-read” objects: lots of sharers but writes are infrequent, so communicating with all of them on a write has minimal impact on performance (e.g., root node in Barnes-Hut)
      - Migratory objects (one processor reads/writes for while, then another, etc.): very few sharers, count does not scale with number of processors
      - Frequently read/written objects: frequent invalidations, but sharer count is low because count cannot build up in short time between invalidations (e.g, shared task queue)
      - Low-contention locks: infrequent invalidations, so no performance problem
      - High-contention locks: can be a challenge, because many readers present when lock released
Multi-socket Intel System
Xeon Phi (Kinghts Landing, KNL)

A Basic Snooping-Based Multi-Processor Implementation

Deadlock: Deadlock is a state where a system has outstanding operations to complete, but no operation can make progress
- Mutual exclusion: only one processor can hold a given resource at once
- Hold and wait: processor must hold the resource while waiting for other resources needed to complete an operation
- No preemption: processors don’t give up resources until operation they wish to perform is complete
- Circular wait: waiting processors have mutual dependencies (a cycle exists in the resource dependency graph)
Livelock: Livelock is a state where a system is executing many operations, but no thread is making meaningful progress
Starvation: State where a system is making overall progress, but some processes make no progress
Transaction on an atomic bus -> abstraction, actually non-atomic
- Client is granted bus access (result of arbitration)
- Client places command on bus (may also place data on bus)
- Response to command by another bus client placed on bus
- Next client obtains bus access (arbitration)
Multi-processor cache controller -> implementation
- contention: requests from both processor & bus
  - cache duplicate tags
  - multi-ported tag memory
  - tags must sync -> update blocking
- Reporting snoop results
  - Address
  - Data
  - Shared: OR of result from all results
  - Dirty: OR of result from all results
  - Snoop-pending: OR of result from all processors (0 indicates all processors have responded)
  - Memory controller could immediately start accessing DRAM, but not respond (squelch response) if a snoop result from another cache indicates it has copy of most recent data
  - Memory could assume one of the caches will service request until snoop results are valid (if snoop indicates no cache has data, then memory must respond)
- Handling cache line write-back
  - evict -> write-back buffer
  - load requested line
  - later flush back
- A write commits when a read-exclusive transaction appears on bus and is acknowledged by all other caches
  - store x <- R0
  - lookup line in cache
  - arbitrate for bus
  - place BusRdX on bus / other processor snoop requests
  - memory responds with data for cache line containing X
  - contents of R0 written to appropriate bytes of cache line
  - order of transactions on the bus defnes the global order of writes in the parallel program (write serialization requirement of coherence)
Split-transaction bus
- request + response
- How to match requests with responses?
- How to handle conficting requests on bus?
- Flow control: how many requests can be outstanding at a time, and what should be done when buffers fll up?
- When are snoop results reported? During the request? or during the response?
- basic design
  - up to 8 outstanding requests (system wide)
  - responses can be reordered requests
  - flow control -> NACKs
  - initialize a request
    - request bus: cmd + address
    - response bus: data + tag
    - Step 1: Requestor asks for request bus access
    - Step 2: Bus arbiter grants access, assigns transaction a tag
    - Step 3: Requestor places command + address on the request bus
    - - [[Check: MSHR]]
  - read miss
    - phase 1
      - Request arbitation: cache controllers present request for address to bus (many caches may be doing so in the same cycle)
      - Request resolution: address bus arbiter grants access to one of the requestors. Request table entry allocated for request
      - Caches perform snoop: look up tags, update cache state, etc. (memory operation commits)
      - Caches acknowledge this snoop result is ready (or signal they could not complete snoop in time here (e.g., raise inhibit wire)
    - phase 2
      - Data response arbitration: responder presents intent to respond to request with tag T
      - Data bus arbiter grants one responder bus access
      - Original requestor signals readiness to receive response
    - phase 3
      - Responder places response data on data bus
      - Caches present snoop result for request with the data
      - Request table entry is freed
  - pipelined transaction
    - write backs and BusUpg transactions do not have a response component
  - conflicting requests
    - disallowing conflicting requests by maintaining a copy of the request table in each cache
  - flow control
    - Caches/memory have buffers for receiving data off the bus (write-back buffer, data buffer)
    - If the buffer flls, client NACKs relevant requests or responses (NACK = negative acknowledgement)
    - Triggers a later retry (wound-wait)
queue in distributed system: to accommodate variable (unpredictable) rates of production and consumption
Multi-level cache hierarchies
- fetch deadlock problem: cache must be able to service requests while waiting on response to its own request (hierarchies increase response delay)
  - Sizing all buffers to accommodate the maximum number of outstanding requests on bus is one solution to avoiding deadlock
  - separate request/response queues
    - responses can be completed without generating further transactions
    - While stalled attempting to send a request, cache must be able to service responses
    - Responses will make progress
store x <- 10
- Virtual address to physical address conversion (TLB lookup)
- TLB miss
- TLB update (might involve OS [[Check: Software-handled TLB]])
- OS traverse page table to see physical memory mapping
- OS may need to swap in page to get the appropriate page table (load from disk to physical address)
- Cache lookup (tag check)
- Determine line not in cache (need to generate BusRdX)
- Arbitrate for bus
- Win bus, place address, command on bus
- All caches perform snoop (e.g., invalidate their local copies of the relevant line)
- Another cache or memory decides it must respond (let’s assume it’s memory)
- Memory request sent to memory controller
- Memory controller is itself a scheduler
- Memory controller checks active row in DRAM row buffer. (May need to activate new DRAM row. Let’s assume it does.)
- DRAM reads values into row buffer
- Memory arbitrates for data bus
- Memory wins bus
- Memory puts data on bus
- Requesting cache grabs data, updates cache line and tags, moves line into exclusive state
- Processor is notifed data exists
- Instruction proceeds

Memory Consistency

Memory coherence defnes requirements for the observed behavior of reads and writes to the same memory location
- The goal of cache coherence is to ensure that the memory system in a parallel computer behaves as if the caches were not there
Memory consistency defnes the behavior of reads and writes to different locations (as observed by other processors)
- Coherence only guarantees that writes to address X will eventually propagate to other processors
- Consistency deals with when writes to X propagate to other processors, relative to reads and writes to other addresses
- defines the allowed behavior of loads and stores to different addresses in a parallel system
Memory operation ordering: w.r.t. program order
- W→R: write to X must commit before subsequent read from Y
- R→R: read from X must commit before subsequent read from Y
- R→W: read to X must commit before subsequent write to Y
- W→W: write to X must commit before subsequent write to Y
Sequentially Consistent (SC): all 4 memory operation orderings, serializable R/W
- A parallel system is sequentially consistent if the result of any parallel execution is the same as if all the memory operations were executed in some sequential order, and the memory operations of any one processor are executed in program order.
- program order
- atomicity: all processors see operations equally
Relaxing order
- hiding latency: overlap memory access operations with other operations when they are independent.
- sufficient logical order != program order
- write buffering (store queue with store-to-load forwarding)
  - relax W→R
- Total store ordering (TSO): Intel
  - relax W→R
  - Processor P can read B before its write to A is seen by all processors (processor can move its own reads in front of its own writes) (relax program order)
  - Reads by other processors cannot return new value of A until the write to A is observed by all processors (relax atomicity)
    - once in L1, observed by anyone
  - Safety net: RMW (lock cmpxchg), memory barrier
    - mm_lfence (“load fence”: wait for all loads to complete)
      - prevent R→R, R→W reorder
      - no-op in x86
      - Does it make any sense to use the LFENCE instruction on x86/x86_64 processors?
    - mm_sfence (“store fence”: wait for all stores to complete)
      - prevent W→W
      - Does Intel SFENCE have release semantics?
    - mm_mfence (“mem fence”: wait for all me operations to complete) [[Check: seq-cst semantics]]
      - lfence + sfence + W→R
      - implicitly shown in lock-prefixed instructions
- Processor consistency (PC)
  - relax W→R
  - Any processor can read new value of A before the write is observed by all processors
  - other processors can read A even if new value of A haven’t been propagated to all processors
- Partial Store Ordering (PSO)
  - relax W→R, W→W
  - reorder write operations in write buffer (E.g. cache miss/hit, ready ops)
- Weak Ordering (WO): ARM
- Release Consistency (RC)
  - relax all
  - out-of-order execution of all R/W instructions [[Check: Load-Store Queue with Forwarding & Random Commit]]
  - Processors support special synchronization operations
  - Memory accesses before memory fence instruction must complete before the fence issues
  - Memory accesses after fence cannot begin until fence instruction is complete
    - clflush (strongly ordered), clflushopt (weakly ordered, can overlap)
Acquire/Release semantics
- Operation X with acquire semantics: prevent reordering of X with any load/store after X in program order
  - Other processors see X’s effect before the effect of all subsequent operations
  - lock()
- Operation X with release semantics: prevent reordering of X with any load/store before X in program order
  - Other processors see effects of all prior operations before seeing effect of X
  - unlock()
- acquire-and-release-semantics
- What do each memory_order mean?
Data race
- synchronized program: SC result on non-SC system
- unsynchronized program
  - Conficting accesses not ordered by synchronization (e.g., a fence, operation with release/acquire semantics, barrier, etc.)
  - data races: the output of the program depends on relative speed of processors (non-deterministic program results)
Eventual consistency
- Eventual consistency guarantees that if there are no other updates to X, A’s update will eventually be observed by all other nodes in the system (note: no guarantees on when, so updates to objects X and Y might propagate to different clients differently)

Scaling a Web Site

simple parallel web server
- Parallelism: use all the server’s cores
- Latency hiding: hide long-latency disk read operations (by context switching between worker processes)
- Concurrency: many outstanding requests, want to service quick requests while long requests are in progress
- Footprint: don’t want too many threads so that aggregate working set of all threads causes thrashing
Apache: dynamically managed size of worker pool (limit maximum to avoid thrashing)
process instead of thread
- protection: crash-unrelated, non-thread safe libraries
- parent process periodically recycle workers, robustness to memory leaks
dynamic web content
“Scale out” to increase throughput
Load balancing with persistence
- All requests associated with a session are directed to the same server (aka. session affinity, “sticky sessions”)
Database contention
- Option 1: “scale up”: buy better hardware for database server, buy professional-grade DB that scales
- Option 2: Replicate data and parallelize reads. (extra storage, consistency)
- Option 3: partition shards
Inter-request parallelism
Web traffic
- bursty
- elasticity instead of provisioning
  - Site performance monitor detects high load
  - Instantiates new web server instances
  - Informs load balancer about presence of new servers
Cache
- Cache commonly accessed objects (memcached)
- Reduce DB/web server load
- Cache web server responses (pages, pieces) (Front-End Cache)
  - Varnish-Cache application accelerator
- Caching using CDN
  - physical locality

Interconnection Networks

Network node: a network endpoint connected to a router/switch
Network interface: Connects nodes to the network
Switch/router: Connects a fxed number of input links to a fxed number of output links
Link: A bundle of wires carrying a signal
Topology: how switches are connected via links
- Routing distance: Number of links (“hops”) along a route between two nodes
- Diameter: the maximum routing distance
- Average distance: average routing distance over all valid routes
- Direct / Indirect: endpoints sit “inside/outside” the network
- Bisection bandwidth: Cut network in half, sum bandwidth of all severed links
- Blocking / non-Blocking: If connecting any pairing of idle nodes is possible, network is non-blocking
- Bus interconnect
- Crossbar interconnect (CCX)
  - every node is connected to every other node
  - Oracle multi-core processing (SPARC T2/T5)
- Ring interconnect
  - Core i7, IBM CELL Broadband Engine
- Mesh interconnect
  - Direct network
  - Echoes locality in grid-based applications
  - Tilera processor
  - prototype Intel chips
- Torus interconnect
  - Characteristics of mesh topology are different based on whether node is near edge or middle of network (torus topology introduces new links to avoid this problem)
- Trees interconnect
  - planar, hierarchical
  - good when traffic has locality (like mesh/torus)
- Hypercube
  - SGI Origin
- Multi-stage logarithmic
  - Indirect network with multiple switches between terminals
  - variations: Omega, Butterfly, Clos Network

	Advantages	Disadvantages
Bus	Simple design cost effective for a small # of nodes easy to implement coherence (via snooping)	Contention: all nodes contend for shared bus Limited bandwidth: all nodes communicate over same wires (one communication at a time) High electrical load = low frequency, high power
Crossbar	$O(1)$ latency & high bandwidth Non-blocking	Not scalable: $O(N^2)$ switches High cost Difficult to arbitrate at scale
Ring	Simple Cheap: $O(N)$ cost	High latency: $O(N)$ Not scalable: Bisection bandwidth as constant
Mesh	$O(N)$ cost Average latency: $O(\sqrt{N})$ Easy to lay out on chip: fixed-length links Path diversity: many ways for message to travel from one node to another	Non-uniform: performance based on whether node is near or middle Blocking
Torus	Still $O(N)$ cost Higher path diversity & bisection BW than mesh	Higher cost than 2D grid Higher complexity: hard to layout on chip & unequal link length
Tree	Latency: $O(\log{N})$	Higher BW links near root
Hypercube	Low latency: $O(\log{N})$ Radix: $O(\log{N})$ # of links $O(N \log{N})$
Multi-stage logarithmic	Cost: $O(N\log{N})$ Latency: $O(\log{N})$	Indirect Blocking

Routing: how a message gets from its source to its destination in the network
- static
- adaptive
Buffering and fow control
- Circuit switching sets up a full path (acquires all resources) between sender and receiver prior to sending a message
  - Establish route (reserve links) then send all data for message
  - Higher bandwidth transmission (no per-packet link mgmt overhead)
  - Does incur overhead to set up/tear down path
  - Reserving links can result in low utilization
  - High-granularity resource allocation: : pre-allocate all resources (links across multiple switches) along entire network path for a message (“setup a fow”)
  - Cost
    - Needs setup phase (“probe”) to set up the path (and to tear it down and release the resources when message complete)
    - Lower link utilization. Transmission of two messages cannot share same link (even if some resources on a preallocated path are no longer utilized during a transmission)
  - Benefits
    - No contention during transmission due to preallocation, so no need for buffering
    - Arbitrary message sizes (once path is set up, send data until done)
- Packet switching makes routing decisions per packet
  - Route each packet individually (possibly over different network links)
  - Opportunity to use link for a packet whenever link is idle
  - Overhead due to dynamic switching logic during transmission
  - No setup/tear down overhead
- Granularity of communication
  - Message: Unit of transfer between network clients
  - Packet: Unit of transfer for network
    - Header + Payload/Body +Tail
  - Flit (fow control digit): Unit of flow control in the network
- Contention
  - buffering
  - dropping
  - reroute (deflection)
- Store-&-Forward (packet-based)
  - Packet copied entirely into network switch before moving to next node
  - Requires buffering for entire packet in each router
  - High per-packet latency (latency = packet transmission time on link x network distance)
- Cut-through flow control (packet-based)
  - Switch starts forwarding data on next link as soon as packet header is received (header determines how much link bandwidth packet requires + where to route)
  - Reduce transmission latency
  - If output link is blocked (cannot transmit head), transmission of tail can continue
    - entire message absorbed into a buffer in a switch -> S&F
  - Requires buffering for entire packet in each router
- Wormhole flow control (flit-based)
  - head-of-line blocking
  - virtual channel flow control
    - Multiplex multiple operations over single physical channel
    - Divide switch’s input buffer into multiple buffers sharing a single physical channel
    - Reduces head-of-line blocking
    - [[Check: Medium Access Control/Multiplexing]]
    - Deadlock avoidance
      - Prevent cycles by ensuring requests and responses use different virtual channels
    - Prioritization of traffic classes
      - Provide quality-of-service guarantees
Current Research
- Energy efficiency of interconnections
  - Interconnect can be energy intensive (~35% of total chip power in MIT RAW research processor)
  - Bufferless networks
  - turn on/off regions of network, use fast and slow networks
- Prioritization and quality-of-service guarantees
  - Prioritize packets to improve multi-processor performance (e.g., some applications may be more sensitive to network performance than others)
  - Throttle endpoints (e.g., cores) based on network feedback
- New/emerging technologies
  - Die stacking (3D chips)
  - Photonic networks-on-chip (use optical waveguides instead of wires)
  - Reconfgurable devices (FPGAs): create custom interconnects tailored to application (see CMU projects: CONNECT, CoRAM, Shrinkwrap)

Implementing Synchronization

Acquire method: How a thread attempts to gain access to protected resource
Waiting algorithm: How a thread waits for access to be granted to shared resource
- busy waiting (spinning)
- blocking
- When should one use a spinlock instead of mutex?
Release method: How thread enables other threads to gain resource when its work in the synchronized region is complete
Coherence & Synchronization
Test-and-set based lock
- ts R0, mem[addr]: load mem[addr] into R0, if mem[addr] is 0, set mem[addr] to 1 [[Check: consensus number]]
- ```
lock:	ts	R0, mem[addr]
		bnz	R0, lock

unlock:	st mem[addr], #0
```
- Bus contention increases amoung of time to transfer lock
- Bus contention slows down execution of critical sections
- low latency, high traffic, pool scaling, low storage cost, no provision for fairness
- • If every process spins on an exchange, every exchange instruction will attempt a write -> many invalidates and the locked value keeps changing ownership
  • Hence, each process keeps reading the lock value – a read does not generate coherence traffic and every process spins on its locally cached copy
  • When the lock owner releases the lock by writing a 0, other copies are invalidated, each spinning process generates a read miss, acquires a new copy, sees the 0, attempts an exchange (requires acquiring the block in exclusive state so the write can happen), first process to acquire the block in exclusive state acquires the lock, others keep spinning

CAS lock

lock cmpxchg dst, src

if	(dst	==	EAX)	
				ZF	=	1	
				dst	=	src	
else	
				ZF	=	0	
				EAX	=	dst

Load-Linked & Store-Conditional Lock

LDREX & STREX

lockit: LL R2, 0(R1) ; load linked, generates no coherence traffic
 		BNEZ R2, lockit ; not available, keep spinning
 		DADDUI R2, R0, #1 ; put value 1 in R2
 		SC R2, 0(R1) ; store-conditional succeeds if no one
        ; updated the lock since the last LL
 		BEQZ R2, lockit ; confirm that SC succeeded, else keep trying

Test-and-test-and-set lock [[Check: double lock in singleton]]
- ```
void	Lock(int*	lock)	{	
		while	(1)	{	
				while	(*lock	!=	0);				
				if	(test_and_set(*lock)	==	0)	
						return;	
		}	
}	
void	Unlock(int*	lock)	{	
			*lock	=	0;	
}
```
- Slightly higher latency than test-and-set in uncontended case
- Generates much less interconnect traffic
  - One invalidation, per waiting processor, per lock release ($O(P)$ invalidations)
  - This is $O(P^2)$ interconnect traffic if all processors have the lock cached
  - Recall: test-and-set lock generated one invalidation per waiting processor per test
- More scalable (due to less traffic)
- Low storage cost
- Still no provision for fairness
Test-and-set lock with backoff [[Check: ALOHA with backoff]]
- ```
void	Lock(volatile	int*	l)	{	
		int	amount	=	1;	
		while	(1)	{	
				if	(test_and_set(*l)	==	0)	
						return;	
				delay(amount);	
				amount	*=	2;	
		}	
}
```
- Generates less traffic than test-and-set
- Same uncontended latency as test-and-set, but potentially higher latency under contention
- Exponential back-off can cause severe unfairness

Ticket lock

struct	lock	{	
			int	next_ticket;	
			int	now_serving;	
};	
void	Lock(lock*	l)	{	
		int	my_ticket	=	atomic_increment(&l->next_ticket);			//	take	a	“ticket”
		while	(my_ticket	!=	l->now_serving);																	//	wait for number to be called
}																																								
void	unlock(lock*	l)	{	
		l->now_serving++;	
}

No atomic operation needed to acquire the lock (only a read)
only one invalidation per lock release ($O(P)$ interconnect traffic)

Array-based lock

struct	lock	{	
			padded_int	status[P];				//	padded	to	keep	off	same	cache	line
			int	head;	
};	
int	my_element;	
void	Lock(lock*	l)	{	
		my_element	=	atomic_circ_increment(&l->head);				//	assume	circular	increment
		while	(l->status[my_element]	==	1);	
}	
void	unlock(lock*	l)	{
		l->status[my_element]	=	1;
		l->status[circ_next(my_element)]	=	0;												//	next()	gives	next	index
}

$O(1)$ interconnect traffic per release, but lock requires space linear in $O(P)$
the atomic circular increment is a more complex operation (higher overhead)

Queueing locks
- the directory controller keeps track of the order in which requests arrived – when the lock is available, it is passed to the next in line (only one process sees the invalidate and update)
Desired lock
- Low latency: If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
- Low interconnect traffic: If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
- Scalability: Latency / traffic should scale reasonably with number of processors
- Low storage cost
- Fairness: Avoid starvation or substantial unfairness

Min by CAS

//	atomicCAS:
//	atomic	compare	and	swap	performs	the	following	logic	atomically	
int	atomicCAS(int*	addr,	int	compare,	int	val)	{	
			int	old	=	*addr;	
			*addr	=	(old	==	compare)	?	val	:	old;	
			return	old;	
}

int	atomic_min(int*	addr,	int	x)	{	
			int	old	=	*addr;
			int	new	=	min(old,	x);
			while	(atomicCAS(addr,	old,	new)	!=	old)	{	
					old	=	*addr;	
					new	=	min(old,	x);	
			}
    		return old;
}

void atomic_incr(int* addr, int x){
   while(1) {
      int old = *addr;
      if (atomicCAS(addr, old, old + x) == old) break;
   }
}

Barrier

centralized barrier

wait for all processes to leave first barrier, before clearing flag for entry into the second
$O(P)$ traffic on interconnect per barrier
- All threads: $2P$ write transactions to obtain barrier lock and update counter ($O(P)$ traffic assuming lock acquisition is implemented in $O(1)$ manner)
- Last thread: 2 write transactions to write to the flag and reset the counter ($O(P)$ traffic since there are many sharers of the flag)
- $P-1$ transactions to read updated flag
- span (latency) of entire operation is $O(P)$

struct	Barrier_t	{
		LOCK	lock;	
		int	arrive_counter;			//	initialize	to	0	(number	of	threads	that	have	arrived)
		int	leave_counter;				//	initialize	to	P	(number	of	threads	that	have	left	barrier)
		int	flag;	
};	
void	Barrier(Barrier_t*	b,	int	p)	{	
		lock(b->lock);	
		if	(b->arrive_counter	==	0)	{							//	if	first	to	arrive...
				if	(b->leave_counter	==	P)	{						//	check	to	make	sure	no	other	threads	“still	in	barrier”
							b->flag	=	0;																			//	first	arriving	thread	clears	flag
				}	else	{	
						unlock(lock);	
						while	(b->leave_counter	!=	P);		//	wait	for	all	threads	to	leave	before	clearing			
						lock(lock);	
						b->flag	=	0;																				//	first	arriving	thread	clears	flag
				}	
		}	
		int	num_arrived	=	++(b->arrive_counter);	
		unlock(b->lock);	
		if	(num_arrived	==	p)	{					//	last	arriver	sets	flag
				b->arrive_counter	=	0;	
				b->leave_counter	=	1;	
				b->flag	=	1;	
		}	
		else	{	
				while	(b->flag	==	0);					//	wait	for	flag
				lock(b->lock);	
				b->leave_counter++;	
				unlock(b->lock);	
		}
}

centralized barrier with sense reversal

Sense reversal optimization results in one spin instead of two

		LOCK	lock;	
  		int		counter;															//	initialize	to	0
  		int		flag;																		//	initialize	to	0
};	
thread_local int	private_sense	=	0;								//	private	per	processor.	Main	idea:	processors	wait	
																														//	for	flag	to	be	equal	to	private_sense
void	Barrier(Barrier_t*	b,	int	p)	{	
		private_sense	=	(private_sense	==	0)	?	1	:	0;	
		lock(b->lock);	
		int	num_arrived	=	++(b->counter);	
		if	(b->counter	==	p)	{						//	last	arriver	sets	flag
				unlock(b->lock);	
				b->counter	=	0;	
				b->flag	=	private_sense;	
		}	
		else	{
				unlock(b->lock);	
				while	(b.flag	!=	private_sense);		//	wait	for	flag
		}
}

Combining tree implementation of barrier
- Combining trees make better use of parallelism in more complex interconnect topologies. $\log{P}$ latency
- Barrier acquire: when processor arrives at barrier, performs increment of parent counter
- Barrier release: beginning from root, notify children of release

Fine-grained Synchronization & Lock-free Programming

Fine-grained linked list

struct	Node	{	
			int	value;	
			Node*	next;	
			Lock*	lock;
};
struct	List	{	
		Node*	head;	
		Lock*	lock;
};

void	insert(List*	list,	int	value)	{	
			Node*	n	=	new	Node;	
			n->value	=	value;	
			//	assume	case	of	insert	before	head	handled	
			//	here	(to	keep	slide	simple)	
			Node*	prev,	*cur;	
			lock(list->lock);
			prev	=	list->head;	
			cur	=	list->head->next;	

			lock(prev->lock);	
			unlock(list->lock);	
			if	(cur)	lock(cur->lock);
			while	(cur)	{	
					if	(cur->value	>	value)	
								break;	

					Node*	old_prev	=	prev;	
					prev	=	cur;	
					cur	=	cur->next;
					unlock(old_prev->lock);	
					if	(cur)	lock(cur->lock);
			}	
			n->next	=	cur;	
			prev->next	=	n;	
			unlock(prev->lock);	
			if	(cur)	unlock(cur->lock);
}

void	delete(List*	list,	int	value)	{	
			//	assume	case	of	delete	head	handled	here	
			//	(to	keep	slide	simple)	
			Node*	prev,	*cur;	

			lock(list->lock);
			prev	=	list->head;	
			cur	=	list->head->next;	
			lock(prev->lock);	
			unlock(list->lock);	
			if	(cur)	lock(cur->lock)
			while	(cur)	{	
					if	(cur->value	==	value)	{	
							prev->next	=	cur->next;	
							unlock(prev->lock);	
							unlock(cur->lock);
							delete	cur;		
							return;	
					}	
					Node*	old_prev	=	prev;	
					prev	=	cur;	
					cur	=	cur->next;	
					unlock(old_prev->lock);	
					if	(cur)	lock(cur->lock);
			}	
		unlock(prev->lock);
}

reduce contention for global data structure lock
tricky to ensure correctness
extra storage
extra locking each traversal step

Lock-free: Non-blocking algorithms are lock-free if some thread is guaranteed to make progress (“systemwide progress”)
this defnition does not prevent starvation of any one thread
wait-free: every operation has a bound on the number of steps the algorithm will take before the operation completes (starvation-free)
Single reader, single writer queue
- bounded (untotal function)
- unbounded (use reclaim as physical head, head for logical head for producer to move head and deallocate)

Lock-free stack

ABA problem
- maintain counter of pop opreations
- requires (double compare & swap) or doubleword CAS
  - cmpxchg8b, cmpxchg16b
- or with careful node allocation or element reuse policies
- In summary, the ABA problem is caused by the value being used to check the state of a shared structure not actually being unique (i.e. the structure could have changed but the value compared still looks the same.)
- The ABA problem can be solved if another mechanism is involved to ensure that memory not being available to threads and reused too early before no threads are holding reference to that memory.

struct	Node	{	
		Node*	next;	
		int			value;	
};	
struct	Stack	{		
		Node*	top;	
		int			pop_count;
};	

void	init(Stack*	s)	{	
		s->top	=	NULL;	
}	
void	push(Stack*	s,	Node*	n)	{	
		while	(1)	{	
				Node*	old_top	=	s->top;	
				n->next	=	old_top;	
				if	(compare_and_swap(&s->top,	old_top,	n)	==	old_top)	
						return;	
		}	
}	
Node*	pop(Stack*	s)	{	
		while	(1)	{
				int	pop_count	=	s->pop_count;
				Node*	top	=	s->top;	
				if	(top	==	NULL)	
						return	NULL;	
				Node*	new_top	=	top->next;	
				if	(double_compare_and_swap(&s->top,							top,							new_top,	
																																&s->pop_count,	pop_count,	pop_count+1))	
						return	top;	
		}	
}

referencing freed memory: Node* top = s->top;

hazard pointer: avoid freeing a node until it’s known that all other threads do not hold reference to it
“Each reader thread owns a single-writer/multi-reader shared pointer called “hazard pointer.” When a reader thread assigns the address of a map to its hazard pointer, it is basically announcing to other threads (writers), “I am reading this map. You can replace it if you want, but don’t change its contents and certainly keep your deleteing hands off it.”
-Andrei Alexandrescu and Maged Michael, Lock-Free Data Structures with Hazard Pointers

struct	Node	{	
		Node*	next;	
		int	value;	
};	
struct	Stack	{		
		Node*	top;	
		int	pop_count;
};	
//	per	thread	ptr	(node	that	cannot		
//	be	deleted	since	the	thread	is		
//	accessing	it)	
Node*	hazard;	
//	per-thread	list	of	nodes	this		
//	thread	must	delete	
Node*	retireList;	
int			retireListSize;

//	delete	nodes	if	possible
void	retire(Node*	ptr)	{	
		push(retireList,	ptr);	
		retireListSize++;	
		if	(retireListSize	>	THRESHOLD)	
					for	(each	node	n	in	retireList)	{	
								if	(n	not	pointed	to	by	any	
												thread’s	hazard	pointer)	{	
											remove	n	from	list	
											delete	n;	
								}	
					}	
}

void	init(Stack*	s)	{	
		s->top	=	NULL;	
}	
void	push(Stack*	s,	int	value)	{	
		Node*	n	=	new	Node;	
		n->value	=	value;	
		while	(1)	{	
				Node*	old_top	=	s->top;	
				n->next	=	old_top;	
				if	(compare_and_swap(&s->top,	old_top,	n)	==	old_top)	
						return;	
		}	
}	
int	pop(Stack*	s)	{	
		while	(1)	{
				Stack	old;	
				old.pop_count	=	s->pop_count;	
				old.top	=	hazard	=	s->top;
				if	(old.top	==	NULL)	{	
						return	NULL;
				}

				Stack	new_stack;	
				new_stack.top	=	old.top->next;	
				new_stack.pop_count	=	old.pop_count+1;		

				if	(doubleword_compare_and_swap(s,	old,	new_stack))	{	
						int	value	=	old.top->value;	
						retire(old.top);
						return	value;	
				}	
				hazard	=	NULL;
		}	
}

Lock-free linked list

insertion

//	insert	new	node	after	specified	node	
void	insert_after(List*	list,	Node*	after,	int	value)	{	
			Node*	n	=	new	Node;	
			n->value	=	value;	
			//	assume	case	of	insert	into	empty	list	handled	
			//	here	(keep	code	on	slide	simple	for	class	discussion)	
			Node*	prev	=	list->head;	
			while	(prev->next)	{	
					if	(prev	==	after)	{	
							while	(1)	{	
									Node*	old_next	=	prev->next;	
									n->next	=	old_next;
									if	(compare_and_swap(&prev->next,	old_next,	n)	==	old_next)	
												return;	
							}	
					}	
					prev	=	prev->next;	
			}	
}

deletion
- Consider case where B is deleted simultaneously with successful insertion of E after B

CAS can fail under heavy contention, requiring spins

Transaction Memory

Transaction Memory (TM)
- Memory transaction: An atomic and isolated sequence of memory accesses
- Atomicity
  - Upon transaction commit, all memory writes in transaction take effect at once
  - On transaction abort, none of the writes appear to take effect (as if transaction never happened)
- Isolation
  - No other processor can observe writes before transaction commits
- Serializability
  - Transactions appear to commit in a single serial order
- motivations
  - locks prevent concurrency
  - failure atomicity: no manually caught exceptions, just abort. No lost locks when a thread fails
  - composability: no manually lock order, but global intent. Safe & scalable composition
  - easy-to-use
  - performs as well as fine-grained locks (provides R->R)
TM Implementation
- Data versioning policy: How does the system manage uncommitted (new) and previously committed (old) versions of data for concurrenttransactions? [[Check: Undo/Redo Log in DB]]
  - eager versioning (undo-log based): Update memory immediately, maintain “undo log” in case of abort [[Check: 2PL]]
    - fast commit
    - slower aborts, fault tolerance issues [[Check: STEAL policy]]
  - lazy versioning (write-buffer based): Log memory updates in transaction write buffer, fush buffer on commit [[Check: MVCC]]
    - faster abort
    - slower commits
- Confict detection policy: how/when does the system determine that two concurrent transactions confict? [[Check: OCC in DB]]
  - Read-set: addresses read during the transaction
  - Write-set: addresses written during the transaction
  - Read-write confict: transaction A reads address X, which was written to by pending (but not yet committed) transaction B (W->R dirty read, R->W unrepeatable read)
  - Write-write confict: transactions A and B are both pending, and both write to address X (W->W)
  - Pessimistic detection [[Check: 2PL]]
    - - write priority
      - need a manager for livelock
    - detect conflicts early (undo less work, turn some aborts to stalls)
    - no forward progress guarantees, more aborts in some cases
    - fine-grained communication (check on each load/store)
    - detection on critical path
  - Optimistic detection: Detect conficts when a transaction attempts to commit
    - forward progress guarantees
    - bulk communication and confict detection
    - detects conficts late, can still have fairness problems
    - [[N: here Case 3 success, I think it’s because MVCC control]]
  - Granularity
    - Object granularity (SW-based)
      - reduce overhead
      - close to programmer’s reasoning
      - false sharing
    - Machine word granularity
      - minimize false sharing
      - increase overhead
    - Cache-line granularity
    - Mixed
  - Examples
  - Hardware transaction memory (HTM)
    - data versioning in caches (cache writer buffer or undo log)
      - add new cache line metadata to track read/write set
      - R bit: indicates data read by transaction (set on loads)
      - W bit: indicates data written by transaction (set on stores)
      - R/W bits can be at word or cache-line granularity
      - R/W bits gang-cleared on transaction commit or abort
      - For eager versioning, need a 2nd cache write for undo log
    - conflict detection via cache coherence protocol
      - Observing shared request to W-word is a read-write confict
      - Observing exclusive (intent to write) request to R-word is a write-read confict
      - Observing exclusive (intent to write) request to W-word is a write-write confict
    - register checkpoint at transaction begin (HW abort)
    - Commit
      - Validate: request RdX access to write set lines (if needed)
      - Commit: gang-reset R and W bits, turns write set data to valid (dirty) data
    - Abort
      - Check: lookup exclusive requests in the read set and write set
      - Abort: invalidate write set, gang-reset R and W bits, restore to register checkpoint
    - Restricted transaction memory (RTM): Intel Haswell
      - xbegin, xend, xabort: tracks read/write set in L1 cache
      - eviction of line in r/w set will cause abort (no guarantee on progress)

Heterogeneous Parallelism & Hardware Specialization

Amdahl’s law: $\text{speedup}(f, n, r) = \frac{1}{\frac{1-f}{\text{perf}(r_{seq})} + \frac{f}{\sum\text{perf}(r_i)}}$
- $f$: fraction of program that is parallelizable
- $n$: total processing resources
- $r$: resources dedicated to each processing core
Intel “Skylake”
- CPU cores & graphics cores share same memory system & LLC (L3 cache)
  - enables low-latency, high-bw communication between CPU & integrated GPU
- Graphic cores are cache coherent with GPU cores
Add discrete GPU
- Keep discrete (power hungry) GPU unless needed for graphics-intensive applications
- Use integrated, low power graphics for basic graphics/window manager/UI
ARM-based supercomputer
- power-efficient building blocks: ARM CPUs (for control/scheduling) + GPU cores or wide SIMD engines (serving as the primary compute engine)
Energy-constrained computing
- supercomputer: shear scale
- datacenter: cost of cooling, physical space requirement
- mobile device: limited battery life, heat dissipation
- chip power consumption limit: the longer a task runs the less power it can use
  - Processor’s power consumption is limited by heat generated (efficiency is required for more than just maximizing battery life)
  - Electrical limit: max power that can be supplied to chip
  - Die temp (junction temp - $T_j$): chip becomes unreliable above this temp (chip can run at high power for short period of time until chip heats to $T_j$)
  - Case temp: mobile device gets too hot for user to comfortably hold (chip is at suitable operating temp, but heat is dissipating into case)
  - Battery life: chip and case are cool, but want to reduce power consumption to sustain long battery life for given task
Mobile: increasing efficiency
- Run faster for a fixed period of time: Run at higher clock, use more cores (reduce latency of critical task)
- Run at a fixed level of performance for longer
Computing specialization
- Throughput-maximized processor architectures: e.g., GPU cores (10x)
- Fixed-function ASIC (“application-specifc integrated circuit”) (100-1000x)
FPGA (Field Programmable Gate Arrays)
- Middle ground between an ASIC and a processor
- FPGA chip provides array of logic blocks, connected by interconnect
- Programmer-defned logic implemented directly by FGPA
- combinatorial logic via LUT
- Project Catapult
Heterogeneous system
- SW dev: how to map application onto a heterogeneous collection of resources?
  - decomposing, scheduling
- HW dev: what is the right mixture of resources?
  - Too few throughput oriented resources (lower peak throughput for parallel workloads)
  - Too few sequential processing resources (limited by sequential part of workload)
  - How much chip area should be dedicated to a specifc function, like video?
- Reducing energy consumption
  - specialized processing
  - move less data
  - Compute less: parallel algorithms that do more work than sequential counterparts may not be desirable even if they run faster
  - Specialize compute units
    - Heterogeneous processors: CPU-like cores + throughput-optimized cores (GPU-like cores)
    - Fixed-function units: audio processing, “movement sensor processing” video decode/encode, image processing/computer vision?
    - Specialized instructions: expanding set of AVX vector instructions, new instructions for accelerating AES encryption (AES-NI)
    - Programmable soft logic: FPGAs
  - Reduce bandwidth requirements
    - Exploit locality (restructure algorithms to reuse on-chip data as much as possible)
    - Aggressive use of compression: perform extra computation to compress application data before transferring to memory (likely to see fxed-function HW to reduce overhead of general data compression/decompression)
Heterogeneous parallel processing: use a mixture of computing resources that ft mixture of needs of target applications
- Latency-optimized sequential cores, throughput-optimized parallel cores, domain-specialized fixed-function processors
Current CS research challenge: how to write efficient, portable programs for emerging heterogeneous architectures?

Domain-Specific Programming Systems

Heterogeneous execution capability
- Programmable, latency-centric (e.g., “CPU-like” cores)
- Programmable, throughput-optimized (e.g., “GPU-like” cores)
- Fixed-function, application-specific (e.g., image/video/audio processing)
Different technologies and performance characteristics within the same machine at different scales
- Within a core: SIMD, multi-threading: fine-granularity sync and communication
  - Abstractions: SPMD programming (ISPC, Cuda, OpenCL, Metal)
- Across cores: coherent shared memory via fast on-chip network
  - Abstractions: OpenMP pragma’s, Cilk, TBB
- Hybrid CPU+GPU multi-core: incoherent (potentially) shared memory
  - Abstractions: OpenCL
- Across racks: distributed memory, multi-stage network
  - Abstractions: message passing (MPI, Go, Spark, Legion, Charm++)
Liszt: for scientific computing on meshes
- ```
val	Position	=	FieldWithConst[Vertex,Float3](0.f,	0.f,	0.f)
val	Temperature	=	FieldWithConst[Vertex,Float](0.f)
val	Flux	=	FieldWithConst[Vertex,Float](0.f)
val	JacobiStep	=	FieldWithConst[Vertex,Float](0.f)

var	i	=	0;
while	(	i	<	1000	)	{
Flux(vertices(mesh))	=	0.f;
JacobiStep(vertices(mesh))	=	0.f;
		for	(e	<-	edges(mesh))	{
				val	v1	=	head(e)
				val	v2	=	tail(e)
				val	dP	=	Position(v1)	-	Position(v2)
				val	dT	=	Temperature(v1)	-	Temperature(v2)
				val	step	=	1.0f/(length(dP))
                Flux(v1)	+=	dT*step
                Flux(v2)	-=	dT*step
                JacobiStep(v1)	+=	step
                JacobiStep(v2)	+=	step
		}	
		i	+=	1
}
```
- fields, mesh, topology functions, iteration over set
- Lizst infers “stencils”: “stencil” = mesh elements accessed in an iteration of loop = dependencies for the iteration
- Statically analyze code to find stencil of each top-level for loop
  - Extract nested mesh element reads
  - Extract field operations
- Language restrictions:
  - Mesh elements are only accessed through built-in topological functions:
    - cells(mesh), …
  - Single static assignment: (immutable values)
    - val v1 = head(e)
  - Data in fields can only be accessed using mesh elements:
  - Pressure(v)
  - No recursive functions
- Portable parallelism
  - mesh partitioning
  - mesh coloring
- Distributed memory: Mesh + Stencil -> Graph -> Partition
- GPU
  - parallel reduction
  - conflict graph
  - To avoid the conflicts in vertex value updating when parallel in edges, there are generally two scheme:
    1. Maintain the update value in the edge, and finally doing reduce parallel in vertex.
    2. Construct the connectivity graph for each edge and colorizing the edges, thus get edge sets where edges are independent within one set. Both method required some additional cost. The former requires extra storage for each edge, the later requires extra time and storage for the connectivity graph and colorization.
    Ideally, we would like to color the graph with the minimum amount of colors to maximize parallelism and minimize overhead. Unfortunately, minimum vertex coloring is in NP-Hard so constructing a minimum vertex coloring for any significantly sized graph is usually out of the question. Instead you probably want to use some kind of greedy/approximation method.
    According to the halide paper, they actualy use an algorithm from the paper “register allocation via coloring”, by Chaitin
- Productivity
  - Abstract representation of mesh: vertices, edges, faces, fields
  - Intuitive topological operators
- Portability
  - Same code runs on large cluster of CPUs and GPUs (and combinations)
- High performance
  - Language is constrained to allow compiler to track dependencies
  - Used for locality-aware partitioning (distributed memory implementation)
  - Used for graph coloring to avoid sync (GPU implementation)
  - Compiler chooses different parallelization strategies for different platforms
  - System can customize mesh representation based on application and platform
Halide: image processing
- two-pass blur: A 2D separable filter (such as a box filter) can be evaluated via two 1D filtering operations
- Halide language
  - ```
  Var	x,	y;	
  Func	blurx,	blury,	out;	
  Image<uint8_t>	in	=	load_image(“myimage.jpg”);	
  //	perform	3x3	box	blur	in	two-passes	
  blurx(x,y)	=	(in(x-1,y)				+	in(x,y)						+	in(x,y))	/	3.f;	
  blury(x,y)	=	(blurx(x,y-1)	+	blurx(x,y+1)	+	blurx(x,y+1))	/	3.f;	
  //	brighten	blurred	result	by	25%,	then	clamp	
  out(x,y)	=	min(blury(x,y)	*	1.25f,	255);			
  //	execute	pipeline	on	domain	of	size	800x600	
  Image<uint8_t>	result	=	out.realize(800,	600);
```
- Halide function: an infinite (but discrete) set of values
- Halide expression: a side-effect free expression describes how to compute a function’s value at a point in it’s domain in terms of the values of other functions
- scheduling: how to schedule the algorithm onto a parallel machine
  - ```
  // Vectorize the xi loop (8-wide)
  // Use threads to parallelize the y loop
  out.tile(x,	y,	xi,	yi,	256,32).vectorize(xi,8).parallel(y);
  
  // Produce elements blurx on demand for each tile of output.
  // Vectorize the x (innermost) loop
  blurx.compute_at(x).vectorize(x,8);
```
  - blurx.compute_root(); : don’t compute within outer loop, compute all
  - blurx.compute_at(x_i): compute necessary elements within out’s $x_i$ loop
  - blurx.compute_at(x) : compute necessary elements within out’s $x$ loop
  - Functional primitives for describing image processing operations
  - Additional primitives for describing schedules
  - separate “algorithm specification” from schedule
  - [[C: Auto Schedule in TVM]]

Domain-Specific Programming on Graphs

GraphLab: A system for describing iterative computations on graphs
- $G = (V, E)$, $D_v$ data associated with vertex $v$, $D_{u \to v}$ data associated with direct edge $u \to v$
- read-only lobal data
- operation: Defnes per-vertex operations on the vertex’s local neighborhood (& global reduction)
- neighborhood: current vertex, adjacent edges, adjacent vertices
- ```
graphlab::omni_engine<pagerank_program>	engine(dc,	graph,	"sync");	
engine.signal_all();	
engine.start();
```
- Vertex signaling: primitive for scheduling work (Asynchronous execution model)
- Synchronization: Programs specify what granularity of atomicity
  - “Full consistency”: implementation ensures no other execution reads or writes to data in scope of $v$ when vertex program for $v$ is running.
  - “Edge consistency”: no other execution reads or writes any data in $v$ or in edges adjacent to $v$
  - “Vertex consistency”: no other execution reads or writes to data in $v$ …
- Job scheduling order
  - Synchronous: update all scheduled vertices “simultaneously”
  - Round-robin: vertex programs observe most recent updates
  - Graph coloring
  - Dynamic: based on new work created by signal

Ligra: A simple framework for parallel graph operations

parents	=	{-1,	...,	-1}	
//	d	=	dst:	vertex	to	“update”	(just	encountered)	
//	s	=	src:	vertex	on	frontier	with	edge	to	d	
procedure	UPDATE(s,	d)	
			return	compare-and-swap(parents[d],	-1,	s);	
procedure	COND(i)	
			return	parents[i]	==	-1;	
procedure	BFS(G,	r)	
			parents[r]	=	r;	
			frontier	=	{r};	
			while	(size(frontier)	!=	0)	do:	
							frontier	=	EDGEMAP(G,	frontier,	UPDATE,	COND);

procedure	EDGEMAP_SPARSE(G,	U,	F,	C):	
			result	=	{}	
			parallel	foreach	v	in	U	do:	
						parallel	foreach	v2	in	out_neighbors(v)	do:	
									if	(C(v2)	==	1	and	F(v,v2)	==	1)	then	
												add	v2	to	result	
			remove	duplicates	from	result	
			return	result;

procedure	EDGEMAP_DENSE(G,	U,	F,	C):	
			result	=	{}	
			parallel	for	i	in	{0,...,|V|-1}	do:	
						if	(C(i)	==	1)	then:	
									foreach	v	in	in_neighbors(i)	do:	
											if	v	∈	U	and	F(v,	i)	==	1	then:	
														add	i	to	result;	
											if	(C(i)	==	0)	
														break;	
			return	result;

procedure	EDGEMAP(G,	U,	F,	C):	
			if	(|U|	+	sum	of	out	degrees	>	threshold)	
					return	EDGEMAP_DENSE(G,	U,	F,	C);	
			else	
					return	EDGEMAP_SPARSE(G,	U,	F,	C);
procedure	VERTEXMAP(U,	F):	
			result	=	{}	
			parallel	for	u	∈ U	do:	
						if	(F(u)	==	1)	then:	
									add	u	to	result;	
			return	result;

visiting frontier edge might be wasteful
- claimed child: edge points to unvisited node (useful work)
- failed child: edge points to node found in this step via another edge
- peer: edge points to a vertex that was added to frontier in same step as current vertex
- valid parent: edge points to vertex found in previous step
Entities: Graph, Vertex subsets, EDGEMAP, VERTEXMAP

r_cur		=	{1/|V|,	...	1/|V|};	
r_next	=	{0,...,0};	
diff	=	{}		
procedure	PRUPDATE(s,	d):	
			atomicIncrement(&r_next[d],	r_cur[s]	/	vertex_degree(s));
procedure	PRLOCALCOMPUTE(i):	
		r_next[i]	=	alpha	*	r_next[i]	+	(1	-	alpha)	/	|V|;	
		diff[i]	=	|r_next[i]	-	r_cur[i]|;	
		r_cur[i]	=	0;	
		return	1;	
procedure	COND(i):	
		return	1;	
procedure	PAGERANK(G,	alpha,	eps):	
			frontier	=	{0,	...	,	|V|-1}	
			error	=	HUGE;	
			while	(error	>	eps)	do:	
						frontier	=	EDGEMAP(G,	frontier,	PRUPDATE,	COND);	
						frontier	=	VERTEXMAP(frontier,	PRLOCALCOMPUTE);	
						error	=	sum	of	per-vertex	diffs		//	this	is	a	parallel	reduce		
						swap(r_cur,	r_next);	
			return	err

System abstracts graph operations as data-parallel operations over vertices and edges

Good DSL
- #1: good systems identify the most important cases, and provide most beneft in these situations [[C: Lamport paper!]]
  - Structure of code mimics the natural structure of problems in the domain
    - Halide: pixel-wise view of flters: pixel(x,y) computed as expression of these input pixel values
    - Graph processing algorithms are designed in terms of per-vertex operations)
  - Efficient expression: common operations are easy and intuitive to express
  - Efficient implementation: the most important optimizations in the domain are performed by the system for the programmer
- #2: good systems are usually simple systems [[C: KISS]]
  - A small number of key primitives and operations
  - Ligra: only two operations! (vertexmap and edgemap)
  - GraphLab: run computation per vertex, trigger new work by signaling (But GraphLab gets messy with all the scheduling options)
  - Halide: only a few scheduling primitives
  - Hadoop: map + reduce
  - Allows compiler/runtime to focus on optimizing these primitives
- #3: good primitives compose
  - Composition of primitives allows for wide application scope, even if scope remains limited to a domain
    - Halide’s loop ordering + loop interleaving schedule primitives allow for expression of wide range of schedules
  - Composition often allows optimization to generalizable
Implementation on graph operations
- Reorganize graph structure to increase locality: efficient I/O
  - -> Streaming graph computation
  - Sharded graph representation
    - Partition graph vertices into intervals (sized so that subgraph for interval fts in memory)
    - Vertices and only incoming edges to these vertices are stored together in a shard
    - Sort edges in a shard by source vertex id
  - loop over all graph edges
    - For each partition i of vertices:
      - Load shard $i$ (contains all incoming edges)
      - For each other shard $s$
        Load section of $s$ containing data for edges leaving $i$ and entering $s$
      - Construct subgraph in memory
      - Do processing on subgraph
- Compress the graph: reduce memory BW or disk I/O
  - graph operations are often BW-bound
  - [[C: *-order unary degree sequence (*OUDS) tree]]
  - [[C: succinct data structure]]

In-Memory Distributed Computing using Spark

MapReduce
- mapper function -> prepare intermediate data -> reducer function
- checkpoints after map/reduce by writing results to FS
- job scheduler
  - exploit data locality
  - handling node failures
  - handling slow machines (duplicate jobs)
- simple program structure (map, reduce)
  - [[C: DryadLINQ for generalization to DAGs]]
- iterative algorithm must load from disk each iteration
Spark: in-memory, fault-tolerant distributed computing
- Programming model for cluster-scale computations where there is signifcant reuse of intermediate datasets
  - iterative ML, graph algorithm, iterative data mining
- Keep intermediates in memory instead of persistent distributed file system
- Fault tolerance for in-memory calculation
  - Replicate all computations
  - Checkpoint and rollback
    - Periodically save state of program to persistent storage
    - Restart from last checkpoint on node failure
  - Maintain log of updates (commands and data)
    - High overhead for maintaining logs
- Resilient distributed dataset (RDD): read-only ordered collection of records (immutable)
  - can only be created by deterministic transformations on data in persistent storage or on existing RDDs
  - Actions on RDDs return data to application
  - ```
  //	create	RDD	from	file	system	data	
  var	lines	=	spark.textFile(“hdfs://15418log.txt”);	
  
  //	create	RDD	using	filter()	transformation	on	lines	
  var	mobileViews	=	lines.filter((x:	String)	=>	isMobileClient(x));	
  
  //	another	filter()	transformation	
  var	safariViews	=	mobileViews.filter((x:	String)	=>	x.contains(“Safari”));	
  
  //	then	count	number	of	elements	in	RDD	via	count()	action	
  var	numViews	=	safariViews.count();
  
  //	1.	create	RDD	from	file	system	data	
  //	2.	create	RDD	with	only	lines	from	mobile	clients	
  //	3.	create	RDD	with	elements	of	type	(String,Int)	from	line	string	
  //	4.	group	elements	by	key	
  //	5.	call	provided	reduction	function	on	all	keys	to	count	views	
  var	perAgentCounts	=		spark.textFile(“hdfs://15418log.txt”)	
  																											.filter(x	=>	isMobileClient(x))	
  																											.map(x	=>	(parseUserAgent(x),1));	
  																											.reduceByKey((x,y)	=>	x+y)	
  																											.collect();
  
  //	create	RDD	from	file	system	data	
  var	lines	=	spark.textFile(“hdfs://15418log.txt”);	
  //	create	RDD	using	filter()	transformation	on	lines	
  var	mobileViews	=	lines.filter((x:	String)	=>	isMobileClient(x));	
  //	instruct	Spark	runtime	to	try	to	keep	mobileViews	in	memory	
  mobileViews.persist();	
  //	create	a	new	RDD	by	filtering	mobileViews	
  //	then	count	number	of	elements	in	new	RDD	via	count()	action	
  var	numViews	=	mobileViews.filter(_.contains(“Safari”)).count();	
  //	1.	create	new	RDD	by	filtering	only	Chrome	views	
  //	2.	for	each	element,	split	string	and	take	timestamp	of	
  //				page	view	
  //	3.	convert	RDD	to	a	scalar	sequence	(collect()	action)	
  var	timestamps	=	mobileViews.filter(_.contains(“Chrome”))	
  																												.map(_.split(“	”)(0))	
  																												.collect();                                           
```
- [[C: ranges-v3, rust iterator, lazy & consumer]]
- Transformation: RDD -> RDD
- Action: RDD -> Data
- Narrow dependency: each partition of parent RDD referenced by at most one child RDD partition
  - Allows for fusing of operations
- Wide dependency: each partition of parent RDD referenced by multiple child RDD partitions
  - groupByKey, must compute all dependency, trigger significant recomputation of ancestor lineage upon node failure
  - join: depend on partitioning [[C: cluster, uncluster, index in DB]]
  - partitionBy(func): inform spark how to partition an RDD
  - persist(): store content in memory / persist(RELIABLE): store content in durable storage (like a checkpoint)
- Scheduling
  - Stage 1 Computation: do nothing since input already materialized in memory
  - Stage 2 Computation: evaluate map in fused manner, only actually materialize RDD F
  - Stage 3 Computation: execute join (could stream the operation to disk, do not need to materialize)
- Resilience via lineage
  - RDD transformations are bulk, deterministic, and functional
    - Implication: runtime can always reconstruct contents of RDD from its lineage (the sequence of transformations used to create it)
  - Lineage is a log of transformations
    - Efficient: since the log records bulk data-parallel operations, overhead of logging is low (compared to logging fine-grained operations, like in a database)
  - Node failure: recompute RDD from lineage
- Performance improvement
  - With increasing DRAM & SSD, improving CPU utilization
  - Efficient code generation (SIMD, target accelerators)
    - [[C: IBM’s SparkGPU]]
Files is a poor abstraction for intermediate variables in largescale data-parallel programs
With DSL: SQL, MLlib, GraphX

Efficiently Evaluating Deep Networks

Blocked dense matrix multiplication
Convolution
- FFT? transform/get back $O(N\log{N})$
- Winograd
- [[C: qnnpack]]
Reduce memory footprint
- pruning
- representing
  - Step 1: prune low-weight links (iteratively retrain network, then prune)
  - Step 2: Weight sharing: make surviving connections share a small set of weights (discretization)
  - Step 3: Huffman encode quantized weights and CSR indices (lossless compression)
  - [[C: quantization of ML]]
    - My guess is that sometimes the low weight connections may provide not-so-helpful information about non-dominant features, and the weights assigned to those connections are significant enough that, after propagating through many layers of the DNN, the result might be slightly changed
      pruned version can have a lower error rate than the original one. This reminds me of regularizations in the loss functions, which prevent overfitting by preferring simpler models. Also, it looks similar to “dropout”, an interesting regularization technique for reducing overfitting by dropping out units in a neural network. Regularization during training is definitely different from dropping small weights in the evaluation, but I believe they may have some connections
[[C: int8 in ARM-v7]]
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Parallel Deep Network Training

Huge computational expense
- Must evaluate the network (forward and backward) for millions of training images
- Must iterate for many iterations of gradient descent
- Training modern networks on big datasets takes days
Large memory footprint
- Must maintain network layer outputs from forward pass
- Additional memory to store gradients/gradient velocity for each parameter
- Recall parameters for popular VGG-16 network require ~500 MB of memory (training requires GBs of memory for academic networks)
- Scaling to larger networks requires partitioning DNN across nodes to keep DNN + intermediates in memory
Dependencies / Synchronization (not embarrassingly parallel)
- Each parameter update step depends on previous
- Many units contribute to same parameter gradients (fine-scale reduction)
- Different images in mini batch contribute to same parameter gradients
Parameter Server
- Training data partitioned among workers
- Copy of parameters sent to workers
- Workers independently compute local “subgradients”
- Worker sends subgradient to parameter server
- Server updates global parameter values based on subgradient
- Updated parameters sent to worker
- Asynchronous parameter update
  - avoid global synchronization on all parameter updates between each SGD iteration
  - partial subgradient updates
  - impact convergence
  - fast interconnect? -> alternative in supercomputer
- Shard the PS: reduce contention
Model paralelism
- Partition network parameters across nodes
- Reduce internode communication through network design
  - small spatial convolutions (1x1 convolutions)
  - reduce/shrink fully connected layers

Addressing the Memory Wall

[[C: what every programmer should know about memory]]
DRAM access latency
- best case: read from active row: Column Access Time (CAS)
- worst case: bit line not ready, read from new row: Precharge (PRE) + Row Active (RAS) + Column Access (CAS)
- Something else that was not mentioned that was DRAM needs to be refreshed, because capacitors lose charge over time. In particular, data for each row is temporarily latched and rewritten after a certain number of cycles so that it’s value is maintained (for instance, a capacitor maintaining a bit value 1 will eventually read a 0 if its charge is not refreshed). RAM that doesn’t need to be refreshed is called SRAM (static RAM) which is much more expensive (stores memory in latches which require more capacitors and transistors than DRAM, which has just one capacitor and transistor per bit), and therefore less common (often used in L1 caches, but not main memory).
  Whenever we read a row from the DRAM into the row buffer, we actually discharged the capacitors corresponding to that row in the DRAM, so when we replace the row that is in the row buffer, we need to charge back the row in the DRAM ( to not lose the information we had in that row). That is why the Precharge (PRE) step is needed.
DRAM burst mode: amortize latency over larger transfers
DRAM banks
- all banks share same pins
- banks allow for pipelining of memory requests
  - Precharge/activate rows/send column address to one bank while transferring data from another
  - Achieves high data pin utilization
- both bank interleaving and channel interleaving (as is discussed in next slide) are interleaved memory design
  each bank holds different memory
  it will need (PRE+RAS)/CAS bands to hide the latency of row activation process and fully use the pin. Therefore, increasing band count or having larger CAS both work here
  Each DRAM bank has one row-buffer
DIMM (Dual in-line memory module): multiple chips
- DIMM appears as a single, higher capacity, wider interface DRAM module to the memory controller. Higher aggregate bandwidth, but minimum transfer granularity is now 64 bits
- Memory controller converts physical address to DRAM bank, row, column
- Physical addresses are interleaved across DRAM chips at byte granularity
- DRAM chips transmit frst 64 bits in parallel
- On the chip pictured, Kayvon noted that there are 8 DIMMs, and asked what the 9th would be if there were one more. To this, someone responded “ECC”
  ECC, or “Error-correcting code memory,” can be used to detect and possibly correct errors in the other 8 DIMMs. Think parity bit checks, or Hamming codes.
  For those familiar with hardware, bit errors are not to be ignored. They can occur because of hardware failure, or if your computer is being used in a non-ideal environment (e.g. in a nuclear power plant, the ISS).
- the physical addresses could be interleaved across DRAM chips with granularity larger than one byte. For example, we could put bytes 0-3 on DRAM 0, 4-7 on DRAM 1, etc. In that case, we would transmit a strided pattern of bytes over the bus each cycle (e.g. byte 0 from DRAM 0, byte 4 from DRAM 1, etc.), but that’s okay, as long as we eventually obtain the whole cache line we want to load
  one possible optimization is to load first the specific word that the memory request asked for and return it so that the processor can begin working with it, while the rest of the cache line finishes loading in. Of course, this would require changes to the semantics of how cache lines are loaded and processors’ expectations about them
- [[C: critical word first]]
Memory controller: memory request scheduler
- Receives load/store requests from LLC
- Conficting scheduling goals
  - Maximize throughput, minimize latency, minimize energy consumption
    - Common scheduling policy: FR-FCFS (frst-ready, frst-come-frst-serve)
    - Service requests to currently open row first (maximize row locality)
    - Service requests to other rows in FIFO order
  - Controller may coalesce multiple small requests into large contiguous requests (take advantage of DRAM “burst modes”)
Multiple-channel memory bus
DDR: double data rate (half cycle)
Embedded DRAM (eDRAM)
- IBM Power 7 server
- Intel Broadwell/Skylake
- XBox GPU
3D stacking DRAM: increase bandwidth, reduce power
- DRAMs connected via through-silicon-vias (TSVs) that run through the chips
- TSVs provide highly parallel connection between logic layer and DRAMs
- Base layer of stack “logic layer” is memory controller, manages requests from processor
- Silicon “interposer” serves as high-bandwidth interconnect between DRAM stack and processor
- Micron/Intel Hybrid Memory Cube (HBC)
- High-bandwidth memory (HBM)
- GPU
- MCDRAM
  - Xeon Phi: 16GB in package stacked DRAM, as 16 GB last level cache, or as 16GB separate address space (“flat mode”)
    - same latency at DDR4, 5x bandwidth of DDR4, 5x less energy cost per bit transferred
Moving computation to the data
- bulk copy
HW-accelerated data compression
- __mm512_extload_ps: Load 8-bit values from memory, convert to 32-bit float representation for storage in 32-bit register
- Cache compression: increase cache’s effective capacity by compressing data resident in cache
  - B$\Delta$I compression
- GPU frame buffer compression
  - All modern GPUs have hardware support for losslessly compressing frame buffer contents before transfer to/from memory
    - On cache line load: transfer compressed data from memory and decompress into cache
    - On evict: compress cache line and only transfer compressed bits to memory
  - Anchor encoding
- ARM GPU “Memory transaction elimination”
  - skip output image write if it is unnecessary
  - Frame 1:
    - Render frame tile at a time
    - Compute hash of pixels in each tile on screen
  - Frame 2:
    - Render frame tile at a time
    - Before storing pixel values for tile to memory, compute hash and see if tile is the same as last frame
    - If yes, skip memory write
Addressing memory bottleneck
- Application programmer: schedule computation in maximize locality
- HW Architecture
  - Intelligent DRAM request scheduling
  - Bringing data closer to processor (deep cache hierarchies, eDRAM)
  - Increase bandwidth (wider memory systems, 3D memory stacking)
  - Ongoing research in locating limited forms of computation “in” or near memory
  - Ongoing research in hardware accelerated compression
General principles
- Locate data near processor
- move computations to data storage
- data compression
- [[C: on-chip computation]]

The Future of High-Performance Computing

Titan Programming
- Solving problem over grid
- Bulk Synchronous model: partition into regions, map region per processor
- strive to keep perfectly balanced
- System Level: MPI
- Node Level: OpenMP (cpu) + CUDA (gpu)
MPI Fault Tolerance
- Checkpoint
- Restore
- Performance Scaling
MapReduce
- flexibility in placement, scheduling, load balancing
- high overhead, low raw performance
- data integrity: replicate
- recovery: recompute
Moore’s Law
- Lithography at such small dimensions
- Statistical variations among devices
Dennard Scaling
from what I’ve read, supercomputers run lighweight Linux-derivatives. The CNK kernel used by IBM Blue Gene (at Argonne National Laboratory) has:
1. statically mapped physical memory
2. no context switching or scheduling at the compute kernel
3. no file I/O implementation on compute nodes
15712 talks about microkernels like Mach
There’s many things going on in a traditional OS that aren’t required in a supercomputer. If you know your machine is dedicated to performing one large compute job, you can actually remove kernel functionality that supports anything else.
For example, general-purpose OSs have a timer “tick” that goes off every, say 10ms, that the kernel uses to schedule different processes to run on the CPU. When running a long compute job for which the programmer has partitioned work appropriately, this can cause significant slowdown.
Other examples include the networking stack (and associated daemon processes for that), desktop environments, and others. Technically, HW interrupts shouldn’t detract from performance, but since these machines don’t have disks, extra code in the kernel is just wasted space in RAM that could be used for computation.

Parallel 3D graphics

The visibility problem:
- what pixels does the triangle overlap? (“coverage”)
- what triangle is closest to the camera in each pixel? (“occlusion”)
- what scene geometry is visible within each screen pixel?
  - What scene geometry projects into a screen pixel? (coverage)
  - Which geometry is visible from the camera at that pixel? (occlusion)
- rays
  - What scene geometry is hit by a ray from a pixel through the pinhole? (coverage)
  - What object is the frst hit along that ray? (occlusion)
The graphics pipeline
- Geometry: compute vertex positions on screen
- Rasterization: compute covered samples
- Shading: compute color of colored pixels
- Pixel Ops: depth test & depth/color write
Coverage
- $E_i(x, y) = (x - X_i)dY_i - (y - Y_i)d X_i = A_i x +B_i y +C$ (inside < 0 , on edge = 0, outside > 0)
- $inside(sx, sy)$: three edges < 0
- incremental updates: $dE_i(x + 1, y) = E_i(x, y) + A_i$
- tiled triangle traversal: All modern GPUs have fixed-function hardware for efficiently performing data-parallel point-in-triangle tests
  - Simplicity of wide parallel execution overcomes cost of extra point-in-triangle tests (most triangles cover many samples, especially when super-sampling coverage)
  - Can skip sample testing work: use trianglebox test to classify entire block not in triangle (“early out test”), or entire block entirely within triangle (“early in”)
Occlusion
- depth-buffer (Z-buffer): For each coverage sample point, depth-buffer stores depth of closest triangle at this sample point that has been processed by the renderer so far
- ```
bool	pass_depth_test(d1,	d2)	{	
			return	d1	<	d2;				
}		
depth_test(tri_d,	tri_color,	x,	y)	{	
		if	(pass_depth_test(tri_d,	zbuffer[x][y])	{
				//	triangle	is	closest	object	seen	so	far	at	this	
				//	sample	point.	Update	depth	and	color	buffers.			
				zbuffer[x][y]	=	tri_d;					//	update	zbuffer
				color[x][y]	=	tri_color;			//	update	color	buffer
		}
}
```
- Z-buffer algorithm has high bandwidth requirements
  - Number of Z-buffer reads/writes for a frame depends on:
  - Depth complexity of the scene: how many triangles cover each pixel on average
  - The order triangles are provided to the graphics pipeline (if depth test fails, don’t write to depth buffer or rgba)
- Z Depth buffer BW = Freq x MPixel per image x avg. scene depth complexity x Z bit
  - GPUs often sample coverage multiple times per pixel for quality: multiply by 4
  - Scene is often rendered multiple times per frame: multiply by 2-3
- Modern GPUs have fxed-function hardware to implement caching and lossless compression of both color and depth buffers to reduce bandwidth
Frame buffer compression
- depth-buffer compression: reduce BW, lossless, for fixed-point numbers
  - screen tile based
  - On tile evict:
    1. Compute zmin/zmax (needed for hierarchical culling and/or compression)
    2. Attempt to compress
    3. Update tile table
    4. Store tile to memory
  - On tile load:
    1. Check tile table for compression scheme
    2. Load required bits from memory
    3. Decompress into tile cache
- Anchor encoding
  - Choose anchor value and compute DX, DY from adjacent pixels (fits a plane to the data)
  - Use plane to predict depths at other pixels, store offset d from prediction at each pixel
  - Scheme (for 24-bit depth buffer)
    - Anchor: 24 bits (full resolution)
    - DX, DY: 15 bits
    - Per-sample offsets: 5 bits
- Depth-offset compression
  - Assume depth values have low dynamic range relative to tile’s zmin and zmaz (assume two surfaces)
  - Store zmin/zmax
  - Store low-precision *8-12 bits) offset value for each sample
- Explicit plane encoding: Do not attempt to infer prediction plane, just get the plane equation of the triangle directly from the rasterizer
  - Store plane equation in tile (high precision values)
  - Store bit per sample indicating coverage
- Hierarchical occlusion culling “Hi-Z”
  - Z-Max culling: For each screen tile, compute farthest value in the depth buffer (often needed for compression): z_max
  - During traversal, for each tile:
    1. Compute closest point on triangle in tile: tri_min
    2. If tri_min > z_max, then triangle is completely occluded in this tile. (The depth test will fail for all samples in the tile.) Proceed to next tile without performing coverage tests for individual samples in tile.
  - Z-min optimization: Depth-buffer also stores z_min for each tile. If tri_max < z_min, then all depth tests for fragments in tile will pass. (No need to perform depth test on individual fragments.)
Sort first: Assign each replicated pipeline responsibility for a region of the output image
- Bandwidth scaling (small amount of sync/communication, simple point-to-point)
- Computation scaling (more parallelism = more performance)
- Simple
- Potential for worload imbalance
- “Tile spread”: as screen tiles get smaller, primitives cover more tiles (duplicate geometry processing across the parallel pipelines)
Sort middle: Assign each rasterizer a region of the render target
Interleaved mapping of screen
- [[C: RGB-Delta arrangement]]
Sort middle interleaved
- Workload balance: both for geometry work AND onto rasterizers (due to interleaving)
- Does not duplicate geometry processing for each overlapped screen region
- Bandwidth scaling: sort is implemented as a broadcast
- If tessellation is enabled, must communicate many more primitives than sort frst
- Duplicated per triangle setup work across rasterizers
Sort middle tiled
- Partition screen into many small tiles (many more tiles than physical rasterizers)
- Sort geometry by tile into buckets (one bucket per tile of screen)
- After all geometry is bucketed, rasterizers process buckets in parallel
- Sort requires point-to-point traffic (assuming each triangle only touches a few buckets)
- Good load balance
- Potentially low bandwidth requirements
Sort last fragment
- Sort after fragment processing based on (x,y) position of fragment
- No redundant geometry processing or in rasterizers (but z-cull is a problem)
- Point-to-point communication during sort
- Interleaved pixel mapping results in good workload balance for frame-buffer ops
- Workload imbalance due to primitives of varying size
- BW limit
- Early Z cull is difficult
Sort everywhere
- detail
- [[TODO: implementation]]

Parallel Computer Architecture and Programming#

Why Parallelism? Why Efficiency#

A Modern Multi-Core Processor#

Parallel Programming Abstraction#

Parallel Programming Basics#

GPU Architecture & CUDA Programming#

Performance Optimization I: Work Distribution & Scheduling#

Performance Optimization II: Locality, Communication & Contention#

Parallel Application Case Studies#

Workload-Driven Performance Evaluation#

Snooping-Based Cache Coherence#

Directory-Based Cache Coherence#

A Basic Snooping-Based Multi-Processor Implementation#

Memory Consistency#

Scaling a Web Site#

Interconnection Networks#

Implementing Synchronization#

Fine-grained Synchronization & Lock-free Programming#

Transaction Memory#

Heterogeneous Parallelism & Hardware Specialization#

Domain-Specific Programming Systems#

Domain-Specific Programming on Graphs#

In-Memory Distributed Computing using Spark#

Efficiently Evaluating Deep Networks#

Parallel Deep Network Training#

Addressing the Memory Wall#

The Future of High-Performance Computing#

Parallel 3D graphics#