Concurrency, Memory Models & Mechanical Sympathy: Complete Reference

Concurrency & Memory Models — System Design Handbook Part 6 featured image

A self-contained reference for the hardest layer under every system: getting multiple threads to cooperate correctly and fast. The deep skill is thinking at two levels at once — the abstract correctness of synchronization and the physical reality of caches, cores, and memory ordering. Most “concurrency bugs” are really memory-model misunderstandings.

How to use this: Part 1 is the reference card. Part 2 maps the territory. Part 3 is the full depth with pros/cons per primitive. Part 4 is exhaustive interview prep with counter-question ladders.

Key takeaways

  • Correctness needs two things: mutual exclusion and memory visibility/ordering — CPUs and compilers reorder memory operations, so visibility is not automatic.
  • A data race is undefined behavior and must be eliminated; a race condition is a logic bug that can exist even without a data race.
  • False sharing — unrelated variables on one 64-byte cache line — silently destroys parallel scaling; pad hot data.
  • Amdahl’s law: a serial fraction caps speedup (5% serial caps near 20x), so reducing the serial part beats adding cores.

PART 1 — CHEATSHEET (Reference Card)

Every concept in this document, condensed.

The one idea

Correctness under concurrency requires controlling two things: mutual exclusion (only one thread in a critical section) and memory visibility/ordering (when one thread’s writes become visible to another). Modern CPUs and compilers reorder memory operations for speed, so visibility is not automatic — you buy it with synchronization or atomics.

Core vocabulary (one line each)

  • Concurrency vs parallelism — dealing with many things (structure) vs doing many things at once (execution).
  • Process vs thread — isolated address space vs shared address space within a process.
  • Race condition — outcome depends on timing/interleaving (a correctness bug).
  • Data race — concurrent access to the same location, ≥1 a write, no synchronization (UB in C/C++/Java-ish).
  • Critical section — code that must run with mutual exclusion.
  • Mutex / semaphore / monitor / condition variable — exclusion / counting permit / object+lock+condition / wait-for-predicate.
  • Deadlock — cycle of threads each waiting on a resource another holds (4 Coffman conditions).
  • Livelock / starvation / priority inversion — progress without advancing / never scheduled / low-prio holds resource a high-prio needs.
  • Sequential consistency (SC) — operations appear in some interleaving respecting each thread’s program order.
  • Memory model — the rules for which reorderings/visibilities are allowed (hardware and language).
  • Memory barrier / fence — instruction forbidding certain reorderings.
  • Cache coherence (MESI) — protocol keeping per-core caches consistent.
  • False sharing — two unrelated variables on one cache line ping-pong between cores.
  • Atomic / CAS — indivisible operation / compare-and-swap, the universal primitive.
  • ABA problem — value changes A→B→A; CAS can’t tell, corrupting lock-free code.
  • Lock-free vs wait-free — system-wide progress guaranteed / every thread progresses in bounded steps.
  • Consensus number — max threads a primitive can solve consensus for (CAS = ∞).
  • RCU — read-copy-update: lock-free reads, deferred reclamation.

Memory model strength (strong → weak)

Model Reordering allowed Notes
Sequential consistency None (beyond interleaving) The intuitive ideal; not what hardware gives
x86-TSO Store→Load reorder (store buffer) Intel/AMD; relatively strong
ARM / POWER Most reorderings Weak; needs explicit barriers
C++11 memory_order Programmer-chosen seq_cst > acquire/release > relaxed
Java Memory Model volatile/final/synchronized define happens-before Language-level guarantees

Lock-free progress guarantees

Wait-free (every thread finishes in bounded steps) ⊃ Lock-free (some thread always progresses) ⊃ Obstruction-free (a thread progresses if it runs alone). Blocking (locks) guarantees none under contention/failure.

Numbers to memorize

  • Cache line = 64 bytes (the unit of coherence; align hot data to it to avoid false sharing).
  • Uncontended atomic/CAS ~ a few–tens of ns; cache-line bounce across cores ~tens–hundreds of ns.
  • Context switch ~ 1–10 µs; a mutex under contention can cost a syscall + context switch.
  • Amdahl: speedup ≤ 1 / (s + (1−s)/N); a 5%-serial program caps near 20× no matter the core count.

Quick decision rules

  • Shared mutable state, simple → mutex (favor coarse, correct first).
  • Read-mostly → RW lock or RCU (lock-free reads).
  • Producer/consumer counting → semaphore.
  • Wait for a condition → monitor + condition variable (loop on the predicate).
  • Hot single counter/pointer, contention → atomics/CAS, beware ABA.
  • Avoid shared state entirely → actors / CSP / immutability.

Top gotchas (litmus tests)

  1. A data race is undefined behavior — not “occasionally wrong,” but anything can happen (compiler assumes none exist).
  2. Without barriers/atomics, threads may never see each other’s writes — caches + reordering, not just timing.
  3. Always wait on a condition variable in a while loop — spurious wakeups + lost-wakeup races.
  4. volatile in C/C++ is NOT for threading (it’s for memory-mapped I/O); use atomics. (Java volatile is different — it does establish happens-before.)
  5. Double-checked locking is broken without proper memory ordering.
  6. CAS suffers ABA — use tagged pointers / hazard pointers / epochs.
  7. False sharing silently destroys scaling — pad/align hot per-thread data.
  8. Deadlock needs all four Coffman conditions — break one (e.g. global lock ordering).
  9. Priority inversion sank Mars Pathfinder — use priority inheritance.
  10. Lock-free ≠ faster; it’s a progress guarantee, often more complex and not always quicker than a good lock.

PART 2 — OUTLINE (full map)

  1. Concurrency vs parallelism; processes vs threads
  2. Race conditions, data races, and critical sections
  3. Mutual exclusion algorithms (Dijkstra, Peterson, Lamport bakery)
  4. Synchronization primitives (mutex, semaphore, monitor, condition variable, RW lock)
  5. Deadlock, livelock, starvation, priority inversion
  6. Sequential consistency
  7. Hardware and language memory models; barriers
  8. Cache coherence (MESI) and false sharing
  9. Atomics, compare-and-swap, and the ABA problem
  10. Lock-free vs wait-free; consensus number; Michael–Scott queue; RCU
  11. Amdahl’s Law and mechanical sympathy (cache lines, NUMA, branch prediction)
  12. Higher-level models: actors and CSP
  13. Decision guide
  14. Make it stick — the teaching tutorial (memory-ordering & false-sharing pictures, mnemonics, flashcards)

PART 3 — DEEP DIVE

1. Concurrency vs parallelism; processes vs threads

Concurrency is a structuring concept — composing independently-progressing tasks; parallelism is executing them simultaneously on multiple cores. You can have concurrency on one core (interleaving) and parallelism only with multiple cores. Processes have isolated address spaces (safe, expensive to create/communicate — IPC); threads share an address space (cheap communication via shared memory — and therefore the entire minefield of this chapter). The fundamental difficulty: shared mutable state accessed concurrently.

2. Race conditions, data races, and critical sections

  • A race condition is a correctness bug where the result depends on timing (e.g. lost update from count++ interleaving its read-modify-write).
  • A data race is a specific technical condition: two threads access the same memory location concurrently, at least one writes, and there’s no synchronization ordering them. In C/C++ (and morally in Java) a data race is undefined behavior — the compiler is allowed to assume races don’t happen, so a racy program can do anything, not just produce a stale value.
  • A critical section is code that must execute with mutual exclusion to be correct.

The distinction matters: you can have a race condition without a data race (e.g. check-then-act across two atomics) and you must eliminate data races entirely (they’re UB), then reason about remaining races.

3. Mutual exclusion algorithms

Before hardware atomics, mutual exclusion was solved in software:

  • Dijkstra posed and solved the mutual-exclusion problem (“Cooperating Sequential Processes”, 1968) and introduced semaphores.
  • Peterson’s algorithm — elegant two-thread mutual exclusion using two flags + a turn variable; correct under sequential consistency, but requires memory barriers on real weakly-ordered hardware (a classic teaching trap).
  • Lamport’s bakery algorithm — N-thread mutual exclusion using “take a ticket” numbers; notable for not requiring atomic operations beyond single-word reads/writes.

Why they still matter: they teach that mutual exclusion is achievable from plain reads/writes only under SC, and that real hardware reordering breaks naive implementations — motivating hardware atomics and barriers.

4. Synchronization primitives

  • Mutex (lock): one holder at a time; the workhorse. Blocking; can deadlock; coarse vs fine-grained tradeoff.
  • Semaphore (Dijkstra): a counter with wait/signal; counts permits — good for bounded resources/producer-consumer. Binary semaphore ≈ mutex (but no ownership).
  • Monitor (Hoare, 1974): an object bundling data + a lock + condition variables, so methods run mutually excluded. The model behind Java synchronized.
  • Condition variable: lets a thread wait for a predicate and be signaled. Always re-check the predicate in a while loop (spurious wakeups; another thread may have invalidated it). Signal/broadcast semantics (Hoare vs Mesa) determine whether the waiter runs immediately or just becomes runnable (Mesa is standard → must loop).
  • Read-write lock: many readers or one writer; good for read-mostly data, but writers can starve and the bookkeeping can cost more than a plain mutex under low contention.

Pros: simple, composable, correct-by-construction if disciplined. Cons: blocking (a holder that stalls blocks everyone), deadlock risk, contention/serialization limiting scalability, and the latency of OS-mediated blocking.

5. Deadlock, livelock, starvation, priority inversion

Deadlock requires all four Coffman conditions (Coffman, Elphick, Shoshani, 1971) simultaneously:

  1. Mutual exclusion, 2. Hold and wait, 3. No preemption, 4. Circular wait.

Break any one to prevent deadlock — the practical fix is usually global lock ordering (kills circular wait) or lock timeouts/try-lock (introduces preemption/backoff).

  • Livelock: threads keep changing state in response to each other but make no progress (two people stepping aside in a hallway).
  • Starvation: a thread is perpetually denied a resource (e.g. writers starved by a stream of readers).
  • Priority inversion: a high-priority thread waits on a lock held by a low-priority thread that’s preempted by a medium-priority thread. Famously hit the Mars Pathfinder; fixed with priority inheritance (the holder temporarily inherits the waiter’s priority).

6. Sequential consistency

Sequential consistency (Lamport, 1979): the result of execution is as if all operations executed in some total order, and each thread’s operations appear in that order in program order. It’s the intuitive model programmers assume — but real hardware does not provide it by default, because store buffers and out-of-order execution reorder memory operations for performance. SC is the yardstick against which weaker models are measured.

7. Hardware and language memory models; barriers

Because SC is expensive, CPUs implement weaker memory models that allow reordering:

  • x86-TSO (Sewell et al., 2010, formalizing Intel/AMD): a per-core store buffer allows a thread’s store to be delayed past a later load (Store→Load reordering) — relatively strong otherwise.
  • ARM / POWER: weak models allowing most reorderings; correctness requires explicit barriers/fences.

Memory barriers/fences forbid specific reorderings (e.g. a release fence ensures prior writes are visible before a later store). Languages expose this so you don’t hand-write fences:

  • C++11 memory model (Boehm & Adve, 2008): memory_order_seq_cst (default, strongest), acquire/release (pairwise synchronization), relaxed (atomicity only, no ordering). Acquire/release is the common efficient idiom (e.g. a release-store publishing data, an acquire-load consuming it).
  • Java Memory Model: defines happens-before via synchronized, volatile (which does establish ordering, unlike C/C++ volatile), and final field guarantees.

The big trap: in C/C++, volatile is for memory-mapped hardware, not thread synchronization — use std::atomic. And double-checked locking is broken unless the flag uses proper acquire/release (the classic example of needing the memory model).

Pros of weak models: huge performance from reordering/buffering. Cons: you must reason about ordering explicitly; subtle, non-deterministic bugs that appear only on weak hardware (ARM) after “working” on x86.

8. Cache coherence (MESI) and false sharing

Each core has private caches; cache coherence keeps them consistent. MESI tracks each cache line as Modified / Exclusive / Shared / Invalid; when one core writes a line, others’ copies are invalidated, forcing them to re-fetch. Coherence operates at cache-line granularity (64 bytes).

False sharing is the performance killer that follows: two unrelated variables that happen to sit on the same 64-byte line cause the line to bounce between cores on every write, even though the threads touch different variables. Fix by padding/aligning hot per-thread data to separate cache lines (e.g. alignas(64), padding in struct layouts, per-thread counters).

Pros: coherence gives the illusion of one shared memory. Cons: coherence traffic is a real, often invisible cost; false sharing can turn a “parallel” program slower than serial.

9. Atomics, compare-and-swap, and the ABA problem

Atomic operations execute indivisibly. The universal building block is compare-and-swap (CAS): atomically, “if location == expected, set it to new, else fail.” With CAS you build lock-free counters, stacks, queues. Other primitives: fetch-and-add, exchange, load-link/store-conditional (ARM/POWER).

The ABA problem: a thread reads value A, another thread changes it A→B→A, then the first thread’s CAS succeeds (sees A) — but the world changed underneath it (e.g. a freed-and-reused node), corrupting a lock-free structure. Fixes: tagged pointers (version counter alongside the pointer, so A-with-tag-1 ≠ A-with-tag-2), hazard pointers, or epoch-based reclamation.

Pros: fine-grained, non-blocking coordination; no lock overhead. Cons: ABA and memory-reclamation hazards; correct lock-free code is hard and easy to get subtly wrong.

10. Lock-free vs wait-free; consensus number; Michael–Scott queue; RCU

Progress guarantees (strongest first): wait-free (every thread completes in a bounded number of steps — no starvation, no matter what others do) ⊃ lock-free (the system as a whole always makes progress — some thread completes) ⊃ obstruction-free (a thread completes if it eventually runs without interference). Lock-based code guarantees none of these under contention or if a holder is preempted/crashes.

Consensus number (Herlihy, 1991): the maximum number of threads for which a primitive can solve wait-free consensus. Atomic registers have consensus number 1; CAS has consensus number ∞ — which is why CAS (or LL/SC) is the foundation of universal lock-free constructions. This is a deep result: it ranks synchronization primitives by power.

Michael–Scott queue (1996): the canonical lock-free FIFO queue using CAS on head/tail pointers — the basis of java.util.concurrent.ConcurrentLinkedQueue. Illustrates real lock-free design (and the reclamation/ABA care it needs).

RCU (Read-Copy-Update): readers proceed with essentially no synchronization (lock-free, extremely cheap reads); updates make a copy, swap a pointer, and defer reclaiming the old version until all pre-existing readers have finished (a “grace period”). Ideal for read-mostly data (heavily used in the Linux kernel).

Pros: progress guarantees, no deadlock, great read scaling (RCU). Cons: complexity, memory-reclamation problems, and the fact that lock-free is not automatically faster than a well-designed lock — it’s a guarantee, not a speed promise.

11. Amdahl’s Law and mechanical sympathy

Amdahl’s Law (1967): if fraction s of work is inherently serial, max speedup with N processors is 1 / (s + (1−s)/N). A program 5% serial caps at ~20× regardless of cores — so reducing the serial/contended fraction matters more than adding cores. (Gustafson’s Law reframes this for problems that grow with capacity.)

Mechanical sympathy — writing code that works with the hardware:

  • Cache lines (64B): keep hot data together (spatial locality), separate contended data (avoid false sharing).
  • NUMA: memory is faster from the local socket; pin threads/memory to avoid remote-node latency.
  • Branch prediction & prefetch: predictable, sequential access patterns are far faster than pointer-chasing; data-oriented design beats naive OOP for hot loops.
  • Contention is the enemy: an uncontended lock is cheap; a contended one (cache-line bouncing + context switches) is catastrophic.

12. Higher-level models: actors and CSP

To sidestep shared-memory hazards, structure concurrency around message passing:

  • Actor model (Erlang/Akka): isolated actors with private state communicate only via asynchronous messages; no shared memory → no data races by construction. Trades shared-memory bugs for message-protocol/ordering reasoning and mailbox overflow concerns.
  • CSP (Hoare, “Communicating Sequential Processes”): processes communicate over channels (the model behind Go’s goroutines + channels). “Don’t communicate by sharing memory; share memory by communicating.”

Pros: eliminates whole bug classes; scales across machines naturally. Cons: message-passing overhead, backpressure/mailbox management, and harder to express tightly-shared state.

13. Decision guide

Can you avoid shared mutable state?
├─ Yes ─► IMMUTABILITY / ACTORS / CSP channels  (no data races by construction)
└─ No ─►
     Read-mostly data?            ► RW LOCK or RCU (lock-free reads)
     Simple shared state?         ► MUTEX (coarse first; correct > clever)
     Counting/bounded resource?   ► SEMAPHORE
     Wait for a condition?        ► MONITOR + CONDITION VARIABLE (while-loop the predicate)
     Single hot counter/pointer?  ► ATOMIC / CAS (beware ABA)
     Need progress guarantee?     ► LOCK-FREE (or WAIT-FREE if no starvation allowed)

Performance pass:
   Scaling poorly? ► check FALSE SHARING (pad to 64B), CONTENTION (reduce serial fraction — Amdahl), NUMA locality
   Weak hardware (ARM)? ► verify MEMORY ORDERING (acquire/release), not just x86 testing

Reach-for / avoid:

  • Mutexfor: most shared state. Avoid when: it’s the hot contended path limiting scaling (then reduce critical-section size or go lock-free).
  • RCU/RW lockfor: read-mostly. Avoid when: write-heavy (writer starvation/overhead).
  • CAS/lock-freefor: hot counters/queues, progress guarantees. Avoid when: the structure is complex (reclamation/ABA risk) and a lock suffices.
  • Actors/CSPfor: isolating state, distribution. Avoid when: you need tight, low-latency shared-memory coordination.

PART 4 — INTERVIEW ARSENAL

How to wield this. Senior signals: (1) you separate data race (UB, must eliminate) from race condition (logic bug); (2) you invoke the memory model — “without acquire/release the reader may never see the write” — rather than blaming “timing”; (3) you cite false sharing / contention / Amdahl when explaining why parallel code doesn’t scale. Each question has a model answer and counter-ladder.

A. Fundamentals

Q1. Data race vs race condition — are they the same? Answer: No. A data race is the technical condition of unsynchronized concurrent access with ≥1 write — undefined behavior in C/C++/Java terms, so it must be eliminated entirely. A race condition is a logic bug where the result depends on interleaving; you can have one without a data race (e.g. TOCTOU across two atomics). First remove data races, then reason about remaining races. Counter-ladder:

  • “Why is a data race UB and not just a stale read?” → The compiler optimizes assuming no races; with one, transformations make any behavior possible.
  • “Example race condition without a data race?” → Check-then-act on a thread-safe map (if !contains: put) — each op atomic, the pair isn’t.

Q2. Why might one thread never see another’s write even long after? Answer: Memory visibility isn’t automatic. Writes can sit in a store buffer / cache, and the compiler and CPU reorder operations under a weak memory model. Without a synchronizing action (lock, or atomic with acquire/release), there’s no happens-before edge forcing visibility — so the other thread may see a stale value indefinitely. You buy visibility with synchronization. Counter-ladder:

  • “How do you fix it minimally?” → Make the variable atomic with release (writer) / acquire (reader), or guard it with a lock.
  • “Is volatile enough in C++?” → No — that’s for MMIO; use std::atomic. (Java volatile does work, different semantics.)

B. Synchronization & deadlock

Q3. Why must you wait on a condition variable in a loop? Answer: Two reasons: spurious wakeups (the OS may wake a waiter with no signal), and Mesa semantics (signaling makes the waiter runnable, not immediately running, so another thread can change the predicate first). Re-checking the predicate in a while is the only correct pattern. Counter-ladder:

  • “Hoare vs Mesa monitors?” → Hoare hands control immediately to the signaled thread (predicate still holds); Mesa just marks it runnable (must re-check). Real systems are Mesa.
  • “Lost wakeup?” → Signaling before the waiter waits loses it; check the predicate under the lock before waiting.

Q4. Four conditions for deadlock and how to prevent it. Answer: Mutual exclusion, hold-and-wait, no preemption, circular wait (Coffman). Break any one — most practically impose a global lock acquisition order (kills circular wait) or use try-lock with backoff / timeouts (introduces preemption). Detection + victim abort is the alternative (databases do this). Counter-ladder:

  • “Two locks acquired in different orders by two threads — fix?” → Always acquire in a canonical order (e.g. by address/id).
  • “What’s priority inversion and the fix?” → High-prio waits on a lock held by preempted low-prio; fix with priority inheritance (Mars Pathfinder).

C. Memory model & hardware

Q5. Code works on your x86 laptop, fails on an ARM server. Why? Answer: x86 (TSO) is a relatively strong memory model; ARM is weak and allows more reorderings. Code that relied on incidental x86 ordering (missing barriers/acquire-release) breaks on ARM where those reorderings actually occur. The fix is correct memory ordering on atomics, not luck. Counter-ladder:

  • “What does x86 still reorder?” → Store→Load (store buffer) — so even x86 isn’t sequentially consistent.
  • “Double-checked locking — why broken?” → The published object can be seen partially constructed without acquire/release ordering on the flag.

Q6. A parallel program doesn’t scale past a few cores. Diagnose. Answer: Candidates: (1) Amdahl — a large serial/contended fraction caps speedup; (2) lock contention — threads serialize on a hot mutex; (3) false sharing — unrelated variables share a 64B cache line, bouncing between cores; (4) NUMA — remote-memory access. Profile to find which; fixes are reduce critical-section size, shard/pad data, pin to NUMA nodes. Counter-ladder:

  • “How confirm false sharing?” → Per-core cache-miss/HITM counters; fix by padding/aligning to 64B and verify scaling improves.
  • “5% serial — speedup ceiling on 64 cores?” → ~1/(0.05+0.95/64) ≈ ~15–16×; reducing the serial part matters more than more cores.

D. Lock-free

Q7. What’s the ABA problem and when does it bite? Answer: In CAS-based lock-free code, a location goes A→B→A between a thread’s read and its CAS; the CAS sees A and succeeds, but the underlying object changed (e.g. a node was freed and reallocated), corrupting the structure. Fixes: tagged/versioned pointers, hazard pointers, or epoch-based reclamation. Counter-ladder:

  • “Why is reclamation the hard part of lock-free?” → You can’t free memory another thread might still dereference; ABA + use-after-free; hence hazard pointers/RCU/epochs.
  • “What does CAS’s consensus number ∞ mean?” → CAS can solve wait-free consensus for any number of threads — the basis of universal lock-free constructions (Herlihy).

Q8. When would you choose RCU? Answer: Read-mostly data where read latency/scaling is critical: RCU readers run with almost no synchronization, while updaters copy, swap a pointer, and defer reclaiming the old version until a grace period passes (all prior readers done). Heavily used in the Linux kernel for routing tables, etc. Counter-ladder:

  • “RCU downside?” → Writes are heavier (copy + deferred reclamation + grace periods); bad for write-heavy data.
  • “Lock-free always faster than a lock?” → No — it’s a progress guarantee; a good lock can outperform a complex lock-free structure.

E. Worked drill — driving a design end-to-end

Watch correctness (data-race freedom + memory ordering) come first, then a performance pass for contention and false sharing.

Prompt: “Design an in-memory metrics aggregator: dozens of worker threads increment per-metric counters millions of times/sec; a reader thread periodically snapshots all counters for export. It must be correct and scale across cores.”

1 — Establish correctness first. “Concurrent increments to a shared counter are a read-modify-write — a data race if unsynchronized, and a lost-update race even if not. So every counter must be an atomic (fetch_add) or guarded. I’ll use atomics: relaxed ordering is sufficient for a pure counter increment (I only need atomicity, not ordering with other data) — that’s the cheap, correct choice.”

2 — Kill contention (the real problem at millions/sec). “A single shared atomic counter per metric will cache-line bounce across all writer cores — coherence traffic dominates and it won’t scale. So I shard: per-thread (or per-core) counters, each thread increments its own with no contention; the reader sums the shards at snapshot time. This trades a tiny read cost (sum N shards) for contention-free writes.”

3 — Avoid false sharing. “Per-thread counters packed in an array would share 64B cache lines → false sharing reintroduces the bounce. I pad/align each shard to 64 bytes (alignas(64)) so each lives on its own line. This is the difference between scaling linearly and scaling negatively.”

4 — The reader/snapshot. “The reader sums shards; an exact instantaneous total isn’t required (metrics tolerate slight skew), so I read each shard with a relaxed/acquire load and sum — no global lock, no stopping the writers. If I needed a consistent snapshot across metrics, I’d use a seqlock or RCU-style pointer swap of the whole counter array, keeping reads cheap.”

5 — Memory model check. “Increments use relaxed atomics (no cross-variable ordering needed). The reader needs no special ordering for approximate sums. If a counter’s value gated other state, I’d use acquire/release to publish that state — but here it doesn’t. I explicitly verify on a weak-memory (ARM) target, not just x86, so I’m not relying on TSO.”

6 — Tradeoffs stated. “I chose per-thread sharded atomics + cache-line padding: contention-free, linearly-scaling writes at the cost of O(threads) work per snapshot and a little memory per shard. I avoided a mutex (would serialize the hot path) and a single shared atomic (would cache-bounce). If snapshots needed strict cross-metric consistency, I’d add a seqlock/RCU swap and pay a bit more on the read side.”

Template: data-race freedom + correct memory ordering first → eliminate contention (shard) → eliminate false sharing (pad) → choose the weakest correct ordering → state the tradeoff.

F. Consolidated gotchas & traps (rapid fire)

  • Data race = UB, eliminate entirely; race condition = logic bug.
  • Visibility needs synchronization — caches + reordering, not timing.
  • Condition variables: always while-loop the predicate.
  • C/C++ volatile ≠ thread-safe; use atomics.
  • Double-checked locking needs acquire/release.
  • CAS → ABA; use tags/hazard pointers/epochs.
  • False sharing — pad hot data to 64B.
  • Deadlock — impose global lock ordering.
  • Priority inversion — priority inheritance.
  • Amdahl — reduce the serial fraction, not just add cores.
  • Test on weak hardware (ARM), not only x86.

G. Pros/cons master tables

Synchronization approaches

Approach Pros Cons
Mutex Simple, correct, composable Blocking; deadlock; contention serializes
RW lock Read concurrency Writer starvation; overhead at low contention
Semaphore Counts permits/resources No ownership; easy to misuse
Monitor + condvar Clean wait-for-condition Must loop predicate; Mesa semantics
Atomics/CAS Fine-grained, non-blocking ABA, reclamation, hard to get right
RCU Near-free reads Heavy writes; grace-period complexity
Actors/CSP No data races by design Message overhead; backpressure

Progress guarantees

Guarantee Meaning Cost
Wait-free Every thread bounded steps Hardest to build
Lock-free Some thread always progresses Complex; ABA/reclamation
Obstruction-free Progress if run alone Weakest non-blocking
Blocking (locks) None under contention/failure Simplest

Go deeper (primary sources)

  • Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs” (1979) — sequential consistency.
  • Herlihy & Shavit, The Art of Multiprocessor Programming (2008); Herlihy, “Wait-Free Synchronization” (1991) — consensus number.
  • Adve & Gharachorloo, “Shared Memory Consistency Models: A Tutorial” (1996).
  • Michael & Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” (1996).
  • Hoare, “Monitors: An Operating System Structuring Concept” (1974); Hoare, Communicating Sequential Processes (CSP).
  • Dijkstra, “Cooperating Sequential Processes” (1968).
  • Boehm & Adve, “Foundations of the C++ Concurrency Memory Model” (2008).
  • Drepper, “What Every Programmer Should Know About Memory” (2007).
  • Coffman, Elphick, Shoshani, “System Deadlocks” (1971) — the four conditions.
  • Sewell et al., “x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors” (2010).

PART 5 — MAKE IT STICK (Teaching Tutorial)

The references are the map; this is the driving lesson. Concurrency bugs feel like magic because two invisible things bite you — memory reordering and cache-line bouncing. We make both visible. Once you can picture the store buffer and the cache line, the bugs stop being mysterious.

14.1 The one idea: correctness needs TWO things

   (1) MUTUAL EXCLUSION — only one thread in the critical section at a time
   (2) MEMORY VISIBILITY — when does thread B actually SEE thread A's write?

People remember (1) and forget (2). But CPUs and compilers reorder memory ops for speed, so a write by A can be invisible to B for a long time — even with no lock bug. Visibility is not free; you buy it with a lock or an atomic. Hold both ideas and most “impossible” bugs become obvious.

14.2 Data race vs race condition (different things!)

  DATA RACE = unsynchronized concurrent access, ≥1 write   → UNDEFINED BEHAVIOR
              (the compiler may do ANYTHING — must eliminate entirely)
  RACE CONDITION = result depends on timing/interleaving   → a logic bug
              (can exist even WITHOUT a data race, e.g. check-then-act on two atomics)

First make the program data-race-free (locks/atomics), then reason about remaining race conditions. They are not the same word.

14.3 Why B never sees A’s write (the store buffer picture)

  Core A                         Core B
  x = 1   → [store buffer]       reads x …
  flag=1  → [store buffer]       sees flag=1 but x still 0 ?!
              │ (drains later, maybe reordered)
              ▼
            memory

A’s writes sit in a store buffer and reach memory later, possibly reordered (x86 even allows store→load reorder). So B can see flag=1 before x=1. The fix is an ordering barrier: make flag an atomic with release (A) / acquire (B) — that creates a happens-before edge so “if you see flag, you see x.” This is why volatile (C/C++) is not enough and double-checked locking is broken without it.

Mnemonic: release = “publish everything I wrote,” acquire = “subscribe to everything they wrote.”

14.4 False sharing (the silent scaling killer)

   One 64-byte cache line:  [ counterA | counterB | … ]
   Core A writes counterA → invalidates the WHOLE line on Core B
   Core B writes counterB → invalidates the WHOLE line on Core A
   → the line ping-pongs between cores every write, even though the
     two threads touch DIFFERENT variables.  Parallel code → slower than serial.

Two unrelated variables on the same 64-byte line = coherence traffic on every write. Fix: pad/align hot per-thread data to its own cache line (alignas(64)). This single trick is the difference between linear scaling and negative scaling.

14.5 CAS and the ABA trap (lock-free in one picture)

  CAS(addr, expect, new) = atomically: if *addr==expect, set new, else fail.
  ABA:  read A → (other thread: A→B→A) → your CAS sees A, SUCCEEDS,
        but the world changed underneath (e.g. node freed & reused). 💥
  Fix: tag/version the pointer (A#1 ≠ A#2), or hazard pointers / epochs.

CAS is the universal primitive (its “consensus number” is infinite), but the hard part of lock-free isn’t the CAS — it’s safely reclaiming memory another thread might still touch.

14.6 Deadlock = 4 conditions; break ONE

  Mutual exclusion + Hold-and-wait + No preemption + Circular wait  → DEADLOCK
  Break circular wait → impose a GLOBAL LOCK ORDER (acquire by id/address). Done.

14.7 Analogies that stick

  • Memory reordering = sending postcards out of order — B might get “I arrived (flag)” before “here’s the address (x)” unless you number them (barrier).
  • False sharing = two roommates sharing one whiteboard — every time one writes their half, the other has to re-copy the whole board.
  • Mutex = a single bathroom key. RW lock = many readers, one writer. Lock-free = a revolving door (someone always gets through).
  • ABA = a parking spot that was vacated and re-taken while you blinked — same spot, different car.

14.8 Misconceptions → corrections

You might think… Actually…
“Data race = race condition.” Different: data race is UB (eliminate); race condition is a logic bug.
“It works, so visibility is fine.” It may rely on x86’s accidental ordering and break on ARM. Test weak hardware.
volatile makes it thread-safe (C/C++).” No — that’s for MMIO; use atomics with acquire/release.
“Lock-free is always faster.” It’s a progress guarantee, not a speed promise; a good lock can win.
“More cores → proportional speedup.” Amdahl: the serial fraction caps it (5% serial → ~20× max).

14.9 Explain it back (Feynman)

  1. The two things correctness needs — and which one people forget? [14.1]
  2. Data race vs race condition — give an example of one without the other. [14.2]
  3. Draw the store buffer; why might B see flag before x, and the fix? [14.3]
  4. Draw false sharing; why does it kill scaling, and the fix? [14.4]
  5. What’s ABA and why is reclamation the hard part of lock-free? [14.5]

14.10 Flashcards (cover the right column)

Prompt Answer
Correctness needs Mutual exclusion + memory visibility
Data race Unsynchronized access, ≥1 write → UB
Why B misses A’s write Store buffer + reordering; needs acquire/release
release / acquire Publish my writes / subscribe to theirs
False sharing Unrelated vars on one 64B line ping-pong
Fix false sharing Pad/align to 64 bytes
CAS If equals expected, swap; else fail
ABA fix Tagged pointers / hazard pointers / epochs
Break deadlock Global lock ordering
Amdahl 5% serial ~20× speedup ceiling

14.11 The 60-second recall

“Concurrency correctness needs two things: mutual exclusion and memory visibility — and people forget visibility. CPUs buffer and reorder writes, so one thread can miss another’s write indefinitely unless you add an ordering barrier (atomic release/acquire, or a lock); that’s why C/C++ volatile isn’t enough and double-checked locking breaks. A data race is undefined behavior you must eliminate; a race condition is a logic bug that can exist without one. False sharing — unrelated variables on the same 64-byte cache line — silently destroys scaling, so pad hot data to its own line. Lock-free uses CAS but the hard part is memory reclamation and the ABA problem; fix with tagged or hazard pointers. Break deadlock by imposing a global lock order, and remember Amdahl: the serial fraction, not the core count, caps your speedup.”

Frequently asked questions

What’s the difference between a data race and a race condition?

A data race is unsynchronized concurrent access to the same memory with at least one write — undefined behavior that must be eliminated entirely. A race condition is a logic bug where the result depends on timing; it can exist even without a data race, such as a check-then-act across two thread-safe operations.

Why might one thread never see another thread’s write?

Memory visibility is not automatic. Writes can sit in store buffers and caches, and the compiler and CPU reorder operations under a weak memory model. Without a synchronizing action — a lock or an atomic with acquire/release ordering — there is no happens-before edge forcing visibility, so a stale value can persist indefinitely.

What is false sharing?

False sharing happens when two unrelated variables occupy the same 64-byte cache line, so writes from different cores keep invalidating each other’s cached copy, causing the line to bounce between cores. It can make parallel code slower than serial; the fix is to pad or align hot per-thread data to separate cache lines.

What is the ABA problem?

In lock-free code using compare-and-swap, a value can change from A to B and back to A between a thread’s read and its CAS, so the CAS wrongly succeeds even though the underlying object changed — for example, a freed and reused node. Fixes include tagged or versioned pointers, hazard pointers, and epoch-based reclamation.

What’s the difference between lock-free and wait-free?

Lock-free guarantees that the system as a whole always makes progress — some thread completes. Wait-free is stronger: every thread completes in a bounded number of steps, with no starvation. Lock-based code guarantees neither under contention or if a lock holder is preempted.

Previous