A self-contained reference for the single hardest topic in distributed systems: how independent machines agree on a value (consensus) and what guarantees a client gets about the order and visibility of operations (consistency). Skim the cheatsheet to reload the whole topic in five minutes; drop into any deep-dive section for first-principles depth; rehearse the interview arsenal to walk in knowing every question and every follow-up they can drill into.
How to use this: Part 1 is the reference card (every concept, one line each). Part 2 is the map. Part 3 is the full depth with pros/cons baked in. Part 4 is exhaustive interview prep — questions, model answers, and the counter-question ladders interviewers use to find your floor.
Key takeaways
- Consistency models are the contract about what clients can observe; consensus is the mechanism nodes use to agree — they meet in state-machine replication.
- Linearizability is a single-object, real-time guarantee; serializability is a multi-object transaction guarantee with no real-time requirement. Strict serializability = both.
- CAP’s “C” means linearizability, not ACID consistency, and partitions are not optional — you choose consistency or availability only when one occurs.
- To tolerate f crash faults you need 2f+1 nodes; to tolerate f Byzantine faults you need 3f+1.
PART 1 — CHEATSHEET (Reference Card)
Every concept in this document, condensed. If it’s discussed below, it has a one-line entry here.
The two questions this topic answers
- Consistency model = the contract between a data store and its clients: which orderings and which stale/fresh reads are legal. It’s about what you’re allowed to observe.
- Consensus = the mechanism by which a set of nodes agree on one value despite failures. It’s about how nodes make a decision stick.
- They connect through state-machine replication: run consensus on a log of commands, apply commands in log order on every replica, and you get a linearizable replicated service.
Consistency models — strongest → weakest
| Model | One-line guarantee | Real-time order? | Available under partition? |
|---|---|---|---|
| Strict serializability | Transactions execute in a serial order that respects real wall-clock order | Yes | No |
| Linearizability | Each single-object op appears atomic at a point between call & return; respects real time | Yes | No |
| Sequential consistency | Some total order consistent with each client’s program order; no real-time guarantee | No | No |
| Serializability | Transactions equivalent to some serial order (multi-object); order need not match real time | No | No |
| Snapshot isolation (SI) | Txn reads a consistent snapshot; first-committer-wins on writes | No | Partial |
| Causal / Causal+ | Causally related ops seen in same order everywhere; concurrent ops may differ. Causal+ also converges | Causal only | Yes (strongest that is) |
| Session guarantees | Read-your-writes, monotonic reads, monotonic writes, writes-follow-reads | Per-session | Yes |
| Bounded staleness | Reads lag the latest write by at most Δ time or k versions | No | Yes |
| Eventual consistency | If writes stop, replicas eventually converge | No | Yes |
The classic trap: Linearizability is a single-object, real-time guarantee. Serializability is a multi-object transaction guarantee with no real-time requirement. They are orthogonal axes. Strict serializability = serializability + linearizability.
The three theorems
- FLP (1985): In a fully asynchronous system, no deterministic consensus protocol can guarantee termination if even one node may crash. (Safety is fine; liveness is impossible to guarantee.) Escaped via randomization, partial synchrony, or failure detectors.
- CAP: During a network partition (P) you must choose Consistency (= linearizability) or Availability. You cannot have both. P is not optional in a distributed system, so the real choice is *C-vs-A during a partition*.
- PACELC: If Partition → choose A or C; Else (normal operation) → choose Latency or Consistency. CAP ignores the everyday latency↔consistency tradeoff; PACELC names it. (e.g. Dynamo = PA/EL; Spanner = PC/EC.)
Failure & timing models
- Crash-stop < crash-recovery < omission < Byzantine (each strictly harder).
- Synchronous (bounded delay) → consensus easy. Asynchronous (no bound) → FLP bites. Partial synchrony (eventually bounded) → where real protocols live (Paxos, Raft).
Consensus algorithms at a glance
| Algorithm | Model | Leader | Steady-state latency | Notes |
|---|---|---|---|---|
| Basic Paxos | crash, async-safe | none | 2 RTT | Safe always; can livelock (dueling proposers) |
| Multi-Paxos | crash | stable leader | 1 RTT | Elect leader, skip Phase 1 thereafter |
| Raft | crash | strong leader | 1 RTT | Built for understandability; strong-leader log |
| Zab | crash | leader | 1 RTT | ZooKeeper; guarantees primary order |
| Viewstamped Replication | crash | leader (primary) | 1 RTT | Predates/parallels Paxos in spirit |
| EPaxos | crash | leaderless | 1 RTT (no conflict) | Great for WAN; tracks per-command deps |
| Flexible Paxos | crash | leader | 1 RTT | Phase-1 & Phase-2 quorums need only intersect |
| PBFT | Byzantine | primary | 2 RTT (3 phases) | Tolerates f malicious of 3f+1 |
Quorum & fault-tolerance math (memorize)
- Crash faults: to tolerate
fcrashes you needN = 2f + 1nodes; a quorum is a majority = ⌊N/2⌋ + 1. - Byzantine faults: to tolerate
fmalicious you needN = 3f + 1; quorum is2f + 1. - Quorum intersection is the whole trick: any two quorums share ≥1 node, so a read quorum always overlaps the latest write quorum.
- Dynamo-style tunable quorum:
W + R > Ngives read-your-writes per key;W > N/2prevents two conflicting writes from both succeeding. - Flexible Paxos: only need
Q1 + Q2 > N(Phase-1 + Phase-2 quorums intersect), not majority for both.
Numbers to keep in your head
- Same-datacenter RTT ≈ 0.5 ms; cross-region (e.g. US-EU) RTT ≈ 60–150 ms.
- One consensus round = 1 RTT to a majority in steady state ⇒ cross-region strongly-consistent writes cost ~one inter-region RTT, minimum.
- Majority of 3 = 2; of 5 = 3. 5 nodes tolerates 2 failures (common production sweet spot for metadata stores).
- Spanner commit-wait ≈ wait out TrueTime uncertainty ε (historically ~1–7 ms) before acking a write.
Related mechanisms
- State-machine replication (SMR): deterministic state machine + consensus-ordered command log ⇒ identical replicas.
- Atomic (total-order) broadcast ≡ consensus: each can implement the other. Ordering messages is the same problem as agreeing on a value.
- 2PC: atomic commit across shards. Blocking: if the coordinator dies after PREPARE, participants are stuck holding locks. Not fault-tolerant by itself.
- 3PC: adds a phase to be non-blocking under crash-stop + synchrony — but breaks under network partitions. Rarely used.
- Modern fix: run 2PC where each participant and the coordinator are themselves Paxos/Raft groups (Spanner) → fault-tolerant distributed transactions.
- Read paths in consensus systems: leader lease (time-bounded, read locally), ReadIndex (confirm leadership via heartbeat quorum, then read), follower/stale reads (bounded staleness).
- TrueTime / external consistency: GPS + atomic clocks give a bounded time interval
[earliest, latest]; commit-wait makes timestamp order match real order ⇒ strict serializability across the globe.
Top gotchas (the litmus tests)
- CAP’s “C” means linearizability, not ACID’s “C” (constraint preservation). Different words, same letter.
- “P” in CAP is not a choice — partitions happen. You choose A or C when one occurs.
- Snapshot isolation is not serializable — it permits write skew. (Postgres “Serializable” = SSI, which is; “Repeatable Read” = SI, which isn’t.)
- Eventual consistency without conflict resolution = silent lost updates (last-write-wins drops data).
- Reading from a follower can return stale data even in a “strongly consistent” system unless you use ReadIndex/lease reads.
- 2PC is not consensus and is not fault-tolerant alone — coordinator failure blocks.
- Basic Paxos guarantees safety, never liveness — two proposers can duel forever.
- Linearizability composes (object-local); sequential consistency does not.
- A majority quorum tolerates crashes, not lies — Byzantine needs 3f+1.
- “Strongly consistent” is ambiguous in interviews — always name the exact model (linearizable? strict serializable?).
PART 2 — OUTLINE (full map)
- Why this topic exists — the root problem
- System, timing, and failure models
- Consistency models — the full lattice – 3.1 Linearizability – 3.2 Sequential consistency – 3.3 Causal consistency & causal+ – 3.4 Session guarantees – 3.5 Eventual consistency – 3.6 Bounded staleness & PRAM – 3.7 The transactional axis: serializability, strict serializability, snapshot isolation – 3.8 The unified hierarchy & how to place any model
- The three theorems — FLP, CAP, PACELC
- The consensus problem, formally
- Quorums & state-machine replication
- Paxos (basic and Multi-Paxos)
- Raft
- Zab, Viewstamped Replication, and the leader-based family
- Leaderless & WAN-optimized: EPaxos and Flexible Paxos
- Byzantine fault tolerance — PBFT and the blockchain bridge
- Atomic broadcast ≡ consensus
- Atomic commit: 2PC, 3PC, and making them fault-tolerant
- Read paths: leases, ReadIndex, follower reads
- External consistency & TrueTime (Spanner)
- How real systems compose all of this
- Decision guide — choosing a model and protocol under pressure
- Make it stick — the teaching tutorial (diagrams, a full Paxos run, mnemonics, flashcards)
PART 3 — DEEP DIVE
1. Why this topic exists — the root problem
A single machine has a single, total order of events and one copy of the data; “consistency” is free. The moment you replicate data (for fault tolerance / read scaling) or partition it (for capacity), two truths collide:
- Replicas can disagree. Updates reach copies at different times. A reader hitting copy A and a reader hitting copy B may see different histories.
- There is no global clock. Independent nodes can’t agree on “now” without coordination, so “which write happened first” is not automatically defined.
So we need two things. A consistency model is the promise we make to clients about which interleavings and which stale reads are observable — it scopes what application developers must reason about. Consensus is the machinery that lets nodes agree on one history (one value, one log entry, one leader) so that the promise can be kept despite crashes and partitions.
The deep insight that ties the whole topic together: strong consistency is expensive precisely because it requires coordination, and coordination is exactly what FLP/CAP say is fragile under asynchrony and partitions. Every design decision in this space is a purchase of consistency with the currency of latency and availability.
2. System, timing, and failure models
You cannot reason about a protocol without first stating its assumptions. Interviewers probe this because vague answers (“the network is unreliable”) signal shallow understanding.
Timing models
- Synchronous: known upper bounds on message delay and processing time. Consensus is easy (you can detect failure reliably by timeout). Unrealistic on the open internet.
- Asynchronous: no bound whatsoever on delays. You cannot distinguish a slow node from a dead one. FLP impossibility lives here.
- Partial synchrony (Dwork–Lynch–Stockmeyer): the system is asynchronous for an unknown period, then becomes synchronous (or bounds hold after some unknown time). This is the realistic model: protocols stay safe always and become live once the network behaves. Paxos and Raft are designed for this.
Failure models (strictly increasing difficulty)
- Crash-stop (fail-stop): a node halts and never returns. Simplest. (Paxos/Raft target this.)
- Crash-recovery: a node crashes and later rejoins, possibly with stale state; needs stable storage to recover safely.
- Omission: node/link drops some messages.
- Byzantine: a node may do anything — lie, send conflicting messages to different peers, collude. Requires cryptographic techniques and 3f+1 nodes (PBFT, blockchains).
Pros/cons of choosing a model
- Stronger assumptions (synchrony, crash-only): simpler, faster protocols. Con: violated assumptions = silent incorrectness.
- Weaker assumptions (async, Byzantine): robust to real-world nastiness. Con: more rounds, more nodes, more latency.
3. Consistency models — the full lattice
Two related but distinct axes run through this topic:
- Single-object / single-operation models (linearizability → sequential → causal → eventual): about the visibility and ordering of individual reads/writes to one register.
- Transactional isolation models (strict serializable → serializable → snapshot isolation → …): about grouping multiple operations on multiple objects into atomic units.
The cleanest way to remember the join: strict serializability = serializability (the transaction axis) + linearizability (the real-time axis).
3.1 Linearizability
Definition. There exists a single total order of all operations such that (a) it is consistent with real-time order — if op A returns before op B is invoked, A precedes B in the order — and (b) each operation appears to take effect atomically at a single instant between its invocation and its response. Equivalent intuition: the system behaves as if there’s one copy of the data and every operation happens instantaneously at some point in its interval.
Why it matters. It’s the strongest single-object guarantee and the one most programmers assume by default. It makes a distributed register behave like a local variable. CAP’s “C” is exactly this.
How you get it. Route all operations through consensus/SMR (a leader applying a totally-ordered log), or use synchronized quorum reads/writes with read-repair plus a real-time anchor.
Key property — it composes (it’s “local”). If every individual object is linearizable, the composition of all of them is linearizable. This is why linearizability is the building block of choice. (Sequential consistency lacks this.)
Pros
- Simplest mental model for application developers — “there is one copy.”
- Composable across objects.
- Enables correct locks, leader election, unique-ID allocation, and “compare-and-set.”
Cons
- Requires coordination on every operation → highest latency.
- Unavailable during partitions (CAP).
- Cross-region linearizable writes pay a full inter-region RTT minimum.
3.2 Sequential consistency
Definition (Lamport). There is some total order of all operations that is consistent with each individual process’s program order — but it need not respect real-time order across processes. If process P1 finishes a write and then (in wall-clock terms) P2 reads, P2 is not guaranteed to see it, as long as some legal global order exists.
Difference from linearizability. Linearizability = sequential consistency + the real-time constraint. Under sequential consistency, the system may “reorder” operations from different clients as long as each client’s own operations keep their order.
Pros
- Cheaper than linearizability (no real-time anchor needed).
- Still gives a single global order — intuitive for many algorithms.
Cons
- Does not compose — two individually sequentially-consistent objects can combine into a non-sequentially-consistent system.
- Surprising staleness: a value you just wrote may not be visible to a “later” reader.
- Rare as an explicit production offering; mostly a teaching/theory waypoint and a memory-model concept.
3.3 Causal consistency & causal+
Definition. Operations that are causally related (connected by the happens-before relation — same process order, or a read that observed a write, transitively) are seen in the same order by all nodes. Concurrent operations (no causal link) may be seen in different orders by different nodes.
Happens-before primer (so this post stands alone): event a → b if they’re in the same process and a precedes b, or a is a send and b is its receive, or transitively. Vector clocks capture this exactly: each node keeps a vector of counters; comparing vectors tells you “before,” “after,” or “concurrent.”
Causal+ (convergent causal). Plain causal consistency permits replicas to diverge permanently on concurrent writes; causal+ adds convergence — all replicas that have seen the same set of writes hold the same value (via a deterministic conflict-resolution rule, e.g. CRDTs / last-writer-wins). This is what real “causal” systems (COPS, etc.) provide.
Why it’s special. Under the constraint of always being available (responding without cross-node coordination) and convergent, real-time causal consistency is the strongest model achievable (Mahajan, Alvisi, Dahlin). In other words: causal+ is the consistency ceiling for an always-on, partition-tolerant store. That single fact is worth memorizing — it’s the bridge between the consistency lattice and CAP.
Pros
- Available under partitions (no coordination on the write path).
- Preserves intuitive cause→effect (no “reply appears before the message it answers”).
- Low latency; local reads/writes.
Cons
- Concurrent updates need explicit conflict resolution (CRDTs / merge functions / LWW).
- Tracking causality metadata (dependencies/vector clocks) has overhead that can grow with the number of clients/keys.
- Weaker than linearizable — no global real-time order.
3.4 Session guarantees (the “client-centric” models)
These four (from the Bayou system) are weaker than causal but extremely practical — they make eventual consistency feel sane within a single client session:
- Read-your-writes: you always see your own prior writes.
- Monotonic reads: once you’ve seen a value, you never see an older one.
- Monotonic writes: your writes are applied in the order you issued them.
- Writes-follow-reads (session causality): a write you make after reading X is ordered after X everywhere.
Pros: cheap, dramatically reduce user-visible anomalies, implementable with sticky routing or version tokens. Cons: only per-session; two devices/sessions of the same user can still disagree unless you propagate version tokens.
3.5 Eventual consistency
Definition. If no new updates are made, eventually all replicas converge to the same value. Says nothing about when, or what you see in the meantime.
Pros
- Maximum availability and lowest latency (write locally, replicate in background).
- Survives partitions trivially.
- Scales horizontally without coordination.
Cons
- Anomalies galore: stale reads, reading older-then-newer-then-older without session guarantees, lost updates under naive last-writer-wins.
- Pushes conflict resolution onto the application unless CRDTs are used.
- “Eventually” can be a long time under load or partition.
3.6 Bounded staleness & PRAM
- Bounded staleness: reads are guaranteed to lag the latest committed write by at most Δ time or k versions. A pragmatic knob (Azure Cosmos DB exposes exactly this) — you trade a quantified amount of freshness for latency/availability.
- PRAM / processor consistency: writes from a single process are seen by all others in the order issued, but no ordering is guaranteed across processes. A theoretical waypoint between causal and eventual.
3.7 The transactional axis: serializability, strict serializability, snapshot isolation
Transactions group multiple operations on multiple objects into one atomic unit. Isolation levels say which concurrency anomalies are permitted.
- Serializability: the outcome of executing transactions concurrently equals some serial (one-at-a-time) order. The gold standard for correctness of concurrent transactions. No real-time requirement — the equivalent serial order can be any order.
- Strict serializability: serializability + linearizability — the serial order must also respect real-time order across transactions. This is what an application developer usually means by “strongly consistent transactions.” Spanner’s “external consistency” is strict serializability.
- Snapshot isolation (SI): each transaction reads from a consistent snapshot as of its start; writes succeed only if no concurrent committed transaction wrote the same item (first-committer-wins). Prevents dirty/non-repeatable reads and lost updates — but permits write skew.
Write-skew (the SI litmus test): two transactions each read an overlapping set, each checks an invariant (e.g. “at least one doctor on call”), each sees the invariant still holds, and each updates a different row. Both commit; the invariant is now violated. Serializability forbids this; SI allows it.
Isolation ladder (weakest→strongest): Read Uncommitted → Read Committed → Repeatable Read (≈ SI in many DBs) → Serializable. Naming trap: PostgreSQL “Repeatable Read” is actually Snapshot Isolation; PostgreSQL “Serializable” is Serializable Snapshot Isolation (SSI), which truly is serializable.
Pros/cons
- Serializable: maximal safety, no anomalies. Con: highest abort/lock cost, lowest concurrency.
- Snapshot isolation: great read concurrency (readers never block writers via MVCC). Con: write skew — silently wrong for certain invariants.
- Read committed: cheap, default in many DBs. Con: non-repeatable reads, lost updates.
3.8 The unified hierarchy & how to place any model
To place any model, ask three questions:
- Single-object or multi-object (transactional)? (Linearizability vs serializability axis.)
- Real-time order required? (Linearizable/strict-serializable = yes; sequential/serializable = no.)
- What ordering of concurrent ops? (Total order → strong; causal order → causal; none → eventual.)
+ real-time ─────────────────────────────► − real-time
single STRICT SERIALIZABLE SEQUENTIAL CONSISTENCY
object LINEARIZABILITY CAUSAL+ ► CAUSAL ► PRAM ► EVENTUAL
│
multi STRICT SERIALIZABLE ► SERIALIZABLE ► SNAPSHOT ISOLATION ► READ COMMITTED
object
The ceiling rule: if your store must stay available and partition-tolerant, causal+ is the strongest you can offer. If you require linearizability/strict-serializability, you must sacrifice availability during partitions (CAP).
4. The three theorems — FLP, CAP, PACELC
FLP impossibility (Fischer, Lynch, Paterson, 1985)
Statement. In an asynchronous system where even one process may crash, no deterministic consensus algorithm can guarantee it will always terminate (reach agreement). Agreement and validity can always be preserved; it is termination (liveness) that cannot be guaranteed.
Why. In pure asynchrony you can’t tell a crashed node from a slow one. An adversarial scheduler can always keep the system in a “bivalent” (undecided) state by delaying the one message that would tip the decision, forever.
How real systems escape it (you must know these three):
- Partial synchrony / timeouts: assume the network eventually behaves; use timeouts to make progress (Paxos, Raft). Safety always holds; liveness resumes when the network stabilizes.
- Randomization: flip coins to break symmetry; terminates with probability 1 (Ben-Or, and randomized BFT).
- Failure detectors (Chandra–Toueg): a
◊S(eventually-strong) failure detector is the weakest oracle sufficient to solve consensus.
Interview-critical nuance: FLP does not say consensus is impossible in practice. It says you can’t have guaranteed termination + determinism + full asynchrony + fault tolerance simultaneously. Real protocols trade away guaranteed termination (they’re live only under partial synchrony) and are correct in practice.
CAP theorem
Statement (Gilbert & Lynch’s formalization of Brewer). A distributed data store cannot simultaneously provide Consistency (linearizability), Availability (every request to a non-failing node gets a non-error response), and Partition tolerance (the system keeps working despite arbitrary message loss between nodes). When a partition occurs, you must give up C or A.
The two biggest misreadings (interviewers love these):
- It is not “pick 2 of 3.” Partitions are a fact of nature, not a design option — you don’t get to “not choose P.” The real statement is: when a partition happens, choose C or A. When there’s no partition, you can have both.
- CAP’s “C” = linearizability, not ACID’s “C.” A system can be perfectly ACID and still be an AP system in CAP terms if it gives up linearizability under partition.
CP vs AP examples: CP — ZooKeeper, etcd, Spanner, HBase (refuse/stall on the minority side of a partition to stay correct). AP — Dynamo, Cassandra, Riak (keep serving, reconcile later).
PACELC (Abadi, 2012)
CAP only describes behavior during partitions — but partitions are rare, and the more pervasive tradeoff is the everyday one. PACELC: if (P)artition, choose (A)vailability or (C)onsistency; (E)lse, choose (L)atency or (C)onsistency.
- Dynamo / Cassandra: PA/EL — sacrifice consistency for availability under partition, and for latency normally.
- Spanner: PC/EC — consistent under partition (minority unavailable) and consistent normally (paying latency via commit-wait).
- PNUTS (Yahoo): PC/EL — interesting hybrid.
Why PACELC is the better lens: it captures that even with a perfect network, every strongly-consistent read/write pays coordination latency. That “EL vs EC” choice dominates real architecture decisions far more often than partitions do.
5. The consensus problem, formally
A set of processes each propose a value; they must decide on one. A correct consensus protocol satisfies:
- Agreement: no two correct processes decide different values.
- Validity (non-triviality): the decided value was proposed by some process (you can’t decide a made-up constant).
- Termination: every correct process eventually decides. (This is the property FLP says you can’t always guarantee in asynchrony.)
- Integrity: each process decides at most once.
Why consensus is the master primitive. With it you can build: leader election, distributed locks, atomic broadcast (total-order delivery), group/membership management, distributed transactions (atomic commit), and linearizable replicated state machines. Almost every “hard” coordination problem reduces to consensus.
6. Quorums & state-machine replication
Quorums
A quorum is any set of nodes large enough that any two quorums intersect. With majority quorums in N = 2f+1 nodes, any two majorities share ≥1 node — so a read quorum always overlaps the most recent write quorum, guaranteeing the read sees the latest write. Quorum intersection is the single mathematical idea under Paxos, Raft, and Dynamo-style stores.
- Majority quorum:
⌊N/2⌋ + 1. Toleratesf = ⌊(N-1)/2⌋crashes. - Dynamo tunable quorum: choose read size
R, write sizeWoverNreplicas.W + R > N⇒ read overlaps latest write (read-your-writes).W > N/2⇒ no two conflicting writes both commit. Sloppy quorums + hinted handoff relax this for availability (and reintroduce conflicts). - Flexible Paxos: Phase-1 (leader-election) quorum
Q1and Phase-2 (replication) quorumQ2need only satisfyQ1 + Q2 > N. So you can shrink the common-case replication quorum (e.g.Q2 = 2of 3) at the cost of a larger recovery quorum.
State-machine replication (SMR)
The canonical way to turn consensus into a service: (1) model your service as a deterministic state machine (same input sequence ⇒ same output/state); (2) use consensus to agree on a totally ordered log of commands; (3) every replica applies the log in order. Result: all replicas are identical, and the service is linearizable. This is how etcd, ZooKeeper, CockroachDB ranges, and Spanner tablets work.
Pros: linearizable, fault-tolerant, conceptually clean. Cons: every write goes through the log (coordination latency); the state machine must be deterministic (no wall-clock reads, no unsynchronized randomness, no map-iteration-order dependence).
7. Paxos (basic and Multi-Paxos)
Roles: proposers, acceptors, learners (a node can play several). Decisions need a majority of acceptors.
Basic (single-decree) Paxos — two phases:
- Phase 1 (Prepare/Promise): a proposer picks a ballot number
n(globally unique, monotonically increasing) and sendsPrepare(n). An acceptor that hasn’t promised a higher ballot repliesPromise(n)and reports any value it has already accepted (with its ballot). - Phase 2 (Accept/Accepted): if the proposer got promises from a majority, it sends
Accept(n, v)wherevis the highest-ballot already-accepted value reported (or its own proposal if none). Acceptors that haven’t promised higher accept it. Once a majority accepts, the value is chosen.
Safety invariant (the heart of Paxos): if a value v was chosen at ballot n, then any proposer with a higher ballot will re-propose v (because Phase 1 forces it to adopt the highest already-accepted value it learns). This guarantees agreement even with concurrent proposers and crashes.
Liveness: not guaranteed. Two proposers can keep preempting each other’s ballots forever (“dueling proposers”). FLP in the flesh. Fixed in practice by electing a distinguished proposer (leader) with randomized backoff.
Multi-Paxos: to agree on a log of values (not one), run an instance per log slot. Elect a stable leader and skip Phase 1 for subsequent slots (Phase 1 is really “claim leadership”; once you hold it, just do Phase 2). Steady state = 1 RTT per command to a majority.
Pros: the original, provably-safe consensus; majority fault-tolerant; safe under asynchrony (only liveness needs synchrony). Cons: famously hard to understand and to implement correctly; the spec leaves out log management, membership changes, and snapshots — every real system reinvents those. This pain is exactly why Raft exists.
8. Raft
Designed for understandability while being equivalent in power to Multi-Paxos. It decomposes consensus into three subproblems:
- Leader election. Time is divided into terms. Each node is follower/candidate/leader. Followers expect heartbeats; on a randomized election timeout a follower becomes a candidate, increments the term, and requests votes. A candidate winning a majority becomes leader. Randomized timeouts make split votes rare.
- Log replication. Clients send commands to the leader; the leader appends to its log and sends
AppendEntriesto followers. An entry is committed once replicated on a majority; the leader then applies it and tells followers the commit index. All commands flow through the leader (strong-leader model). - Safety. Two rules guarantee correctness: (a) Election restriction — a candidate can only win if its log is at least as up-to-date as the voter’s, so a leader never lacks a committed entry. (b) Log Matching — if two logs have an entry with the same index and term, all prior entries are identical (enforced by a consistency check in
AppendEntries). A leader never overwrites or deletes its own entries; it only appends. - Membership changes via joint consensus (overlapping old+new configs) to avoid two disjoint majorities during reconfiguration. Log compaction via snapshots.
Pros: understandable, well-specified end-to-end (election + log + membership + snapshots), tons of correct open-source implementations (etcd, Consul, TiKV, CockroachDB). Cons: strong-leader funnels all writes through one node (the leader is a throughput bottleneck and a WAN-latency choke point); leader changes cause brief unavailability.
9. Zab, Viewstamped Replication, and the leader-based family
- Zab (ZooKeeper Atomic Broadcast): primary-backup atomic broadcast underlying ZooKeeper. Like Multi-Paxos but explicitly preserves primary order (a single primary’s writes are delivered in the order issued) and has a clean recovery/sync phase on leader change. Powers ZooKeeper’s linearizable writes and FIFO client order.
- Viewstamped Replication (VR): a leader/primary-based replication protocol (Oki & Liskov) that predates Paxos in publication spirit and is arguably easier to follow; uses views (like Raft terms) and view-change on primary failure.
- Common shape: elect a leader/primary → leader sequences operations → replicate to a majority → recover via a view/term change on failure. Raft, Zab, VR, and Multi-Paxos are variations on this one theme. Knowing they’re the same idea with different vocabularies is a strong-signal interview point.
10. Leaderless & WAN-optimized: EPaxos and Flexible Paxos
- EPaxos (Egalitarian Paxos): leaderless — any replica can commit any command. It tracks per-command dependencies; non-interfering commands commit in 1 RTT (fast path) without a leader bottleneck, only ordering commands that actually conflict. Excellent for geo-distributed deployments (clients talk to their nearest replica) and removes the single-leader hot spot. Con: dependency tracking and conflict handling are complex; conflicting-command throughput degrades.
- Flexible Paxos: the observation that Paxos only needs Phase-1 and Phase-2 quorums to intersect (
Q1 + Q2 > N), not each be a majority. Lets you tune for the common case (smallQ2for fast replication) while keeping safety via a correspondingly largerQ1at recovery. Pro: lower steady-state latency/quorum size. Con: larger/ slower leader-election quorum; more careful reasoning.
11. Byzantine fault tolerance — PBFT and the blockchain bridge
When nodes can be malicious (lie, send conflicting messages, collude), majority quorums are insufficient — a lying node can tell different things to different peers.
- Bound: tolerate
fByzantine faults needsN = 3f + 1nodes; quorums are2f + 1. (Intuition: two2f+1quorums intersect in ≥f+1nodes, guaranteeing ≥1 honest node in common.) - PBFT (Castro & Liskov, 1999): practical BFT under partial synchrony. A primary orders requests; three message phases — pre-prepare → prepare → commit — let honest replicas agree despite up to
fliars; a view change replaces a faulty primary. ~2 RTT in the common case but O(n²) message complexity. - Blockchain bridge: Nakamoto consensus (Bitcoin) solves BFT in an open, permissionless, Sybil-prone setting using Proof-of-Work + the longest-chain rule, trading instant finality for probabilistic finality (wait for confirmations). Proof-of-Stake and classical-BFT-derived protocols (Tendermint/BFT, HotStuff — note HotStuff achieves linear message complexity and underpins modern systems) provide faster/finalized agreement among a known validator set.
Pros of BFT: survives malicious/compromised nodes; required for trustless / multi-org systems. Cons: more nodes (3f+1), more rounds, higher message complexity, and usually unnecessary inside a single trusted operator’s datacenter (crash-tolerance suffices).
12. Atomic broadcast ≡ consensus
Atomic broadcast (total-order broadcast): deliver messages to all nodes (a) reliably and (b) in the same total order everywhere. This is equivalent to consensus — each can implement the other:
- Consensus → atomic broadcast: run one consensus instance per sequence number to agree on “what’s the i-th message.”
- Atomic broadcast → consensus: broadcast your proposal; everyone decides the first delivered value.
This equivalence is why a replicated log is the universal abstraction: agreeing on a value, ordering a stream of messages, and replicating a state machine are the same problem. Kafka’s ordered partition log, Raft’s command log, and a consensus sequence are three faces of one idea.
13. Atomic commit: 2PC, 3PC, and making them fault-tolerant
Atomic commit ≠ consensus. Consensus needs a proposed value to be chosen; atomic commit needs all participants to agree to commit, and any single “abort” must abort the whole transaction. That asymmetry makes it more fragile.
- Two-Phase Commit (2PC): a coordinator sends PREPARE; each participant votes yes (and durably prepares, holding locks) or no; if all vote yes the coordinator sends COMMIT, else ABORT. Blocking flaw: if the coordinator crashes after participants prepared but before they hear the decision, participants are stuck holding locks indefinitely — they can’t unilaterally commit (others may have voted no) or abort (others may have committed). 2PC is not partition/coordinator-fault tolerant on its own.
- Three-Phase Commit (3PC): inserts a pre-commit phase so a recovering node can infer the outcome, making it non-blocking under crash-stop + synchrony. But it fails under network partitions (two sides can reach different decisions) and assumes bounded delays, so it’s rarely used in practice.
- The modern fix (Spanner pattern): make 2PC fault-tolerant by having each participant and the coordinator be a Paxos/Raft group. Now “the coordinator” can’t simply die — its decision is itself replicated by consensus. This composes atomic commit (across shards) with consensus (within each shard) to get fault-tolerant distributed transactions. This layering — 2PC across shards, consensus within each shard — is one of the most important architectural patterns to be able to draw.
14. Read paths: leases, ReadIndex, follower reads
In a leader-based system, a naïve read from the leader’s local state can still be stale if the node was just deposed (a new leader committed writes it doesn’t know about). Three correct read strategies:
- Leader lease: the leader holds a time-bounded lease (granted by a quorum); within it, no one else can be leader, so it can serve linearizable reads locally with no round trip. Risk: depends on bounded clock drift — a slow clock can extend a stale lease. Mitigated by conservative lease lengths.
- ReadIndex: before serving a read, the leader records its current commit index and confirms it’s still leader by exchanging one heartbeat round with a majority; once confirmed, it serves the read after applying up to that index. Linearizable, costs one RTT, no clock dependency.
- Follower / stale reads: serve reads from followers for scale, accepting bounded staleness (often with a timestamp/”read as of” bound). Trades freshness for read throughput and lower latency.
Pros/cons summary: lease reads = fastest but clock-trusting; ReadIndex = safe + one RTT; follower reads = cheapest/scalable but stale.
15. External consistency & TrueTime (Spanner)
The problem: strict serializability across the globe needs timestamps whose order matches real order — but clocks drift. TrueTime gives each server an interval TT.now() = [earliest, latest] guaranteed to contain true absolute time, using GPS + atomic clocks in every datacenter to bound uncertainty ε (historically single-digit milliseconds).
Commit-wait: when a transaction commits at timestamp s, Spanner waits until TT.now().earliest > s before releasing/acking — i.e. it waits out the uncertainty ε so that no other transaction can be assigned an earlier-but-actually-later timestamp. This guarantees external consistency = strict serializability: if T1 commits before T2 starts (in real time), T1’s timestamp < T2’s.
Pros: globally strict-serializable transactions — the strongest model, at planetary scale. Cons: requires specialized hardware (GPS/atomic clocks); every commit pays the ε commit-wait latency; tighter ε needs better clocks. It’s a beautiful illustration of the core theme: Spanner spends money on hardware to shrink ε, because shrinking time-uncertainty is the same as cheapening consistency.
16. How real systems compose all of this
- etcd / Consul / ZooKeeper: Raft/Zab SMR → linearizable metadata store for config, locks, leader election, service discovery. CP under CAP.
- Kafka: partition = replicated ordered log; leader + ISR (in-sync replicas) acks;
acks=all+min.insync.replicas≈ quorum durability. Per-partition total order = atomic broadcast. - Cassandra / DynamoDB (classic Dynamo): leaderless, tunable
R/W/Nquorums, AP/EL, conflict resolution via LWW or app logic; optional stronger reads (QUORUM, or DynamoDB’s “strongly consistent reads”). - Spanner / CockroachDB / YugabyteDB: Raft/Paxos per shard (SMR) + 2PC across shards + MVCC; Spanner adds TrueTime for strict serializability; Cockroach uses hybrid-logical clocks + uncertainty intervals instead of special hardware.
- The recurring pattern: consensus within a shard for linearizable replication; 2PC across shards for atomicity; a clock/timestamp scheme for cross-shard ordering. If you can draw that, you can design most modern distributed databases.
17. Decision guide — choosing a model and protocol under pressure
Knowing every option isn’t the same as choosing fast. This is the two-step decision: pick the consistency contract the use case demands, then pick the mechanism that delivers it. Memorize the two trees.
Step 1 — which consistency model?
Does correctness break if clients see different orders or stale data?
├─ No — staleness is harmless ───────────────► EVENTUAL
│ └─ Users notice their OWN staleness? ──► + SESSION GUARANTEES (read-your-writes)
│ └─ Cause→effect must hold across users? ──► CAUSAL+
├─ Yes — single-object invariant
│ (lock, CAS, unique ID, "buy the last item") ─────► LINEARIZABILITY
└─ Yes — multi-object invariant inside a transaction ──► SERIALIZABLE
├─ Must also match real-time order for outside observers? ► STRICT SERIALIZABLE
└─ Read-heavy, invariants don't span disjoint rows? ─────► SNAPSHOT ISOLATION
Reach-for / avoid rules
- Linearizability — for: locks, leader election, unique-ID/sequence allocation, “last item in stock.” Avoid when: you must stay available under partition, or latency is critical and staleness is harmless.
- Strict serializability — for: financial ledgers, anything whose ordering an external user can observe. Avoid when: you can’t pay coordination + commit-wait latency.
- Snapshot isolation — for: read-heavy, MVCC stores where readers shouldn’t block writers. Avoid when: an invariant spans rows neither transaction writes (write skew).
- Causal+ — for: collaborative/social/multi-region apps that must stay available. Avoid when: you need a global real-time order.
- Eventual + session guarantees — for: feeds, carts, high-availability UX. Avoid when: money or uniqueness is at stake.
Step 2 — which mechanism & topology?
Are all nodes operated by one trusted party?
├─ No (mutually distrusting / open membership) ─► BFT: PBFT or HotStuff, N = 3f+1
└─ Yes
├─ Single region, want simplest proven path ─► RAFT (5 nodes → tolerate 2)
├─ Geo-distributed, single leader is the choke ─► EPAXOS / leaderless, or
│ per-region leader leases
├─ Need lowest steady-state quorum latency ─► FLEXIBLE PAXOS (shrink Q2)
├─ Atomicity across shards ─► 2PC across shards, each shard its own Raft group
└─ Global strict-serializable order ─► consensus per shard + TrueTime/HLC timestamps
Sizing & read-path quick rules
- Always odd N. 5 is the metadata-store default — survive one planned restart + one unplanned failure simultaneously.
- Read path: ReadIndex for safe linearizable reads (one heartbeat RTT), leader lease for the fastest (trusts bounded clock drift), follower reads for scale (bounded staleness).
- Write cost reality check: a cross-region strongly-consistent write ≈ one inter-region RTT (~70–150 ms) minimum — say this out loud whenever you propose linearizability across regions.
PART 4 — INTERVIEW ARSENAL
How to wield this in an interview. Three habits separate staff/principal answers from mid-level ones: (1) Name the exact model — never say “strongly consistent,” say “linearizable” or “strict serializable.” (2) Lead with the tradeoff — every answer ends in “…which costs you X.” (3) Quantify — “a cross-region linearizable write costs ~one inter-region RTT, ~70–150 ms.” Below, each question lists a model answer and the counter-question ladder — the follow-ups an interviewer uses to find your floor.
A. Fundamentals
Q1. What’s the difference between consistency and consensus? Answer: Consistency is the contract about what clients can observe (ordering/staleness of operations). Consensus is the mechanism by which nodes agree on a single value/log entry despite failures. You typically implement a strong consistency model using consensus (state-machine replication). Counter-ladder:
- “Can you have strong consistency without consensus?” → Yes, on a single node, or with synchronous primary-backup under strong assumptions; but for fault-tolerant linearizability you need consensus or an equivalent quorum protocol.
- “Can you have consensus without strong consistency?” → Consensus is the tool; you could run it and still expose weaker models to clients (e.g. serve stale follower reads).
- “Which one does CAP constrain?” → The consistency contract (linearizability) vs availability.
Q2. Define linearizability precisely. Answer: Every operation appears to take effect atomically at some instant between its invocation and response, and the resulting total order respects real-time order (if A completes before B starts, A is ordered before B). Equivalent to “behaves like a single copy with instantaneous operations.” Counter-ladder:
- “How is it different from serializability?” → Linearizability is single-object + real-time; serializability is multi-object transactions with no real-time requirement. (See Q5.)
- “Does linearizability compose?” → Yes — it’s a local property; per-object linearizability ⇒ system linearizability. Sequential consistency does not compose.
- “How would you test a system for linearizability?” → Record a concurrent history and check for a valid linearization (e.g. Jepsen/Knossos / the Wing-Gong algorithm).
B. Consistency models
Q3. Walk me from strongest to weakest consistency and what each buys/costs. Answer: Strict serializable → linearizable → sequential → causal+ → causal → session guarantees → bounded staleness → eventual. Each step down removes a coordination requirement, buying latency/availability and giving up some ordering/freshness guarantee. (Use the table from the cheatsheet.) Counter-ladder:
- “What’s the strongest model you can offer while staying available under partitions?” → Causal+ (Mahajan et al.). This is the key fact.
- “Where does snapshot isolation sit?” → On the transactional axis, below serializable; it permits write skew.
- “Give a real anomaly eventual consistency allows.” → Lost update via last-writer-wins; or reading new-then-old without monotonic-reads.
Q4. What are the session guarantees and why do they matter? Answer: Read-your-writes, monotonic reads, monotonic writes, writes-follow-reads. They make an eventually-consistent store feel consistent within one client session — cheaply, via sticky routing or version tokens — eliminating the most jarring user-visible anomalies. Counter-ladder:
- “User has two devices — do session guarantees help?” → Not across sessions unless you propagate version tokens between them.
- “How do you implement read-your-writes against replicas?” → Pin the session to a replica, or carry the last-write version and only read from replicas caught up to it.
Q5. Linearizability vs serializability — can you have one without the other? (the canonical question) Answer: They’re orthogonal. Linearizable but not serializable: a single linearizable register supports no multi-object transactions at all. Serializable but not linearizable: a system can produce some valid serial order that ignores real-time (a transaction that committed earlier in wall-clock could be ordered later). Both = strict serializability, e.g. Spanner. Counter-ladder:
- “Is snapshot isolation serializable?” → No — write skew. (Be ready to describe the on-call-doctors example.)
- “What does PostgreSQL ‘Serializable’ actually use?” → Serializable Snapshot Isolation (SSI), which is serializable; its “Repeatable Read” is plain SI.
- “What does Spanner provide and how?” → External consistency = strict serializability, via TrueTime + commit-wait.
C. The theorems
Q6. State the CAP theorem and the two most common misconceptions. Answer: During a network partition you must choose linearizable consistency or availability. Misconception 1: it’s “2 of 3” — wrong; partitions aren’t optional, so the real choice is C-vs-A during a partition, and you get both when there’s no partition. Misconception 2: CAP’s C is ACID’s C — wrong; it’s linearizability. Counter-ladder:
- “Then why does everyone say ‘pick 2’?” → Marketing shorthand; the formal statement is about partition-time behavior.
- “Is a CP system unavailable all the time?” → No — only the minority side during a partition; otherwise fully available.
- “Where does PACELC improve on CAP?” → It adds the else clause: even without partitions, you trade latency vs consistency. (Spanner = PC/EC; Dynamo = PA/EL.)
Q7. Explain FLP and how real systems get around it. Answer: In a fully asynchronous system with even one possible crash, no deterministic protocol can guarantee termination of consensus — safety holds, but an adversarial scheduler can keep it undecided forever. Real systems escape via partial synchrony + timeouts (Paxos/Raft), randomization, or failure detectors (◊S). They stay safe always, and become live once the network behaves. Counter-ladder:
- “So is Raft ‘wrong’ per FLP?” → No — Raft never violates safety; it can fail to make progress during pathological asynchrony (e.g. perpetual leader churn), exactly as FLP predicts. It just isn’t pathological in practice.
- “What’s the weakest failure detector that solves consensus?” →
◊S(Chandra–Toueg–Hadzilacos).
D. Consensus algorithms
Q8. Explain Paxos. Why is it safe but not live? Answer: Two phases — Prepare/Promise then Accept/Accepted — over majority quorums; the safety invariant is that any higher ballot re-proposes the highest already-accepted value, so once a value is chosen it can never change ⇒ agreement. Liveness isn’t guaranteed because two proposers can keep preempting each other’s ballots (dueling proposers) — exactly FLP. Fixed by electing a stable leader. Counter-ladder:
- “What does Multi-Paxos add?” → A stable leader that skips Phase 1 for subsequent log slots → 1 RTT steady state.
- “Why is Paxos considered hard?” → The paper omits log management, membership changes, snapshots — implementers must invent them, and the corner cases are subtle.
- “How does quorum intersection guarantee safety?” → Any two majorities overlap, so a new leader’s Phase-1 majority always includes a node that saw the latest chosen value.
Q9. How does Raft differ from Paxos, and what are its safety rules? Answer: Same power, designed for understandability; strong-leader model with terms. Safety rests on election restriction (only an up-to-date log can win election ⇒ leader never misses a committed entry) and log matching (same index+term ⇒ identical prefixes; leader only appends, never overwrites its own log). Membership changes via joint consensus. Counter-ladder:
- “What happens during a leader failure?” → Followers time out (randomized), a candidate with an up-to-date log wins a majority vote in a new term; brief write unavailability.
- “Can two leaders exist at once?” → Possibly briefly in different terms, but only one can get a majority to commit; the higher term wins, and the election restriction prevents committed-entry loss.
- “What’s the throughput weakness?” → All writes funnel through one leader; it’s a CPU/network and WAN-latency bottleneck.
Q10. When would you choose leaderless consensus (EPaxos) over Raft? Answer: Geo-distributed deployments where a single leader creates a WAN bottleneck and clients want to commit at their nearest replica; EPaxos commits non-conflicting commands in 1 RTT with no fixed leader. Trade-off: complex dependency tracking and worse behavior under high conflict. Counter-ladder:
- “What’s the cost when commands conflict?” → A slow path (extra round) and dependency resolution; throughput drops as conflict rate rises.
- “What is Flexible Paxos and how does it help?” → Phase-1 and Phase-2 quorums need only intersect (
Q1+Q2>N), so you can shrink the common-case replication quorum for lower latency at the cost of a larger recovery quorum.
E. Byzantine & advanced
Q11. When do you need BFT, and what’s the node/quorum math? Answer: When nodes can be malicious/compromised or you span mutually-distrusting organizations (no single trusted operator). Need N = 3f+1 to tolerate f Byzantine faults; quorums of 2f+1. PBFT does it in 3 phases under partial synchrony; modern protocols (HotStuff) cut message complexity to linear. Counter-ladder:
- “Why 3f+1, not 2f+1?” → A liar can equivocate; you need any two quorums to intersect in ≥
f+1nodes so they share at least one honest node — that forces2f+1quorums and3f+1total. - “How is blockchain consensus different from PBFT?” → Permissionless + Sybil-resistant via PoW/PoS; longest-chain gives probabilistic finality vs PBFT’s deterministic finality among a known set.
- “Do you need BFT inside one company’s datacenter?” → Usually no — crash-tolerant consensus (Raft) suffices; BFT’s overhead isn’t justified without an actual adversarial-node threat.
F. Production / design scenarios
Q12. Design a distributed lock / leader-election service. What consistency do you need and how do you build it? Answer: You need linearizability (two clients must never both believe they hold the lock). Build on an SMR consensus store (etcd/ZooKeeper/Raft): represent the lock as a key with a lease/session; acquisition is a linearizable compare-and-set; auto-release on session expiry. Use fencing tokens (monotonic numbers) so a paused-then-resumed old lock holder can’t corrupt state — the resource rejects stale tokens. Counter-ladder:
- “A client acquires the lock, GC-pauses for 30s, lease expires, another client acquires it — what stops corruption?” → Fencing tokens: each acquisition returns an increasing token; the protected resource only accepts the highest token seen. (This is the famous “lock + pause” failure; having the fencing-token answer ready is a strong signal.)
- “Why not just use a TTL in Redis?” → A single-node Redis lock isn’t linearizable under failover; Redlock is contested. For correctness use a consensus-backed store + fencing.
Q13. Your read replicas serve stale data and users complain. Walk through fixes by cost. Answer: (1) Session guarantees — read-your-writes via sticky routing or version tokens (cheap, fixes most complaints). (2) ReadIndex / lease reads from the leader for true linearizable reads (one RTT or clock-based). (3) Bounded staleness with an explicit Δ to cap how stale. (4) Full linearizable quorum reads if correctness demands it (most expensive). Counter-ladder:
- “User edits a profile then refreshes and sees the old value — minimum fix?” → Read-your-writes session guarantee.
- “Now they need a globally consistent dashboard total — what changes?” → That’s a linearizable/consistent-snapshot read across shards; needs ReadIndex or a consistent MVCC snapshot timestamp.
Q14. Design cross-shard money transfer (debit A, credit B on different shards). Answer: Need atomicity across shards (no lost/created money) + isolation. Use 2PC across the two shards, where each shard is itself a Raft group so the coordinator/participants are fault-tolerant; add MVCC for isolation and a timestamp scheme for ordering. This is the Spanner pattern: consensus within shards, 2PC across them. Counter-ladder:
- “Plain 2PC — what fails?” → Coordinator crash after PREPARE blocks participants holding locks; not fault-tolerant. Fix: replicate the coordinator via consensus.
- “How do you get global ordering of transfers?” → Commit timestamps from a clock scheme: TrueTime + commit-wait (Spanner) or HLC + uncertainty intervals (Cockroach).
- “Snapshot isolation enough here?” → For simple transfers often yes, but watch for write-skew-prone invariants; use serializable if the invariant spans rows neither transaction writes.
Q15. How many nodes for a metadata store, and why 5 rather than 3? Answer: Odd numbers (majority quorum). 3 nodes tolerate 1 failure; 5 tolerate 2 — important because during a planned restart/upgrade of one node you can still survive an unplanned failure of another. 5 is the common production sweet spot for critical metadata; beyond that, write latency (majority size) grows without proportional benefit. Counter-ladder:
- “Why odd numbers?” → Even N gives no fault-tolerance benefit over N−1 but raises quorum size (4 still tolerates only 1, like 3).
- “Cost of going to 7?” → Larger quorum ⇒ higher write latency and more replication traffic; only worth it for very high fault-tolerance needs.
G. Worked drill — driving a design end-to-end
This is the muscle the Q&A above doesn’t train: starting from a blank whiteboard and driving. Watch how each decision names a model, justifies it, and states its cost. Talk like this.
Prompt: “Design a coordination service like etcd / ZooKeeper — it stores cluster config and a service registry, and provides distributed locks and leader election. It must stay correct when nodes fail.”
1 — Clarify requirements (don’t skip this). “Correctness over availability — this is the brain of the cluster; a wrong answer is worse than no answer. Small data (config/registry), low write volume, very high read volume, reads must be fresh, single trusted operator, single region to start. So I’m building a CP system in CAP terms.”
2 — Pick the consistency contract. “Locks and leader election are single-object invariants — two clients must never both hold the lock — so I need linearizability. From the Step-1 tree, single-object invariant → linearizable. I’ll deliver it with state-machine replication over a consensus log.”
3 — Topology & quorum. “Consensus among a trusted set, single region, and I want a complete, proven implementation → Raft. I’ll run 5 nodes, majority quorum of 3, tolerating 2 failures — chosen so I can take one node down for upgrade and still survive an unplanned failure. Writes commit once replicated to a majority: ~1 RTT in-datacenter, sub-millisecond.”
4 — Write path. “Clients send writes to the Raft leader; it appends to the log, replicates via AppendEntries, commits on majority ack, then applies to the state machine and acks the client. All writes funnel through the leader — that’s the throughput ceiling, acceptable here because write volume is low.”
5 — Read path (the part people get wrong). “Naïve leader-local reads can be stale if the leader was just deposed. Since reads must be fresh and are the dominant load, I’ll use ReadIndex: the leader confirms it’s still leader with one heartbeat round to a majority, then serves the read — linearizable at ~1 RTT. If that read RTT becomes a bottleneck, I add leader leases to read locally with no round trip, accepting a bounded-clock-drift assumption. Followers can serve bounded-staleness reads for clients that opt in.”
6 — Locks & the trap. “A lock is a linearizable key acquired by compare-and-set, tied to a client session/lease; the lock auto-releases if the session’s heartbeats stop. But here’s the classic failure: a client acquires the lock, GC-pauses for 30s, its lease expires, another client acquires it — now two clients think they hold it. The fix is fencing tokens: each acquisition returns a monotonically increasing token, and the protected resource rejects any token lower than the highest it has seen. The paused client’s stale token bounces.”
7 — Failure handling. “Leader dies → followers time out on randomized intervals, a candidate with an up-to-date log wins a new term, brief (sub-second) write unavailability, reads from leases briefly pause. Network partition → only the majority side elects a leader and serves writes; the minority side refuses (CP). Membership changes use joint consensus so we never split into two disjoint majorities. Snapshots compact the log.”
8 — State the tradeoffs out loud (this is the senior signal). “I chose correctness over availability: during a partition the minority is unavailable by design. The single leader caps write throughput — fine for config-scale, wrong for a high-write store (there I’d shard with a Raft group per shard). Linearizable reads cost a heartbeat RTT unless I trade it for clock-trusting leases. If this ever spanned regions, every strongly-consistent write would cost an inter-region RTT (~70–150 ms), at which point I’d push more reads to bounded-staleness followers or reconsider per-region leadership.”
That progression — requirements → consistency contract → topology → write path → read path → failure modes → explicit tradeoffs — is the template for any design prompt in this space.
H. Consolidated gotchas & traps (rapid fire)
- “Strongly consistent” is ambiguous — always name linearizable vs strict-serializable.
- CAP’s C ≠ ACID’s C.
- P is not a choice.
- SI ≠ serializable (write skew).
- 2PC is not consensus and blocks on coordinator failure.
- Basic Paxos has no liveness guarantee.
- Leader-local reads can be stale without lease/ReadIndex.
- Majority quorum tolerates crashes, not lies (need 3f+1 for Byzantine).
- Linearizability composes; sequential consistency doesn’t.
- Eventual + LWW = silent data loss.
- Fencing tokens are the answer to the lock-plus-pause problem.
- Causal+ is the consistency ceiling for always-available stores.
I. Pros/cons master tables
Consistency models
| Model | Pros | Cons |
|---|---|---|
| Strict serializable | Strongest; no anomalies; intuitive | Highest latency; unavailable under partition; often needs special clocks |
| Linearizable | Single-copy illusion; composes; enables locks/CAS | Coordination per op; CAP-unavailable under partition |
| Sequential | Cheaper; single global order | No real-time; doesn’t compose; rare in practice |
| Serializable (txn) | Correct concurrency; no anomalies | High abort/lock cost; lower concurrency |
| Snapshot isolation | Readers don’t block writers (MVCC); fast | Write skew → silent invariant violations |
| Causal+ | Available under partition; preserves cause→effect; converges | Conflict resolution needed; metadata overhead |
| Session guarantees | Cheap; kills common anomalies per session | Only per-session; cross-device needs tokens |
| Bounded staleness | Quantified freshness knob | Still stale within bound |
| Eventual | Max availability/throughput; partition-proof | Many anomalies; app-side conflict handling |
Consensus / commit mechanisms
| Mechanism | Pros | Cons |
|---|---|---|
| Basic Paxos | Provably safe; majority-tolerant | No liveness guarantee; hard to implement; under-specified |
| Multi-Paxos | 1-RTT steady state | Same complexity; leader bottleneck |
| Raft | Understandable; complete spec; many impls | Strong-leader bottleneck; election blips |
| Zab/VR | Battle-tested; clear primary order | Leader-based limits; recovery complexity |
| EPaxos | Leaderless; WAN-friendly; 1 RTT no-conflict | Complex deps; degrades under conflict |
| Flexible Paxos | Tunable low-latency quorums | Larger recovery quorum; careful reasoning |
| PBFT/BFT | Tolerates malicious nodes | 3f+1 nodes; O(n²) msgs; usually overkill in-DC |
| 2PC | Simple atomic commit across shards | Blocking on coordinator failure; not fault-tolerant alone |
| 3PC | Non-blocking under crash-stop+synchrony | Fails under partitions; rarely used |
Go deeper (primary sources)
- Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System” (1978) — happens-before, the foundation.
- Fischer, Lynch, Paterson, “Impossibility of Distributed Consensus with One Faulty Process” (1985) — FLP.
- Gilbert & Lynch, “Brewer’s Conjecture and the Feasibility of CAP” (2002); Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design” (2012, PACELC).
- Lamport, “Paxos Made Simple” (2001); Ongaro & Ousterhout, “In Search of an Understandable Consensus Algorithm (Raft)” (2014).
- Castro & Liskov, “Practical Byzantine Fault Tolerance” (1999); Yin et al., “HotStuff” (2019).
- Corbett et al., “Spanner: Google’s Globally-Distributed Database” (2012) — TrueTime, external consistency.
- Mahajan, Alvisi, Dahlin, “Consistency, Availability, and Convergence” (2011) — causal+ as the ceiling.
- Herlihy & Wing, “Linearizability” (1990); Berenson et al., “A Critique of ANSI SQL Isolation Levels” (1995).
- Kleppmann, Designing Data-Intensive Applications (2017) — the modern synthesis (Ch. 5, 7, 9).
PART 5 — MAKE IT STICK (Teaching Tutorial)
Parts 1–4 are the map. This part is the driving lesson. The four ideas here never stick from definitions alone — consistency models, FLP/CAP, the serializability trio, and Paxos — so we build each from zero, with a picture, one concrete story, and a memory hook. Read it once, slowly. By the end you should be able to explain Paxos at a whiteboard and recite the serializability trio without thinking.
18.1 Start with the one picture: why this topic even exists
Everything in this chapter exists to solve one problem. Draw it once and you’ll never be lost:
┌─────────┐ write x=2 ┌──────────┐
Alice ─┤ CLIENT ├─────────────►│ Replica A│ x=2
└─────────┘ └────┬─────┘
│ (slow / lost / reordered)
┌─────────┐ read x ┌─────▼────┐
Bob ─┤ CLIENT ├────────────►│ Replica B│ x=1 ← Bob sees the OLD value
└─────────┘ └──────────┘
Two copies + delay = the copies disagree. A consistency model is the promise about what Bob is allowed to see. Consensus is the machinery that lets the replicas agree so the promise can be kept. That’s the whole chapter in one sentence. Keep this picture in your head; every concept below is an answer to “what does Bob see, and how do we control it?”
18.2 Build a consistency model from zero (don’t memorize the lattice — grow it)
Start from the easiest world and weaken it one step at a time. You’re not memorizing 8 models; you’re walking down a staircase, removing one guarantee per step.
ONE COPY (a single variable) ← trivially perfect. There's only one truth.
│ add replicas for fault-tolerance…
▼
LINEARIZABLE "pretend there's still one copy" ← every op looks instant, in real-time order
│ drop the real-time clock…
▼
SEQUENTIAL "one global order, but maybe reordered vs wall-clock"
│ stop ordering unrelated ops…
▼
CAUSAL+ "cause before effect; concurrent ops may differ; everyone converges"
│ drop cross-user causality, keep your own session sane…
▼
SESSION "read-your-writes, monotonic reads" (feels fine to one user)
│ drop even that…
▼
EVENTUAL "if writes stop, replicas eventually agree" (says nothing about *when*)
The mental trick: each step down buys availability/latency by selling a guarantee. Strong = lots of coordination = slow & fragile under partition. Weak = no coordination = fast & always-on. That single trade — guarantee ⇄ coordination cost — is the spine of the entire field. When someone says a model name, ask only: which guarantee did this step sell?
18.3 The serializability trio, finally — two axes, one grid
The reason “serializability / linearizability / strict serializability” never stick is that you’re trying to memorize three definitions when there are really just two yes/no questions:
Does real (wall-clock) time order matter?
NO YES
┌────────────────────┬────────────────────┐
ONE object → │ (sequential) │ LINEARIZABLE │
(single op) │ │ one item, on clock│
├────────────────────┼────────────────────┤
MANY objects → │ SERIALIZABLE │ STRICT │
(a transaction) │ many items, │ SERIALIZABLE │
│ no clock │ many items + clock│
└────────────────────┴────────────────────┘
- Linearizable = ONE item, ON the clock. (“one register that behaves like a single copy, instantly”)
- Serializable = MANY items (a transaction), NO clock — the result equals some one-at-a-time order, but that order can be reshuffled vs real time.
- Strict serializable = MANY items, ON the clock = Serializable and real-time-respecting = what Spanner gives.
Mnemonic: Linear = one Line, on a stopwatch. Serial = a Series, no stopwatch. Strict = the series, stopwatch back on.
Why “serializable ≠ linearizable” — one tiny story (write skew): Two doctors, rule = at least one on call. Both run, on a snapshot showing “2 on call”: “if ≥1 other is on call, I’ll go off.” Each takes a different doctor off. Both commit. Now zero on call.
Snapshot: {Alice=on, Bob=on}
Txn1 (Alice): sees Bob on → Alice OFF ✔commit
Txn2 (Bob): sees Alice on → Bob OFF ✔commit ← different rows, so no write-write conflict
Result: {Alice=off, Bob=off} invariant BROKEN
A serial order (one then the other) would have stopped the second. Snapshot isolation allowed it → snapshot isolation is NOT serializable. That one story is the difference; recall the doctors and you recall the whole concept.
18.4 FLP and CAP without the fog
FLP in one picture — you can’t tell “slow” from “dead”:
Node X sent a heartbeat… you hear nothing.
Is X ┌─ DEAD? → proceed without it
└─ just SLOW? → if you proceed, X may wake up and disagree
In a purely async network you CANNOT distinguish these.
That’s FLP: with even one possible crash and no timing assumptions, no protocol can guarantee it always reaches agreement (it can’t safely decide when to give up on a silent node). Real systems escape by adding timeouts (assume “eventually the network behaves”) — they stay correct always, and make progress once the network calms down.
CAP in one picture — the partition forces a choice:
╳ partition (the two sides can't talk)
┌───────┐ ╳ ┌───────┐
│ side A │ ╳╳╳ │ side B │
└───┬───┘ ╳ └───┬───┘
│ A write arrives only on side A:
│ • answer it (stay AVAILABLE) → side B is now stale → gave up CONSISTENCY
│ • refuse it (stay CONSISTENT) → that client is down → gave up AVAILABILITY
You can’t have both during a partition. (No partition? You get both — that’s why “2 of 3” is misleading.) PACELC just adds the everyday clause: Partition → A or C; Else → Latency or Consistency. Spanner = PC/EC (correct, pays latency). Dynamo = PA/EL (up & fast, gives up consistency).
18.5 Paxos, walked one message at a time (the part nobody can explain — until now)
Forget the paper. Paxos has 2 roles you care about (proposer, acceptor), 2 phases, and 1 sacred rule. We’ll run it live with 5 acceptors.
The 2 phases, in plain words:
- Prepare → Promise = “Can I lead at ballot n? Promise you’ll ignore anything older.” (And acceptors confess any value they already accepted.)
- Accept → Accepted = “Okay everyone, accept value v at ballot n.” Once a majority accepts, v is chosen — forever.
The 1 sacred rule (the entire magic): in Phase 2 a proposer may not pick its own favorite value if any acceptor confessed an already-accepted one — it must re-propose the highest-ballot value it heard. This is why a chosen value can never be un-chosen.
Live run — P1 proposes “X”:
P1 ballot=1 A1 A2 A3 A4 A5
Prepare(1) ───────────► ✔ ✔ ✔ (majority promises, none accepted yet)
Promise(1, none) ◄─────
Accept(1,"X") ────────► ✔X ✔X ✔X ← majority {A1,A2,A3} accept
"X" is now CHOSEN ✅
Now a competing proposer P2 shows up (the “dueling” case everyone fears):
P2 ballot=2 A1 A2 A3 A4 A5
Prepare(2) ──────────► ✔ ✔ ✔ (reaches {A3,A4,A5})
Promise(2, …) ◄─────── A3 confesses: "I accepted (1,'X')"
A4,A5: "nothing accepted"
┌─ SACRED RULE: P2 saw an accepted value ("X" at ballot 1).
│ It must NOT propose its own value. It MUST re-propose "X".
Accept(2,"X") ───────► ✔X ✔X ✔X ← value is STILL "X"
P2 wanted to push “Y” but the rule forced it to carry “X” forward. *That’s how Paxos guarantees agreement under concurrency and failure: any new leader’s Phase-1 majority necessarily overlaps the old chosen majority (quorum intersection), so it always hears the chosen value and is forced to keep it. Liveness isn’t guaranteed (P1 and P2 can keep leap-frogging ballots forever — that’s FLP) → in practice you elect one* stable leader (Multi-Paxos / Raft) and skip Phase 1.
Say this out loud until it’s automatic: “Promise to ignore the old, then accept the new — but you must carry forward anything a majority might already hold.”
18.6 Analogies that stick
- Consensus = a committee that must agree on one decision even if members step out (crash) — and once a majority writes it down, nobody can rewrite it.
- Linearizability = a single shared ledger everyone writes to in real time. Eventual consistency = everyone keeps a notebook and they reconcile later.
- Quorum intersection = two overlapping committees of a 5-person board: any two groups of 3 share at least one person, so the latest decision always has a witness in the next meeting.
- TrueTime/commit-wait = waiting for the slowest clock to catch up before announcing, so no two announcements can be mis-ordered.
18.7 Misconceptions → corrections
| You might think… | Actually… |
|---|---|
| “CAP means pick 2 of 3.” | Partitions aren’t optional. You only choose C or A, and only during a partition. |
| “Consistent = ACID consistent.” | CAP’s C = linearizability, not ACID’s invariant-preservation. Same letter, different idea. |
| “Serializable = linearizable.” | Different axes. Serializable can ignore real time; linearizable is single-object. Strict = both. |
| “Snapshot isolation is serializable.” | No — it allows write skew (the doctors). |
| “Paxos can deadlock and lose data.” | It never loses safety; it can only stall (liveness) — fixed by a stable leader. |
| “A majority quorum stops malicious nodes.” | No — majority stops crashes. Lies need 3f+1 (Byzantine). |
| “2PC is consensus.” | No — 2PC blocks on coordinator failure; it’s atomic commit, not fault-tolerant agreement. |
18.8 Explain it back (Feynman self-test)
If you can’t answer these out loud in your own words, reread the section in brackets:
- Why does adding a second replica create the whole problem? [18.1]
- What guarantee do you sell going from linearizable → causal? [18.2]
- Tell the doctors story and say which model it breaks. [18.3]
- Why can’t you tell a slow node from a dead one, and how do real systems cope? [18.4]
- Walk Paxos with a competing proposer — why can’t “X” become “Y”? [18.5]
- Why is 5 nodes the metadata sweet spot, not 4 or 6? [Part 3 §6 / cheatsheet]
18.9 Flashcards (cover the right column)
| Prompt | Answer |
|---|---|
| Linearizable in 4 words | One item, on clock |
| Serializable in 4 words | Many items, no clock |
| Strict serializable | Serializable + real-time (Spanner) |
| The write-skew story | Two doctors both go off-call |
| CAP real choice | C or A, only during a partition |
| PACELC else-clause | Else → Latency or Consistency |
| FLP in one line | Can’t tell slow from dead → can’t guarantee termination |
| Paxos sacred rule | Re-propose any value a majority might already hold |
| Crash vs Byzantine nodes | 2f+1 vs 3f+1 |
| Quorum intersection | Any two majorities share a node |
18.10 The 60-second recall (recite this)
“Replication + delay makes copies disagree. A consistency model is the promise about what a reader sees; consensus is how nodes agree to keep that promise. Strong models cost coordination (slow, partition-fragile); weak ones are fast and always-on — that’s the whole trade. Linearizable = one item on the clock; serializable = many items, any order; strict = both (Spanner). FLP says you can’t tell slow from dead, so async consensus can’t guarantee termination — timeouts fix it. CAP says during a partition you pick consistency or availability. Paxos: promise to ignore the old, accept the new, but always carry forward what a majority might already hold — which is why a chosen value is chosen forever. Majority quorums beat crashes; Byzantine needs 3f+1.”
If you can say that paragraph from memory, you own this chapter.
Frequently asked questions
What is the difference between consistency and consensus?
Consistency is the contract about which orderings and stale reads a client may observe. Consensus is the mechanism by which nodes agree on a single value or log entry despite failures. You typically implement strong consistency using consensus via state-machine replication.
Is linearizability the same as serializability?
No. Linearizability is a single-object guarantee that respects real-time order. Serializability is a multi-object transaction guarantee that transactions are equivalent to some serial order, with no real-time requirement. Strict serializability is both combined, which is what Google Spanner provides.
What does the CAP theorem actually say?
During a network partition you must choose between linearizable consistency and availability; you cannot have both. It is not a free choice of two of three, because partitions are a fact of nature. When there is no partition you can have both. CAP’s consistency means linearizability, not ACID consistency.
Why is basic Paxos safe but not live?
Paxos always preserves agreement because any higher ballot re-proposes the highest already-accepted value. It cannot guarantee termination because two proposers can keep preempting each other’s ballots forever — a direct consequence of the FLP impossibility result. A stable leader (Multi-Paxos) fixes liveness in practice.
How many nodes do crash and Byzantine consensus need?
To tolerate f crash failures you need 2f+1 nodes with majority quorums. To tolerate f Byzantine (arbitrary or malicious) failures you need 3f+1 nodes with quorums of 2f+1, because lying nodes can equivocate.
