A self-contained reference for keeping multiple copies of data correct, available, and fast. Replication is what buys you fault tolerance, read scaling, and locality — and the bill comes due as lag, conflicts, and lost updates. This chapter covers every replication topology, how quorums work, and the full menu of conflict-resolution strategies from last-writer-wins to CRDTs.
How to use this: Part 1 is the reference card. Part 2 maps the territory. Part 3 is the full depth with pros/cons per approach. Part 4 is exhaustive interview prep with counter-question ladders.
Key takeaways
- Replication buys availability, read scaling, and locality — and the bill is lag, conflicts, and potential lost updates.
- A quorum with W + R > N lets a read see the latest write, but it is not the same as linearizability.
- Last-writer-wins silently drops concurrent writes; use version vectors to detect conflicts and CRDTs or application merge to resolve them losslessly.
- Asynchronous replication has a nonzero data-loss window (RPO > 0) on failover; synchronous replication trades latency for durability.
PART 1 — CHEATSHEET (Reference Card)
Every concept in this document, condensed.
The core tension
Replication gives availability + read scale + locality, but the moment you have more than one copy you must answer: how do writes propagate, what does a reader see while they’re propagating, and what happens when two copies are written concurrently? Every design below is a different answer.
Core vocabulary (one line each)
- Replica — one copy of the data; leader/primary accepts writes, follower/replica copies them.
- Single-leader — one leader, many read followers; all writes serialize through the leader.
- Multi-leader — several leaders accept writes (often one per region); concurrent writes can conflict.
- Leaderless — any replica accepts any write; clients use quorums (Dynamo-style).
- Synchronous replication — leader waits for follower ack before confirming (durable, slower).
- Asynchronous replication — leader confirms immediately, replicates in background (fast, can lose recent writes on failover).
- Semi-synchronous — wait for one follower, async to the rest (common compromise).
- Replication lag — how far a follower trails the leader.
- Quorum —
W + R > Nensures a read overlaps the latest write. - Anti-entropy — background process that reconciles divergent replicas.
- Read repair — fix stale replicas detected during a quorum read.
- Merkle tree — hash tree to find which key ranges differ between replicas cheaply.
- Hinted handoff — a healthy node temporarily stores writes for a down node.
- Sloppy quorum — accept writes on any reachable N nodes (not the “home” nodes) to stay available.
- Version vector — per-replica counters to detect concurrent (conflicting) writes.
- LWW (last-writer-wins) — pick highest timestamp; simple but drops concurrent writes.
- CRDT — Conflict-free Replicated Data Type; merges concurrent updates deterministically with no coordination.
- Chain replication — replicas in a chain; writes head→tail, reads at tail; strongly consistent + high throughput.
- State-machine replication (SMR) — apply the same ordered command log on every replica.
Replication topologies
| Topology | Writes | Conflicts? | Consistency | Availability |
|---|---|---|---|---|
| Single-leader | leader only | No (serialized) | Strong (sync) / lagged (async) | Writes stop if leader down until failover |
| Multi-leader | many leaders | Yes | Eventual + conflict resolution | High (local writes) |
| Leaderless (quorum) | any replica | Yes | Tunable via R/W | High |
| Chain replication | head only | No | Strong (linearizable) | Reconfigure on failure |
Quorum math (memorize)
N= replicas,W= write quorum,R= read quorum.W + R > N⇒ every read quorum intersects the latest write quorum (read-your-writes per key).W > N/2⇒ two writes can’t both succeed on disjoint majorities (prevents some conflicts).- Common:
N=3, W=2, R=2(tolerate 1 node down, still consistent).N=3, W=1, R=1= fast, very weak. - Sloppy quorum breaks the intersection guarantee — trades consistency for availability.
Numbers / rules of thumb
- Replication factor 3 is the near-universal default (survive 1 failure with room to spare).
- Synchronous cross-region replication ≈ one inter-region RTT (~70–150 ms) added to every write.
- Async replication failover can lose up to the replication lag of writes (RPO > 0).
Quick decision rules
- One write region, need simplicity + strong reads → single-leader (sync or semi-sync).
- Writes in multiple regions, must stay local/available → multi-leader or leaderless + conflict resolution.
- Concurrent writes are unavoidable and must not lose data → CRDTs (or app merge), never plain LWW.
- Need linearizable replication with high throughput → chain replication or SMR/consensus.
Top gotchas (litmus tests)
- Async replication = possible data loss on failover (RPO > 0). Sync = durability at latency cost.
- LWW silently discards concurrent writes — including ones that arrived “first” if clocks are skewed.
- Replication lag breaks read-your-writes, monotonic reads, and consistent-prefix unless you add session guarantees.
- Quorum (
W+R>N) is not linearizability — concurrent reads during a write can still see old/new nondeterministically without extra work. - Sloppy quorums sacrifice the intersection guarantee — “quorum” no longer means what it does in a strict quorum.
- Split brain: two leaders after a partition in multi-leader/failover → conflicting writes.
- Multi-leader auto-increment IDs and triggers are conflict landmines.
- Merkle trees make anti-entropy cheap; full key scans don’t scale.
- A version vector that grows with client count is a real operational problem.
- Failover is where most “consistent” systems actually lose data — interview for the failover story.
PART 2 — OUTLINE (full map)
- Why replicate
- Single-leader replication
- Multi-leader replication
- Leaderless (Dynamo-style) replication
- Synchronous vs asynchronous vs semi-synchronous
- Replication lag and the session-guarantee fixes
- Quorums and the W+R>N rule
- Anti-entropy, read repair, Merkle trees
- Hinted handoff and sloppy quorums
- Conflict detection with version vectors
- Conflict resolution: LWW, application merge, CRDTs
- CRDTs in depth
- Chain replication
- State-machine replication as a replication strategy
- Geo-replication patterns
- Decision guide
- Make it stick — the teaching tutorial (quorum & CRDT traces, mnemonics, flashcards)
PART 3 — DEEP DIVE
1. Why replicate
Three independent motivations, often conflated:
- Availability / fault tolerance — survive node, disk, rack, or datacenter failure.
- Read scaling — spread read load across copies (great when reads ≫ writes).
- Locality / latency — keep data near users in multiple regions.
The catch (the whole reason this topic is hard): the instant there’s more than one copy, you inherit the consistency problem — copies diverge under concurrent writes and network delay. A quick framing you should be able to state: under a network partition you can keep all replicas available or keep them linearizable, not both (Gilbert & Lynch’s formalization of Brewer’s conjecture); replication strategy is largely a choice of where on that spectrum to sit.
2. Single-leader replication
One replica is the leader (primary); it accepts all writes, records them to a replication log, and followers apply that log to stay in sync. Reads can go to the leader (fresh) or followers (scalable, possibly stale).
How writes propagate (replication-log methods):
- Statement-based — ship the SQL/command. Fragile: nondeterminism (
NOW(),RAND(), triggers) diverges replicas. - Write-ahead-log (WAL) shipping — ship the storage-engine log. Tightly couples replica to engine version.
- Logical (row-based) replication — ship logical row changes. Decoupled from engine internals; supports heterogeneous versions and CDC.
Failover: if the leader dies, promote a follower. This is the dangerous part — choosing the most up-to-date follower, redirecting clients, fencing the old leader, and (with async replication) accepting loss of un-replicated writes.
Pros: no write conflicts (writes serialize at one leader); simple mental model; strong reads from the leader. Cons: leader is a write bottleneck and SPOF until failover; async followers are stale; failover can lose data and cause split brain if the old leader returns.
3. Multi-leader replication
Multiple leaders each accept writes (typically one leader per region, or per datacenter), replicating to each other asynchronously.
Use cases: multi-region write locality, offline-capable clients (each device is a leader), collaborative editing.
The defining problem — write conflicts. Two leaders can accept conflicting writes to the same record concurrently; there’s no single point that serializes them, so conflicts must be detected and resolved (see §10–12). Operational landmines: auto-increment keys, triggers/side effects, and uniqueness constraints all conflict across leaders.
Pros: local low-latency writes; survives region isolation; good for geo and offline. Cons: conflicts are inherent and must be handled; weakest default consistency; topology (all-to-all vs star) affects propagation and conflict rate.
4. Leaderless (Dynamo-style) replication
No designated leader — the client (or a coordinator) writes to several replicas and reads from several, using quorums. Pioneered by Amazon’s Dynamo (DeCandia et al., 2007) and seen in Cassandra/Riak.
A write goes to all N replicas; it succeeds when W ack. A read queries replicas and succeeds with R responses, taking the newest version (and repairing stale ones). Divergence is healed by read repair and anti-entropy (§8).
Pros: no failover (no leader to lose); naturally highly available; tunable consistency per request via R/W. Cons: application must handle concurrent versions/conflicts; “quorum” guarantees are weaker than they look (see §7); read repair and anti-entropy add complexity.
5. Synchronous vs asynchronous vs semi-synchronous
- Synchronous: the leader waits for follower acknowledgment(s) before confirming the write. Durable (no data loss on leader failure) but the write is only as fast as the slowest acked replica, and a slow/dead follower can stall writes.
- Asynchronous: the leader confirms immediately and ships changes in the background. Fast and available, but a leader crash loses any writes not yet replicated — a nonzero Recovery Point Objective (RPO).
- Semi-synchronous: wait for one follower synchronously, replicate to the rest async — the common production compromise (guarantees the write exists on ≥2 nodes without waiting for all).
The tradeoff to state out loud: sync buys durability with latency and availability risk (a stalled follower stalls writes); async buys latency/availability with a data-loss window. Semi-sync is the usual middle.
6. Replication lag and the session-guarantee fixes
With async followers, reads can be stale, producing three classic anomalies — and each has a targeted fix (these are the session guarantees, from the Bayou system, Terry et al. 1995):
- Read-your-own-writes violated (you post, then see the old version): fix by reading your own recent writes from the leader, or routing your session to a replica caught up past your write’s version.
- Monotonic reads violated (you see a value, then an older one from a laggier replica): fix by pinning a session to one replica (or version floor) so reads never go backward.
- Consistent-prefix violated (you see an answer before its question, across partitions): fix by ensuring causally related writes are read in order (causal consistency).
Pros of session guarantees: cheap, kill the most jarring user-visible anomalies. Cons: per-session only — two devices can still disagree without propagating version tokens.
7. Quorums and the W+R>N rule
With N replicas, choose write quorum W and read quorum R. If W + R > N, any read set and the latest write set share at least one replica, so a read can observe the most recent write — the same quorum-intersection principle that underlies consensus.
N=3, W=2, R=2: tolerate one node down and still read the latest. The common balanced choice.W=N, R=1: fast reads, writes need all replicas (fragile to one slow node).W=1, R=1: fastest, weakest — effectively eventual.
Important nuance (the trap): W+R>N gives you staleness bounds, not linearizability. Concurrent reads during an in-flight write can return old-then-new in different orders; edge cases (failed writes that partially applied, sloppy quorums) break even read-your-writes. For true linearizability you need read repair on every read plus careful handling, or a consensus-backed path.
8. Anti-entropy, read repair, Merkle trees
Leaderless/eventual systems heal divergence two ways:
- Read repair: during a quorum read, if some replicas return stale versions, the coordinator writes back the newest version to them. Repairs hot (frequently read) data for free.
- Anti-entropy: a background process compares replicas and reconciles differences, repairing cold data that reads never touch.
To compare replicas without shipping all data, use a Merkle tree: a tree of hashes over key ranges. Two replicas compare root hashes; if equal, done; if not, descend only into differing subtrees. This makes reconciliation cost proportional to the differences, not the dataset.
Pros: keeps replicas convergent cheaply; Merkle trees make it scalable. Cons: read-only-repair leaves rarely-read data stale indefinitely (hence anti-entropy too); building/maintaining Merkle trees has overhead.
9. Hinted handoff and sloppy quorums
When some of a key’s “home” replicas are unreachable, a system can stay available two ways:
- Hinted handoff: a reachable node accepts the write on behalf of the down node, storing a “hint,” and forwards it once the node recovers.
- Sloppy quorum: count acks from any W reachable nodes, not necessarily the key’s designated replicas.
Pros: maximizes write availability during failures. Cons: breaks the quorum-intersection guarantee — a subsequent strict-quorum read may not see the sloppily-written value until the hint is delivered. “Quorum” no longer means “overlaps the latest write.” Be explicit about this in interviews; it’s a favorite gotcha.
10. Conflict detection with version vectors
Before you can resolve a conflict you must detect it: did write B supersede write A (B saw A), or were they concurrent (neither saw the other)? Version vectors — a counter per replica attached to each value — answer this exactly: compare two versions element-wise; if one dominates, it’s a true successor; if neither dominates, they’re concurrent and conflict.
Dynamo attaches version vectors to objects; on a conflicting read it returns all concurrent versions (“siblings”) for the application (or a CRDT) to merge.
Pros: exact, principled conflict detection (no false “winners”). Cons: size grows with the number of writers; mitigate with server-side replica IDs (not client IDs), dotted version vectors, and pruning.
11. Conflict resolution: LWW, application merge, CRDTs
Once concurrency is detected, you must converge to one value:
- Last-Writer-Wins (LWW): keep the version with the highest timestamp, discard the rest. Simple but lossy — concurrent writes are silently dropped, and with skewed physical clocks the “winner” can be the genuinely older write. Acceptable only where occasional loss is fine (e.g. caches, ephemeral state).
- Application merge: return siblings to the app, which merges by domain logic (e.g. union a shopping cart — the canonical Dynamo example). Correct but pushes work onto every reader.
- CRDTs: data types whose merge is automatic, commutative, associative, idempotent — concurrent updates always converge with no coordination and no loss (§12). The principled answer.
Decision: use CRDTs/merge when losing data is unacceptable; reserve LWW for where it isn’t.
12. CRDTs in depth
Conflict-free Replicated Data Types (Shapiro, Preguiça, Baquero, Zawirski, 2011) guarantee Strong Eventual Consistency: replicas that have received the same set of updates are in the same state, with no conflicts and no coordination. Two flavors:
- State-based (CvRDT): replicas exchange full state and merge via a join that forms a semilattice (the merge is commutative, associative, idempotent — so order and duplication don’t matter). Robust to message loss/duplication; heavier to ship.
- Operation-based (CmRDT): replicas broadcast operations, which must be commutative and delivered (usually causally, exactly-once). Lighter messages; stronger delivery requirements.
Common types:
- G-Counter (grow-only): per-replica counts, value = sum; merge = element-wise max.
- PN-Counter: two G-Counters (increments, decrements) to support decrement.
- LWW-Register: single value with a timestamp; merge = keep latest (lossy on concurrency, like LWW).
- OR-Set (observed-remove set): add-wins set where each add carries a unique tag, so a concurrent add and remove don’t lose the add. The go-to for sets/carts.
- Sequence/RGA, and text CRDTs: ordered sequences for collaborative editing.
Pros: automatic convergence, no coordination, available under partition, no lost updates. Cons: metadata overhead (tags, tombstones for removes); not every data model maps cleanly to a CRDT; some (LWW-Register) are still semantically lossy.
13. Chain replication
Replicas are arranged in a chain: writes enter at the head, propagate node-by-node to the tail; the tail answers reads and acknowledges writes. Because the tail only acknowledges a write after it has passed through every replica, reads at the tail are linearizable, while the write work is spread along the chain for high throughput (van Renesse & Schneider, 2004).
Pros: strong (linearizable) consistency with better throughput than broadcasting every write to all replicas at once; simple read path (tail only). Cons: write latency = chain length; failures require chain reconfiguration (a separate coordination service); a slow middle node stalls the chain. Variants (CRAQ) allow reads from any node while preserving consistency.
14. State-machine replication as a replication strategy
Model the service as a deterministic state machine, agree on a totally ordered log of commands (via consensus), and have every replica apply the log in order. All replicas stay identical, and the service is linearizable (Schneider’s state-machine approach, 1990). This is the strongly-consistent end of the replication spectrum — what etcd, ZooKeeper, and the per-shard cores of Spanner/CockroachDB use.
Pros: linearizable, fault-tolerant, no conflicts (the log defines the order). Cons: every write pays consensus/coordination latency; the state machine must be strictly deterministic (no wall-clock reads, no unsynchronized randomness).
15. Geo-replication patterns
- Single-leader per region with async cross-region: simple; reads local, writes go to a home region (cross-region write latency for non-home users).
- Multi-leader (leader per region): local writes everywhere, async convergence, conflict resolution required.
- Leaderless across regions: tunable per-request consistency; watch cross-region quorum latency.
- Consensus per shard + cross-shard 2PC + bounded clocks (Spanner/Cockroach): strong global consistency at the cost of inter-region RTTs on the write path.
- Follower reads / bounded staleness: serve most reads locally from a regional replica, accepting a small staleness bound, to avoid cross-region round trips.
Rule of thumb: you cannot have local-fast writes and global strong consistency and partition availability all at once — pick the two the product needs.
16. Decision guide
Where do writes originate?
├─ One place ──────────────► SINGLE-LEADER
│ ├─ Must not lose writes on failover? ► SYNC / SEMI-SYNC replication
│ └─ Latency-first, small loss OK? ► ASYNC replication
├─ Multiple regions, writes must be local ─► MULTI-LEADER or LEADERLESS
│ └─ Concurrent writes must not lose data? ► CRDTs (or app merge); NEVER plain LWW
└─ Need linearizable replicated service ───► SMR / CONSENSUS (or CHAIN REPLICATION for throughput)
Tuning a leaderless store:
Strong-ish per key ► W+R>N (e.g. N=3,W=2,R=2)
Availability-first ► sloppy quorum + hinted handoff (accept weakened guarantees)
Heal divergence ► read repair (hot) + anti-entropy w/ Merkle trees (cold)
Reach-for / avoid:
- Single-leader — for: most OLTP, simplicity, strong reads. Avoid when: you need multi-region local writes.
- Multi-leader — for: geo write locality, offline clients. Avoid when: you can’t tolerate conflict-resolution complexity.
- Leaderless — for: always-on, tunable consistency. Avoid when: you need linearizability without extra machinery.
- CRDTs — for: counters, sets, collaborative state under concurrency. Avoid when: the data model doesn’t fit or metadata cost is prohibitive.
- Chain replication / SMR — for: strong consistency. Avoid when: write latency budget is tight and staleness is acceptable.
PART 4 — INTERVIEW ARSENAL
How to wield this. Senior signals: (1) you separate the three reasons to replicate and tie the strategy to the actual one; (2) you always have a failover/data-loss (RPO) story; (3) you never resolve conflicts with naive timestamp LWW without flagging the data-loss trap. Each question has a model answer and counter-ladder.
A. Fundamentals
Q1. Single-leader vs multi-leader vs leaderless — when each? Answer: Single-leader when writes can serialize in one place (most OLTP) — no conflicts, simple, strong reads from the leader. Multi-leader when you need local writes in multiple regions or offline clients — accept conflict resolution. Leaderless (quorum) when you want always-on writes with no failover and tunable per-request consistency — accept concurrent versions. Counter-ladder:
- “Biggest downside of single-leader?” → Leader is a write SPOF/bottleneck; failover risks data loss and split brain.
- “What makes multi-leader hard?” → Write conflicts with no single serialization point; auto-increment/uniqueness landmines.
- “Why is leaderless ‘highly available’?” → No leader to lose, so no failover gap; writes succeed as long as W replicas are reachable.
Q2. Synchronous vs asynchronous replication — the tradeoff? Answer: Sync waits for follower ack → no data loss on leader crash (RPO=0) but slower and a stalled follower stalls writes. Async confirms immediately → fast and available but loses un-replicated writes on failover (RPO>0). Semi-sync (wait for one follower) is the usual compromise. Counter-ladder:
- “How much data can async lose?” → Up to the replication lag at the moment of failure.
- “Why not always sync to all?” → One slow replica stalls every write; availability drops.
- “What’s semi-synchronous?” → Synchronous to one follower, async to the rest — guarantees the write on ≥2 nodes without waiting for all.
B. Lag & consistency
Q3. Users post a comment and don’t see it on refresh. Diagnose and fix, cheapest first. Answer: Classic read-your-writes violation from reading a lagging follower. Cheapest fix: route the user’s own recent reads to the leader, or to a replica known to be caught up past their write’s version (a version token). Broader: monotonic-reads (pin session to a replica) prevents reads going backward. Counter-ladder:
- “They see the comment, refresh, it vanishes, refreshes again, returns — what’s that?” → Monotonic-reads violation across replicas of differing lag; pin the session or enforce a version floor.
- “Reply shows before the original across two feeds — name it.” → Consistent-prefix / causal violation; need causal consistency.
Q4. Explain W+R>N. Does it give linearizability? Answer: With N replicas, if write quorum W and read quorum R satisfy W+R>N, any read set overlaps the latest write set, so a read can see the newest write. But it does not give linearizability — concurrent reads during an in-flight write can return old/new nondeterministically, and partial/failed writes or sloppy quorums break even staleness bounds. Counter-ladder:
- “Pick N,W,R to tolerate one node down with consistency.” → N=3, W=2, R=2.
- “Why isn’t quorum = linearizable?” → No agreement on a single order of concurrent ops; need read-repair-on-read or a consensus path.
- “What does a sloppy quorum change?” → Writes count on any reachable nodes, so the intersection guarantee no longer holds until hints are delivered.
C. Conflicts & resolution
Q5. Two regions write the same key concurrently. How do you detect and resolve? Answer: Detect with version vectors — compare element-wise; if neither dominates, they’re concurrent (a true conflict). Resolve by (a) returning siblings for application merge, (b) using a CRDT so the merge is automatic and lossless, or (c) LWW only if losing one write is acceptable. Never default to timestamp LWW for data you can’t afford to lose. Counter-ladder:
- “Why not just last-writer-wins by timestamp?” → Clock skew makes the winner arbitrary; the genuinely newer write can lose. Silent data loss.
- “Give a concrete merge.” → Shopping cart = union of items (Dynamo); counters = sum via G/PN-counters.
- “Version-vector size blows up with clients — fix?” → Key causality on server/replica IDs, use dotted version vectors, prune.
Q6. What’s a CRDT and when do you reach for one? Answer: A data type whose merge is commutative, associative, and idempotent, guaranteeing all replicas converge with no coordination and no lost updates (Shapiro et al.). State-based merge full states via a semilattice join; op-based broadcast commutative ops (usually needs causal, exactly-once delivery). Reach for them for counters (G/PN), sets (OR-Set), and collaborative text — anywhere concurrent writes are normal and loss is unacceptable. Counter-ladder:
- “State-based vs op-based tradeoff?” → State-based tolerates message loss/dup but ships more; op-based is lighter but needs reliable causal delivery.
- “Cost of CRDTs?” → Metadata: unique tags, tombstones for removes; some types (LWW-Register) are still semantically lossy.
- “Add vs remove concurrently in a set?” → OR-Set is add-wins by tagging each add; a concurrent remove only removes observed tags.
D. Architecture & failure
*Q7. How does chain replication get strong consistency and throughput? Answer: Writes flow head→tail; the tail acks only after the write traversed every replica, so tail reads are linearizable. Write work is distributed along the chain rather than fanned out from one leader, raising throughput. Counter-ladder:*
- “Failure handling?” → A separate coordinator reconfigures the chain (drops failed node, relinks); the protocol must handle in-flight writes.
- “Downside?” → Write latency scales with chain length; a slow middle node stalls the chain. CRAQ relaxes reads to any node.
Q8. Walk the failover story for a single-leader async system. Where’s the data loss? Answer: Leader crashes → detect → pick the most up-to-date follower → promote → redirect clients → fence the old leader. Data loss = any writes the old leader confirmed but hadn’t replicated (RPO = lag). Split-brain risk if the old leader returns thinking it’s still primary — must fence (e.g. via a consensus-backed lease/epoch). Counter-ladder:
- “How prevent split brain?” → Fencing: epoch/term numbers from a consensus store; the resource rejects stale-epoch leaders.
- “How avoid data loss entirely?” → Synchronous (or semi-sync) replication, at latency cost.
E. Worked drill — driving a design end-to-end
Watch how the replication strategy follows from the requirements, with a failover/RPO story made explicit.
Prompt: “Design the storage layer for a multi-region session/shopping-cart service: users may switch regions, carts are updated frequently and sometimes concurrently from multiple devices, and the service must stay writable even if a region is isolated. Losing a cart item is unacceptable.”
1 — Clarify and pick the topology. “Requirements: writes must succeed locally in every region (availability + locality) → I cannot use a single global leader. So multi-leader (leader per region) or leaderless. Frequent concurrent writes from multiple devices → conflicts are normal, not exceptional. And ‘losing an item is unacceptable’ rules out LWW.”
2 — Conflict strategy first (it drives everything). “Model the cart as a CRDT: an OR-Set of line items (add-wins, so a concurrent add+remove never drops the add) with PN-Counters for quantities. Concurrent updates from two devices merge deterministically with no coordination and no loss — exactly the requirement.”
3 — Replication mechanics. “Leaderless, N=3 replicas per region with W=2, R=2 for per-key freshness within a region, and asynchronous cross-region propagation of CRDT state/ops. Cross-region we don’t need quorum on the write path — the CRDT converges regardless of delivery order, which is why it’s safe under partition.”
4 — Availability under partition. “If a region is isolated, it keeps accepting cart writes locally (CRDT, no coordination). On heal, anti-entropy with Merkle trees reconciles divergent replicas cheaply, and read repair fixes hot keys during reads. No data lost; carts merge.”
5 — Session correctness when users switch regions. “User hops region and must see their own cart → read-your-writes: carry a version token (the CRDT’s causal context) and read from a replica caught up to it, or merge the inbound CRDT state on arrival. Because it’s a CRDT, even a stale read just merges forward — it never loses items.”
6 — Failover / RPO. “No leader to fail (leaderless), so no failover gap and no split-brain leader problem. The only ‘loss’ risk is a replica dying before W acks — mitigated by W=2 (write survives on ≥2 nodes) and hinted handoff for temporarily-down replicas.”
7 — Tradeoffs stated. “I chose availability + convergence over linearizability: a reader may briefly see a not-yet-merged cart, but never a wrong or lossy one. Cost: CRDT metadata (item tags, tombstones) and anti-entropy overhead. If the product instead needed a single authoritative cart with strict ordering (say, for inventory holds), I’d move that specific operation onto a single-leader/consensus path and pay the cross-region latency — but that’s the checkout/inventory concern, not the cart.”
Template: requirements → topology → conflict strategy → replication mechanics → partition/failover behavior → explicit tradeoff. It generalizes to any replication design.
F. Consolidated gotchas & traps (rapid fire)
- Async replication has RPO > 0 — failover loses un-replicated writes.
- LWW silently drops concurrent writes; skew can crown the older one.
- Replication lag breaks read-your-writes / monotonic reads / consistent prefix — add session guarantees.
- W+R>N ≠ linearizability.
- Sloppy quorum breaks the intersection guarantee.
- Split brain after partition/failover → conflicting leaders; fence with epochs.
- Multi-leader + auto-increment IDs/uniqueness = conflict landmines.
- Version vectors grow with #writers — key on replica IDs.
- Read repair leaves cold data stale — also run anti-entropy.
- Merkle trees make reconciliation cost ∝ differences, not data size.
G. Pros/cons master tables
Replication topologies
| Topology | Pros | Cons |
|---|---|---|
| Single-leader (sync) | Strong, no conflicts, durable | Latency; slow follower stalls writes; leader SPOF |
| Single-leader (async) | Fast, simple | Stale followers; failover data loss (RPO>0) |
| Multi-leader | Local writes, geo, offline | Inherent conflicts; ID/uniqueness landmines |
| Leaderless (quorum) | Always-on, tunable, no failover | App handles versions; weak quorum guarantees |
| Chain replication | Linearizable + high throughput | Latency ∝ chain length; reconfig on failure |
| SMR / consensus | Linearizable, no conflicts | Consensus latency; determinism required |
Conflict resolution
| Strategy | Pros | Cons |
|---|---|---|
| LWW | Trivial | Silent data loss; clock-skew “winner” |
| Application merge | Correct, domain-aware | Work on every read; easy to get wrong |
| CRDTs | Automatic, lossless convergence | Metadata/tombstone overhead; not all models fit |
Go deeper (primary sources)
- DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store” (2007) — leaderless quorums, version vectors, hinted handoff, read repair.
- Shapiro, Preguiça, Baquero, Zawirski, “Conflict-free Replicated Data Types” (2011) — CRDTs and strong eventual consistency.
- Terry et al., “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System” (1995) — session guarantees and conflict handling.
- van Renesse & Schneider, “Chain Replication for Supporting High Throughput and Availability” (2004).
- Schneider, “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial” (1990).
- Helland, “Life Beyond Distributed Transactions: an Apostate’s Opinion” (2007).
- Gilbert & Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services” (2002).
- Kleppmann, Designing Data-Intensive Applications (2017), Ch. 5.
PART 5 — MAKE IT STICK (Teaching Tutorial)
The references are the map; this is the driving lesson. We build the three replication shapes, draw why quorums work, and trace a CRDT merge by hand so “convergence” stops being a buzzword. Pen and paper for the two traces — they’re the whole point.
17.1 The one picture: three places writes can land
SINGLE-LEADER MULTI-LEADER LEADERLESS (quorum)
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ L │ all writes │ L1 │↔↔↔│ L2 │ writes │ R1 │ │ R2 │ │ R3 │
└─┬──┘ └────┘ └────┘ both sides └────┘ └────┘ └────┘
┌─┴──┐┌────┐ conflicts possible! client writes W of them,
│ F1 ││ F2 │ (merge needed) reads R of them; W+R>N
└────┘└────┘
no conflicts local writes everywhere no leader to fail
One question decides your whole design: where do writes originate? One place → single-leader (no conflicts, simple). Many places → multi-leader/leaderless (local & available, but conflicts are now your problem). Everything else follows.
17.2 Why W + R > N works — draw the overlap
Don’t memorize the formula; see it. N=3 replicas, write quorum W=2, read quorum R=2:
replicas: R1 R2 R3
a WRITE hits any 2: [R1 R2] ← newest value lives here
a later READ hits any 2: [R2 R3]
▲
R2 is in BOTH → the read is guaranteed to SEE the write.
Because 2 + 2 = 4 > 3, the two sets CANNOT avoid each other.
That overlap is the entire trick (it’s the same quorum-intersection idea behind consensus). W + R > N ⇒ read set and latest-write set always share a node ⇒ the read can see the latest write. Bump N for fault tolerance; bump W for durable writes; bump R for fresh reads — but keep the sum > N. Sloppy quorum? You let writes land on any reachable nodes → the overlap guarantee breaks → reads can miss the write until hints are delivered. (Say that trade out loud.)
17.3 Trace a CRDT merge by hand (this is where “convergence” clicks)
A G-Counter (grow-only counter) = each replica keeps a slot per replica; value = sum of all slots; merge = element-wise MAX. Watch two replicas diverge during a partition and converge perfectly on heal — regardless of message order:
slots = [R1, R2, R3]
R1: +1 +1 → [2,0,0] (value 2) ╳ partition ╳
R2: +1 → [0,1,0] (value 1) ╳ partition ╳
R3: +1 → [0,0,1] (value 1)
…heal. Everyone exchanges state, merges by element-wise MAX:
merge([2,0,0],[0,1,0],[0,0,1]) = [2,1,0... wait: max each slot] = [2,1,1]
value = 2+1+1 = 4 ✔ every replica lands on [2,1,1] = 4
Now the magic property: do the merges in ANY order, twice, whatever — you still get [2,1,1]. Because element-wise max is commutative, associative, and idempotent, order and duplication don’t matter → all replicas converge with no coordination and no lost increments. Compare that to last-writer-wins, which would have kept one replica’s count and thrown the others away. That is why CRDTs exist: convergence by math, not by locking.
17.4 The data-loss trap, in one picture (why not just LWW?)
A: cart={milk} ts=10:00:05 (clock fast)
B: cart={milk,eggs} ts=10:00:02
LWW keeps higher ts → {milk}. The eggs (really newer) are GONE.
Fix: version-vector detect "concurrent" → MERGE (union) → {milk,eggs}. Keep the eggs.
17.5 Analogies that stick
- Quorum intersection = a 5-person board: any two meetings of 3 share a member, so the last decision always has a witness at the next meeting.
- CRDT = everyone keeps a tally and they take the max — reconcile in any order, always agree, never lose a tick.
- Async replication = mailing a copy — fast, but if the post office (leader) burns down before delivery, that copy is lost (RPO > 0).
- Hinted handoff = a neighbor signs for your package while you’re out and hands it over when you’re back.
17.6 Misconceptions → corrections
| You might think… | Actually… |
|---|---|
| “Quorum (W+R>N) = linearizable.” | No — it bounds staleness, but concurrent ops can still interleave; not full linearizability. |
| “Async replication is basically as safe.” | It has RPO > 0 — failover loses un-replicated writes. |
| “Last-writer-wins is a fine default.” | It silently drops concurrent writes (the eggs). |
| “CRDTs need coordination to converge.” | The opposite — their whole point is convergence without coordination. |
| “Sloppy quorum still guarantees freshness.” | No — it trades the overlap guarantee for availability. |
17.7 Explain it back (Feynman)
- Where do writes originate — and how does that pick the topology? [17.1]
- Draw the W+R>N overlap and say why it guarantees a fresh read. [17.2]
- Trace a G-Counter merge; why does order not matter? [17.3]
- Tell the “eggs” story and the version-vector fix. [17.4]
- What does async replication lose on failover, and what’s that called? [Part 3 §5]
17.8 Flashcards (cover the right column)
| Prompt | Answer |
|---|---|
| Pick topology by… | Where writes originate |
| W+R>N guarantees | Read set overlaps latest write set |
| Common safe quorum | N=3, W=2, R=2 |
| CRDT merge property | Commutative, associative, idempotent → converges |
| G-Counter value/merge | Sum of slots / element-wise max |
| LWW danger | Drops concurrent writes (clock skew) |
| Detect conflicts with | Version vectors |
| Async failover loss | Up to the replication lag (RPO > 0) |
| Sloppy quorum trades | Freshness guarantee for availability |
17.9 The 60-second recall
“Replicate for availability, read-scaling, and locality — and inherit conflicts. Pick the shape by where writes originate: one place = single-leader (no conflicts), many = multi-leader/leaderless (conflicts are yours). Quorums work because W+R>N forces the read set to overlap the latest write set — same overlap trick as consensus. Never resolve concurrent writes with timestamp last-writer-wins; it drops the genuinely newer write. Detect conflicts with version vectors, then merge — and CRDTs make the merge automatic because element-wise operations are commutative, associative, and idempotent, so replicas converge with no coordination and no lost updates. Async replication is fast but loses un-replicated writes on failover (RPO > 0).”
Frequently asked questions
What’s the difference between single-leader, multi-leader, and leaderless replication?
Single-leader serializes all writes through one node (no conflicts, simple). Multi-leader accepts writes at several nodes (local writes, but conflicts must be resolved). Leaderless lets any replica accept writes using quorums (highly available, application handles concurrent versions).
What does W + R > N guarantee?
With N replicas, if the write quorum W and read quorum R satisfy W + R > N, any read quorum overlaps the latest write quorum, so a read can observe the most recent write. It does not, however, guarantee linearizability for concurrent operations.
Why is last-writer-wins dangerous?
Last-writer-wins resolves concurrent writes by keeping the highest timestamp and discarding the rest, silently losing data. With skewed physical clocks the surviving write may even be the older one. Use it only where occasional loss is acceptable.
What is a CRDT and when should you use one?
A Conflict-free Replicated Data Type has a merge function that is commutative, associative, and idempotent, so concurrent updates always converge with no coordination and no lost updates. Use CRDTs for counters, sets, and collaborative state where concurrent writes are normal and loss is unacceptable.
What is the difference between synchronous and asynchronous replication?
Synchronous replication waits for follower acknowledgment before confirming a write — durable but slower, and a slow follower stalls writes. Asynchronous replication confirms immediately and replicates in the background — fast and available, but a leader crash can lose un-replicated writes.
