A self-contained reference for the most subtly deep idea in distributed systems: there is no global “now.” Once you accept that, every ordering primitive — logical clocks, vector clocks, hybrid clocks, TrueTime — becomes a different answer to the same question: given two events on two machines, which came first, and can we even ask? Master this and consistency, replication, and consensus stop being mysterious.
How to use this: Part 1 is the reference card. Part 2 maps the territory. Part 3 is the full depth with pros/cons per mechanism. Part 4 is exhaustive interview prep — questions, model answers, and the counter-question ladders interviewers use to find your floor.
Key takeaways
- There is no global “now” in a distributed system; the only ordering you get for free is causal (happens-before).
- Lamport clocks give a consistent total order but cannot detect concurrency; vector clocks detect concurrency exactly at O(N) size.
- Wall-clock time can jump backward — use a monotonic clock for durations, and never resolve conflicts with physical-timestamp last-writer-wins.
- Hybrid Logical Clocks give near-wall-clock, causally consistent timestamps without special hardware; TrueTime gives a bounded interval you can wait out.
PART 1 — CHEATSHEET (Reference Card)
Every concept in this document, condensed.
The one idea everything hangs on
There is no shared global clock across machines, and message delays are unbounded — so “simultaneous” and “before” are not free. We recover ordering either by tracking causality explicitly (logical/vector clocks) or by bounding physical-clock error (NTP/PTP/TrueTime). Every design picks one.
Core vocabulary (one line each)
- Physical clock — wall-clock time from a quartz oscillator, disciplined by NTP/PTP. Drifts; can jump backward.
- Monotonic clock — a counter that only ever increases; correct for measuring durations, meaningless across machines.
- Logical clock (Lamport) — a per-node integer that captures happens-before but not concurrency.
- Vector clock — a vector of per-node counters that captures happens-before and detects concurrency exactly.
- Version vector — vector clock applied to replica data versions (for conflict detection in replication).
- Hybrid Logical Clock (HLC) — physical time + a logical counter; close to wall-clock yet captures causality, monotonic.
- TrueTime — Google’s API returning an interval
[earliest, latest]guaranteed to contain true time, via GPS + atomic clocks. - happens-before (→) — Lamport’s partial order: same-process order, send→receive, and transitivity.
- Concurrent (a ‖ b) — neither
a → bnorb → a. Not “at the same time” — just causally unrelated. - Total / partial / causal order — every pair ordered / some pairs ordered / ordered iff causally related.
- Broadcast ordering — FIFO (per-sender), causal (respects →), total (everyone same order).
- Consistent snapshot — a global state that could have actually occurred (no effect-without-cause).
- Idempotency key / sequence number — make a retried or reordered message safe / orderable.
Clock-mechanism comparison
| Mechanism | Captures causality? | Detects concurrency? | Size | Wall-clock meaning? |
|---|---|---|---|---|
| Lamport clock | Yes (one direction) | No | O(1) | No |
| Vector clock | Yes | Yes (exact) | O(N) nodes | No |
| HLC | Yes | Partial | O(1) (+ counter) | ~Yes (close) |
| Physical + NTP | No | No | O(1) | Yes (±error) |
| TrueTime | Yes (via wait) | N/A | O(1) + interval | Yes (bounded ε) |
Ordering hierarchy
Total order ⊃ Causal order ⊃ FIFO order. Total-order broadcast ≡ consensus (the strongest, most expensive). Causal is the strongest you get while staying available under partition.
Numbers to memorize
- Quartz drift ≈ 30–50 ppm ⇒ ~2–4 seconds/day if undisciplined.
- NTP accuracy: ~1 ms on a LAN, 10–100 ms over the public internet.
- PTP (IEEE 1588) with hardware timestamping: sub-microsecond.
- TrueTime ε (uncertainty half-width): historically ~1–7 ms; commit-wait pays ~2ε.
- Leap seconds and NTP “step” adjustments can make wall-clock jump backward — never assume monotonic.
Quick decision rules
- Need to detect conflicting concurrent writes? → version vectors (not Lamport, not timestamps).
- Need a single agreed order of operations? → total-order broadcast / consensus.
- Need cause→effect preserved cheaply, available under partition? → causal order (vector clocks / dependency tracking).
- Need near-wall-clock timestamps that still respect causality, without special hardware? → HLC.
- Need globally consistent transaction timestamps? → TrueTime + commit-wait (or HLC + uncertainty windows).
Top gotchas (litmus tests)
- “Concurrent” ≠ “same time.” It means causally unrelated.
- Wall-clock time can go backward (NTP step, leap second). Never use it to measure elapsed time — use a monotonic clock.
- Last-Writer-Wins by physical timestamp silently drops data when clocks are skewed — the “winning” write may actually be older.
- Lamport clocks can’t detect concurrency —
a → b ⟹ L(a) < L(b), but not the converse. - Vector clocks grow O(N) with the number of writers; unbounded client counts are a real problem.
- A timestamp ordering being a total order doesn’t make it a causal order unless you enforce it.
- Total-order broadcast is as hard as consensus — don’t promise it cheaply.
- NTP gives you accuracy with no guarantee; TrueTime gives you a bounded interval you can wait out — that’s the whole difference.
- Causal delivery requires buffering out-of-order messages until their dependencies arrive.
- A “consistent snapshot” must never show a received message that was never sent — capture in-flight messages (channel state), not just node state.
PART 2 — OUTLINE (full map)
- Why time is hard in distributed systems
- Physical clocks — quartz, NTP, PTP, monotonic vs wall-clock, leap seconds
- The happens-before relation
- Lamport logical clocks
- Vector clocks & version vectors (and matrix clocks)
- Hybrid Logical Clocks (HLC)
- TrueTime & commit-wait
- Orderings — total, partial, causal
- Broadcast & delivery ordering — FIFO, causal, total
- Consistent global snapshots (Chandy–Lamport)
- Ordering in practice — sequence numbers, log offsets, idempotency, LWW
- Decision guide — choosing a time/ordering mechanism
- Make it stick — the teaching tutorial (vector-clock traces, mnemonics, flashcards)
PART 3 — DEEP DIVE
1. Why time is hard in distributed systems
On one machine, the CPU gives every event a single, totally ordered timeline; “which came first” is free. Across machines two facts break that:
- No shared clock. Each node has its own quartz oscillator drifting independently. Even synchronized, they differ by milliseconds — larger than the time it takes to do thousands of operations.
- Unbounded message delay. In an asynchronous network you cannot put an upper bound on how long a message takes, so “I sent this before you sent that” is not observable without extra machinery.
The consequence (Lamport’s 1978 insight): the only ordering we can know for free is causal — if event a could have influenced event b, then a precedes b. Everything else (a single global order, real wall-clock order) must be purchased with either coordination or synchronized-clock hardware. This is the root of why strong consistency is expensive.
2. Physical clocks — quartz, NTP, PTP, monotonic vs wall-clock
Quartz drift. A typical crystal drifts ~30–50 ppm; left alone, two machines diverge by seconds per day. So we discipline clocks against a reference.
NTP (Network Time Protocol, Mills). Hierarchical (stratum 0 reference clocks → stratum 1 servers → …). A client estimates offset and round-trip delay from timestamp exchanges and slews (gradually adjusts) or steps (jumps) its clock. Practical accuracy: ~1 ms LAN, 10–100 ms WAN. Crucially, NTP can step the clock backward, and the adjustment is best-effort with no error bound you can program against.
PTP (Precision Time Protocol, IEEE 1588). Uses hardware timestamping at the NIC to remove software jitter; achieves sub-microsecond sync on suitable LANs. Used in finance/telco and increasingly in data centers.
Monotonic vs wall-clock — the practical rule. Operating systems expose two clocks: a wall-clock (CLOCK_REALTIME) that can jump (NTP/leap second/admin change), and a monotonic clock (CLOCK_MONOTONIC) that only increases. Measure durations and timeouts with the monotonic clock; use wall-clock only for human-facing timestamps. Mixing them up causes the classic bug where a leap-second backward step makes a timeout compute a negative duration.
Leap seconds. Occasionally a second is inserted into UTC; naive systems repeat or skip a second. Google popularized leap smearing (spreading the adjustment over hours) to avoid a discontinuity.
Pros: cheap, ubiquitous, human-meaningful. Cons: no causality, no programmable error bound, can move backward — never the basis for correctness-critical ordering on its own.
3. The happens-before relation
Lamport defined the happens-before partial order →:
- If
aandbare in the same process andaoccurs beforeb, thena → b. - If
ais the sending of a message andbis its receipt, thena → b. - Transitivity: if
a → bandb → c, thena → c.
If neither a → b nor b → a, the events are concurrent (a ‖ b). The deep point: a → b means a could have causally influenced b. Concurrency means “causally independent,” not “at the same wall-clock instant.” This relation is the foundation for every logical-clock scheme below and for causal consistency.
4. Lamport logical clocks
A scalar counter L per process, with two rules:
- Increment
Lbefore each local event. - On send, attach
L; on receive of a message with timestampt, setL = max(L, t) + 1.
Guarantee: a → b ⟹ L(a) < L(b). Critical limitation: the converse is false — L(a) < L(b) does not imply a → b (they may be concurrent). So Lamport clocks give a consistent total order (break ties by process ID) but cannot detect concurrency.
Pros: O(1) space, trivial to implement, enough to build a total order for things like mutual exclusion or a tie-broken global ordering. Cons: can’t tell “happened-before” from “concurrent,” so useless for conflict detection.
Classic use: Lamport’s distributed mutual-exclusion algorithm and any place you need a deterministic total order without needing to know about concurrency.
5. Vector clocks & version vectors
A vector clock is an array V[1..N], one counter per process:
- Increment your own entry on each local event.
- On send, attach the whole vector; on receive, take the element-wise
max, then increment your own entry.
Comparison: V(a) < V(b) (every component ≤, at least one <) ⟺ a → b. If neither dominates, they are concurrent. So vector clocks detect concurrency exactly — their defining advantage over Lamport clocks.
Version vectors are the same idea applied to data versions across replicas (Dynamo uses them): they detect whether one write supersedes another or whether two writes conflict (concurrent) and need resolution.
Matrix clocks extend this to “what each node knows about what every other node knows” (O(N²)) — used to safely garbage-collect old state; rarely needed but worth naming.
Pros: exact causality + concurrency detection; the right tool for conflict detection. Cons: O(N) size, and N = number of writers — with millions of clients this explodes. Mitigations: dotted version vectors, server-side IDs instead of client IDs, pruning. This size problem is the #1 practical objection in interviews.
6. Hybrid Logical Clocks (HLC)
HLC (Kulkarni, Demirbas, et al., 2014) combines a physical-time component with a logical counter to get the best of both: timestamps that are close to wall-clock time (so they’re human-meaningful and sortable against NTP) and respect happens-before, while remaining monotonic even when the physical clock jumps.
Mechanically, each HLC timestamp is (l, c) where l tracks the max physical time seen and c is a counter that increments when physical time hasn’t advanced or to preserve causality on message receipt. It never goes backward and stays within the clock-sync bound of physical time.
Pros: O(1), no special hardware, monotonic, causally consistent, near-wall-clock — why CockroachDB and others use it. Cons: causality guarantee is only as good as the clock-sync bound; it provides an uncertainty window, not the hard interval TrueTime gives.
7. TrueTime & commit-wait
Google’s TrueTime (Spanner, 2012) exposes TT.now() returning an interval [earliest, latest] guaranteed to contain the true absolute time, with uncertainty half-width ε bounded by deploying GPS receivers and atomic clocks in every datacenter (so a bad clock is detected and evicted, keeping ε small — historically ~1–7 ms).
Commit-wait turns that bounded uncertainty into correctness: when a transaction commits at timestamp s, Spanner waits until TT.now().earliest > s before releasing locks/acking — i.e. it waits out ε so that any transaction that starts later in real time is guaranteed a larger timestamp. This delivers external consistency (strict serializability) globally.
Pros: globally ordered timestamps that match real time — the gold standard. Cons: needs special hardware; every commit pays ~2ε latency; the whole design is “spend money to shrink ε,” because shrinking time-uncertainty is the same as cheapening strong consistency.
8. Orderings — total, partial, causal
- Partial order: some pairs are ordered, some are concurrent (happens-before is a partial order).
- Total order: every pair is ordered — there is a single global sequence. Requires coordination to agree on the order of concurrent events.
- Causal order: a partial order that orders exactly the causally-related events and leaves concurrent ones unordered (or orders them arbitrarily but consistently everywhere).
The strength ladder: total ⊃ causal ⊃ FIFO. Total order is the most useful (everyone agrees) and the most expensive (it’s equivalent to consensus). Causal order is the strongest you can maintain while remaining available under network partitions — the bridge to causal consistency.
9. Broadcast & delivery ordering — FIFO, causal, total
When delivering messages to a group, you can guarantee progressively stronger orderings:
- FIFO delivery: messages from the same sender are delivered in send order (per-sender sequence numbers). Cheap.
- Causal delivery: if
send(m1) → send(m2), every node deliversm1beforem2. Implemented by attaching causal metadata (vector clocks) and buffering a message until its dependencies have been delivered. Prevents “reply shown before the message it answers.” - Total-order (atomic) broadcast: every node delivers all messages in the same order. Equivalent to consensus — you cannot implement it without paying consensus’s cost. This is exactly the primitive a replicated log provides.
Pros/cons: FIFO is nearly free but weak; causal needs buffering + metadata but stays available; total order is strongest but is consensus in disguise (latency, partition-sensitivity).
10. Consistent global snapshots (Chandy–Lamport)
To capture a meaningful global state of a running distributed system (for checkpointing, deadlock/termination detection, debugging), you can’t just ask every node “what’s your state now” — there’s no shared instant, and you’d miss messages in flight. The Chandy–Lamport snapshot algorithm (1985) records both node states and channel states (in-flight messages) using marker messages: a node records its state, sends markers on all outgoing channels, and records incoming messages until it sees a marker on each channel. The result is a consistent cut — a global state that could have occurred, never showing an effect without its cause.
Pros: gives a usable global state without freezing the system. Cons: the snapshot may be a state that no single real instant matched (it’s a possible state, not “the” state) — fine for its purposes, surprising if misunderstood.
11. Ordering in practice — sequence numbers, log offsets, idempotency, LWW
- Per-sender sequence numbers give FIFO and enable deduplication of retried messages (the receiver tracks the highest contiguous sequence seen).
- Append-only log offsets (e.g. a partitioned log like Kafka) provide a total order within a partition for free (single writer per partition) — which is why “shard by key, order within shard” is the dominant scalable ordering pattern. There is no total order across partitions without extra coordination.
- Idempotency keys make retries safe under at-least-once delivery: the server records processed keys and ignores duplicates, so reordering/redelivery can’t double-apply.
- Last-Writer-Wins (LWW) resolves concurrent writes by picking the highest timestamp. The trap: with physical timestamps and clock skew, the “latest” write may actually be the older one, silently losing data. Use logical/version timestamps, or accept LWW only where occasional loss is tolerable.
12. Decision guide — choosing a time/ordering mechanism
What do you actually need?
├─ Just measure elapsed time / timeouts ──────► MONOTONIC clock (never wall-clock)
├─ Human-readable timestamps, ordering not critical ──► WALL-CLOCK + NTP (accept skew)
├─ Detect concurrent/conflicting writes ──────► VERSION VECTORS (exact concurrency)
├─ A single agreed global order of events ────► TOTAL-ORDER BROADCAST / CONSENSUS
│ └─ Can shard by key? ──► per-partition LOG OFFSET (total order within shard, cheap)
├─ Preserve cause→effect, stay available under partition ──► CAUSAL ORDER (vector clocks / dep tracking)
├─ Near-wall-clock timestamps that respect causality, no special HW ──► HLC
└─ Globally strict-serializable transaction order ──► TRUETIME + commit-wait (or HLC + uncertainty window)
Reach-for / avoid:
- Lamport clock — for: a cheap deterministic total order where concurrency detection isn’t needed. Avoid when: you must detect conflicts (it can’t).
- Vector/version vectors — for: conflict detection in replication. Avoid when: writer count is huge and unbounded (size blows up) — mitigate or use server-side IDs.
- HLC — for: general-purpose causal-and-physical timestamps. Avoid when: you need a hard error bound for correctness (use TrueTime).
- TrueTime — for: global strict serializability. Avoid when: you can’t deploy GPS/atomic clocks or can’t pay commit-wait.
- Total-order broadcast — for: replicated state machines. Avoid when: you only need per-key order (shard + log offset is far cheaper).
PART 4 — INTERVIEW ARSENAL
How to wield this. Three tells of a strong answer: (1) you distinguish causal order from wall-clock order and never conflate “concurrent” with “simultaneous”; (2) you reach for the cheapest mechanism that satisfies the requirement (don’t propose consensus when per-partition ordering suffices); (3) you name the failure mode (clock skew → LWW data loss; vector-clock size blowup). Each question below has a model answer and the counter-ladder.
A. Fundamentals
Q1. Why can’t we just synchronize all clocks and use timestamps to order events? Answer: Because clock sync (NTP) has error far larger than the duration of operations (milliseconds vs microseconds), and clocks can step backward. Two events microseconds apart on different machines can get timestamps in the wrong order. Physical time gives accuracy without a programmable bound; correctness-critical ordering needs causality tracking or a bounded interval you can wait out (TrueTime). Counter-ladder:
- “What error does NTP actually have?” → ~1 ms LAN, 10–100 ms WAN.
- “When is timestamp ordering acceptable?” → When occasional misordering is tolerable, or when you have TrueTime-style bounded ε plus commit-wait.
- “Why is a monotonic clock not enough across machines?” → It has no shared zero point; it’s only meaningful for local durations.
Q2. Define happens-before and ‘concurrent.’ Answer: a → b if same-process order, or send→receive, or transitive composition. a ‖ b (concurrent) iff neither happens-before the other — meaning causally independent, not necessarily simultaneous. Counter-ladder:
- “Two events with the same wall-clock time — concurrent?” → Possibly causally related anyway; wall-clock equality is irrelevant to happens-before.
- “Does happens-before give a total order?” → No, it’s a partial order; concurrent events are unordered.
B. Logical & vector clocks
Q3. Lamport vs vector clocks — when do you need each? Answer: Lamport clocks (scalar) guarantee a → b ⟹ L(a) < L(b) and give a consistent total order, but can’t detect concurrency. Vector clocks (one counter per node) detect happens-before and concurrency exactly, at O(N) space. Use Lamport when you just need a total order; use vector/version vectors when you must detect conflicting concurrent writes. Counter-ladder:
- “Why can’t Lamport detect concurrency?” → The implication is one-directional:
L(a) < L(b)doesn’t implya → b. - “What’s the cost of vector clocks at scale?” → O(N) where N is the number of writers; with millions of clients it’s unbounded — mitigate with server-side node IDs, dotted version vectors, pruning.
- “How do you compare two vector clocks?” → Element-wise: if one dominates (all ≥, one >), it’s later; otherwise concurrent.
Q4. A shopping cart replicated across nodes gets concurrent adds. How do you avoid losing items? Answer: Don’t use LWW — it drops one add. Track versions with version vectors to detect the concurrent writes, then merge (union the cart, the Dynamo approach), or model the cart as a CRDT (OR-Set) so concurrent adds/removes converge deterministically without losing data. Counter-ladder:
- “Why not last-writer-wins by timestamp?” → Clock skew makes the ‘winner’ arbitrary and silently discards the loser’s add.
- “What if a remove and an add are concurrent?” → Define semantics explicitly (add-wins vs remove-wins); OR-Set is add-wins by construction.
C. Physical time & pitfalls
Q5. What’s the danger of using System.currentTimeMillis() (wall-clock) for timeouts or ordering? Answer: Wall-clock can jump — NTP steps, leap seconds, admin changes — possibly backward, so elapsed-time math can go negative or wrong, and timestamp ordering can invert. Use a monotonic clock for durations/timeouts; reserve wall-clock for human-facing display. Counter-ladder:
- “How do leap seconds break things?” → A repeated/skipped second; mitigated by leap smearing (spreading it over hours).
- “Give a real outage class this causes.” → Distributed locks/leases keyed on wall-clock expiring early/late; LWW dropping newer data.
Q6. Explain TrueTime and commit-wait. Why does Spanner need atomic clocks? Answer: TrueTime returns a bounded interval [earliest, latest] containing true time, with small ε guaranteed by GPS + atomic clocks (outliers detected and evicted). Commit-wait makes a committing transaction wait until TT.now().earliest passes its commit timestamp, so any later-in-real-time transaction gets a larger timestamp ⇒ external consistency (strict serializability). The hardware exists to keep ε small, since commit latency ≈ 2ε. Counter-ladder:
- “What if you don’t have atomic clocks?” → Use HLC + an uncertainty window (CockroachDB): same idea, looser bound, occasional read restarts.
- “What’s the latency cost?” → ~2ε commit-wait per transaction (single-digit ms historically).
- “How is this different from NTP?” → NTP gives accuracy with no usable bound; TrueTime gives a bound you can wait out.
D. Ordering & broadcast
Q7. What’s the difference between FIFO, causal, and total-order broadcast, and what does each cost? Answer: FIFO = per-sender order (cheap, sequence numbers). Causal = respects happens-before across senders (needs causal metadata + buffering). Total = everyone delivers in the same order (equivalent to consensus — most expensive, partition-sensitive). Choose the weakest that meets the requirement. Counter-ladder:
- “Why is total-order broadcast as hard as consensus?” → Agreeing on a single delivery order is reducible to agreeing on a value, and vice versa.
- “How do you get total order cheaply in practice?” → Shard by key and use a single-writer append log per shard (per-partition offsets); accept no cross-shard total order.
- “How does causal delivery actually work?” → Tag messages with vector clocks; hold a message until all causally-prior messages are delivered.
Q8. Design ordering for a distributed event log feeding analytics; you need high throughput and per-entity ordering. Answer: Partition by entity key; within a partition a single leader assigns monotonic offsets ⇒ total order per entity, cheaply and at high throughput. Consumers get per-key order. Accept that there’s no global order across partitions; if a global order is ever needed, merge by an HLC timestamp or route through consensus (expensive) — usually unnecessary. Counter-ladder:
- “A consumer needs a globally consistent ordering across keys — now what?” → Either centralize (single partition, kills throughput), use HLC timestamps to merge approximately, or define that you only need causal/per-key order (usually true).
- “Exactly-once?” → At-least-once delivery + idempotency keys / dedup on offset; true exactly-once needs transactional output coupling.
E. Worked drill — driving a design end-to-end
Watch how each step names the ordering requirement and buys the cheapest mechanism for it.
Prompt: “Design the data layer for a collaborative comment thread (like Google Docs comments) replicated across three regions. Users in different regions comment and reply concurrently; it must stay available even if a region is partitioned, and replies must never appear before the comment they answer.”
1 — Clarify the ordering requirements. “Two distinct needs: (a) availability under partition — so I cannot require global consensus on every write; (b) ‘reply never before its parent’ — that’s precisely a causal-order guarantee, not a total order. So my target is causal+ consistency: causal ordering plus convergence.”
2 — Pick the time/ordering mechanism. “From the decision guide: preserve cause→effect while staying available → causal order via vector clocks / dependency tracking, not total-order broadcast. Each comment carries the causal dependency (its parent and the writer’s vector clock). Replicas buffer a comment until its dependencies have been applied — that enforces ‘no reply before parent’ everywhere.”
3 — Conflict handling. “Concurrent edits/deletes are causally independent, so I need deterministic convergence: model the thread as a CRDT — an OR-Set of comments (add-wins) plus an append-only causal order per subtree. No last-writer-wins on physical timestamps — clock skew would drop comments.”
4 — Timestamps for display. “For human-facing ‘posted at’ ordering I’ll attach HLC timestamps: near-wall-clock, monotonic, and causally consistent, so the UI ordering never contradicts causality. Wall-clock alone would let a skewed region show a reply ‘before’ its parent by time even though we deliver it after.”
5 — Replication. “Asynchronous multi-leader replication between regions (each region accepts local writes for availability), propagating writes with their causal metadata; receivers apply in causal order. Vector-clock size is bounded by #regions (3), not #users — I key causality on region/replica IDs, not client IDs, sidestepping the O(N-clients) blowup.”
6 — Failure behavior. “Region partition → each region keeps serving local reads/writes (available); cross-region propagation queues and drains on heal; causal buffering guarantees correctness on convergence. No data loss because concurrent ops merge via the CRDT rather than overwrite.”
7 — Tradeoffs stated. “I traded a single global order (which would need consensus and break availability under partition) for causal+, the strongest model compatible with the availability requirement. Cost: causal metadata on each write and buffering of out-of-order messages; benefit: always-on, no lost comments, no effect-before-cause. If the product later demanded a strict global order of all comments, I’d have to introduce consensus and accept partition-time unavailability.”
The template — identify the exact ordering requirement → buy the cheapest mechanism that satisfies it → handle concurrency convergence → state the tradeoff — generalizes to any ordering question.
F. Consolidated gotchas & traps (rapid fire)
- “Concurrent” = causally unrelated, not simultaneous.
- Wall-clock can go backward — monotonic clock for durations.
- LWW + skew = silent data loss.
- Lamport can’t detect concurrency; vector clocks can.
- Vector clocks are O(#writers) — key on replica IDs, not clients.
- Total-order broadcast ≡ consensus — don’t promise it cheaply.
- Per-partition log offsets give total order within a shard for free; nothing across shards.
- Causal delivery requires buffering dependencies.
- TrueTime gives a bound you wait out; NTP gives accuracy you can’t trust.
- A consistent snapshot must capture in-flight messages, not just node states.
G. Pros/cons master tables
Clock/ordering mechanisms
| Mechanism | Pros | Cons |
|---|---|---|
| Wall-clock + NTP | Cheap, human-meaningful | No causality, no bound, can jump backward |
| Monotonic clock | Correct for durations | Local only; no cross-node meaning |
| Lamport clock | O(1); consistent total order | Can’t detect concurrency |
| Vector clock | Exact causality + concurrency | O(#writers) size |
| Version vector | Conflict detection in replication | Same size issue; needs merge logic |
| HLC | O(1), monotonic, causal + near-wall-clock | Bound only as good as clock sync |
| TrueTime | Globally ordered, bounded ε | Special hardware; commit-wait latency |
Broadcast orderings
| Ordering | Pros | Cons |
|---|---|---|
| FIFO | Cheap (sequence #s) | Weak; only per-sender |
| Causal | Available under partition; preserves cause→effect | Metadata + buffering |
| Total (atomic) | Single agreed order; SMR-ready | Equivalent to consensus; partition-sensitive |
Go deeper (primary sources)
- Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System” (1978) — happens-before and logical clocks.
- Fidge, “Timestamps in Message-Passing Systems That Preserve the Partial Ordering” (1988); Mattern, “Virtual Time and Global States of Distributed Systems” (1989) — vector clocks.
- Chandy & Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems” (1985).
- Kulkarni, Demirbas, et al., “Logical Physical Clocks” (2014) — HLC.
- Corbett et al., “Spanner: Google’s Globally-Distributed Database” (2012) — TrueTime and commit-wait.
- Liskov, “Practical Uses of Synchronized Clocks in Distributed Systems” (1991).
- Mills, Network Time Protocol (RFC 5905).
- Kleppmann, Designing Data-Intensive Applications (2017), Ch. 8–9.
PART 5 — MAKE IT STICK (Teaching Tutorial)
The references above are the map. Here’s the driving lesson. We build clocks from zero, trace a vector clock by hand (the thing that finally makes concurrency click), and lock it in with mnemonics and flashcards. Do the trace yourself with a pen once — it’s worth ten re-reads.
13.1 The one picture: there is no “now”
P1 ──●────────●──────────► time?
e1 e2
P2 ──────●───────●───────► time?
f1 f2
Which happened first: e2 or f1? Across machines, with clock drift and
network delay, THE QUESTION HAS NO FREE ANSWER. The only thing you can
know for sure: an event that CAUSED another came first.
Burn this in: *the only ordering you get for free is causality (one thing could have influenced another). Everything else — a single global order, real wall-clock order — must be bought* with coordination or special clocks. That’s why this whole chapter exists.
13.2 Build clocks from zero (three steps)
Step 1 — Lamport clock (a single counter). Rule: bump your counter on every event; on receiving a message, set counter = max(yours, theirs) + 1. Guarantee: if a → b then L(a) < L(b). The catch (say it aloud): the arrow only goes one way. L(a) < L(b) does NOT mean a → b. So Lamport can give you an order but can’t tell “before” from “concurrent.”
Step 2 — Vector clock (one counter per node). Now each event carries a vector. You can compare two vectors and get the exact answer: before, after, or concurrent.
Step 3 — HLC / TrueTime = bolt physical time on so the numbers also look like wall-clock (HLC) or come with a bounded error you can wait out (TrueTime). Same idea, fancier.
13.3 Trace a vector clock by hand (do this — it’s the whole topic)
Three processes. Each keeps [P1,P2,P3]. Rule: bump your own slot on any event; on receive, take element-wise max then bump your own slot.
P1: [1,0,0]──send m──────────────────────►
e1 \
P2: ▼ recv m → max([0,0,0],[1,0,0])=[1,0,0], bump → [1,1,0]
g1 ───────► [1,2,0]
g2
P3: [0,0,1]────────────────────────────►
h1 (totally on its own)
Now compare (one vector ≤ another iff every slot ≤, and it’s “before” if strictly less somewhere):
e1[1,0,0]vsg1[1,1,0]→ e1 ≤ g1, strictly less in slot 2 → e1 → g1 (causal: the message carried it). ✔ makes sense.e1[1,0,0]vsh1[0,0,1]→ slot1: 1>0, slot3: 0<1 → neither dominates → concurrent (e1 ‖ h1). ✔ they never touched.
The punchline: vector clocks detected the concurrency exactly. A Lamport scalar would have given e1=1 and h1=1 (or some order) and lied that one came first. This is the single most important thing in the chapter — and you only believe it once you’ve drawn the trace.
13.4 Why last-writer-wins eats your data (one picture)
Replica A clock: 10:00:05 (skewed FAST) writes cart={milk} ts=10:00:05
Replica B clock: 10:00:02 (correct) writes cart={milk,eggs} ts=10:00:02
LWW keeps the HIGHER timestamp → {milk}. The eggs (genuinely later!) are GONE.
Physical-timestamp LWW silently drops the real newer write when clocks are skewed. Use version vectors to detect the conflict, then merge (or a CRDT). Remember the eggs.
13.5 Analogies that stick
- Happens-before = “could have sent a letter.” If I could have mailed you info before your event, I’m “before” you. If neither of us could have reached the other, we’re concurrent — not “at the same time,” just unconnected.
- Lamport clock = numbering pages in one book — fine for a sequence, useless for telling which of two separate books was written first.
- Vector clock = each author tracking how many pages everyone has written — now you can tell who’d seen whose work.
- TrueTime = a watch that says “the time is between 10:00:03 and 10:00:07” and you wait out the gap so two events can’t be mis-ordered.
13.6 Misconceptions → corrections
| You might think… | Actually… |
|---|---|
| “Concurrent = same time.” | Concurrent = causally unrelated. Wall-clock is irrelevant. |
| “Synchronized clocks fix ordering.” | NTP error (ms) ≫ op duration; clocks even jump backward. Use causality or TrueTime’s bounded interval. |
| “Lamport clocks detect concurrency.” | No — L(a)<L(b) does not imply a→b. Only vector clocks detect it. |
| “Wall-clock time always moves forward.” | NTP steps + leap seconds can move it backward. Use a monotonic clock for durations. |
| “Last-writer-wins is safe.” | It silently drops concurrent writes (the eggs). |
13.7 Explain it back (Feynman)
- Why is causality the only “free” ordering? [13.1]
- What exactly can a Lamport clock not do, and why? [13.2]
- Redo the 3-process trace from memory; show e1 ‖ h1. [13.3]
- Tell the “eggs” story — why LWW lost data. [13.4]
- What does commit-wait actually wait for? [Part 3 §7]
13.8 Flashcards (cover the right column)
| Prompt | Answer |
|---|---|
| Only free ordering | Causality (happens-before) |
| Concurrent means | Causally unrelated (not “same time”) |
| Lamport’s blind spot | Can’t detect concurrency |
| Vector clock compare | ≤ all slots → before; neither dominates → concurrent |
| Vector clock size | O(number of writers) |
| HLC in one line | Physical time + counter; monotonic, causal, near-wall-clock |
| TrueTime + commit-wait | Wait out the ε interval → strict serializability |
| LWW danger | Drops the real newer write under clock skew |
13.9 The 60-second recall
“Across machines there’s no shared ‘now’, so the only ordering you get free is causality: if a could have influenced b, a is before b; otherwise they’re concurrent — meaning unrelated, not simultaneous. A Lamport clock gives you a counter that respects causality but can’t spot concurrency. A vector clock (one slot per node, take the max on receive) detects before/after/concurrent exactly — prove it by tracing three processes. Wall-clock time can jump backward, so never use it for durations, and never resolve conflicts with timestamp last-writer-wins (you’ll drop the genuinely newer write). When you need near-wall-clock causal timestamps use HLC; for a global strict order, use TrueTime and wait out its uncertainty.”
Frequently asked questions
Why can’t distributed systems just use synchronized clocks to order events?
Clock synchronization (NTP) has error of milliseconds to tens of milliseconds — far larger than the duration of operations — and clocks can step backward. So timestamps can order events incorrectly. Correct ordering needs causality tracking or a bounded time interval you can wait out, like TrueTime.
What is the happens-before relation?
Defined by Lamport, event a happens-before b if they are in the same process and a precedes b, or a is a message send and b is its receipt, or by transitivity. If neither happens-before the other, the events are concurrent — meaning causally unrelated, not necessarily simultaneous.
Lamport clock vs vector clock — when do you need each?
Use a Lamport clock when you just need a consistent total order; it cannot detect concurrency. Use a vector clock (or version vector) when you must detect concurrent, conflicting writes, which it does exactly, at the cost of O(N) size where N is the number of writers.
What is a Hybrid Logical Clock (HLC)?
An HLC combines physical time with a logical counter to produce timestamps that stay close to wall-clock time, respect happens-before, and never go backward — without special hardware. CockroachDB and others use HLCs for causally consistent ordering.
What is TrueTime and commit-wait?
TrueTime is Google Spanner’s clock API that returns a bounded interval guaranteed to contain the true time, using GPS and atomic clocks. Commit-wait makes a transaction wait out that uncertainty before acknowledging, guaranteeing external consistency (strict serializability) globally.
