Storage Engine Internals: B-Trees, LSM-Trees & WAL Complete Reference

Storage Engine Internals — System Design Handbook Part 5 featured image

A self-contained reference for where bytes meet physics. Every database’s personality — its write speed, read speed, space usage, and durability story — is decided by two choices: how it lays data out on disk (B-tree vs LSM-tree) and how it survives a crash (write-ahead logging). Master the storage engine and you can predict a database’s behavior from first principles instead of memorizing vendor docs.

How to use this: Part 1 is the reference card. Part 2 maps the territory. Part 3 is the full depth with pros/cons per structure. Part 4 is exhaustive interview prep with counter-question ladders.

Key takeaways

  • B-trees optimize reads with in-place updates; LSM-trees optimize writes with append-and-compact. Almost everything else follows from the storage hierarchy.
  • The RUM conjecture says you can minimize at most two of read, update, and memory (space) overheads — there is no free lunch.
  • Durability lives at fsync (and at a replica quorum in distributed systems), never in the OS page cache.
  • Compaction is a real tax: it consumes I/O and CPU and causes tail-latency spikes.

PART 1 — CHEATSHEET (Reference Card)

Every concept in this document, condensed.

The one idea

A storage engine is a set of tradeoffs forced by the memory/storage hierarchy: sequential I/O is vastly cheaper than random I/O, and durability requires forcing data to stable storage. B-trees optimize reads with in-place updates; LSM-trees optimize writes with append-and-compact. Almost everything else follows.

Core vocabulary (one line each)

  • Storage hierarchy — registers→L1/L2/L3 cache→RAM→SSD→HDD→network, each ~10–100× slower & bigger than the last.
  • B-tree / B+tree — balanced on-disk tree, fixed-size pages, in-place updates; the classic read-optimized index.
  • Write-ahead log (WAL) — append changes to a log before applying them, so a crash can be recovered.
  • ARIES — the canonical WAL recovery algorithm: Analysis → Redo → Undo.
  • LSM-tree — buffer writes in memory (memtable), flush to sorted files (SSTables), merge via compaction.
  • Memtable — in-RAM sorted structure absorbing writes before flush.
  • SSTable — immutable Sorted String Table file on disk.
  • Compaction — background merge of SSTables to reclaim space and bound read cost.
  • Bloom filter — probabilistic set membership; says “definitely not present” to skip SSTables.
  • Write / read / space amplification — bytes written / read / stored per logical operation.
  • RUM conjecture — you can optimize at most two of Read, Update, Memory(space) overheads.
  • Buffer pool / page cache — RAM cache of disk pages; the hit ratio dominates performance.
  • fsync — force OS buffers to durable storage; the boundary of real durability.

B-tree vs LSM-tree

Dimension B-tree LSM-tree
Writes In-place, random-ish Sequential append (fast)
Write amplification Lower per write, but page writes Higher (compaction rewrites)
Reads 1 path, predictable May check multiple SSTables (bloom filters help)
Space amplification Fragmentation/half-full pages Tombstones + overlapping versions until compaction
Read amplification Low Higher (mitigated by bloom + compaction)
Best for Read-heavy, range scans, OLTP Write-heavy, high ingest
Examples InnoDB, Postgres, BerkeleyDB RocksDB, LevelDB, Cassandra, Bigtable

Numbers to memorize (orders of magnitude)

  • L1 cache ~1 ns, main memory ~100 ns, SSD random read ~16–100 µs, HDD seek ~1–10 ms, intra-DC RTT ~0.5 ms, cross-region RTT ~70–150 ms.
  • Disk/SSD page (block) size: 4–16 KB; B-tree node ≈ one page.
  • Sequential disk throughput ≫ random IOPS — the reason LSM appends win on writes.
  • A B-tree of order ~hundreds is 3–4 levels deep for billions of keys → few page reads per lookup.

Quick decision rules

  • Write-heavy ingest, time-series, logs → LSM-tree.
  • Read-heavy OLTP, lots of range scans/point reads, predictable latency → B-tree.
  • Analytical scans over few columns → column store.
  • Need crash durability → WAL + fsync (and replication for node loss).

Top gotchas (litmus tests)

  1. Durability lives at fsync — data in the OS page cache is not durable; a power loss eats it.
  2. fsync can fail (and historically Linux could lose the dirty page after a failed fsync — “fsyncgate”); robust engines treat fsync failure as fatal.
  3. LSM reads may touch many SSTables — without bloom filters and compaction, read amplification kills you.
  4. Compaction is a tax: it consumes I/O and CPU and causes latency spikes; it’s not free background magic.
  5. Tombstones (LSM deletes) take space and slow scans until compacted; range deletes are especially nasty.
  6. B-tree updates are in-place → must WAL first (torn-page / partial-write protection).
  7. Write amplification on SSDs wears flash and burns throughput — measure it.
  8. The page cache hit ratio, not the engine, often decides real latency.
  9. Range scans love B-trees and column stores; random points with high write rate love LSM.
  10. RUM conjecture: you can’t be best at reads, writes, and space simultaneously — know which two you bought.

PART 2 — OUTLINE (full map)

  1. The memory/storage hierarchy and why it dictates design
  2. B-trees and B+trees
  3. Write-ahead logging and ARIES recovery
  4. LSM-trees: memtables, SSTables, compaction, bloom filters
  5. The amplification triangle and the RUM conjecture
  6. Buffer pool / page cache
  7. MVCC at the storage layer
  8. Row stores vs column stores; encoding and compression
  9. Durability and the fsync problem
  10. Index types
  11. Case studies: InnoDB, RocksDB/LevelDB, Bigtable/SSTable
  12. Decision guide
  13. Make it stick — the teaching tutorial (B-tree vs LSM pictures, mnemonics, flashcards)

PART 3 — DEEP DIVE

1. The memory/storage hierarchy and why it dictates design

Each level of the hierarchy is roughly 10–100× slower and larger than the one above: registers → L1/L2/L3 cache (ns) → RAM (~100 ns) → SSD (~tens of µs) → HDD (~ms, dominated by seek) → network. Two consequences drive all storage-engine design:

  1. Random I/O is far costlier than sequential I/O (catastrophically so on spinning disks; still meaningful on SSDs due to write amplification and flash translation). So engines that turn random writes into sequential appends (LSM) win on write throughput.
  2. Durability requires crossing into stable storage (SSD/HDD), and that crossing (fsync) is the expensive, correctness-critical boundary.

A storage engine is, fundamentally, a strategy for minimizing expensive I/O for the expected workload while guaranteeing durability. Naming which I/O pattern a design optimizes is the heart of every answer in this chapter.

2. B-trees and B+trees

The B-tree (Bayer & McCreight, 1972) is a balanced tree of fixed-size pages (≈ one disk block, 4–16 KB). Each node holds many keys (high fanout), so the tree is shallow — 3–4 levels indexes billions of keys, meaning a lookup is a handful of page reads. A B+tree keeps all values in leaf nodes and links leaves together, making range scans efficient (walk the linked leaves). This is the dominant index structure in relational databases.

Updates are in-place: to change a key you locate its page and rewrite it. This requires care for crash safety — a partial (“torn”) page write must not corrupt data — which is why B-trees pair with a WAL (and techniques like Postgres full-page writes).

Pros: predictable, low read amplification (one path); excellent for point reads and range scans; mature. Cons: writes can be random (page updates scattered across the file); pages are often half-full (fragmentation → space amplification); in-place updates need WAL + careful concurrency (latching).

3. Write-ahead logging and ARIES recovery

Write-ahead logging (WAL): before modifying a data page, append a log record describing the change to a sequential log and fsync it. The rule — log before data — guarantees that after a crash you can reconstruct or undo any change. WAL also turns many random data-page writes into one sequential log write (deferring/ batching the page updates), improving throughput.

ARIES (Mohan et al., 1992) is the canonical recovery algorithm, with three phases after a crash:

  1. Analysis — scan the log from the last checkpoint to find dirty pages and in-flight transactions.
  2. Redo — replay all logged changes (even uncommitted) to restore the exact pre-crash page state (“repeating history”).
  3. Undo — roll back transactions that hadn’t committed, using compensation log records (CLRs) so the undo itself is restartable.

Key ARIES ideas: LSN (log sequence number) stamped on each page to know what’s applied; CLRs to make undo idempotent/restartable; fuzzy checkpoints to bound recovery time without freezing.

Pros: correct crash recovery with fine-grained locking and partial rollback; the basis of essentially all relational durability. Cons: WAL fsync is on the critical path of commit (latency); log management/checkpointing complexity.

4. LSM-trees: memtables, SSTables, compaction, bloom filters

The Log-Structured Merge-tree (O’Neil, Cheng, Gawlick, O’Neil, 1996) optimizes for writes:

  1. Writes go to a memtable (in-RAM sorted structure) plus a WAL for durability — so a write is a sequential log append + an in-memory insert (fast).
  2. When the memtable fills, it’s flushed to an immutable SSTable (Sorted String Table) on disk — a sequential write.
  3. Reads check the memtable, then SSTables newest→oldest; bloom filters per SSTable let a read skip files that definitely don’t contain the key.
  4. Compaction merges SSTables in the background, discarding overwritten/deleted entries, reclaiming space and bounding the number of files a read must consult.

Compaction strategies:

  • Leveled (LevelDB/RocksDB default): non-overlapping SSTables per level, each level ~10× the previous. Lower read & space amplification, higher write amplification.
  • Size-tiered (Cassandra default): merge similarly-sized SSTables. Lower write amplification, higher read & space amplification.

Deletes write a tombstone marker (you can’t modify an immutable file); the key is truly removed only when compaction passes it.

Pros: very high write throughput (sequential appends), good compression (sorted, immutable files), tunable via compaction. Cons: read amplification (multiple SSTables; bloom filters mitigate point reads but not range scans); compaction consumes I/O/CPU and causes latency spikes; tombstones bloat space and slow scans until compacted.

5. The amplification triangle and the RUM conjecture

Three costs measure a storage engine:

  • Read amplification — data pages/SSTables read per logical read.
  • Write amplification — bytes physically written per logical write (compaction rewrites inflate this; matters for SSD wear).
  • Space amplification — bytes stored per byte of live data (fragmentation, tombstones, old versions).

The RUM conjecture (Athanassoulis et al., 2016) formalizes the intuition: you can minimize at most two of Read, Update, and Memory (space) overheads — improving one usually worsens another. B-trees pick low read amplification (pay in write/space); leveled LSM lowers read+space (pays in write amplification); size-tiered LSM lowers write (pays in read+space). There is no free lunch — pick the two your workload needs.

6. Buffer pool / page cache

Databases cache disk pages in RAM (a buffer pool, or rely on the OS page cache). Reads hit RAM when possible; dirty pages are written back later (coalescing writes). The cache hit ratio frequently dominates real-world latency more than the on-disk structure does — a B-tree lookup that hits the buffer pool is a memory access; a miss is a disk I/O 100–1000× slower.

Eviction (LRU/CLOCK/LRU-K), dirty-page writeback policy, and read-ahead/prefetch for sequential scans are the key knobs. Direct I/O vs OS page cache is a recurring design decision (control vs simplicity, avoiding double-caching).

Pros: turns most operations into RAM speed; absorbs write bursts. Cons: cold cache = cliff in latency; double-caching if both the engine and OS cache; writeback must coordinate with WAL for durability.

7. MVCC at the storage layer

To support snapshot reads without blocking writers, engines keep multiple physical versions of rows, each tagged with the transaction/timestamp that created it; a reader sees the version visible as of its snapshot. Two implementation families: append-new-version (Postgres writes a new row version, old ones cleaned by VACUUM) and undo-log/rollback-segment (MySQL InnoDB / Oracle keep the latest in place and reconstruct old versions from an undo log).

Pros: non-blocking consistent reads; the substrate for snapshot isolation. Cons: version storage and garbage collection overhead — long-running transactions pin old versions and cause bloat (Postgres) or undo-log growth (InnoDB); GC tuning (autovacuum) is a real operational concern.

8. Row stores vs column stores; encoding and compression

  • Row store (OLTP): all columns of a row stored together. Great for reading/writing whole rows (point lookups, transactional updates).
  • Column store (OLAP): each column stored contiguously. Great for analytical scans that touch few columns over many rows — you read only the needed columns, and contiguous same-type data compresses extremely well (run-length, dictionary, bit-packing, delta) and is vectorization/SIMD-friendly. Stonebraker & Çetintemel (“One Size Fits All”, 2005) argued these workloads are different enough to need different engines.

Encoding/compression matters everywhere: dictionary encoding for low-cardinality columns, delta/frame-of-reference for sorted numerics, RLE for runs, plus general block compression (LZ4/Zstd). Compression trades CPU for I/O — usually a win when I/O-bound.

Pros (column): huge scan/compression wins for analytics. Cons (column): poor for point lookups/row updates (a row is scattered across column files); typically batch-loaded. Hybrid (PAX) formats and HTAP systems try to bridge.

9. Durability and the fsync problem

A write is durable only once it’s forced to stable media. Data sitting in the OS page cache is lost on power failure. fsync() (or fdatasync, O_DSYNC) forces the OS to flush. The subtleties that bite real systems:

  • fsync is expensive and sits on the commit critical path → engines batch commits (group commit) to amortize it.
  • fsync can fail, and historically on Linux a failed fsync could mark dirty pages clean and lose the data on retry (“fsyncgate”, ~2018) — robust engines now treat fsync failure as fatal and crash/recover rather than continue.
  • Write barriers / ordering: the WAL record must hit stable storage before the data page; disk write caches and reordering must be controlled (FUA/flush).
  • Replication as durability: in distributed systems, “durable” means “acknowledged by a quorum of replicas,” not “fsynced on one node” — a single node/disk can be lost entirely.

Pros of doing it right: genuine crash and power-loss safety. Cons: fsync latency forces batching/group commit; correctness requires handling fsync failure and ordering — easy to get subtly wrong.

10. Index types

  • B+tree index — ordered; supports point and range queries; the relational default.
  • Hash index — O(1) point lookups, no range support; good for equality.
  • LSM-based index — write-optimized ordered index (RocksDB).
  • Inverted index — term → posting list of documents; the basis of full-text search (Lucene/Elasticsearch).
  • Secondary index — index on a non-primary column; points to primary key or row location (and must be maintained on writes — write amplification).
  • Covering index — includes all columns a query needs, so the query is answered from the index alone (no table fetch).
  • Composite index — multi-column; usable left-to-right (prefix rule).
  • Specialized: bitmap (low cardinality, analytics), GiST/R-tree (spatial), bloom (membership pre-filter).

Tradeoff: every index speeds reads but slows writes and costs space (it must be kept in sync). Over-indexing is a classic write-throughput killer.

11. Case studies

  • InnoDB (MySQL): B+tree (clustered on primary key), WAL (redo log) + undo log for MVCC, buffer pool, group commit. Read-optimized OLTP.
  • RocksDB / LevelDB: LSM-tree, memtable + WAL, leveled compaction, bloom filters, column families. Embedded write-optimized engine under many databases (used by CockroachDB, TiKV, etc.).
  • Bigtable / SSTable (Chang et al., 2006) + GFS (Ghemawat et al., 2003): LSM-style — in-memory memtable, immutable SSTables on a distributed file system, compaction; the lineage of HBase/Cassandra. Demonstrates the LSM pattern at planetary scale on top of a replicated file system.

12. Decision guide

What's the dominant workload?
├─ Write-heavy / high ingest (logs, metrics, events) ─────► LSM-TREE
│      ├─ Read & space matter most ► LEVELED compaction
│      └─ Write amp matters most   ► SIZE-TIERED compaction
├─ Read-heavy OLTP, point + range, predictable latency ──► B+TREE
├─ Analytical scans over few columns, big tables ────────► COLUMN STORE (+ compression)
└─ Full-text search ────────────────────────────────────► INVERTED INDEX

Durability:
   Crash safety ► WAL + fsync (group commit to amortize)
   Node/disk loss ► replicate to a quorum (fsync on one node is not enough)

Indexing:
   Equality only ► HASH;  range/sort ► B+TREE;  membership prefilter ► BLOOM;
   answer-from-index ► COVERING;  but every extra index taxes writes + space.

Reach-for / avoid:

  • B-treefor: reads, ranges, OLTP, predictable latency. Avoid when: write throughput is the bottleneck.
  • LSM-treefor: heavy writes/ingest. Avoid when: you need consistently low read/scan latency without compaction spikes.
  • Column storefor: analytics. Avoid when: you do point lookups or row-level updates.
  • More indexesfor: read speed. Avoid when: write throughput/space is tight.

PART 4 — INTERVIEW ARSENAL

How to wield this. Senior signals: (1) you reason from the storage hierarchy (sequential vs random, where durability happens) rather than reciting structures; (2) you tie the engine choice to the read/write/space amplification the workload can afford (RUM); (3) you locate durability precisely at fsync/quorum and know its failure modes. Each question has a model answer and counter-ladder.

A. Fundamentals

Q1. B-tree vs LSM-tree — when each, and why? Answer: B-trees update in place and have low read amplification → best for read-heavy OLTP and range scans, but writes can be random and pages fragment. LSM-trees turn writes into sequential appends + background compaction → best for write-heavy/ingest, at the cost of read amplification (mitigated by bloom filters) and compaction-driven latency spikes. Choose by which of read/write/space you can afford to spend (RUM conjecture). Counter-ladder:

  • “Why are LSM writes faster?” → Sequential append vs random page updates; sequential I/O is far cheaper.
  • “What’s the LSM read cost?” → Possibly checking multiple SSTables; bloom filters skip files for point reads, but range scans still merge across levels.
  • “What’s compaction’s downside?” → I/O/CPU consumption and tail-latency spikes; write amplification (rewriting data repeatedly).

Q2. Walk through what makes a committed write durable. Answer: Append a WAL record describing the change and fsync it to stable storage before (or atomically with) acking the commit; the data pages can be written back later. Durability = the WAL is on stable media (and, in a distributed system, replicated to a quorum). Data in the OS page cache is not durable. Counter-ladder:

  • “fsync is slow — how do you cope?” → Group commit: batch many transactions’ WAL flushes into one fsync.
  • “What if fsync fails?” → Treat as fatal (fsyncgate) — don’t assume the data is safe; crash and recover.
  • “Single-node fsync enough for a distributed DB?” → No — replicate to a quorum; a disk/node can be lost entirely.

B. LSM internals

Q3. Explain the LSM write and read paths and the role of bloom filters and compaction. Answer: Write: append to WAL + insert into the in-memory memtable; when full, flush to an immutable SSTable. Read: check memtable, then SSTables newest→oldest; a per-SSTable bloom filter says “definitely not here” so reads skip most files. Compaction merges SSTables, dropping overwritten/deleted keys, bounding read amplification and reclaiming space. Counter-ladder:

  • “How are deletes handled?” → Tombstone markers; the key is removed only when compaction processes it; tombstones bloat space/scans meanwhile.
  • “Leveled vs size-tiered compaction?” → Leveled: lower read/space, higher write amplification; size-tiered: lower write, higher read/space.
  • “Why can a bloom filter give a false ‘maybe’?” → Probabilistic; false positives (waste a lookup) but never false negatives.

Q4. Your LSM store has good write throughput but p99 read latency spikes periodically. Diagnose. Answer: Compaction. Background merges contend for disk/CPU, and read amplification rises when many SSTables accumulate before compaction catches up. Mitigations: tune compaction (rate-limit, schedule), increase bloom-filter bits, adjust level sizing, separate compaction I/O, or add cache. It’s the classic LSM tail-latency tax. Counter-ladder:

  • “How reduce read amplification?” → More aggressive (leveled) compaction, bigger bloom filters, better block cache.
  • “Tradeoff of more compaction?” → Higher write amplification (SSD wear, throughput) — the RUM tradeoff again.

C. B-tree & recovery

Q5. Why does a B-tree need a WAL, and what does ARIES do on crash? Answer: In-place page updates can be torn by a crash (partial write), corrupting data; the WAL lets you redo/undo to a consistent state. ARIES recovery: Analysis (find dirty pages and in-flight txns from the last checkpoint), Redo (replay all changes to restore exact pre-crash state — repeat history), Undo (roll back uncommitted txns using restartable compensation log records). Counter-ladder:

  • “Why redo even uncommitted changes?” → To restore the exact page state the log assumes, then undo cleanly (“repeating history” simplifies correctness).
  • “What are CLRs for?” → Make undo idempotent/restartable so a crash during recovery is safe.
  • “What bounds recovery time?” → Checkpoints (fuzzy checkpoints) limit how far back Analysis/Redo must scan.

D. Layout & indexing

Q6. When would you choose a column store, and why does it compress so well? Answer: For analytical workloads scanning a few columns over many rows. Storing each column contiguously means a scan reads only the needed columns, and contiguous same-type values compress extremely well (dictionary, RLE, delta, bit-packing) and vectorize (SIMD). Bad for point lookups/row updates because a row is scattered across column files. Counter-ladder:

  • “Why not use a column store for OLTP?” → Row reconstruction/updates touch every column file; terrible for point writes.
  • “Name compression schemes and when each fits.” → Dictionary (low cardinality), RLE (runs), delta/FOR (sorted numerics), plus block compression (LZ4/Zstd).

Q7. A table has 8 indexes and writes are slow. Explain and fix. Answer: Every index must be updated on each write → 8× index maintenance (write amplification) plus space. Fix: drop unused/redundant indexes, use composite/covering indexes to cover multiple queries with fewer structures, and reconsider whether some reads can tolerate a scan. Indexing is a read/write tradeoff. Counter-ladder:

  • “How do you find unused indexes?” → Index usage stats from the DB; drop those never chosen by the planner.
  • “Covering index benefit?” → Query answered from the index alone — no table fetch — at the cost of a wider index.

E. Worked drill — driving a design end-to-end

Watch the workload’s read/write/space profile dictate the engine, with durability pinned precisely.

Prompt: “Design the storage engine choice and durability story for a metrics/time-series ingestion system: millions of writes/sec of (series, timestamp, value), reads are mostly recent-range scans per series and rollups, and a node can crash or lose power. Storage cost matters.”

1 — Profile the workload. “Write-dominated (millions/sec), append-mostly (timestamps increase), reads are range scans by series over recent time, plus rollups; space matters. From the RUM lens I want low write amplification and decent range-scan reads; I can spend some read amplification.”

2 — Choose the engine.LSM-tree. Writes become sequential WAL appends + memtable inserts — exactly what millions of appends/sec need. Data sorted by (series, timestamp) means recent-range scans per series are contiguous within SSTables. B-tree would suffer random in-place writes under this ingest rate.”

3 — Tune compaction for the access pattern. “Time-series is nearly append-only with time-localized reads, so I’d use a time-windowed compaction strategy (group SSTables by time window) — it keeps a window’s data together (great for range scans and for dropping whole old windows on TTL with minimal write amplification). Pure leveled compaction would rewrite old, never-updated data needlessly.”

4 — Space. “Sorted, immutable SSTables of same-type numeric columns compress hard — delta-encode timestamps, delta/FOR the values, block-compress (Zstd). TTL via dropping old time-window SSTables wholesale (cheap) rather than per-key tombstones (which would bloat).”

5 — Durability. “Each write hits the WAL with group commit (amortize fsync across the flood of writes) before ack. fsync failure → treat as fatal. For node/power loss beyond one machine, replicate to a quorum (durability = acknowledged by ≥2 of 3 replicas), since one disk can vanish. Recent memtable data is protected by the WAL replay on restart.”

6 — Reads. “Per-series bloom filters + the block cache serve point/recent reads; range scans walk the sorted recent SSTables. Rollups run as background jobs writing pre-aggregated series (read-optimized) so dashboards don’t scan raw data.”

7 — Tradeoffs stated. “I bought write throughput and low space (compression + window drops) and spent some read amplification and compaction-driven tail latency — the right RUM corner for ingest-heavy time-series. If the workload were instead random point updates with strict low-latency reads, I’d flip to a B-tree and accept lower write throughput. Durability is pinned at WAL+fsync locally and quorum replication globally — not at the page cache.”

Template: profile read/write/space → pick the engine at the right RUM corner → tune compaction/layout to the access pattern → pin durability at fsync + quorum → state the tradeoff.

F. Consolidated gotchas & traps (rapid fire)

  • Durability = fsync (local) + quorum (distributed), never the page cache.
  • fsync can fail and lose data — treat as fatal.
  • Compaction is a real tax — tail-latency spikes, write amplification.
  • Tombstones bloat until compacted; range deletes are nasty.
  • B-trees need WAL for torn-page safety.
  • RUM: optimize at most two of read/update/memory.
  • Page-cache hit ratio often dominates latency.
  • Every index taxes writes and space.
  • Column stores are wrong for point lookups/updates.
  • Bloom filters never give false negatives (only false “maybe”).

G. Pros/cons master tables

Storage structures

Structure Pros Cons
B+tree Low read amp; range scans; predictable Random writes; fragmentation; needs WAL
LSM-tree High write throughput; compresses well Read amp; compaction spikes; tombstone bloat
Column store Massive scan/compression wins Bad for points/updates; batch-oriented
Hash index O(1) point lookups No ranges
Inverted index Full-text search Maintenance cost; not for ranges/points on values

Compaction (LSM)

Strategy Pros Cons
Leveled Low read & space amp High write amplification
Size-tiered Low write amplification High read & space amp
Time-windowed Great for time-series TTL/scan Niche to time-ordered data

Go deeper (primary sources)

  • Bayer & McCreight, “Organization and Maintenance of Large Ordered Indexes” (1972) — B-trees.
  • O’Neil, Cheng, Gawlick, O’Neil, “The Log-Structured Merge-Tree (LSM-Tree)” (1996).
  • Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging” (1992).
  • Chang et al., “Bigtable: A Distributed Storage System for Structured Data” (2006); Ghemawat, Gobioff, Leung, “The Google File System” (2003).
  • Graefe, “Modern B-Tree Techniques” (2011).
  • Athanassoulis et al., “Designing Access Methods: The RUM Conjecture” (2016).
  • Stonebraker & Çetintemel, “One Size Fits All: An Idea Whose Time Has Come and Gone” (2005).
  • Petrov, Database Internals (2019) — the modern synthesis.

PART 5 — MAKE IT STICK (Teaching Tutorial)

The references are the map; this is the driving lesson. A storage engine is just two pictures (B-tree, LSM-tree) and one boundary (fsync). Draw them once and you can predict any database’s behavior.

13.1 The one idea: random vs sequential I/O

  Random write (overwrite page in place)      Sequential write (append to a log)
   seek… write… seek… write…  ← slow          write write write write →  ← fast
   (B-tree updates do this)                    (LSM + WAL do this)

Every storage decision traces back to one fact: sequential I/O is much cheaper than random I/O, and durability means crossing onto disk (fsync). B-trees pay random writes to get cheap reads; LSM-trees pay later (compaction) to make writes sequential now. That’s the whole fork.

13.2 B-tree vs LSM in two pictures

  B-TREE (read-optimized)                LSM-TREE (write-optimized)
     [ . . root . . ]                     WRITE → memtable (RAM, sorted) + WAL
      /    |     \                                │ flush when full
   [..] [..] [..]  ← shallow (3–4 levels)         ▼
    leaves linked → range scans easy        SSTable SSTable SSTable  (immutable, on disk)
   update = find page, REWRITE in place             │ compaction merges them
                                            READ → check memtable, then SSTables
                                                   (bloom filter skips most)
  • B-tree: one path to any key (low read amplification), but updates rewrite pages (random-ish writes, fragmentation). Great for reads + range scans + OLTP.
  • LSM-tree: writes are append-only and fast; reads may check several SSTables (bloom filters help); compaction runs in the background (costing I/O + tail-latency spikes). Great for write-heavy ingest.

One-line picker: Reads & ranges → B-tree. Write floods → LSM.

13.3 The amplification triangle (RUM) — you can’t win all three

            READ amp
              /\
             /  \      Pick TWO to optimize; the third gets worse.
            /    \     B-tree:        low READ  (pay write + space)
           /______\    Leveled LSM:   low READ+SPACE (pay WRITE)
       WRITE     SPACE  Size-tiered:  low WRITE (pay READ+SPACE)
       amp        amp

RUM conjecture: minimize at most two of Read, Update(write), Memory(space). There is no free lunch — when someone proposes an engine, ask “which corner did they buy, and what did they pay?”

13.4 The fsync boundary (where durability actually lives)

  app write → OS PAGE CACHE (RAM)  ── power loss here = DATA GONE
                   │ fsync()
                   ▼
              DISK / SSD  ── now it's durable (locally)
                   │ replicate to a quorum
                   ▼
        durable even if this whole node dies

Memorize the boundary: data in the page cache is not durable; only after fsync (and, in a distributed system, after a quorum acknowledges) is it safe. fsync is slow → engines batch commits (group commit) to amortize it. And fsync can fail and lose the page — robust engines treat that as fatal.

13.5 Analogies that stick

  • B-tree = a filing cabinet — to change a record you find the folder and rewrite it in place.
  • LSM = an inbox + periodic filing — toss new notes on top fast (memtable/SSTable), and a clerk merges and dedups later (compaction).
  • Bloom filter = a bouncer with a guest list that only ever says “definitely not on the list” or “maybe” — never a false “not here.”
  • fsync = actually mailing the letter vs leaving it in your outbox (page cache).

13.6 Misconceptions → corrections

You might think… Actually…
“A committed write in RAM is safe.” Not until fsync (and a quorum, when distributed). The page cache is volatile.
“LSM is just faster than B-tree.” Faster writes; it pays read amplification and compaction spikes.
“Compaction is free background work.” It consumes real I/O/CPU and causes tail-latency spikes.
“More indexes = better.” Each index taxes every write and costs space.
“You can optimize read, write, and space together.” RUM: pick two.

13.7 Explain it back (Feynman)

  1. Why does sequential vs random I/O decide everything? [13.1]
  2. Draw B-tree and LSM; who wins writes, who wins reads, why? [13.2]
  3. State RUM and place B-tree, leveled LSM, size-tiered LSM on it. [13.3]
  4. Draw the fsync boundary; where is data not durable? [13.4]
  5. Why does an LSM store get periodic p99 read spikes? [Part 4 Q4]

13.8 Flashcards (cover the right column)

Prompt Answer
Root cause of all design Sequential I/O ≫ random I/O
B-tree strength Reads + range scans (in-place updates)
LSM strength Write throughput (append + compact)
Bloom filter promise Never a false negative
RUM conjecture Optimize ≤ 2 of read/write/space
Durability lives at fsync (+ quorum if distributed)
Group commit Batch fsyncs to amortize cost
Compaction cost I/O + CPU + tail-latency spikes

13.9 The 60-second recall

“A storage engine is a strategy for cheap I/O on the expected workload, given that sequential I/O beats random and durability means reaching disk. B-trees update pages in place — low read amplification, great for reads and range scans; LSM-trees append writes and compact later — high write throughput, at the cost of read amplification and compaction tail-latency spikes. The RUM conjecture says you can only minimize two of read, write, and space, so every engine buys a corner and pays elsewhere. Durability lives at fsync, never in the OS page cache, and in a distributed system it means a quorum acknowledged — a single fsync isn’t enough. Bloom filters skip SSTables (never a false negative), and every extra index taxes writes and space.”

Frequently asked questions

B-tree vs LSM-tree — when do you use each?

Use a B-tree for read-heavy workloads, range scans, and predictable latency; it updates in place with low read amplification. Use an LSM-tree for write-heavy or high-ingest workloads; it turns writes into sequential appends plus background compaction, at the cost of read amplification and compaction spikes.

What makes a write durable?

A write is durable once it is forced to stable storage via fsync of the write-ahead log, and in a distributed system once it is acknowledged by a quorum of replicas. Data sitting in the OS page cache is not durable and is lost on power failure.

What is write amplification?

Write amplification is the number of bytes physically written per logical write. LSM-tree compaction rewrites data repeatedly, inflating write amplification, which matters for SSD wear and throughput. B-trees, leveled compaction, and size-tiered compaction sit at different points on this tradeoff.

What is the RUM conjecture?

The RUM conjecture states that a storage engine can minimize at most two of three overheads: Read, Update, and Memory (space). B-trees favor low read overhead; leveled LSM-trees lower read and space at the cost of write amplification; size-tiered LSM-trees lower write overhead at the cost of read and space.

Why do LSM-trees have read amplification?

A read may need to check the in-memory memtable and several immutable SSTables across levels before finding a key. Bloom filters let point reads skip SSTables that definitely do not contain the key, and compaction bounds the number of files, but range scans still merge across levels.

Previous