Most databases use B-trees. Some use LSM trees. The choice determines whether writes or reads are fast.

You can’t optimize both.

B-Trees: Read-Optimized#

B-tree keeps data sorted on disk in a tree structure. Each node is a page (typically 4KB). Insert or update? Find the right page, modify it in place, write it back.

Reads are fast. Single lookup finds your data. Range scans are efficient because data is already sorted and contiguous.

Writes are slow. Every write does random disk I/O. Updating one value means reading a page, modifying it, writing it back. For high write throughput, B-trees struggle.

LSM Trees: Write-Optimized#

LSM tree (Log-Structured Merge-tree) takes a different approach. Writes go to an in-memory buffer (memtable). When memtable fills, flush it to disk as an immutable sorted file (SSTable).

Writes are fast. Sequential appends, no random I/O. Just keep creating new SSTables.

Reads are slow. Your data might be in any SSTable. Query has to check multiple files, merge results. The more SSTables, the slower the read.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% graph TD W[Write Request] W --> BT[B-Tree: Find page
Modify in place
Random I/O] W --> LSM[LSM: Append to memtable
Sequential write
Fast] R[Read Request] R --> BTR[B-Tree: Single lookup
Fast] R --> LSMR[LSM: Check multiple SSTables
Merge results
Slower] style W fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style BT fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style LSM fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style BTR fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style LSMR fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

B-tree optimizes reads. LSM tree optimizes writes. Pick based on workload.

Real-World Usage#

MySQL (InnoDB): B-tree. Most workloads are read-heavy.

Cassandra, RocksDB, LevelDB: LSM tree. Built for write-heavy workloads (logs, time-series, analytics).

The problem with LSM trees is they accumulate SSTables over time. Background compaction merges them, but that’s expensive. More on that next.

The Trade-off#

Write-heavy workload? LSM tree wins. Sequential writes, high throughput.

Read-heavy workload? B-tree wins. Direct lookups, no merging needed.

Balanced workload? Depends on your read/write ratio and whether you can tolerate read latency spikes during compaction.

I’ve worked with MySQL (B-tree) extensively. Haven’t built anything on LSM-based storage yet, but the write throughput advantage is tempting for certain use cases.

What storage engine does your database use?