Compaction Strategies: Cleaning Up After LSM Trees
LSM trees write fast by creating new SSTables. After a while, you have hundreds of them. Reads check every single one.
Compaction merges SSTables into fewer, larger files. But how you merge matters.
The Problem#
Write 1000 records. That’s SSTable 1. Update 500 of them. That’s SSTable 2 with the new values. Delete 200. That’s SSTable 3 with tombstones.
Now a read has to check all three SSTables, merge the results, apply tombstones. The more SSTables, the slower every read gets.
Size-Tiered Compaction#
Merge SSTables of similar size. When you accumulate N small SSTables (say 4), merge them into one larger SSTable. When you have 4 of those larger ones, merge again.
Creates a pyramid: many small SSTables at the top, few large ones at the bottom.
Write amplification: How many times the same data gets rewritten. You write 1MB once. During compaction, that 1MB gets merged into larger SSTables multiple times. With size-tiered, each record gets rewritten about log(N) times as it moves up the pyramid. Lower write amplification means less disk I/O for writes.
Read amplification: How many SSTables you check for a single query. Since SSTables at each tier can have overlapping key ranges, a read might need to check one SSTable per tier. More tiers = higher read amplification = slower reads.
Cassandra uses this by default. Good for write-heavy workloads where you don’t care about read latency spikes.
Leveled Compaction#
Organize SSTables into levels. Level 0 has newest data. Level 1 has 10x more data than Level 0. Level 2 has 10x more than Level 1.
When Level N fills up, compact one SSTable from Level N with all overlapping SSTables in Level N+1.
Key property: within each level (except L0), SSTable key ranges don’t overlap. Reads only check one SSTable per level.
Write amplification: Higher than size-tiered. When you compact from Level N to Level N+1, you might rewrite the same data multiple times as it moves through levels. A record written once might get rewritten 10+ times before reaching the bottom level. More disk I/O, but predictable.
Read amplification: Lower because you only check one SSTable per level. Instead of checking 20 SSTables with overlapping ranges, you check maybe 4-5 (one per level). Faster reads.
RocksDB uses this. Better for read-heavy workloads.
Pick your poison: write amplification or read amplification.
Time-Window Compaction#
For time-series data. Create one SSTable per time window (say, per hour). Compact SSTables within the same window. Never compact across windows.
Deleting old data? Just drop entire SSTables. No compaction needed.
Used in time-series databases where data expires after N days.
The Trade-off#
No free lunch. Compaction uses CPU and disk I/O. During heavy compaction, read latency spikes. Write throughput can drop.
Tuning compaction is more art than science. Most of us using Cassandra or RocksDB just stick with defaults and monitor. The people building these databases are the ones tweaking compaction algorithms.
What compaction strategy does your database use?