Log Compaction: Keeping the Latest Without Keeping Everything

A Kafka topic stores every event ever published. User-42 changed their email 500 times. All 500 events are in the log. A new consumer starting from the beginning has to replay all 500 to figure out the current email. That’s wasteful.

Delete old events? You’d break consumers who haven’t processed them yet. You need a way to keep the latest value for each key while discarding the history.

How Log Compaction Works#

Instead of deleting records by age (retention period), log compaction deletes records by key. For each key, it keeps only the most recent entry. The result: the log shrinks dramatically, but every key’s latest value is preserved.

Before compaction:

offset 1: key=user-42, email=old@mail.com
offset 2: key=user-99, email=alice@mail.com
offset 3: key=user-42, email=newer@mail.com
offset 4: key=user-42, email=current@mail.com

After compaction:

offset 2: key=user-99, email=alice@mail.com
offset 4: key=user-42, email=current@mail.com

Offsets 1 and 3 are removed. The log is now a snapshot of current state.

graph TD L["Full Log: 100M records"] --> C["Compaction Process"] C --> S["Scan log segments"] S --> K["Build key -> latest offset map"] K --> W["Write new segment: keep only latest per key"] W --> CL["Compacted Log: 5M records"] style L fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style K fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style W fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style CL fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Compaction vs Retention#

They solve different problems. Time-based retention says “delete everything older than 7 days.” Log compaction says “keep only the latest value per key, regardless of age.”

You can use both. Set retention to 7 days AND enable compaction. Recent data has full history (for consumers who need it). Old data has only the latest value per key (for consumers starting fresh).

This connects to compaction in LSM-trees. Same idea: merge overlapping entries to reclaim space while preserving the latest state. Different domain (database vs message log), identical concept.

Tombstones: Deleting a Key#

To delete a key from a compacted log, publish a message with that key and a null value. This is called a tombstone. The compactor sees the tombstone, removes all previous entries for that key, and eventually removes the tombstone itself (after a configurable grace period to ensure all consumers see the delete).

// Publish a tombstone to delete user-42 from the compacted topic
producer.send(new ProducerRecord<>("user-updates", "user-42", null));

If you don’t publish a tombstone, the key lives in the compacted log forever. Its value never gets cleaned up.

When to Use Compacted Topics#

Compacted topics are perfect for CDC and event sourcing snapshot topics. You capture every database change as an event. A new consumer doesn’t need to replay the entire history. It just reads the compacted topic to get current state, then consumes new events going forward.

It’s also useful for configuration distribution. Publish config changes to a compacted topic. Each service reads the latest config values on startup without processing historical changes.

At Oracle, we had a similar problem with NSSF notification messages. Every config change generated a notification event. Downstream systems consumed these events. But when a downstream system restarted, it had to replay all historical notifications to rebuild its state. Some configs had changed hundreds of times. We added a “current state” topic alongside the event stream: a compacted topic where each key was a config identifier and the value was the current config snapshot. Restarting consumers read from the snapshot topic first (fast), then switched to the live event stream (for new changes). Startup time dropped from minutes to seconds because the consumer no longer replayed the full history.

What I’m Learning#

Log compaction bridges the gap between event streams and state snapshots. Events give you history. Compaction gives you current state. Having both means consumers can choose: replay everything for audit purposes, or read the compacted view for a fast start. The concept is the same as database compaction: multiple versions of the same key get merged into the latest one.

Do you use log compaction in your event pipelines, or do you manage snapshots separately?