Read Repair and Anti-Entropy: Healing Stale Replicas

Your quorum read succeeds. Two out of three replicas responded with the latest data. But replica 3 is stale. How does it catch up?

Two ways: fix it when you notice, or fix everything constantly.

Read Repair#

Client reads from multiple replicas, detects one has stale data (older version number or timestamp), writes the newest value back to the stale replica. Happens during normal reads.

Works great for hot data. Frequently accessed keys stay in sync. Problem is cold data. If nobody reads a key for weeks, that stale replica stays stale forever.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C as Client participant R1 as Replica 1 participant R2 as Replica 2 participant R3 as Replica 3 C->>R1: Read key=X C->>R2: Read key=X C->>R3: Read key=X R1-->>C: value=10, v=5 R2-->>C: value=10, v=5 R3-->>C: value=8, v=3 (stale!) Note over C: Detects R3 is stale C->>R3: Write value=10, v=5 R3-->>C: OK

Client reads from 3 replicas, notices R3 is behind, fixes it inline.

Anti-Entropy#

Background process continuously compares replicas and syncs differences. Runs whether data is accessed or not.

Naive approach scans every key on every replica. Expensive. Better way: Merkle trees. Hash ranges of keys into a tree structure. Compare tree hashes between replicas. Only sync the ranges that differ.

Fixes all data eventually, but “eventually” might be hours or days. Also adds constant network and CPU overhead.

The Trade-off#

Read repair is cheap (only runs during reads) but incomplete (misses cold data). Anti-entropy is complete but expensive (always running, scanning everything).

Real systems use both. Cassandra does read repair on hot paths, anti-entropy in background. Different tuning for different priorities.

I’m still fuzzy on Merkle tree implementation details. How deep should the tree be? How often do you rebuild it? Need to dig into Cassandra’s source code.

The bigger lesson: distributed systems don’t “just work.” You need explicit mechanisms to heal divergence, and each mechanism has costs.

What healing strategy does your database use?