Replication Lag: The Bug That Isn’t a Bug
In one of the companies I worked at, we had this intermittent bug. A service would process a request, immediately query its status, and see stale data. “System’s dropping requests!” the team said.
Logs showed the write succeeded. Database confirmed the data was there. Took me embarrassingly long to realize: not a bug. Replication lag.
The write went to the leader. The user’s next read hit a follower that hadn’t caught up yet. Simple timing issue, but it looked exactly like data loss to users.
There are three ways this shows up in production.
Read-after-write problems: You write data, immediately read it back, see stale data. We fixed this by routing authenticated users to the leader for 5 seconds after any write. Adds latency but users prefer slow and correct over fast and wrong.
Time going backwards: User refreshes twice. First refresh shows new data (hit follower A). Second refresh shows old data (hit follower B that’s lagging). We used sticky sessions so same user always hits same follower.
Causality violations: In a conversation thread, you see the reply before the original message. Fix is to write related data to same partition or track causality with version vectors.
Two reads, seconds apart. Load balancer sends them to different followers with different lag.
The problem wasn’t losing data. It was not handling the lag properly. Async replication gives you performance, but you have to code for the inconsistency window.
Have you debugged phantom “data loss” that was actually replication lag?