Read Replicas: Hidden Consistency Traps

User updates their profile name. Page refreshes. Old name is still showing. They click refresh again. New name appears.

Your code is fine. Your database is fine. The read hit a replica that was 200ms behind the primary. Welcome to replication lag.

The Setup#

Primary handles writes. Replicas handle reads. Replication is asynchronous. There’s always a lag, usually milliseconds, sometimes seconds under load.

// Write goes to primary
@Transactional
public void updateProfile(String userId, String newName) {
    userRepo.save(new User(userId, newName));  // Primary
}

// Read goes to replica
public User getProfile(String userId) {
    return readOnlyUserRepo.findById(userId);  // Replica (stale!)
}

User writes to primary, reads from replica. Replica hasn’t caught up. User sees stale data. Files a bug report.

Read-Your-Own-Writes#

The most common fix: after a write, route subsequent reads from that user to the primary for a short window.

public User getProfile(String userId, boolean recentlyUpdated) {
    if (recentlyUpdated) {
        return primaryUserRepo.findById(userId);  // Read from primary
    }
    return replicaUserRepo.findById(userId);  // Read from replica
}

How do you know “recently updated”? Options: set a cookie with a timestamp after writes, track it in a session store, or use a cache flag with a short TTL.

// After write, set a flag
cache.put("recent-write:" + userId, true, Duration.ofSeconds(5));

// Before read, check the flag
boolean recentWrite = cache.get("recent-write:" + userId) != null;
return getProfile(userId, recentWrite);

Simple. Five seconds covers most replication lag. The primary handles the extra reads only for users who just wrote something.

graph TD W[Write Request] --> P[(Primary DB)] P --> F[Set recent-write flag] R[Read Request] --> C{Recent Write?} C -->|Yes| P C -->|No| RP[(Read Replica)] style W fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style P fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style RP fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

The Traps I’ve Seen#

Replica lag under load. Normal lag is 50ms. During batch jobs or spikes, it jumps to 10 seconds. Your 5-second read-your-writes window isn’t enough anymore.

Cross-user consistency. User A posts a comment. User B loads the page. User B’s read hits a lagging replica. Comment is missing. Not a read-your-writes issue since it’s a different user. This is harder to fix and often acceptable.

Monitoring blind spots. At Oracle, we monitored average replica lag. Average was 100ms. But p99 lag during our nightly batch was 8 seconds. Averages lie, just like with latency. We only caught it when users complained about “stale data” every morning.

What I’m Learning#

Read replicas are the easiest way to scale reads and the easiest way to introduce subtle consistency bugs. The bugs don’t crash anything. They just confuse users. And confused users file vague bug reports that are impossible to reproduce.

The rule I follow: any endpoint that reads data the user just wrote should go to primary. Everything else can hit replicas. It’s a session guarantee your users expect even if they can’t name it.

Have you dealt with replica lag surprises? How do you route reads?