Write to primary. Read from replica. Assert. Fails intermittently. The classic flaky test in distributed systems. It’s not a bug in your code. It’s a bug in your test: you’re testing an eventually consistent system with strong-consistency assertions.

This confused me for longer than I’d like to admit.

The Polling Pattern#

The simplest fix: poll until the assertion passes or a timeout expires.

@Test
void userUpdateEventuallyPropagates() {
    // Write to primary
    userService.updateEmail("user-42", "new@email.com");

    // Poll replica until it catches up
    await()
        .atMost(Duration.ofSeconds(5))
        .pollInterval(Duration.ofMillis(200))
        .untilAsserted(() -> {
            User user = replicaService.getUser("user-42");
            assertEquals("new@email.com", user.getEmail());
        });
}

Ugly? A bit. But it correctly tests what your system actually promises: the data will be consistent eventually, not immediately. The timeout represents your acceptable propagation window. If it takes longer than 5 seconds, something is actually broken.

graph TD W["Write to Primary"] --> P["Poll Replica"] P --> C{Data matches?} C -->|Yes| PASS["Test passes"] C -->|No| T{Timeout exceeded?} T -->|No| WAIT["Wait 200ms"] --> P T -->|Yes| FAIL["Test fails: consistency window exceeded"] style W fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style P fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style PASS fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style T fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style WAIT fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style FAIL fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Deterministic Testing with Controlled Clocks#

Polling is fine for integration tests, but it’s slow and non-deterministic. For unit tests of eventually consistent logic, control time itself.

@Test
void replicaSyncsWithinWindow() {
    TestClock clock = new TestClock();
    ReplicationManager replication = new ReplicationManager(clock);

    replication.write("key-1", "value-A");
    // Advance clock past the replication interval
    clock.advance(Duration.ofMillis(500));
    replication.processPendingReplications();

    assertEquals("value-A", replication.readFromReplica("key-1"));
}

No waiting. No flakiness. You control when “time” passes and when replication happens. The test is fast and deterministic. The trade-off: you’re testing your replication logic, not the actual async propagation. Both types of tests are valuable.

Testing Idempotency#

Eventually consistent systems often process the same event multiple times (due to retries, consumer group rebalancing, network duplicates). Testing idempotent consumers means deliberately replaying events and asserting no side effects.

@Test
void processingSameEventTwiceHasNoSideEffect() {
    OrderEvent event = new OrderEvent("order-123", "COMPLETED");

    eventHandler.handle(event);
    eventHandler.handle(event); // Replay

    // Should have exactly one order completion, not two
    assertEquals(1, orderRepo.countCompletedForOrder("order-123"));
}

This is the kind of test that seems unnecessary until a rebalance causes your system to process 1,000 duplicate events and double-charges a bunch of users.

Event Ordering Challenges#

In eventually consistent systems, events can arrive out of order. Testing this means deliberately delivering events in wrong order and verifying the system handles it. Can your consumer handle receiving “order shipped” before “order created”? If not, you need a test that proves it.

This connects to ordering guarantees. If your system relies on ordering, test what happens when ordering breaks.

At Oracle, we had a persistent flaky test problem with NSSF config validation. Tests would write a config, immediately read it back from a read replica, and assert the value. Passed 90% of the time. Failed 10%. We blamed infrastructure, added retries to the test, added Thread.sleep(1000). Still flaky, just less often. The actual fix was accepting that the replica has lag and testing accordingly. We switched to polling assertions with a 3-second timeout for replica-dependent tests and deterministic tests for the config validation logic itself. Flaky test rate dropped from 10% to under 0.5%. Took me embarrassingly long to realize the test was wrong, not the system.

What I’m Learning#

Testing eventually consistent systems requires a mindset shift. Stop asserting immediate results. Start asserting eventual results within a bounded window. Use polling for integration tests, controlled clocks for unit tests, and always test idempotency explicitly. The hardest part isn’t writing the tests. It’s accepting that “write then immediately read” is a bug in your test, not a valid test pattern.

What’s your approach to testing systems with eventual consistency? Do you use polling, controlled clocks, or something else?