Reconciliation: When Your Systems Disagree

You send an event to an external system. Your database marks it as sent. The external system never received it. Now your internal state is wrong and nobody knows.

This happens in every system that integrates with something outside its boundary. Network blips, missed CDC events, bugs in serialization. Data drifts apart silently.

The Reconciliation Pattern#

On a schedule, fetch records from both sides and compare.

@Scheduled(cron = "0 0 * * * *") // every hour
public void reconcile() {
    Set<String> internal = internalRepo.findIdsSince(lastReconciliation);
    Set<String> external = externalClient.fetchIdsSince(lastReconciliation);

    Set<String> missingExternal = new HashSet<>(internal);
    missingExternal.removeAll(external); // we have it, they don't

    Set<String> missingInternal = new HashSet<>(external);
    missingInternal.removeAll(internal); // they have it, we don't

    if (!missingExternal.isEmpty()) {
        alertService.warn("Records missing from external: " + missingExternal.size());
    }
    if (!missingInternal.isEmpty()) {
        alertService.warn("Records missing internally: " + missingInternal.size());
    }
}

Three outcomes from comparison:

Match. Both sides agree. Most records fall here.
Missing externally. You sent it, they don’t have it. Lost event, failed delivery, serialization bug.
Missing internally. They have it, you don’t. Phantom record, or your system dropped an inbound event.

graph TD F1[Fetch Internal Records] --> C{Compare} F2[Fetch External Records] --> C C --> M[Match: No Action] C --> ME[Missing External: Resend or Alert] C --> MI[Missing Internal: Investigate] ME --> R{Auto-fix safe?} R -->|Yes| AF[Retry Send] R -->|No| MR[Manual Review Queue] style F1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style ME fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style MI fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style AF fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style MR fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Auto-fix vs Alert#

The temptation is to auto-fix everything. Missing externally? Resend. Missing internally? Re-ingest. But auto-fixing can make things worse if the root cause is a bug rather than a transient failure. You might resend a corrupted record, or ingest a duplicate.

Safe to auto-fix: idempotent operations where resending is harmless. Unsafe: anything that creates side effects (charging a user, triggering a notification). For unsafe cases, send to a manual review queue or a DLQ.

Reconciliation Frequency#

How often depends on how much drift you can tolerate. Financial systems reconcile every few minutes. Config syncing might reconcile daily. The SLO for data freshness dictates the schedule.

At Oracle, we reconciled NSSF notification records against actual network function registrations hourly. Most hours, zero mismatches. But roughly once a week, we’d find 2-3 records where a CDC event was missed during a network function restart. The reconciliation job caught these within an hour instead of waiting for a user to report incorrect state. Structured logs with correlation IDs made it easy to trace exactly where the event was lost.

What I’m Learning#

Reconciliation isn’t glamorous. It’s a scheduled job that compares two lists. But it’s the safety net for every distributed integration. You can build the most reliable event pipeline in the world and something will still drift eventually. The question isn’t whether your systems will disagree. It’s how fast you’ll notice.

How does your team handle data consistency across system boundaries?