Two events arrive out of order. You don’t know they’re out of order. You process them anyway. The system ends up in a state that never should have existed.

Sequence Numbers as the Foundation#

A global sequence number assigned to every write event is the most direct solution to ordering problems. Event 1, event 2, event 3. If event 4 arrives after event 6, you know something is missing. You wait, or request a replay, rather than blindly processing forward.

This sounds simple. It’s surprisingly hard in distributed systems, because assigning globally unique monotonically increasing sequence numbers requires coordination. Something has to be the sequence number authority.

Gap Detection#

The value of sequence numbers goes beyond ordering. A gap in the sequence (you’ve seen 1, 2, 3, 5 but not 4) is an explicit signal that something was lost. Without sequence numbers, you might never know event 4 existed. With them, you detect the gap and can take action: pause processing, request retransmission, alert on-call.

graph TD A[Write Event] --> B[Sequence Assigner] B --> C[Event with Sequence Number] C --> D[Consumers] D --> E{Gap Detected?} E -->|Yes| F[Pause and Request Retransmit] E -->|No| G[Process in Order] F --> H[Fill Gap] H --> G style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

At Oracle#

We had an audit log pipeline that consumed events from multiple services and wrote them to a central store. The pipeline had no sequence numbers. Events occasionally arrived late and got written in the order they arrived, not the order they happened. We didn’t know this was happening.

I discovered it during a compliance review when a reported sequence of actions for an account didn’t line up with what the user claimed to have done. Adding sequence numbers and a gap check revealed we were silently dropping about 0.2% of events during high-load periods. That sounds small until you realize it’s audit data and “not there” is not an acceptable state for a record that might go into a regulatory report.

What I’m Learning#

Ordering guarantees in event-driven systems covers the Kafka side of this. Sequence numbers are the lower-level primitive that makes those guarantees inspectable rather than just assumed.

Have you ever discovered missing events because of a gap in sequence numbers?