There are two clocks in any stream processing system. Event time: when the click actually happened, recorded in the payload. Processing time: when your system received it. On a healthy network they’re close. In reality they’re not.

Mobile clients buffer events when offline. Retries add delay. A click at 10:00:05 might reach your processor at 10:00:47. The 10:00 window has long since closed.

The Problem With Never Waiting#

If you never close a window, you never produce output. If you close it immediately, you drop late events. Neither is acceptable.

Watermarks are the compromise. A watermark is a statement: “I believe all events up to time T have arrived.” Events after the watermark are late. Windows ending before T can be closed and results can be emitted.

How you set the watermark: max(event_time_seen) - allowed_lateness. If the latest event time seen is 10:00:50 and allowed lateness is 30 seconds, the watermark is 10:00:20. Any window ending before 10:00:20 is safe to close.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C as Client participant P as Processor participant W as Watermark C->>P: Event (event_time=10:00:05, arrives 10:00:40) P->>W: Advance watermark to 10:00:10 C->>P: Late event (event_time=10:00:08, arrives 10:01:20) W-->>P: Watermark already at 10:00:50 P-->>C: Route to side output (dead letter)

The trade-off is direct: larger allowed lateness means more complete results but slower output. Smaller means faster output but more events routed to a dead letter queue or discarded.

At Oracle#

Our notification pipeline processed user activity events from mobile clients. Some users commuted underground and buffered events for up to 5 minutes. Early version used processing-time watermarks: the watermark advanced based on wall clock, ignoring event timestamps entirely.

Result: those mobile events looked “late” and were dropped. We were losing roughly 12% of events and patching the daily activity reports manually every morning.

The fix was switching to event-time watermarks with a 6-minute allowed lateness. Late event drop rate fell from about 12% to under 0.5%. The manual patching job went away. Embarrassingly simple fix once you understood what processing-time watermarks actually meant.

What I’m Learning#

Choosing the right allowed lateness feels like tuning a parameter you can’t really know in advance. How do teams figure out the right value without just guessing?