Sometimes the change you need to make breaks compatibility. You can’t add a default. The field type genuinely needs to change. You can’t keep the old schema. Here’s what you do instead.
The New Topic Strategy The cleanest approach: create a new topic with the new schema. Producers write to both old and new topics in parallel. Consumers migrate to the new topic one by one. When all consumers are on the new topic, stop writing to the old one.
Kafka retains messages for days or weeks. Your consumer code will be updated independently of your producer. That means old messages need to be readable by new consumer code, and new messages need to be readable by old consumer code. You can’t just change a field.
What Backward and Forward Mean Backward compatibility: a new consumer can read messages written with the old schema. If you add an optional field with a default, old messages (which don’t have that field) are still valid.
Service A writes a Kafka message with field user_id. Service B reads it. Service A’s team renames it to userId next sprint. Service B starts throwing deserialization errors at runtime. Neither team knew about the other.
The Problem In a microservices system passing messages through Kafka, producers and consumers evolve independently. There’s no enforced contract. A producer can change a field name, add a required field, or change a data type, and the consumer finds out when deserialization fails in production.
Real-time results are fast and approximate. Historical results are slow and accurate. The tension between them is where Lambda and Kappa architecture come from.
Lambda: Two Pipelines Lambda runs two parallel systems. The batch layer processes all historical data on a schedule (Spark on HDFS, every few hours) and produces ground truth. The speed layer processes the live stream (Kafka Streams or Flink) for low-latency results. The serving layer merges both: “latest batch result plus stream delta since the last batch.
There are two clocks in any stream processing system. Event time: when the click actually happened, recorded in the payload. Processing time: when your system received it. On a healthy network they’re close. In reality they’re not.
Mobile clients buffer events when offline. Retries add delay. A click at 10:00:05 might reach your processor at 10:00:47. The 10:00 window has long since closed.
The Problem With Never Waiting If you never close a window, you never produce output.
Aggregating over an infinite stream sounds easy until you realize you have no idea when it ends. You need to cut it into chunks. That’s what windows are.
Three Window Types Tumbling windows are fixed, non-overlapping buckets. “Clicks per minute” is a tumbling window: minute 1, minute 2, minute 3, no overlap. Simple to implement, but events that span the boundary get split across buckets.
Sliding windows overlap. “Average clicks in the last 5 minutes, recomputed every minute” means each event can appear in up to 5 windows.
You have 3 consumers reading from 6 Kafka partitions. One consumer crashes. The remaining 2 need to pick up its partitions. That handoff isn’t as smooth as you’d hope.
Your event log has 100 million records. Key ‘user-42’ has been updated 500 times. You only care about the latest value. But deleting old entries would break consumers who haven’t caught up yet.
Your consumer retried a bad message 10,000 times. It will never succeed. Dead letter queues catch the messages that can’t be processed so the rest of your system keeps moving.