Three consumers, six partitions. Each consumer handles two partitions. Consumer C crashes. Who takes over C’s partitions? Both A and B need to know C is gone, agree on the new assignment, and resume processing. This coordination is called rebalancing, and it’s one of the most disruptive events in a Kafka consumer group.

The Stop-the-World Problem#

In the eager (default) rebalancing protocol, when any consumer joins or leaves, ALL consumers stop processing. Every consumer revokes all its partitions. The group coordinator reassigns everything from scratch. Every consumer gets its new assignment and resumes.

During this window, nothing is being processed. For a group with 50 consumers and 200 partitions, this pause can last seconds. If a consumer’s session times out because it was slow to process a batch, it triggers a rebalance, which makes processing slower, which triggers more timeouts. A cascading mess.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant A as Consumer A participant GC as Group Coordinator participant B as Consumer B Note over GC: Consumer C missed heartbeat GC->>A: Revoke all partitions GC->>B: Revoke all partitions Note over A,B: All processing stops GC->>GC: Reassign partitions GC->>A: Assign P1, P2, P3 GC->>B: Assign P4, P5, P6 Note over A,B: Processing resumes

Cooperative (Incremental) Rebalancing#

The fix: don’t revoke everything. Only revoke the partitions that are actually moving. Consumer C dies. Its partitions (P5, P6) need to move. Consumers A and B keep processing their existing partitions (P1-P4) while only the orphaned partitions get reassigned.

// Enable cooperative rebalancing
Properties props = new Properties();
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    CooperativeStickyAssignor.class.getName());

Two rounds instead of one. First round: coordinator identifies partitions to move. Second round: those partitions get assigned to new owners. Everything else continues uninterrupted.

The difference is dramatic. Instead of a full stop, only the affected partitions experience a brief pause. With 50 consumers and 200 partitions, if one consumer dies affecting 4 partitions, 196 partitions keep flowing.

Offset Management During Rebalancing#

Here’s where it gets tricky. Consumer C was processing partition P5, committed offset 1000, but had actually processed up to offset 1050 before crashing. The new owner of P5 starts at the committed offset: 1000. Events 1001 through 1050 get reprocessed.

This is why idempotent consumers are essential. Rebalancing guarantees at-least-once delivery, not exactly-once. If your consumer isn’t idempotent, rebalancing introduces duplicates. Every. Single. Time.

Session Timeouts and Heartbeats#

The coordinator detects a dead consumer through missed heartbeats. Two critical settings:

session.timeout.ms: how long the coordinator waits without a heartbeat before declaring the consumer dead. Too short and slow GC pauses trigger unnecessary rebalances. Too long and a crashed consumer’s partitions sit idle.

max.poll.interval.ms: maximum time between poll() calls. If processing a batch takes longer than this, the coordinator assumes the consumer is stuck and rebalances.

Tuning these wrong is the number one cause of cascading rebalances. I’ve seen teams set session.timeout.ms to 10 seconds while their batch processing regularly takes 15. Constant rebalancing, constant duplicate processing.

At Oracle, we had a similar coordination problem with NSSF config processing workers. Multiple workers consumed config change events from a shared queue. When a worker went down for deployment, the remaining workers needed to pick up its work. Our initial approach was the “stop-the-world” style: pause all workers, redistribute, resume. Deploys caused visible processing delays. We switched to a sticky assignment: each worker tracked which config ranges it owned. When a worker left, only its ranges were redistributed. The other workers never paused. Config processing continuity during deployments improved significantly, and we stopped seeing the retry storm that the pause had been causing (the 50% reduction in retry messages was partly from eliminating unnecessary rebalance-triggered retries).

What I’m Learning#

Consumer group rebalancing is one of those things that works invisibly until it doesn’t. The eager protocol is simple to understand but brutal at scale. Cooperative rebalancing is the right default for any non-trivial consumer group. And regardless of protocol, idempotency on the consumer side isn’t optional. Rebalancing will replay events, and your system needs to handle that gracefully.

Have you been bitten by consumer group rebalancing? What timeout settings worked for your workload?