Your consumer retries a message. Fails. Retries again. Fails. Retries 10,000 more times. Still fails.

The message is malformed. It will never succeed. But your consumer doesn’t know that. It just keeps retrying, blocking every message behind it.

This is the poison pill problem. And dead letter queues (DLQs) are the fix.

The Poison Pill#

Not all failures are transient. A database timeout might resolve on retry. A malformed JSON payload never will.

Without a DLQ, your consumer has two options: retry forever (blocking the queue) or skip the message (losing data). Both are bad.

A DLQ gives you a third option: move the bad message aside, keep processing the rest, and deal with the failure later.

@KafkaListener(topics = "orders")
@RetryableTopic(
    attempts = "3",
    backoff = @Backoff(delay = 1000, multiplier = 2),
    dltTopicSuffix = ".dlq"
)
public void processOrder(OrderEvent event) {
    orderService.create(event.getOrder());
}

@DltHandler
public void handleDlt(OrderEvent event) {
    log.error("Message failed after 3 attempts: {}", event);
    alertService.notify("DLQ: failed order " + event.getOrderId());
}

Three retries with exponential backoff. Still failing? Route to the DLQ. Alert the team. Move on.

graph TD M[Message] --> C[Consumer] C --> S{Success?} S -->|Yes| K[Commit Offset] S -->|No| R{Retries Left?} R -->|Yes| C R -->|No| D[Dead Letter Queue] D --> A[Alert + Manual Review] style M fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style K fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style A fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Replay and Recovery#

The DLQ isn’t a graveyard. It’s a triage queue.

Once you fix the bug (or the downstream service recovers), replay the DLQ messages back into the original topic. At Oracle, we built a simple replay tool for NSSF notifications. Fix the root cause, replay, verify. That 50% reduction in retry noise I mentioned in Making Consumers Idempotent? Half of that came from stopping poison pills from clogging the retry loop.

The key: make your consumers idempotent so replayed messages are safe.

What I’m Learning#

DLQs are admission that not every message can be processed right now, and that’s fine. The worst thing you can do is block healthy messages behind a broken one. Separate the failures, keep the system moving, fix things later.

What’s your DLQ strategy? Do you alert on every message, or batch review?