Transactional Outbox: Solving the Dual Write Problem
You have a bug in your microservices, but you just haven’t found it yet.
It usually looks like this:
@Transactional
public void completeOrder(Order order) {
orderRepo.save(order); // Step 1: Update DB
kafka.send("order-completed", order); // Step 2: Tell the world
}
This works 99.9% of the time. But that 0.1%? That’s where your data dies. This is the Dual Write Problem.
Why Your Events Are “Ghosts”#
You are writing to two different things: a Database and a Message Broker. You cannot wrap them both in one transaction.
- If the DB commit fails, but the message was already sent? You just told the Warehouse to ship an order that doesn’t exist in your DB.
- If the DB commit succeeds, but the network blips before the message is sent? The order is in your DB, but the Warehouse never hears about it. It’s a “ghost” order.
The Solution: The Outbox#
Instead of trying to talk to Kafka during your business logic, you just talk to your database. You write the event into a special outbox table in the same transaction as your order.
The Atomic Write#
@Transactional
public void completeOrder(Order order) {
orderRepo.save(order);
// Save the "intent" to publish an event
OutboxEntry entry = new OutboxEntry("ORDER_COMPLETED", toJson(order));
outboxRepo.save(entry);
}
Now, either both records are saved, or nothing is. No more ghosts.
The Message Relay#
A separate process (the “Relay”) reads that outbox table and pushes the messages to Kafka. Once it gets an ACK from Kafka, it deletes the row or marks it as processed.
Outbox vs. CDC#
Wait, didn’t I just write about CDC?
Yes. CDC (like Debezium) is actually the best way to implement the “Relay” part of the Outbox pattern. Instead of a background thread polling your DB every 100ms (which is heavy), you let Debezium watch the transaction log for new rows in the outbox table.
It’s the best of both worlds: Application-defined events (Outbox) with infrastructure-level reliability (CDC).
What I’m Thinking#
I remember the first time I realized that publishEvent() inside a transaction was a lie. It felt like the ground shifted under me.
The Outbox pattern feels like “boilerplate” when you first see it. You think, “I really have to create a table just to send a message?”
But once you’ve had to manually reconcile a database with a Kafka topic because of a network timeout, you never go back. It’s the “glue” that makes event-driven systems actually trustworthy.
Have you ever lost an event because of a partial failure? How did you recover the data?