You have a bug in your microservices, but you just haven’t found it yet.

It usually looks like this:

@Transactional
public void completeOrder(Order order) {
    orderRepo.save(order); // Step 1: Update DB
    kafka.send("order-completed", order); // Step 2: Tell the world
}

This works 99.9% of the time. But that 0.1%? That’s where your data dies. This is the Dual Write Problem.

Why Your Events Are “Ghosts”#

You are writing to two different things: a Database and a Message Broker. You cannot wrap them both in one transaction.

  • If the DB commit fails, but the message was already sent? You just told the Warehouse to ship an order that doesn’t exist in your DB.
  • If the DB commit succeeds, but the network blips before the message is sent? The order is in your DB, but the Warehouse never hears about it. It’s a “ghost” order.

The Solution: The Outbox#

Instead of trying to talk to Kafka during your business logic, you just talk to your database. You write the event into a special outbox table in the same transaction as your order.

The Atomic Write#

@Transactional
public void completeOrder(Order order) {
    orderRepo.save(order); 
    
    // Save the "intent" to publish an event
    OutboxEntry entry = new OutboxEntry("ORDER_COMPLETED", toJson(order));
    outboxRepo.save(entry); 
}

Now, either both records are saved, or nothing is. No more ghosts.

The Message Relay#

A separate process (the “Relay”) reads that outbox table and pushes the messages to Kafka. Once it gets an ACK from Kafka, it deletes the row or marks it as processed.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% graph LR App[App Logic] --> DB[(Database)] DB --> OT[Outbox Table] Relay[Relay Process] -- Polls --> OT Relay -- Publishes --> K[Kafka/RabbitMQ]

Outbox vs. CDC#

Wait, didn’t I just write about CDC?

Yes. CDC (like Debezium) is actually the best way to implement the “Relay” part of the Outbox pattern. Instead of a background thread polling your DB every 100ms (which is heavy), you let Debezium watch the transaction log for new rows in the outbox table.

It’s the best of both worlds: Application-defined events (Outbox) with infrastructure-level reliability (CDC).

What I’m Thinking#

I remember the first time I realized that publishEvent() inside a transaction was a lie. It felt like the ground shifted under me.

The Outbox pattern feels like “boilerplate” when you first see it. You think, “I really have to create a table just to send a message?”

But once you’ve had to manually reconcile a database with a Kafka topic because of a network timeout, you never go back. It’s the “glue” that makes event-driven systems actually trustworthy.

Have you ever lost an event because of a partial failure? How did you recover the data?