Change Data Capture: Streaming Database Changes

Database changes. You need other systems to react. How do they know what changed?

Option 1: Application publishes events. Fragile. Bugs skip events.

Option 2: Read database changes directly. Change Data Capture.

What is CDC#

Change Data Capture streams database changes (inserts, updates, deletes) as events. Other systems subscribe to this stream.

Example: User updates email in database. CDC captures:

{
  "operation": "UPDATE",
  "table": "users",
  "before": {"id": 123, "email": "old@example.com"},
  "after": {"id": 123, "email": "new@example.com"},
  "timestamp": "2026-01-14T10:30:00Z"
}

Search index, cache, analytics warehouse all react to this event.

Why CDC vs Application Events#

Application events (dual write problem):

@Transactional
public void updateUser(User user) {
    userRepository.save(user);  // Write 1: Database
    eventPublisher.publish(new UserUpdatedEvent(user));  // Write 2: Event
}

Problems:

Event publish fails? DB updated, event lost. Inconsistency.
DB transaction rollback after event sent? Event published for change that didn’t happen.
Code path bypasses event? (admin tool, SQL script) No event fired.

This is the dual write problem: Writing to two systems (database + event bus) can’t be atomic. One might succeed while other fails.

CDC solves this: Single write to database. CDC reads from database log. Database is source of truth. If it’s in DB, event fires. No dual write, no inconsistency.

Alternative: Outbox Pattern

Another solution to dual write: write events to database table in same transaction.

@Transactional
public void updateUser(User user) {
    userRepository.save(user);
    
    // Write event to outbox table (same transaction)
    OutboxEvent event = new OutboxEvent(
        "UserUpdated", 
        user.getId(), 
        toJson(user)
    );
    outboxRepository.save(event);
    // Both succeed or both fail atomically
}

Separate process reads outbox table, publishes to Kafka, marks as sent.

Outbox pattern requires application code changes. CDC works with existing code. Pick based on your constraints.

How CDC Works#

1. Log-Based CDC (Best)#

Read database transaction log (MySQL binlog, PostgreSQL WAL, MongoDB oplog).

MySQL binlog example:

# Binlog entry
BEGIN
UPDATE users SET email='new@example.com' WHERE id=123
COMMIT

CDC tool (Debezium, Maxwell, custom) parses binlog, converts to events.

Pros:

Low overhead (log already exists)
Captures all changes (even from SQL scripts)
Ordered, transactional

Cons:

Requires log access (permissions, networking)
Log format changes between DB versions

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant A as Application participant DB as Database participant L as Transaction Log participant CDC as CDC Tool participant K as Kafka A->>DB: UPDATE users SET email=... DB->>L: Write to binlog/WAL DB-->>A: Success CDC->>L: Read log entries L-->>CDC: Binlog record CDC->>K: Publish UserUpdated event Note over K: Consumers react
(search, cache, analytics)

Log-based CDC reads transaction log, converts to events.

2. Query-Based CDC (Polling)#

Periodically query for changes. Track last update timestamp.

-- Poll every 5 seconds
SELECT * FROM users 
WHERE updated_at > '2026-01-14 10:29:55'
ORDER BY updated_at;

Pros:

Simple, no special permissions
Works with any database

Cons:

Polling overhead
Misses deletes (row gone, can’t query it)
Not real-time (5 sec delay)
Misses rapid updates (updated twice in 5 sec? See only final state)

3. Trigger-Based CDC#

Database triggers write changes to separate table.

CREATE TRIGGER user_changes
AFTER UPDATE ON users
FOR EACH ROW
BEGIN
    INSERT INTO user_changelog (user_id, old_email, new_email, changed_at)
    VALUES (OLD.id, OLD.email, NEW.email, NOW());
END;

CDC reads from changelog table.

Pros:

Captures before/after state
Works without log access

Cons:

Trigger overhead on every write
Changelog table grows unbounded
Adds complexity to DB schema

Real-World Pattern#

Debezium + Kafka is standard.

Debezium connector reads MySQL binlog, publishes to Kafka. Consumers subscribe.

# Debezium MySQL connector config
{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql.example.com",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "password",
    "database.server.id": "1",
    "database.server.name": "mysql-server",
    "table.include.list": "mydb.users,mydb.orders"
  }
}

Events flow: MySQL → Debezium → Kafka → Consumers (Elasticsearch, Redis, Data Warehouse).

Schema Evolution#

Table schema changes? CDC events change.

Add column:

// Before
{"id": 123, "email": "user@example.com"}

// After adding phone column
{"id": 123, "email": "user@example.com", "phone": null}

Consumers must handle schema evolution. Use schema registry (Confluent Schema Registry, Avro).

When to Use CDC#

Good use cases:

Sync to search index (Elasticsearch)
Invalidate cache on changes
Replicate to data warehouse (analytics)
Audit log (who changed what when)
Materialized views in different databases

Don’t use CDC for:

Business logic triggers (use application events)
Critical workflows (CDC has lag, eventual consistency)
When you need validation before action (CDC sees committed changes only)

What I’m Curious About#

Worked with systems using application events, never implemented CDC at scale. Reading about it, the guarantees are attractive (never miss a change), but operational complexity (managing Debezium, Kafka, schema registry) seems significant.

For systems where database is already source of truth and you need multiple downstream consumers to stay in sync, CDC makes sense. For new features where you control the flow, application events might be simpler.

The trade-off: CDC guarantees consistency but adds infrastructure. Application events are simpler but require discipline to maintain consistency.

Have you used CDC in production? What challenges did you face?