Change Data Capture: Streaming Database Changes
Database changes. You need other systems to react. How do they know what changed?
Option 1: Application publishes events. Fragile. Bugs skip events.
Option 2: Read database changes directly. Change Data Capture.
What is CDC#
Change Data Capture streams database changes (inserts, updates, deletes) as events. Other systems subscribe to this stream.
Example: User updates email in database. CDC captures:
{
"operation": "UPDATE",
"table": "users",
"before": {"id": 123, "email": "old@example.com"},
"after": {"id": 123, "email": "new@example.com"},
"timestamp": "2026-01-14T10:30:00Z"
}
Search index, cache, analytics warehouse all react to this event.
Why CDC vs Application Events#
Application events (dual write problem):
@Transactional
public void updateUser(User user) {
userRepository.save(user); // Write 1: Database
eventPublisher.publish(new UserUpdatedEvent(user)); // Write 2: Event
}
Problems:
- Event publish fails? DB updated, event lost. Inconsistency.
- DB transaction rollback after event sent? Event published for change that didn’t happen.
- Code path bypasses event? (admin tool, SQL script) No event fired.
This is the dual write problem: Writing to two systems (database + event bus) can’t be atomic. One might succeed while other fails.
CDC solves this: Single write to database. CDC reads from database log. Database is source of truth. If it’s in DB, event fires. No dual write, no inconsistency.
Alternative: Outbox Pattern
Another solution to dual write: write events to database table in same transaction.
@Transactional
public void updateUser(User user) {
userRepository.save(user);
// Write event to outbox table (same transaction)
OutboxEvent event = new OutboxEvent(
"UserUpdated",
user.getId(),
toJson(user)
);
outboxRepository.save(event);
// Both succeed or both fail atomically
}
Separate process reads outbox table, publishes to Kafka, marks as sent.
Outbox pattern requires application code changes. CDC works with existing code. Pick based on your constraints.
How CDC Works#
1. Log-Based CDC (Best)#
Read database transaction log (MySQL binlog, PostgreSQL WAL, MongoDB oplog).
MySQL binlog example:
# Binlog entry
BEGIN
UPDATE users SET email='new@example.com' WHERE id=123
COMMIT
CDC tool (Debezium, Maxwell, custom) parses binlog, converts to events.
Pros:
- Low overhead (log already exists)
- Captures all changes (even from SQL scripts)
- Ordered, transactional
Cons:
- Requires log access (permissions, networking)
- Log format changes between DB versions
(search, cache, analytics)
Log-based CDC reads transaction log, converts to events.
2. Query-Based CDC (Polling)#
Periodically query for changes. Track last update timestamp.
-- Poll every 5 seconds
SELECT * FROM users
WHERE updated_at > '2026-01-14 10:29:55'
ORDER BY updated_at;
Pros:
- Simple, no special permissions
- Works with any database
Cons:
- Polling overhead
- Misses deletes (row gone, can’t query it)
- Not real-time (5 sec delay)
- Misses rapid updates (updated twice in 5 sec? See only final state)
3. Trigger-Based CDC#
Database triggers write changes to separate table.
CREATE TRIGGER user_changes
AFTER UPDATE ON users
FOR EACH ROW
BEGIN
INSERT INTO user_changelog (user_id, old_email, new_email, changed_at)
VALUES (OLD.id, OLD.email, NEW.email, NOW());
END;
CDC reads from changelog table.
Pros:
- Captures before/after state
- Works without log access
Cons:
- Trigger overhead on every write
- Changelog table grows unbounded
- Adds complexity to DB schema
Real-World Pattern#
Debezium + Kafka is standard.
Debezium connector reads MySQL binlog, publishes to Kafka. Consumers subscribe.
# Debezium MySQL connector config
{
"name": "mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql.example.com",
"database.port": "3306",
"database.user": "debezium",
"database.password": "password",
"database.server.id": "1",
"database.server.name": "mysql-server",
"table.include.list": "mydb.users,mydb.orders"
}
}
Events flow: MySQL → Debezium → Kafka → Consumers (Elasticsearch, Redis, Data Warehouse).
Schema Evolution#
Table schema changes? CDC events change.
Add column:
// Before
{"id": 123, "email": "user@example.com"}
// After adding phone column
{"id": 123, "email": "user@example.com", "phone": null}
Consumers must handle schema evolution. Use schema registry (Confluent Schema Registry, Avro).
When to Use CDC#
Good use cases:
- Sync to search index (Elasticsearch)
- Invalidate cache on changes
- Replicate to data warehouse (analytics)
- Audit log (who changed what when)
- Materialized views in different databases
Don’t use CDC for:
- Business logic triggers (use application events)
- Critical workflows (CDC has lag, eventual consistency)
- When you need validation before action (CDC sees committed changes only)
What I’m Curious About#
Worked with systems using application events, never implemented CDC at scale. Reading about it, the guarantees are attractive (never miss a change), but operational complexity (managing Debezium, Kafka, schema registry) seems significant.
For systems where database is already source of truth and you need multiple downstream consumers to stay in sync, CDC makes sense. For new features where you control the flow, application events might be simpler.
The trade-off: CDC guarantees consistency but adds infrastructure. Application events are simpler but require discipline to maintain consistency.
Have you used CDC in production? What challenges did you face?