Write-Ahead Logging: How Databases Survive Crashes

Database crashes mid-write. Power fails. Server dies. When it restarts, how does it know what happened?

Write-Ahead Logging.

The Problem#

Random writes are slow. Disk seeks take around 10ms. If you write every change directly to disk at random locations, throughput tanks.

In-memory updates are fast but volatile. Crash before flushing to disk? Data gone.

You need durability without killing performance.

How WAL Works#

Every change gets appended to a log file first. Sequential write, no seeking, fast. Then update in-memory structures. Background process flushes to disk later.

Crash happens? Replay the WAL on restart. The log tells you exactly what was committed.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C as Client participant W as WAL participant M as Memory participant D as Disk C->>W: Write: UPDATE user SET status='active' W->>W: Append to log (sequential) W-->>C: ACK (durable now) W->>M: Update in-memory Note over M,D: Background flush M-->>D: Write to actual location

WAL guarantees durability through sequential append. Actual disk write happens later.

Why Sequential Writes Matter#

Disk seeks are expensive. Moving the read/write head around kills performance. Sequential writes avoid seeks entirely. Just append to end of file.

SSDs are faster but still benefit. Sequential writes reduce write amplification and extend SSD lifespan.

The trade-off: you’re writing twice (once to WAL, once to actual location). But sequential + async makes it worth it.

WAL in Real Systems#

PostgreSQL has pg_wal directory. Every transaction appended there before commit.

MySQL uses redo logs. InnoDB writes changes to ib_logfile0 and ib_logfile1 before updating data pages.

Kafka? The entire thing is a distributed WAL. That’s the product.

What I’m Still Figuring Out#

Checkpointing. When do you truncate old WAL segments? Keep them forever and you run out of disk. Delete too early and you can’t recover.

PostgreSQL does checkpoints where it flushes all dirty pages to disk, then marks that point in WAL. Older segments can be recycled. But the tuning (checkpoint intervals, WAL segment size) seems like black magic.

WAL is the foundation of durability in almost every database. Sequential append beats random write every time.

Does your database use WAL? Have you tuned checkpoint settings?