Database Migrations Without Downtime
ALTER TABLE on a 2M row table. MySQL locks it. Every query queues up behind the lock. Your API returns 503s for two minutes.
I’ve done this. In production. On a Friday. Don’t be me.
The Problem With Direct Migrations#
Most schema changes in MySQL acquire a metadata lock. Small tables, no problem. Large tables? That lock blocks reads and writes for the entire duration.
-- Looks innocent. Blocks everything on a large table.
ALTER TABLE transactions ADD COLUMN status VARCHAR(20) DEFAULT 'pending';
On our 2M row transactions table at Oracle, this took over 90 seconds. Ninety seconds of downtime from a single DDL.
Expand-Contract Pattern#
Instead of one big change, break it into safe steps.
Expand: Add the new column as nullable, no default. This is usually instant in MySQL 8.0+ with INSTANT algorithm.
ALTER TABLE transactions ADD COLUMN status VARCHAR(20) NULL, ALGORITHM=INSTANT;
Migrate: Backfill existing rows in batches. Small updates, no locks.
@Scheduled(fixedRate = 1000)
public void backfillStatus() {
jdbcTemplate.update(
"UPDATE transactions SET status = 'completed' " +
"WHERE status IS NULL LIMIT 1000"
);
}
Contract: Once all rows are backfilled and code no longer reads the old format, add the constraint or default.
Shadow Writes#
Renaming a column? Changing a type? Harder. You can’t just add and backfill.
Shadow writes: deploy code that writes to both old and new columns. Backfill old data to the new column. Verify they match. Switch reads to the new column. Stop writing to the old one. Drop the old column.
More deploys. More complexity. Zero downtime.
The Tools That Help#
pt-online-schema-change (Percona) and gh-ost (GitHub) handle this for you. They create a shadow table, copy data in chunks, replay changes via triggers or binlog, then swap tables atomically.
gh-ost --alter="ADD COLUMN status VARCHAR(20)" \
--database=mydb --table=transactions \
--execute
At Oracle, we switched to pt-online-schema-change after the Friday incident. Never blocked a table again.
What I’m Learning#
Every “simple” migration is a potential outage on a large table. The expand-contract pattern feels like overkill until you’ve experienced the alternative. Three safe deploys beat one scary one.
The rule I follow now: if the table has more than 100K rows, never run a direct ALTER TABLE in production.
How do you handle schema migrations? Do you use online DDL tools?