A fintech startup was moving fast — three engineers, weekly deploys, PostgreSQL on RDS. Then they hit a schema change that required dropping a NOT NULL constraint, adding two columns, and backfilling 40 million rows. They ran it in a transaction at 2am. The migration held an exclusive table lock for 22 minutes. Their payment processing API returned 503s the entire time. Their monitoring caught it at minute 4; by then, the damage was done.
Zero-downtime schema migrations are solvable. Every technique exists. The problem is most teams learn them after the first outage. This guide covers the patterns that let you evolve your database schema aggressively without ever taking your application down.
- Never run DDL that holds a long lock (ALTER TABLE on large tables, DROP COLUMN with rewrite) during business hours
- The expand-contract pattern (add column → backfill → switch app → drop old) is the foundation of all zero-downtime migrations
- Use
gh-ostorpt-online-schema-changefor MySQL;pg_repack+CREATE INDEX CONCURRENTLYfor PostgreSQL - Flyway and Liquibase give you version control for SQL — pick one and use it from day one
- Always test migrations against a production-sized dataset — a migration that takes 30 seconds on staging takes 45 minutes on 100GB prod
Why Schema Migrations Cause Outages
Most schema migration outages come from one of three causes:
- Lock contention —
ALTER TABLEacquires an exclusive lock that blocks all reads and writes until the operation completes. On a large table, this can take minutes to hours. - Table rewrites — operations like adding a NOT NULL column without a default, changing a column type, or rebuilding an index with
REINDEXcopy the entire table, locking it throughout. - Application/database version mismatch — during a rolling deployment, old app code and new app code run simultaneously against the same database. If your migration removes a column the old code still reads, old instances will error immediately.
-- Operations that cause table rewrites (avoid on large tables during business hours): -- MySQL: ALTER TABLE orders ADD COLUMN archived BOOLEAN DEFAULT FALSE NOT NULL; -- rewrites table pre-MySQL 8.0 ALTER TABLE orders MODIFY COLUMN amount DECIMAL(12,2); -- type change = rewrite ALTER TABLE orders DROP COLUMN legacy_field; -- rewrite if inline -- PostgreSQL: ALTER TABLE orders ADD COLUMN processed_at TIMESTAMP NOT NULL; -- requires table scan to set default ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT; -- full rewrite CREATE INDEX orders_user_idx ON orders(user_id); -- blocks all writes during build -- Safe alternatives: -- MySQL: use gh-ost or pt-osc for any table > 1GB -- PostgreSQL: CREATE INDEX CONCURRENTLY; use pg_repack for table rebuilds
The Expand-Contract Pattern
The expand-contract pattern (also called parallel change) is the fundamental building block for zero-downtime schema changes. Instead of changing the schema atomically, you do it in phases over multiple deploys:
- Expand — add the new column/table/index (old code ignores it; new code can write to it)
- Migrate — backfill existing data into the new structure
- Switch — deploy new application code that reads from the new column
- Contract — remove the old column/table/index (only after all app instances are on the new code)
-- Example: rename column 'user_name' → 'display_name' without downtime
-- Phase 1 (Expand): add new column, write to both
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Deploy app code that writes to BOTH user_name AND display_name
-- Phase 2 (Migrate): backfill existing rows
-- MySQL: use pt-online-schema-change or chunked update
UPDATE users SET display_name = user_name WHERE display_name IS NULL LIMIT 10000;
-- Repeat until complete
-- PostgreSQL: can do in batches
DO $$
DECLARE
last_id BIGINT := 0;
BEGIN
LOOP
UPDATE users SET display_name = user_name
WHERE id > last_id AND display_name IS NULL
LIMIT 10000
RETURNING MAX(id) INTO last_id;
EXIT WHEN last_id IS NULL;
PERFORM pg_sleep(0.1); -- throttle
END LOOP;
END $$;
-- Phase 3 (Switch): deploy app code that reads from display_name only
-- Phase 4 (Contract): after all instances updated, drop the old column
ALTER TABLE users DROP COLUMN user_name;
Online Schema Change Tools
MySQL: gh-ost
gh-ost (GitHub's Online Schema Transmogrifier) is the production standard for MySQL schema changes on large tables. It creates a ghost table, applies changes there, syncs rows via binlog, and atomically cuts over — never holding a long lock.
# gh-ost: alter a large orders table with no downtime
gh-ost \
--host=mysql-primary \
--user=ghuser \
--password=secret \
--database=myapp \
--table=orders \
--alter="ADD COLUMN fulfillment_status ENUM('pending','shipped','delivered') DEFAULT 'pending'" \
--execute \
--chunk-size=2000 \
--max-load=Threads_running=25 \
--critical-load=Threads_running=50 \
--switch-to-rbr \
--allow-master-master # if using Group Replication
# Key flags:
# --chunk-size: rows per sync iteration (lower = less lag on replica)
# --max-load: throttle if MySQL load exceeds this
# --critical-load: abort if load exceeds this
# --postpone-cut-over-flag-file: pause before final cutover (manual approval)MySQL: pt-online-schema-change
# pt-osc: alternative for environments where gh-ost can't connect to binlog
pt-online-schema-change \
--host=mysql-primary \
--user=ptuser \
--password=secret \
D=myapp,t=orders \
--alter "ADD INDEX idx_status (fulfillment_status)" \
--execute \
--chunk-size=5000 \
--max-load="Threads_running:30" \
--no-drop-old-table # keep _orders_old until you verifyPostgreSQL: CREATE INDEX CONCURRENTLY
-- Standard CREATE INDEX blocks all writes during build
CREATE INDEX orders_user_idx ON orders(user_id); -- DON'T do this on large tables in prod
-- CREATE INDEX CONCURRENTLY: builds without write lock (takes 2-3x longer but safe)
CREATE INDEX CONCURRENTLY orders_user_idx ON orders(user_id);
-- If a concurrent build fails (it leaves an INVALID index):
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'orders';
-- Drop the invalid index and retry:
DROP INDEX CONCURRENTLY orders_user_idx;
CREATE INDEX CONCURRENTLY orders_user_idx ON orders(user_id);PostgreSQL: pg_repack
# pg_repack: reclaim bloat or change table storage without an exclusive lock
# (VACUUM FULL requires a full table lock; pg_repack does not)
# Install
apt-get install postgresql-15-repack # or your PG version
# Repack a bloated table
pg_repack -h localhost -d mydb -t orders
# Repack with index change (equivalent to CLUSTER but online)
pg_repack -h localhost -d mydb -t orders --order-by created_atSchema Version Control: Flyway vs Liquibase
Flyway and Liquibase are the two dominant tools for versioning your migrations. The core concept: every schema change is a numbered migration file that runs exactly once, in order, on each environment.
Flyway
-- Migration files: V{version}__{description}.sql
-- Example: V1__create_users.sql
CREATE TABLE users (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- V2__add_display_name.sql
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- V3__index_users_email.sql (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Run via CLI
flyway -url=jdbc:postgresql://localhost/mydb -user=admin -password=secret migrate
-- Or Java/Spring (auto-migrates on startup):
# application.properties
spring.flyway.enabled=true
spring.flyway.locations=classpath:db/migrationLiquibase
-- changelog.xml (Liquibase uses XML, YAML, or SQL formats)
# Run via CLI
liquibase --url=jdbc:postgresql://localhost/mydb \
--username=admin --password=secret \
--changeLogFile=changelog.xml update| Aspect | Flyway | Liquibase |
|---|---|---|
| Format | SQL (primary) or Java | XML, YAML, JSON, or SQL |
| Learning curve | Lower — SQL files feel natural | Higher — XML/YAML abstraction layer |
| Rollback | Community: manual; Teams: auto | Built-in rollback changesets |
| Multi-DB support | Good | Excellent (abstracted DDL) |
| Best for | SQL-first teams, simpler setups | Multi-DB environments, enterprise |
Our recommendation: Flyway for teams that want to write plain SQL migrations (most teams). Liquibase if you're deploying the same schema across MySQL, PostgreSQL, and Oracle.
Rollback Strategies
Every migration should have a corresponding rollback plan. In practice, most rollbacks are harder than the forward migration — you can't un-delete data. Think through rollback before you write the migration:
-- Safe to roll back:
-- ADD COLUMN: rollback = DROP COLUMN
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50);
-- Rollback: ALTER TABLE orders DROP COLUMN discount_code;
-- ADD INDEX: rollback = DROP INDEX
CREATE INDEX CONCURRENTLY orders_discount_idx ON orders(discount_code);
-- Rollback: DROP INDEX CONCURRENTLY orders_discount_idx;
-- Harder to roll back:
-- DROP COLUMN: rollback requires a backup restore (data is gone)
-- Mitigation: rename instead of drop, keep data around for 1-2 sprints
-- RENAME COLUMN: both app versions must handle old and new name during rollout
-- Mitigation: use expand-contract (add new column, keep old, switch apps, drop old)
-- DATA MIGRATIONS: backfilling data is usually irreversible
-- Mitigation: write data to new column only; don't modify old column until you're sure
-- Blue/Green deployment + schema migrations:
-- Run migration BEFORE deploying new app code
-- New schema must be backward-compatible with old app code
-- Rollback = point load balancer back to old app stack (schema change stays)Testing Migrations Against Production Scale
The most common mistake: testing migrations on a small staging database. A migration that takes 8 seconds on a 2GB staging table can take 45 minutes on a 120GB production table. Always test on production-sized data before running in prod.
-- Estimate migration time before running
-- MySQL: check table size
SELECT
table_name,
ROUND(data_length / 1024 / 1024, 0) AS data_mb,
ROUND(index_length / 1024 / 1024, 0) AS index_mb,
table_rows
FROM information_schema.tables
WHERE table_schema = 'myapp'
AND table_name = 'orders';
-- PostgreSQL: table size
SELECT
pg_size_pretty(pg_total_relation_size('orders')) AS total_size,
pg_size_pretty(pg_relation_size('orders')) AS table_size,
(SELECT reltuples::BIGINT FROM pg_class WHERE relname = 'orders') AS est_rows;
-- Estimate gh-ost time:
-- gh-ost processes ~2,000 rows/chunk at 1 chunk/second by default
-- 50M rows ÷ 2,000 = 25,000 iterations ≈ 7 hours
-- Increase --chunk-size to 5,000 and it drops to ~3 hours (watch replica lag)
-- Run a dry-run first (gh-ost --test-on-replica)
gh-ost --test-on-replica --host=replica-host --table=orders \
--alter="ADD COLUMN ..." --executeMigration Patterns by Scenario
-- 1. Adding a new table: safe, deploy anytime
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
entity_type VARCHAR(50) NOT NULL,
entity_id BIGINT NOT NULL,
action VARCHAR(20) NOT NULL,
created_at TIMESTAMP DEFAULT now()
);
CREATE INDEX CONCURRENTLY audit_log_entity_idx ON audit_log(entity_type, entity_id);
-- 2. Adding a nullable column: safe, no lock (PostgreSQL 11+, MySQL 8+)
ALTER TABLE orders ADD COLUMN notes TEXT; -- no table rewrite for nullable
-- 3. Adding a NOT NULL column WITH DEFAULT: safe in PostgreSQL 11+
ALTER TABLE orders ADD COLUMN priority SMALLINT NOT NULL DEFAULT 0;
-- PostgreSQL 11+ stores the default in catalog, no table rewrite
-- MySQL: use gh-ost or add nullable first, backfill, add constraint separately
-- 4. Adding a column with a computed default (requires backfill)
-- Step 1: add nullable
ALTER TABLE orders ADD COLUMN priority_label VARCHAR(20);
-- Step 2: backfill in chunks
UPDATE orders SET priority_label = CASE
WHEN priority >= 8 THEN 'high'
WHEN priority >= 4 THEN 'medium'
ELSE 'low'
END WHERE priority_label IS NULL LIMIT 10000;
-- Step 3 (optional): add NOT NULL constraint after backfill
ALTER TABLE orders ALTER COLUMN priority_label SET NOT NULL;
-- 5. Dropping a column: two-phase
-- Phase 1: make app stop reading/writing it (deploy new code)
-- Phase 2 (next sprint): drop the column
ALTER TABLE orders DROP COLUMN legacy_tracking_id;Working with JusDB on Schema Migrations
The technical patterns are well-documented. The hard part is making them fit into your deployment pipeline — integrating gh-ost with your CI/CD, getting rollback procedures approved before an incident, and training engineers to think in expand-contract instead of just ALTER TABLE.
We help teams set up migration tooling and review high-risk schema changes as part of our MySQL consulting and PostgreSQL consulting. If you have a large migration coming up — table rebuilds, column renames, index changes on 100M+ row tables — talk to us before you run it.
Related reading: Database Schema Design Fundamentals | MySQL Performance Tuning | PostgreSQL VACUUM Tuning