Zero-Downtime Database Migrations: The Expand/Contract Pattern
Deploying code and schema changes together is fragile — a failed deploy leaves code and schema out of sync. The expand/contract pattern decouples them: schema changes backward-compatible with both old and new code, applied in two separate deploys.
The problem
Deploying a new service version alongside a schema change creates a window of incompatibility. During a rolling deploy (ECS tasks, Kubernetes pods replacing one at a time), old and new code run simultaneously against the same database. If the schema change is breaking — renaming a column, dropping a table, changing a type — the old code fails while the new code works, or vice versa.
The safest way to handle this is to never apply a breaking schema change while old code is still running.
The expand/contract pattern
PatternDatabase MigrationsDecouple schema changes from code changes into two separate deploys: first expand the schema to support both old and new code, then deploy the new code, then contract the schema to remove what the old code needed.
Prerequisites
- rolling deployments
- SQL schema management
- transaction semantics
Key Points
- Expand: add new columns/tables alongside old ones. Both old and new code work against the expanded schema.
- Migrate code: deploy new service version that uses the new schema elements.
- Contract: once no old code is running, remove the old columns/tables.
- Each phase is a separate deploy. The schema is never incompatible with running code.
A worked example: renaming a column
You need to rename user_name to display_name in a users table.
The wrong approach: rename in a single migration alongside the code deploy. During the rolling deploy, old pods look for user_name (which no longer exists) and crash.
The expand/contract approach:
Phase 1: Expand
Add the new column. Populate it from the old column. Update writes to write to both columns. Old code continues to read user_name — unaffected. New code can read display_name — it exists.
-- Migration 1: add new column
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
UPDATE users SET display_name = user_name;
# New code: write to both columns during transition
def update_user_name(user_id, name):
db.execute(
"UPDATE users SET user_name = %s, display_name = %s WHERE id = %s",
(name, name, user_id)
)
# New code: read from new column
def get_display_name(user_id):
return db.query("SELECT display_name FROM users WHERE id = %s", user_id)
Deploy the new code. Old pods still work (they read user_name). New pods work (they read display_name, which is populated). The database has both columns.
Phase 2: Contract
Once all old pods are gone (rollout complete, no old code running), drop the old column.
-- Migration 2: drop old column (run after all new code is deployed)
ALTER TABLE users DROP COLUMN user_name;
The new code is already only reading display_name. The drop does not affect any running code.
What makes a schema change backward-compatible
| Change | Safe to deploy without expand/contract? | |--------|----------------------------------------| | Add a nullable column | Yes — old code ignores it | | Add a NOT NULL column with a default | Yes — old writes satisfy the constraint | | Add a NOT NULL column without default | No — old writes fail | | Rename a column | No — use expand/contract | | Drop a column | No — old code references it | | Add an index | Yes — non-blocking in Postgres/MySQL with CONCURRENT option | | Add a foreign key | Depends — blocking if validating existing rows |
Additive changes (new nullable columns, new tables, new indexes) are generally safe to deploy alongside code. Destructive changes (drops, renames, type changes, adding NOT NULL) require expand/contract.
💡Using Flyway or Liquibase to manage migration sequencing
Migration tools (Flyway, Liquibase, Alembic in Python) run SQL migrations on startup. They track which migrations have been applied via a schema history table.
For expand/contract to work correctly, the two phases must be in separate migration files:
V1__add_display_name.sql ← expand: runs before new code deploy
V2__drop_user_name.sql ← contract: runs after new code deploy completes
The challenge: if your service applies migrations automatically on startup, V2 runs when the new code starts — potentially before all old pods are gone. Solutions:
- Apply the contract migration manually after confirming rollout completion.
- Use a separate migration-only job that runs after the rollout completes.
- Use a feature flag that enables the contract migration only when explicitly triggered.
Automated migrations are convenient for expand phases. Contract phases often need manual gating to ensure the timing is right.
Blue-green deployment for incompatible changes
When a schema change is fundamentally incompatible — a type change, a table restructure, data format rewrite — expand/contract does not help because there is no intermediate state where both schemas are valid.
In this case, blue-green deployment separates the environments completely:
- Provision a Green environment with new code and new schema.
- Replicate data from Blue to Green. Enable write-ahead replication so Blue writes are mirrored to Green.
- Switch traffic from Blue to Green.
- Keep Blue running briefly for rollback capability. Stop replication.
- Decommission Blue.
The cost is provisioning a full parallel environment. The benefit is a clean cutover with no compatibility window. This is appropriate for major version migrations (Postgres 13 → 16, Rails 5 → 7, database engine swap).
A team applies a migration that adds a NOT NULL column without a default value, then does a rolling deploy of the new code. During the deploy, old pods start throwing 500 errors on writes. What happened and how should this have been handled?
mediumThe migration ran before the deploy started. Old pods are still running the old code which does not set the new column in INSERT statements.
AThe migration should have run after all old pods were terminated
Incorrect.Running the migration after the deploy means new pods start against the old schema — they fail because the column doesn't exist. The migration must run before new code deploys.BThe column should have been added with a DEFAULT value or as nullable first, then a separate migration to add the NOT NULL constraint
Correct!A NOT NULL column without a default requires all INSERT statements to include it. Old code that doesn't know about this column violates the constraint. The expand/contract fix: first add the column as nullable (old inserts succeed), migrate code to populate it, then add the NOT NULL constraint in a separate migration after old code is gone.CBlue-green deployment should be used for all schema changes
Incorrect.Blue-green is appropriate for incompatible schema changes, but it's heavyweight. A nullable column with a subsequent NOT NULL constraint is a standard expand/contract operation — no need for a full parallel environment.DThe migration tool should have detected the incompatibility and prevented the deploy
Incorrect.Migration tools (Flyway, Liquibase) do not validate schema compatibility with running application code — they only track which migrations have been applied.
Hint:Think about what INSERT statements from old code look like when a new NOT NULL column exists that they don't know about.