TL;DR
Database migrations are where SaaS teams learn the difference between "works in development" and "works in production." The migration that takes 200ms on your local PostgreSQL instance locks a 50M-row table for 45 minutes in production. I've helped teams recover from migrations that took down production for 4+ hours... every one was preventable. The core principle: never mix schema changes with data changes in the same migration. Use expand-contract for zero-downtime deploys, run CONCURRENTLY on every index creation, and always... always... test against a production-sized dataset before deploying. The tooling matters less than the patterns. Alembic, Prisma Migrate, and raw SQL all work if you follow the zero-downtime playbook.
Part of the Performance Engineering Playbook ... a comprehensive guide to building systems that stay fast under real-world load.
Why Migrations Break Production
Every SaaS company hits the same wall. The team ships a migration that worked perfectly on a 10,000-row staging database. In production, with 50 million rows and 200 concurrent connections, the same migration acquires a table lock that blocks every read and write for 45 minutes. The on-call engineer's phone rings at 3 AM. Customers see 500 errors. The incident postmortem reveals the root cause: nobody tested the migration against production-scale data.
I've seen this pattern at 8 different companies in the past two years. The failure mode is always the same: a migration that's technically correct but operationally catastrophic.
The fix isn't better tooling. It's better patterns.
The Three Categories of Migration Risk
Not all migrations carry the same risk profile. Understanding the category determines your deployment strategy.
Category 1: Schema-Only (Low Risk)
Adding nullable columns, creating new tables, adding indexes with CONCURRENTLY. These don't touch existing data and can run in under a second on tables of any size.
-- Safe: adding a nullable column is instant in PostgreSQL 11+
ALTER TABLE orders ADD COLUMN region VARCHAR(50);
-- Safe: new table creation never blocks existing queries
CREATE TABLE order_audit_logs (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
action VARCHAR(50) NOT NULL,
changed_at TIMESTAMPTZ DEFAULT NOW()
);
PostgreSQL 11+ made ADD COLUMN with a default value a metadata-only operation... no table rewrite required. This single optimization eliminated the most common migration failure mode.
Caveat: This only applies to static defaults. DEFAULT 'active', DEFAULT 0, DEFAULT false are instant. But volatile defaults like DEFAULT gen_random_uuid(), DEFAULT now(), or DEFAULT random() still trigger a full table rewrite. The UUID case catches people most often... if you need a UUID column with a default, add it nullable first, backfill in batches, then add the constraint.
Category 2: Data Backfill (Medium Risk)
Populating new columns, transforming existing data, copying between tables. These touch every row and compete with production traffic for I/O and locks.
-- DANGEROUS: updates all rows in a single transaction
UPDATE orders SET region = 'us-east-1' WHERE region IS NULL;
-- SAFE: batch update with explicit limits
DO $$
DECLARE
batch_size INT := 5000;
rows_updated INT;
BEGIN
LOOP
UPDATE orders
SET region = 'us-east-1'
WHERE id IN (
SELECT id FROM orders
WHERE region IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- breathe between batches
END LOOP;
END $$;
The SKIP LOCKED clause is critical... it prevents the backfill from blocking application queries that need the same rows. The pg_sleep between batches lets the WAL writer catch up and prevents replication lag from spiking.
Category 3: Structural Change (High Risk)
Renaming columns, changing types, splitting tables, adding NOT NULL constraints to populated columns. These require the expand-contract pattern... there is no safe shortcut.
| Operation | Risk | Duration (50M rows) | Lock Type |
|---|---|---|---|
| ADD COLUMN (nullable) | None | < 1ms | AccessExclusive (instant) |
| ADD COLUMN (with default) | None (PG 11+) | < 1ms | AccessExclusive (instant) |
| CREATE INDEX | Low | 5-30 min | ShareLock (blocks writes) |
| CREATE INDEX CONCURRENTLY | None | 10-60 min | No lock |
| ALTER COLUMN TYPE | High | 15-90 min | AccessExclusive (blocks all) |
| ADD NOT NULL constraint | High | 10-30 min | AccessExclusive (blocks all) |
| Backfill 50M rows | Medium | 5-45 min | Row-level (per batch) |
The Expand-Contract Pattern
This is the single most important pattern for zero-downtime migrations. Every structural change follows three phases.
Phase 1: Expand
Add the new structure alongside the old. Both exist simultaneously. The application writes to both.
// Migration: add new column
await db.schema.alterTable("users", (table) => {
table.string("email_normalized").nullable();
});
// Application code: dual-write during transition
async function updateUserEmail(userId: string, email: string) {
await db("users").where({ id: userId }).update({
email: email,
email_normalized: email.toLowerCase().trim(),
});
}
Phase 2: Migrate
Backfill the new structure with existing data. Run in batches during low-traffic periods.
// Backfill migration ... run as a background job, NOT a deployment step
async function backfillNormalizedEmails() {
const batchSize = 5000;
let processed = 0;
while (true) {
const count = await db.raw(
`
UPDATE users
SET email_normalized = LOWER(TRIM(email))
WHERE id IN (
SELECT id FROM users
WHERE email_normalized IS NULL
LIMIT ?
FOR UPDATE SKIP LOCKED
)
`,
[batchSize]
);
if (count.rowCount === 0) break;
processed += count.rowCount;
// Log progress for observability
logger.info(`Backfilled ${processed} email records`);
// Let the database breathe
await sleep(100);
}
}
Phase 3: Contract
Once the new structure is fully populated and the application only reads from it, remove the old structure.
-- Only after verifying:
-- 1. No application code reads the old column
-- 2. Backfill is 100% complete
-- 3. Monitoring shows no errors for 48+ hours
ALTER TABLE users DROP COLUMN email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;
The contract phase is where teams get impatient. Don't drop the old column in the same deploy cycle as the code change. Wait at least one full deployment cycle... preferably a week... before removing the old structure. If something goes wrong, you can revert the application code and the old column still has data.
Index Creation: The Silent Killer
Index creation on large tables is the most common cause of migration-related downtime. A standard CREATE INDEX acquires a ShareLock that blocks all writes to the table for the duration of the build.
On a 50M-row table, that's 5-30 minutes of blocked inserts and updates.
The Fix: CONCURRENTLY
-- NEVER do this in production
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- ALWAYS do this
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);
CONCURRENTLY takes 2-3x longer but doesn't block writes. The tradeoff is worth it in every production scenario I've encountered.
Gotchas with CONCURRENTLY:
- Cannot run inside a transaction. Most migration tools wrap operations in transactions by default. You need to configure them to run this statement outside a transaction block.
- Can fail silently. If the build fails partway through, PostgreSQL marks the index as
INVALID. Always verify after creation:
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;
- Replication lag. On replicas, the index build can cause lag spikes. Monitor
pg_stat_replicationduring the build.
Migration Tool Configuration
Every major migration tool handles CONCURRENTLY differently:
| Tool | Configuration |
|---|---|
| Alembic | op.execute('CREATE INDEX CONCURRENTLY ...') with autocommit=True in migration context |
| Prisma Migrate | Native support added in 2025 (previously required raw SQL workaround) |
| Knex | knex.raw('CREATE INDEX CONCURRENTLY ...') outside transaction |
| Django | AddIndexConcurrently operation (built-in since Django 3.0) |
| Rails | add_index :table, :column, algorithm: :concurrently with disable_ddl_transaction! |
Testing Against Production-Scale Data
The gap between development and production databases is where migrations fail. A migration that runs in 200ms on 10,000 rows takes 45 minutes on 50 million rows. The relationship isn't linear... it's often worse than linear due to index maintenance, WAL generation, and checkpoint pressure.
The Production-Scale Testing Playbook
Step 1: Create a production-sized test database.
# Dump production schema + anonymized data
pg_dump --schema-only production_db > schema.sql
pg_dump --data-only --table=orders --where="created_at > '2025-01-01'" production_db | \
sed 's/[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*/redacted@test.com/g' > data.sql
Never use actual production data for testing. Anonymize emails, names, and any PII. The schema and row counts are what matter for migration timing.
Step 2: Time the migration.
# Run with timing enabled
psql test_db -c "\timing on" -f migration.sql
Step 3: Compare against your maintenance window.
If the migration takes longer than 30 seconds, it needs the expand-contract pattern. If it takes longer than 5 minutes, it needs batched execution. If it takes longer than 30 minutes, it needs a dedicated maintenance plan.
The 30-Second Rule
Any migration that takes more than 30 seconds on production-scale data is not safe for a standard deployment. This is the threshold I use with every advisory client. Below 30 seconds, you can run it during a deploy. Above 30 seconds, you need a strategy.
Rollback Strategies That Actually Work
"Just revert the migration" is the most dangerous phrase in database operations. Reversing a migration is rarely the inverse of applying it.
Forward-Only Migrations
For expand-contract patterns, rollback means deploying the previous application code... not reverting the database. The expanded schema works with both old and new code. This is why the contract phase waits a full deployment cycle.
Snapshot Rollback
For high-risk structural changes, take a point-in-time snapshot before the migration:
# PostgreSQL point-in-time recovery setup
pg_basebackup -D /backup/pre-migration -Ft -z -P
# Or use cloud-native snapshots
aws rds create-db-snapshot \
--db-instance-identifier production \
--db-snapshot-identifier pre-migration-2026-03-01
The snapshot gives you a guaranteed rollback point. The cost is storage (snapshots on RDS are $0.095/GB-month for storage beyond the free allocation equal to your provisioned database size) and recovery time (15-45 minutes for a 500GB database). Both are cheaper than a botched migration.
The Rollback Decision Framework
| Migration Type | Rollback Strategy | Recovery Time |
|---|---|---|
| Add nullable column | Drop column (safe) | < 1 second |
| Add index CONCURRENTLY | Drop index (safe) | < 1 second |
| Backfill data | Re-run backfill with old values | Minutes to hours |
| Rename column | Rename back (requires code revert) | < 1 second + deploy |
| Change column type | Restore from snapshot | 15-45 minutes |
| Drop column | Restore from snapshot | 15-45 minutes |
Multi-Tenant Migration Considerations
SaaS applications with tenant-level schema isolation face unique migration challenges. If you're using PostgreSQL schemas per tenant (one of the patterns from multi-tenancy with Prisma and RLS), every migration runs N times... once per tenant.
The Cascading Risk
A migration on 1,000 tenant schemas that takes 2 seconds each takes 33 minutes total. If any single tenant's migration fails, you have a partially migrated database with inconsistent schemas across tenants.
The Parallel Approach
// Run tenant migrations in controlled parallel batches
async function migrateTenants(migrationSql: string) {
const tenants = await db("tenants").select("schema_name");
const concurrency = 10; // Don't overwhelm the connection pool
for (let i = 0; i < tenants.length; i += concurrency) {
const batch = tenants.slice(i, i + concurrency);
await Promise.all(
batch.map(async (tenant) => {
try {
await db.raw(`SET search_path TO ??`, [tenant.schema_name]);
await db.raw(migrationSql);
} catch (error) {
logger.error(`Migration failed for tenant ${tenant.schema_name}`, error);
// Record failure ... don't abort entire batch
await recordMigrationFailure(tenant.schema_name, error);
}
})
);
logger.info(`Migrated ${Math.min(i + concurrency, tenants.length)}/${tenants.length} tenants`);
}
}
Key decisions:
- Concurrency of 10 keeps connection pool pressure manageable
- Individual error handling prevents one tenant's failure from blocking all others
- Failure recording enables targeted retry for failed tenants
The Migration Checklist
Before running any migration in production:
Pre-Flight:
- Migration tested against production-scale data (row counts and indexes match)
- Execution time measured and within maintenance window
- Rollback plan documented and tested
- Database snapshot taken (or point-in-time recovery verified)
- Monitoring dashboards open (connection pool, query latency, replication lag)
Execution:
- Run during lowest-traffic period (check analytics, not assumptions)
- Schema changes separated from data changes
- Indexes created with
CONCURRENTLY - Backfills batched with
SKIP LOCKEDand sleep intervals
Post-Flight:
- Verify no invalid indexes (
pg_index WHERE NOT indisvalid) - Verify replication lag returned to baseline
- Verify application error rates unchanged
- Wait one full deployment cycle before contract phase
When to Apply This
- You're running migrations on tables with more than 1 million rows
- Your application serves traffic 24/7 with no maintenance window
- You're operating in a multi-tenant environment with shared or per-tenant schemas
- Your team has been burned by migration-related downtime (once is enough)
When NOT to Apply This
- Pre-launch with no production traffic... run migrations however you want
- Tables under 100K rows... standard migrations complete in under a second
- Development and staging environments... save the complexity for production
Running a SaaS that's outgrowing simple migrations? I help teams design zero-downtime migration strategies that scale with their data.
- Technical Advisor for Startups ... Architecture decisions from MVP to scale
- Next.js Development for SaaS ... Production-grade systems that handle growth
- Technical Due Diligence ... Pre-investment architecture assessment
Continue Reading
This post is part of the Performance Engineering Playbook ... covering TTFB optimization, Core Web Vitals, caching strategies, and monitoring at scale.
More in This Series
- Node.js Memory Leaks in Production ... Finding and fixing memory issues under load
- CDN Caching Strategy ... Multi-layer caching that actually invalidates correctly
- Core Web Vitals Optimization ... From red to green on every metric
- The Lambda Tax: Cold Starts and Hidden Costs ... When serverless costs more than servers
Related Guides
- Multi-Tenancy with Prisma and RLS ... Tenant isolation at the database layer
- Database Query Optimization ... From N+1 to optimal query patterns
- SaaS Reliability Monitoring ... Observability that catches issues before customers do
