Skip to content
February 3, 202614 min readinfrastructure

Database Migrations That Won't Wake You at 3 AM

Most SaaS teams treat database migrations as a deployment step. They're an architecture decision. Here's the pattern comparison, rollback strategy, and zero-downtime playbook I use with advisory clients.

databasemigrationssaaspostgresqlzero-downtime
Database Migrations That Won't Wake You at 3 AM

TL;DR

Database migrations are where SaaS teams learn the difference between "works in development" and "works in production." The migration that takes 200ms on your local PostgreSQL instance locks a 50M-row table for 45 minutes in production. I've helped teams recover from migrations that took down production for 4+ hours... every one was preventable. The core principle: never mix schema changes with data changes in the same migration. Use expand-contract for zero-downtime deploys, run CONCURRENTLY on every index creation, and always... always... test against a production-sized dataset before deploying. The tooling matters less than the patterns. Alembic, Prisma Migrate, and raw SQL all work if you follow the zero-downtime playbook.

Part of the Performance Engineering Playbook ... a comprehensive guide to building systems that stay fast under real-world load.


Why Migrations Break Production

Every SaaS company hits the same wall. The team ships a migration that worked perfectly on a 10,000-row staging database. In production, with 50 million rows and 200 concurrent connections, the same migration acquires a table lock that blocks every read and write for 45 minutes. The on-call engineer's phone rings at 3 AM. Customers see 500 errors. The incident postmortem reveals the root cause: nobody tested the migration against production-scale data.

I've seen this pattern at 8 different companies in the past two years. The failure mode is always the same: a migration that's technically correct but operationally catastrophic.

The fix isn't better tooling. It's better patterns.


The Three Categories of Migration Risk

Not all migrations carry the same risk profile. Understanding the category determines your deployment strategy.

Category 1: Schema-Only (Low Risk)

Adding nullable columns, creating new tables, adding indexes with CONCURRENTLY. These don't touch existing data and can run in under a second on tables of any size.

-- Safe: adding a nullable column is instant in PostgreSQL 11+ ALTER TABLE orders ADD COLUMN region VARCHAR(50); -- Safe: new table creation never blocks existing queries CREATE TABLE order_audit_logs ( id BIGSERIAL PRIMARY KEY, order_id BIGINT REFERENCES orders(id), action VARCHAR(50) NOT NULL, changed_at TIMESTAMPTZ DEFAULT NOW() );

PostgreSQL 11+ made ADD COLUMN with a default value a metadata-only operation... no table rewrite required. This single optimization eliminated the most common migration failure mode.

Caveat: This only applies to static defaults. DEFAULT 'active', DEFAULT 0, DEFAULT false are instant. But volatile defaults like DEFAULT gen_random_uuid(), DEFAULT now(), or DEFAULT random() still trigger a full table rewrite. The UUID case catches people most often... if you need a UUID column with a default, add it nullable first, backfill in batches, then add the constraint.

Category 2: Data Backfill (Medium Risk)

Populating new columns, transforming existing data, copying between tables. These touch every row and compete with production traffic for I/O and locks.

-- DANGEROUS: updates all rows in a single transaction UPDATE orders SET region = 'us-east-1' WHERE region IS NULL; -- SAFE: batch update with explicit limits DO $$ DECLARE batch_size INT := 5000; rows_updated INT; BEGIN LOOP UPDATE orders SET region = 'us-east-1' WHERE id IN ( SELECT id FROM orders WHERE region IS NULL LIMIT batch_size FOR UPDATE SKIP LOCKED ); GET DIAGNOSTICS rows_updated = ROW_COUNT; EXIT WHEN rows_updated = 0; PERFORM pg_sleep(0.1); -- breathe between batches END LOOP; END $$;

The SKIP LOCKED clause is critical... it prevents the backfill from blocking application queries that need the same rows. The pg_sleep between batches lets the WAL writer catch up and prevents replication lag from spiking.

Category 3: Structural Change (High Risk)

Renaming columns, changing types, splitting tables, adding NOT NULL constraints to populated columns. These require the expand-contract pattern... there is no safe shortcut.

OperationRiskDuration (50M rows)Lock Type
ADD COLUMN (nullable)None< 1msAccessExclusive (instant)
ADD COLUMN (with default)None (PG 11+)< 1msAccessExclusive (instant)
CREATE INDEXLow5-30 minShareLock (blocks writes)
CREATE INDEX CONCURRENTLYNone10-60 minNo lock
ALTER COLUMN TYPEHigh15-90 minAccessExclusive (blocks all)
ADD NOT NULL constraintHigh10-30 minAccessExclusive (blocks all)
Backfill 50M rowsMedium5-45 minRow-level (per batch)

The Expand-Contract Pattern

This is the single most important pattern for zero-downtime migrations. Every structural change follows three phases.

Phase 1: Expand

Add the new structure alongside the old. Both exist simultaneously. The application writes to both.

// Migration: add new column await db.schema.alterTable("users", (table) => { table.string("email_normalized").nullable(); }); // Application code: dual-write during transition async function updateUserEmail(userId: string, email: string) { await db("users").where({ id: userId }).update({ email: email, email_normalized: email.toLowerCase().trim(), }); }

Phase 2: Migrate

Backfill the new structure with existing data. Run in batches during low-traffic periods.

// Backfill migration ... run as a background job, NOT a deployment step async function backfillNormalizedEmails() { const batchSize = 5000; let processed = 0; while (true) { const count = await db.raw( ` UPDATE users SET email_normalized = LOWER(TRIM(email)) WHERE id IN ( SELECT id FROM users WHERE email_normalized IS NULL LIMIT ? FOR UPDATE SKIP LOCKED ) `, [batchSize] ); if (count.rowCount === 0) break; processed += count.rowCount; // Log progress for observability logger.info(`Backfilled ${processed} email records`); // Let the database breathe await sleep(100); } }

Phase 3: Contract

Once the new structure is fully populated and the application only reads from it, remove the old structure.

-- Only after verifying: -- 1. No application code reads the old column -- 2. Backfill is 100% complete -- 3. Monitoring shows no errors for 48+ hours ALTER TABLE users DROP COLUMN email; ALTER TABLE users RENAME COLUMN email_normalized TO email;

The contract phase is where teams get impatient. Don't drop the old column in the same deploy cycle as the code change. Wait at least one full deployment cycle... preferably a week... before removing the old structure. If something goes wrong, you can revert the application code and the old column still has data.


Index Creation: The Silent Killer

Index creation on large tables is the most common cause of migration-related downtime. A standard CREATE INDEX acquires a ShareLock that blocks all writes to the table for the duration of the build.

On a 50M-row table, that's 5-30 minutes of blocked inserts and updates.

The Fix: CONCURRENTLY

-- NEVER do this in production CREATE INDEX idx_orders_customer_id ON orders(customer_id); -- ALWAYS do this CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);

CONCURRENTLY takes 2-3x longer but doesn't block writes. The tradeoff is worth it in every production scenario I've encountered.

Gotchas with CONCURRENTLY:

  1. Cannot run inside a transaction. Most migration tools wrap operations in transactions by default. You need to configure them to run this statement outside a transaction block.
  2. Can fail silently. If the build fails partway through, PostgreSQL marks the index as INVALID. Always verify after creation:
SELECT indexrelid::regclass, indisvalid FROM pg_index WHERE NOT indisvalid;
  1. Replication lag. On replicas, the index build can cause lag spikes. Monitor pg_stat_replication during the build.

Migration Tool Configuration

Every major migration tool handles CONCURRENTLY differently:

ToolConfiguration
Alembicop.execute('CREATE INDEX CONCURRENTLY ...') with autocommit=True in migration context
Prisma MigrateNative support added in 2025 (previously required raw SQL workaround)
Knexknex.raw('CREATE INDEX CONCURRENTLY ...') outside transaction
DjangoAddIndexConcurrently operation (built-in since Django 3.0)
Railsadd_index :table, :column, algorithm: :concurrently with disable_ddl_transaction!

Testing Against Production-Scale Data

The gap between development and production databases is where migrations fail. A migration that runs in 200ms on 10,000 rows takes 45 minutes on 50 million rows. The relationship isn't linear... it's often worse than linear due to index maintenance, WAL generation, and checkpoint pressure.

The Production-Scale Testing Playbook

Step 1: Create a production-sized test database.

# Dump production schema + anonymized data pg_dump --schema-only production_db > schema.sql pg_dump --data-only --table=orders --where="created_at > '2025-01-01'" production_db | \ sed 's/[a-zA-Z0-9._%+-]*@[a-zA-Z0-9.-]*/redacted@test.com/g' > data.sql

Never use actual production data for testing. Anonymize emails, names, and any PII. The schema and row counts are what matter for migration timing.

Step 2: Time the migration.

# Run with timing enabled psql test_db -c "\timing on" -f migration.sql

Step 3: Compare against your maintenance window.

If the migration takes longer than 30 seconds, it needs the expand-contract pattern. If it takes longer than 5 minutes, it needs batched execution. If it takes longer than 30 minutes, it needs a dedicated maintenance plan.

The 30-Second Rule

Any migration that takes more than 30 seconds on production-scale data is not safe for a standard deployment. This is the threshold I use with every advisory client. Below 30 seconds, you can run it during a deploy. Above 30 seconds, you need a strategy.


Rollback Strategies That Actually Work

"Just revert the migration" is the most dangerous phrase in database operations. Reversing a migration is rarely the inverse of applying it.

Forward-Only Migrations

For expand-contract patterns, rollback means deploying the previous application code... not reverting the database. The expanded schema works with both old and new code. This is why the contract phase waits a full deployment cycle.

Snapshot Rollback

For high-risk structural changes, take a point-in-time snapshot before the migration:

# PostgreSQL point-in-time recovery setup pg_basebackup -D /backup/pre-migration -Ft -z -P # Or use cloud-native snapshots aws rds create-db-snapshot \ --db-instance-identifier production \ --db-snapshot-identifier pre-migration-2026-03-01

The snapshot gives you a guaranteed rollback point. The cost is storage (snapshots on RDS are $0.095/GB-month for storage beyond the free allocation equal to your provisioned database size) and recovery time (15-45 minutes for a 500GB database). Both are cheaper than a botched migration.

The Rollback Decision Framework

Migration TypeRollback StrategyRecovery Time
Add nullable columnDrop column (safe)< 1 second
Add index CONCURRENTLYDrop index (safe)< 1 second
Backfill dataRe-run backfill with old valuesMinutes to hours
Rename columnRename back (requires code revert)< 1 second + deploy
Change column typeRestore from snapshot15-45 minutes
Drop columnRestore from snapshot15-45 minutes

Multi-Tenant Migration Considerations

SaaS applications with tenant-level schema isolation face unique migration challenges. If you're using PostgreSQL schemas per tenant (one of the patterns from multi-tenancy with Prisma and RLS), every migration runs N times... once per tenant.

The Cascading Risk

A migration on 1,000 tenant schemas that takes 2 seconds each takes 33 minutes total. If any single tenant's migration fails, you have a partially migrated database with inconsistent schemas across tenants.

The Parallel Approach

// Run tenant migrations in controlled parallel batches async function migrateTenants(migrationSql: string) { const tenants = await db("tenants").select("schema_name"); const concurrency = 10; // Don't overwhelm the connection pool for (let i = 0; i < tenants.length; i += concurrency) { const batch = tenants.slice(i, i + concurrency); await Promise.all( batch.map(async (tenant) => { try { await db.raw(`SET search_path TO ??`, [tenant.schema_name]); await db.raw(migrationSql); } catch (error) { logger.error(`Migration failed for tenant ${tenant.schema_name}`, error); // Record failure ... don't abort entire batch await recordMigrationFailure(tenant.schema_name, error); } }) ); logger.info(`Migrated ${Math.min(i + concurrency, tenants.length)}/${tenants.length} tenants`); } }

Key decisions:

  • Concurrency of 10 keeps connection pool pressure manageable
  • Individual error handling prevents one tenant's failure from blocking all others
  • Failure recording enables targeted retry for failed tenants

The Migration Checklist

Before running any migration in production:

Pre-Flight:

  • Migration tested against production-scale data (row counts and indexes match)
  • Execution time measured and within maintenance window
  • Rollback plan documented and tested
  • Database snapshot taken (or point-in-time recovery verified)
  • Monitoring dashboards open (connection pool, query latency, replication lag)

Execution:

  • Run during lowest-traffic period (check analytics, not assumptions)
  • Schema changes separated from data changes
  • Indexes created with CONCURRENTLY
  • Backfills batched with SKIP LOCKED and sleep intervals

Post-Flight:

  • Verify no invalid indexes (pg_index WHERE NOT indisvalid)
  • Verify replication lag returned to baseline
  • Verify application error rates unchanged
  • Wait one full deployment cycle before contract phase

When to Apply This

  • You're running migrations on tables with more than 1 million rows
  • Your application serves traffic 24/7 with no maintenance window
  • You're operating in a multi-tenant environment with shared or per-tenant schemas
  • Your team has been burned by migration-related downtime (once is enough)

When NOT to Apply This

  • Pre-launch with no production traffic... run migrations however you want
  • Tables under 100K rows... standard migrations complete in under a second
  • Development and staging environments... save the complexity for production

Running a SaaS that's outgrowing simple migrations? I help teams design zero-downtime migration strategies that scale with their data.


Continue Reading

This post is part of the Performance Engineering Playbook ... covering TTFB optimization, Core Web Vitals, caching strategies, and monitoring at scale.

More in This Series

Get insights like this weekly

Join The Architect's Brief — one actionable insight every Tuesday.

Need help with performance?

Let's talk strategy