Zero-Downtime Database Migrations: A Step-by-Step Playbook for Production PostgreSQL

At some point, every growing application needs to change its database schema. Add a column, rename a table, split one table into two, change a data type, add an index on a table with 50 million rows. In the early days, you run the migration during a quiet hour and accept 30 seconds of downtime. But once your application serves customers across multiple time zones, processes payments around the clock, or powers an API that other services depend on, there is no quiet hour. Every minute of downtime is a business impact. Zero-downtime database migrations are not optional for production systems. They are a core engineering discipline, and PostgreSQL provides all the tools you need to make every schema change without interrupting service.

Why Schema Changes Cause Downtime

PostgreSQL uses locks to maintain data consistency during schema modifications. The most disruptive lock is ACCESS EXCLUSIVE, which blocks all reads and writes to the affected table. Operations that acquire this lock include ALTER TABLE ... ADD COLUMN with a DEFAULT value (on PostgreSQL versions before 11), ALTER TABLE ... ALTER COLUMN TYPE, ALTER TABLE ... RENAME, and DROP TABLE. On a table with millions of rows, these operations can hold the lock for seconds to minutes while PostgreSQL rewrites the table data.

Even operations that acquire less restrictive locks can cause downtime indirectly. CREATE INDEX locks the table against writes for the duration of the index build, which can take minutes or hours on large tables. ALTER TABLE ... ADD COLUMN with a NOT NULL constraint (without a default) fails entirely if existing rows violate the constraint. Adding a foreign key constraint requires a full table scan to validate existing data, blocking writes during the scan.

The lock queue compounds the problem. When a migration acquires an ACCESS EXCLUSIVE lock, every subsequent query on that table queues behind it. Even if the migration itself takes only 5 seconds, if 200 queries queue up during those 5 seconds, the effective downtime is much longer as the queue drains. For high-traffic tables receiving hundreds of queries per second, a 5-second exclusive lock can cascade into 30 seconds or more of degraded performance.

The Expand-and-Contract Pattern

The expand-and-contract pattern (also called parallel change) is the fundamental technique for zero-downtime schema evolution. Instead of making a breaking change in one step, you split it into three phases: expand (add the new structure alongside the old), migrate (move data from old to new and update application code), and contract (remove the old structure). Each phase is independently deployable, reversible, and non-breaking.

Consider renaming a column from "username" to "display_name." A naive approach runs ALTER TABLE users RENAME COLUMN username TO display_name, which breaks every query referencing "username" until the application code is also deployed. The expand-and-contract approach works differently. Phase one (expand): add the new column with ALTER TABLE users ADD COLUMN display_name VARCHAR(255), then backfill it with UPDATE users SET display_name = username WHERE display_name IS NULL (in batches to avoid long-running transactions). Add a database trigger that copies new writes to both columns. Phase two (migrate): deploy application code that reads from display_name and writes to both columns. Verify that display_name is fully populated and the application is functioning correctly. Phase three (contract): deploy application code that only references display_name, remove the database trigger, and drop the old column with ALTER TABLE users DROP COLUMN username.

This approach takes three deployments instead of one, but each deployment is safe. If phase two reveals a problem, you roll back the application to read from the original column. The data is still there. No downtime, no data loss, no coordination required between database and application deployments.

Safe Operations in PostgreSQL

Several common schema changes are safe by default in modern PostgreSQL (version 11 and later) and do not require the expand-and-contract pattern. Adding a nullable column without a default is nearly instant regardless of table size because PostgreSQL only updates the catalog metadata without rewriting table data. Adding a column with a DEFAULT value (PostgreSQL 11 and later) is also instant because the default is stored in the catalog and applied on read rather than by rewriting every existing row.

Creating an index concurrently using CREATE INDEX CONCURRENTLY avoids the write lock that a standard CREATE INDEX acquires. The concurrent variant takes longer (two table scans instead of one) but allows reads and writes to continue throughout the build. Always use CONCURRENTLY for index creation on production tables. The trade-off is that if the build fails partway through, you are left with an invalid index that must be dropped and rebuilt. Check for invalid indexes after every concurrent build with a query against pg_index.

Adding a CHECK constraint or NOT NULL constraint with NOT VALID tells PostgreSQL to enforce the constraint on new writes without validating existing rows. You then validate existing data separately with ALTER TABLE ... VALIDATE CONSTRAINT, which acquires a weaker lock (SHARE UPDATE EXCLUSIVE) that allows concurrent reads and writes. This two-step approach is dramatically faster for large tables compared to adding a validated constraint in a single statement.

Dangerous Operations and Their Safe Alternatives

Changing a column type (ALTER COLUMN ... TYPE) is one of the most dangerous operations because PostgreSQL rewrites the entire table while holding an ACCESS EXCLUSIVE lock. The safe alternative uses the expand-and-contract pattern: add a new column with the desired type, backfill data from the old column in batches, switch application reads and writes to the new column, then drop the old column.

Adding a NOT NULL constraint to an existing column that might contain nulls requires backfilling null values first. Run the backfill in batches: UPDATE users SET phone = 'unknown' WHERE phone IS NULL AND id BETWEEN 1 AND 10000. After backfilling all rows, add the constraint with NOT VALID and then validate it separately. Never run a single UPDATE that touches all rows of a large table, as the resulting transaction holds locks and generates WAL that can impact replication lag and disk usage.

Dropping a column that is still referenced by application code causes immediate errors. Always deploy application code that stops referencing the column before running the DROP COLUMN migration. For additional safety, add the column to the EXCLUDE list using a database-level mechanism or simply leave it in place and clean it up during a future maintenance cycle. Unused columns consume minimal storage and pose no performance risk if they are not indexed.

Batch Backfills Without Locking

Large data migrations (backfilling a new column, transforming existing data) must be executed in batches to avoid long-running transactions that bloat the table, increase replication lag, and risk lock timeouts. The standard pattern processes rows in ranges of 1,000 to 10,000 at a time with a short sleep between batches to allow other transactions to proceed.

A practical backfill script uses a loop that selects the next batch of rows by primary key range, updates them within a small transaction, commits, sleeps for 100 to 500 milliseconds, and repeats. In PostgreSQL, this looks like a DO block or an external script that tracks progress. Monitor replication lag during the backfill and increase the sleep interval if lag exceeds your threshold. For tables with tens of millions of rows, a well-tuned batch backfill takes hours instead of the minutes a single UPDATE would take, but it completes without impacting application performance.

For complex data transformations, consider using pg_repack, which can rebuild a table and its indexes without holding exclusive locks for the duration. pg_repack creates a new copy of the table, replays changes that occurred during the copy using triggers, and swaps the old and new tables atomically. It requires briefly acquiring an ACCESS EXCLUSIVE lock for the swap, but this lock is held for milliseconds rather than the minutes required for a full table rewrite.

Migration Tooling and Process

Your migration framework should enforce safety rules automatically. Configure your migration runner to set a statement timeout (e.g., SET statement_timeout = '5s') before running each migration. If any individual statement takes longer than 5 seconds, it is rolled back automatically rather than holding a lock indefinitely. This timeout catches dangerous operations that would block the table for extended periods and forces the developer to use a safe alternative.

Lock timeout (SET lock_timeout = '3s') prevents migrations from waiting indefinitely to acquire a lock. If the table is under heavy load and the lock cannot be acquired within 3 seconds, the migration fails fast rather than queuing behind hundreds of pending queries. Retry the migration during a period of lower traffic or use advisory locks to coordinate with application code.

Code review every migration against a safety checklist: Does any statement acquire an ACCESS EXCLUSIVE lock on a large table? Is CREATE INDEX CONCURRENTLY used instead of CREATE INDEX? Are backfills batched? Are constraints added with NOT VALID? Is there a rollback plan? Automated linters like strong_migrations (Ruby), django-pg-zero-downtime-migrations (Python), or custom CI checks can catch unsafe patterns before they reach production.

MAPL TECH helps teams implement zero-downtime deployment pipelines for PostgreSQL and other production databases. From migration tooling setup to complex schema evolution planning, we ensure your database changes never impact your users. Explore our cloud engineering services or contact us to audit your current migration process.

Zero-Downtime Database Migrations: A Step-by-Step Playbook for Production PostgreSQL

Why Schema Changes Cause Downtime

The Expand-and-Contract Pattern

Safe Operations in PostgreSQL

Dangerous Operations and Their Safe Alternatives

Batch Backfills Without Locking

Migration Tooling and Process

Related Articles

Database Read Replicas: When and How to Use Them for Growing Applications

Multi-Region Deployment Strategies for African and Caribbean Markets

Edge Computing Patterns That Cut Cloud Costs Without Adding Complexity