Database Migration Plan

Write a safe, zero-downtime database migration plan for a schema change. Use when asked to plan a database migration, design a zero-downtime schema change, document an expand/contract migration, produce a rollback procedure for a database change, or coordinate a database schema update with a deployment. Produces a structured migration plan covering migration objectives, backward compatibility analysis, expand/contract phase breakdown, exact SQL, rollback steps per phase, data validation queries, and a deployment runbook.

Published by @Mohit Aggarwal·0 agent reads / 30d·0 saves·

Database Migration Plan Skill

Produce a complete, safe database migration plan for a schema change. A migration plan is not just the SQL — it is a coordinated sequence of steps that ensures the application stays available, data stays consistent, and every step can be rolled back independently.

The expand/contract pattern is the default approach: expand the schema to support both old and new states, migrate the application, then contract to remove the old state. Never combine schema changes and data backfills in a single migration that runs during deployment.

Required Inputs

Ask for these if not already provided:

  • Current schema state — the DDL or description of the table(s) as they are now
  • Target schema state — the DDL or description of what the table(s) should look like after migration
  • Migration reason — why this change is being made (new feature, performance fix, normalization, compliance)
  • Database engine — PostgreSQL, MySQL, SQLite, CockroachDB, etc.
  • Estimated data volume — approximate number of rows in affected tables
  • Deployment constraints — is any downtime allowed? What is the expected traffic level during migration? Are there multiple app instances running?
  • Rollback window — how long after deploy can the team roll back before the migration becomes irreversible?

Output Format


Database Migration Plan: [Migration Name]

Service: [Name] | Team: [Team name] Author: [Name] | Reviewed by: [Name / DBA] Date: [Date] | Target deploy date: [Date] Database engine: [PostgreSQL X.X / MySQL X.X] Ticket: [JIRA-XXX]


1. Migration Overview

What is changing: [1–2 sentences: the specific schema change — e.g. "Adding a non-nullable organisation_id column to the users table and backfilling it from the accounts table."]

Why: [1–2 sentences: the business or technical reason driving the change.]

Migration type: [Additive only / Additive + backfill / Column rename / Column type change / Table restructure / Index change]

Zero-downtime: [Yes — using expand/contract / No — requires maintenance window — state duration]

Estimated migration duration:

  • Expand phase: [~X minutes]
  • Data backfill: [~X minutes/hours — based on X rows at Y rows/second]
  • Contract phase: [~X minutes after app version deployed]

2. Backward Compatibility Analysis

Before writing a single line of SQL, assess whether each change is backward compatible with the currently deployed application code.

ChangeBackward compatible?RiskNotes
[e.g. Add nullable column org_id]YesLowOld app ignores new column
[e.g. Backfill org_id]YesMediumOld app unaffected; new app reads backfilled values
[e.g. Add NOT NULL constraint to org_id]NoHighOld app that inserts without org_id will fail
[e.g. Drop old column account_id]NoHighOld app that reads account_id will fail
[e.g. Add index on org_id]YesLowAdditive; no breaking change
[e.g. Rename column]NoHighNever rename in one step; use expand/contract

Summary: [e.g. "This migration requires the expand/contract pattern across 3 deployment phases because steps 3 and 4 are not backward compatible."]


3. Expand/Contract Phases

Phase Overview

Phase 1 — EXPAND
  Deploy migration: add new column (nullable), create new indexes
  Old app: continues to work (ignores new column)
  New app: not yet deployed
  Duration: [~X min] | Rollback: trivial — drop new column

       │
       ▼

Phase 2 — BACKFILL + DUAL-WRITE
  Deploy app update: writes to both old and new columns
  Run backfill: populate new column for existing rows
  Validate: confirm 100% of rows have non-null new column
  Duration: [~X hours depending on data volume]
  Rollback: deploy previous app version; new column is still nullable

       │
       ▼

Phase 3 — ENFORCE + SWITCH
  Deploy migration: add NOT NULL constraint, drop old column/index
  Deploy app update: reads only from new column
  Duration: [~X min] | Rollback: requires forward-fix (constraint must be dropped first)

       │
       ▼

Phase 4 — CONTRACT (optional cleanup)
  Deploy migration: drop deprecated columns, rename if needed
  Final state matches target schema
  Rollback: not recommended — contract changes are destructive

Phase 1 — Expand Schema

Goal: Add the new column and structures without breaking the existing application. Deploy order: Run migration first, then (optionally) deploy app. Application state: Old app running; no app changes required yet.

-- Migration: 001_add_org_id_to_users.sql
BEGIN;

-- Add nullable column (safe — old app ignores it)
ALTER TABLE users
    ADD COLUMN org_id UUID NULL
        REFERENCES organisations(id) ON DELETE RESTRICT;

-- Add index NOW, not in Phase 3 — building index on large table during Phase 3 is risky
CREATE INDEX CONCURRENTLY users_org_id_idx ON users (org_id);

-- Note: CONCURRENTLY does not lock the table; safe on live traffic
-- Note: Cannot run CONCURRENTLY inside a transaction block; run separately if needed

COMMIT;

Validation after Phase 1:

-- Confirm column exists and is nullable
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'org_id';
-- Expected: is_nullable = 'YES'

-- Confirm index exists
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'users' AND indexname = 'users_org_id_idx';

Rollback (Phase 1 only):

BEGIN;
DROP INDEX CONCURRENTLY IF EXISTS users_org_id_idx;
ALTER TABLE users DROP COLUMN IF EXISTS org_id;
COMMIT;

Phase 2 — Backfill Existing Data

Goal: Populate the new column for all existing rows before enforcing NOT NULL. When to run: After Phase 1 is live and stable. Can be run as a background job or a one-time script. Application state: Deploy app version that dual-writes to both old and new columns.

App code change required:

// All INSERT and UPDATE operations must now set BOTH old_column and new_column
// until Phase 3 is complete. This ensures new rows are populated during the backfill window.

Backfill script — batch processing:

-- Run in batches to avoid locking. Adjust batch size based on table size and DB load.
-- Target: no single batch takes more than 5 seconds.

DO $$
DECLARE
    batch_size  INT := 1000;
    affected    INT;
BEGIN
    LOOP
        UPDATE users
        SET    org_id = accounts.organisation_id
        FROM   accounts
        WHERE  users.account_id = accounts.id
          AND  users.org_id IS NULL
        LIMIT  batch_size;

        GET DIAGNOSTICS affected = ROW_COUNT;
        EXIT WHEN affected = 0;

        -- Pause between batches to avoid saturating I/O
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

Monitoring during backfill:

-- Check progress — run periodically during backfill
SELECT
    COUNT(*) FILTER (WHERE org_id IS NOT NULL) AS backfilled,
    COUNT(*) FILTER (WHERE org_id IS NULL)     AS remaining,
    COUNT(*)                                   AS total,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE org_id IS NOT NULL) / COUNT(*), 2
    ) AS pct_complete
FROM users;

Backfill completion validation:

-- Must return 0 before proceeding to Phase 3
SELECT COUNT(*) AS unbackfilled_rows
FROM users
WHERE org_id IS NULL;

-- Confirm no new rows written without org_id (dual-write working)
SELECT COUNT(*) AS recent_missing
FROM users
WHERE org_id IS NULL
  AND created_at > now() - INTERVAL '1 hour';

Rollback (Phase 2 — app only):

  • Deploy previous app version (single-write to old column)
  • org_id column remains nullable; no data is lost
  • Backfilled values remain; harmless

Phase 3 — Enforce Constraints

Goal: Add NOT NULL constraint and remove dependency on the old column. Prerequisites: Phase 2 backfill must be 100% complete (zero rows with org_id IS NULL). Deploy order: Run migration, then deploy app version that reads only from org_id.

PostgreSQL — use NOT VALID + VALIDATE for large tables:

-- Step 1: Add constraint as NOT VALID (no full table scan — instant)
ALTER TABLE users
    ADD CONSTRAINT users_org_id_not_null
    CHECK (org_id IS NOT NULL) NOT VALID;

-- Step 2: VALIDATE CONSTRAINT (takes a SHARE UPDATE EXCLUSIVE lock — allows reads and writes)
-- Run this separately, as it can take minutes on large tables
ALTER TABLE users
    VALIDATE CONSTRAINT users_org_id_not_null;

-- Step 3: Once validated, convert to actual NOT NULL
-- (PostgreSQL trusts the validated check constraint — this is instant)
ALTER TABLE users
    ALTER COLUMN org_id SET NOT NULL;

-- Step 4: Drop the now-redundant check constraint
ALTER TABLE users
    DROP CONSTRAINT users_org_id_not_null;

Validation after Phase 3:

-- Confirm NOT NULL is enforced
SELECT column_name, is_nullable
FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'org_id';
-- Expected: is_nullable = 'NO'

-- Test that insert without org_id fails (run in a transaction and roll back)
BEGIN;
INSERT INTO users (email) VALUES ('[email protected]');
-- Expected: ERROR: null value in column "org_id" violates not-null constraint
ROLLBACK;

Rollback (Phase 3):

-- Drop the NOT NULL constraint (restores nullable state)
ALTER TABLE users ALTER COLUMN org_id DROP NOT NULL;
-- Then deploy previous app version (dual-write)
-- Note: Once app code reading the new column is live, rolling back the constraint
-- without rolling back the app will cause issues — plan this carefully.

Phase 4 — Contract (Remove Old Column)

Goal: Remove the old column once the app no longer references it. Prerequisites: Phase 3 fully deployed and stable for at least [X days/hours rollback window]. Warning: This phase is destructive — the old column's data is permanently deleted.

BEGIN;

-- Drop the old column
ALTER TABLE users DROP COLUMN account_id;

-- Drop any indexes that referenced the old column
DROP INDEX IF EXISTS users_account_id_idx;

COMMIT;

Pre-drop validation:

-- Confirm no application queries still reference the old column
-- (Check this in code review and via a search of the codebase before running)
-- grep -r "account_id" app/

-- Confirm the column is safe to drop
SELECT COUNT(*) FROM users WHERE account_id IS NOT NULL;
-- Should be 0 (or irrelevant once new column is canonical)

Rollback: Not straightforward — dropped column data cannot be recovered. Only proceed to Phase 4 after the rollback window has passed and the change is confirmed stable.


4. Data Validation Plan

Run these queries before and after the full migration to confirm data integrity.

Pre-migration baseline:

-- Record these values before any migration step
SELECT COUNT(*)   AS total_users FROM users;
SELECT COUNT(*)   AS total_orgs  FROM organisations;
SELECT MIN(created_at), MAX(created_at) FROM users;

-- Check for any anomalies in the source data before backfill
SELECT COUNT(*) AS users_without_account
FROM users WHERE account_id IS NULL;

Post-backfill integrity check:

-- All users have an org that exists
SELECT COUNT(*) AS orphaned_org_refs
FROM users u
WHERE u.org_id IS NOT NULL
  AND NOT EXISTS (
      SELECT 1 FROM organisations o WHERE o.id = u.org_id
  );
-- Expected: 0

-- org_id matches expected value from source column
SELECT COUNT(*) AS mismatched_backfill
FROM users u
JOIN accounts a ON u.account_id = a.id
WHERE u.org_id != a.organisation_id;
-- Expected: 0

-- Row count unchanged (no rows created or deleted by migration)
SELECT COUNT(*) AS total_users_after FROM users;
-- Must match pre-migration baseline

Post-contract final check:

-- Old column is gone
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'account_id';
-- Expected: 0

-- New column is NOT NULL
SELECT is_nullable FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'org_id';
-- Expected: NO

5. Performance Impact Assessment

StepLock typeLock durationTraffic impact
Add nullable columnACCESS EXCLUSIVEMillisecondsNegligible
CREATE INDEX CONCURRENTLYSHARE UPDATE EXCLUSIVEMinutes (proportional to table size)Reads and writes continue
Batch backfillRow-level locks only<5s per batchLow if batches are small
ADD CONSTRAINT NOT VALIDACCESS EXCLUSIVEMillisecondsNegligible
VALIDATE CONSTRAINTSHARE UPDATE EXCLUSIVEMinutesReads and writes continue
ALTER COLUMN SET NOT NULLACCESS EXCLUSIVEMilliseconds (if check constraint validated)Negligible
DROP COLUMNACCESS EXCLUSIVEMillisecondsNegligible

Expected load increase during backfill:

  • DB CPU: [estimated % increase during batch writes]
  • DB I/O: [estimated increase]
  • Monitoring threshold to pause backfill: [e.g. DB CPU > 80% for >2 minutes]

Backfill rate estimate:

  • Table size: [X million rows]
  • Batch size: [1000 rows]
  • Pause between batches: [100ms]
  • Estimated total duration: [X hours at Y rows/second]

6. Deployment Runbook

Follow this checklist on the day of migration. Mark each step as done before proceeding.

Pre-migration (day before):

  • DBA / tech lead has reviewed the migration plan
  • Performance impact assessed; monitoring dashboards ready
  • Backfill script tested on a staging DB with production-scale data
  • Rollback procedure tested on staging
  • On-call engineer briefed; Slack channel [#db-migrations] set up for coordination
  • Maintenance window scheduled (if required)

Phase 1 — Expand (T+0):

  • Take a manual DB snapshot / verify automated backup is recent
  • Run 001_expand_add_org_id.sql on production
  • Run Phase 1 validation queries — confirm pass
  • Deploy app version with dual-write
  • Monitor error rate for [10 minutes]

Phase 2 — Backfill (T+[X hours]):

  • Confirm Phase 1 has been stable for [X hours]
  • Start backfill script in a screen/tmux session
  • Monitor progress via backfill progress query every [5 minutes]
  • Monitor DB CPU and I/O — pause if thresholds exceeded
  • Run completion validation — confirm 0 unbackfilled rows
  • Run integrity checks — confirm 0 orphaned refs, 0 mismatches

Phase 3 — Enforce (T+[X days]):

  • Confirm backfill 100% complete and stable for [X hours]
  • Add NOT VALID constraint
  • Run VALIDATE CONSTRAINT (monitor duration and lock waits)
  • Alter column to NOT NULL
  • Run Phase 3 validation queries
  • Deploy app version reading only from new column
  • Monitor error rate for [30 minutes]

Phase 4 — Contract (T+[X days after rollback window]):

  • Confirm rollback window has passed — no incidents, no rollback needed
  • Search codebase for references to old column — confirm zero
  • Run DROP COLUMN migration
  • Run final integrity checks
  • Close migration ticket; update schema documentation

Quality Checks

  • Every migration phase has an independent rollback procedure — no phase assumes the next one has run
  • Batch backfill script includes a pause between batches to avoid saturating I/O
  • NOT NULL constraints use the NOT VALID + VALIDATE pattern on tables with >100k rows
  • The app dual-write period is explicitly defined — old column writes are not dropped until Phase 3 is deployed
  • Data validation queries include a row count check to confirm no data loss
  • Lock types are identified for every DDL statement — no "should be fine" assumptions
  • The deployment runbook names who runs each step, not just what to run
  • Phase 4 (contract) is explicitly gated on the rollback window passing — not run on the same day as Phase 3

Anti-Patterns

  • Do not combine the expand and contract phases into a single deployment — they must be separated by a deployment cycle
  • Do not run DDL changes without first testing on a production-sized data clone
  • Do not skip the NOT VALID + VALIDATE pattern for constraint additions on large tables — it causes full table locks
  • Do not define a rollback as "restore from backup" — each phase must have an explicit, fast rollback procedure
  • Do not omit dual-write logic during the transition period — removing the old column before all writers are updated causes data loss

Bundled with this artifact

1 file

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Google Cloud Waf Sustainability

Generates sustainability-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Google Cloud Well-Architected Framework (WAF). Use this skill to evaluate a workload, identify environmental impact requirements, and provide actionable recommendations to build, deploy, and manage the workload sustainably in Google Cloud.

software-engineering+2
0
SKILL0

Google Cloud Waf Reliability

Generates reliability-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Google Cloud Well-Architected Framework. Use this skill to evaluate a workload, identify reliability requirements, and provide actionable recommendations for build, deploy, and manage the workload reliably in Google Cloud.

software-engineering+2
0
SKILL0

Google Cloud Waf Performance Optimization

Generates performance-focused guidance for Google Cloud workloads based on the design principles and recommendations in the Performance Optimization pillar of the Google Cloud Well-Architected Framework (WAF). Use this skill to evaluate a workload, identify performance requirements, and provide actionable recommendations for resource allocation, modular design, and elasticity.

software-engineering+2
0