Disaster Recovery Plan

Write a disaster recovery plan for a service or system — covering RPO/RTO targets, failure scenario runbooks, backup and restore procedures, DR testing cadence, and communication templates. Use when asked to write a DR plan, document failover procedures, create recovery runbooks, define RTO/RPO targets, or prepare for a disaster recovery game day. Produces a full DR document with per-scenario recovery runbooks, backup validation procedures, testing schedule, and communication templates.

Published by @Mohit Aggarwal·0 agent reads / 30d·0 saves·

Disaster Recovery Plan Skill

Produce a complete disaster recovery plan for a service or system — giving engineers, SREs, and on-call responders everything they need to recover from a disaster scenario in the shortest possible time. A good DR plan is tested regularly, has exact commands (not vague instructions), and makes RTO/RPO targets measurable so the team knows whether recovery succeeded.

Required Inputs

Ask for these if not already provided:

  • Service name and what it does (business function and technical role)
  • Criticality tier — business impact of extended downtime (e.g. Tier 1 = revenue-critical, Tier 2 = ops impact, Tier 3 = internal only)
  • Current infrastructure setup — cloud provider, regions/zones, deployment model (Kubernetes, ECS, VMs, serverless)
  • RPO/RTO requirements — Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long can it be down)
  • Backup strategy — what is backed up, how often, where backups are stored, retention policy
  • On-call contacts — names and contact details for the responder chain

Output Format


Disaster Recovery Plan: [Service Name]

Team: [Team name] | Tech lead: [Name] Criticality tier: [Tier 1 / Tier 2 / Tier 3] | Last tested: [Date] Next DR test: [Date] | Document owner: [Name] Last updated: [Date] | Review cycle: Quarterly

Emergency? Skip to Section 3 — Failure Scenario Runbooks. Find the scenario that matches your situation and follow the steps exactly.


1. Recovery Targets

TargetValueRationale
RPO (Recovery Point Objective)[X minutes/hours][e.g. "Last committed transaction — database replication is synchronous"]
RTO (Recovery Time Objective)[Y minutes/hours][e.g. "Revenue impact begins at 30 min; target recovery in 15 min"]
MTTR target (non-disaster)[Z minutes][Operational incidents, not DR events]
Data retention (backups)[N days/weeks][Compliance requirement or operational policy]
Backup frequency[Every X hours][RPO-driven — backup interval must be ≤ RPO]

What these mean in practice:

  • If a database is corrupted, we can lose at most [X minutes] of transactions before the business impact is unacceptable.
  • The service must be operational again within [Y minutes/hours] of declaring a DR event.
  • If either target cannot be met, escalate to [Engineering Manager] immediately.

2. Failure Scenario Inventory

ScenarioLikelihoodImpactRTO targetRPO targetRunbook
Single availability zone failureMedium[Partial / Full outage][15 min][0 — no data loss]Section 3.1
Full region failureLowFull outage[60 min][5 min]Section 3.2
Database corruption / data lossLowFull outage[90 min][RPO value]Section 3.3
Critical dependency outageHigh[Partial degradation][30 min][N/A]Section 3.4
Security breach / ransomwareVery lowFull outage + investigation[4 hours][Last clean backup]Section 3.5
Accidental bulk data deletionLowPartial or full data loss[60 min][RPO value]Section 3.6

3. Failure Scenario Runbooks

3.1 Single Availability Zone Failure

Trigger: One AZ becomes unreachable — pods/instances in that zone stop responding. Detection: PagerDuty alert [AlertName] fires, or cloud provider status page shows AZ degradation. Expected RTO: [15 minutes] | Expected RPO: Zero (no data loss if multi-AZ replication is working)

Step 1 — Confirm the failure

# Check pod/instance health across zones
kubectl get pods -o wide -n [namespace] | grep -v Running

# Check which nodes are affected
kubectl get nodes -o wide | grep -v Ready

# Verify cloud provider AZ status
# AWS: https://health.aws.amazon.com/health/status
# GCP: https://status.cloud.google.com

Step 2 — Assess whether auto-recovery has occurred

# If using auto-scaling, check if replacement instances launched
kubectl get pods -n [namespace] --watch

# Check deployment replica count
kubectl get deployment [service-name] -n [namespace]

# Verify load balancer health checks are passing
[cloud provider CLI command to check target group health]

Step 3 — Force rescheduling if auto-recovery stalled

# Cordon the affected node so no new pods schedule on it
kubectl cordon [node-name]

# Drain the node — moves all pods to healthy nodes
kubectl drain [node-name] --ignore-daemonsets --delete-emptydir-data

# Verify pods have rescheduled successfully
kubectl get pods -o wide -n [namespace]

Step 4 — Verify service health

# Smoke test key endpoints
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/health
curl -s -o /dev/null -w "%{http_code}" https://[service-url]/[critical-endpoint]

# Check error rate in monitoring
[dashboard link or query]

Recovery confirmed when: All pods are Running, health check returns 200, error rate is at baseline.


3.2 Full Region Failure

Trigger: The primary region is entirely unavailable. Detection: All service health checks failing, cloud provider status page confirms region-wide event. Expected RTO: [60 minutes] | Expected RPO: [5 minutes — based on cross-region replication lag]

Step 1 — Confirm regional failure (5 minutes)

# Confirm the primary region is unreachable
ping [primary-region-endpoint] || echo "Primary region unreachable"

# Check replication lag on standby region database
[command to check replica lag — e.g. for RDS: aws rds describe-db-instances --region [dr-region]]

Step 2 — Declare DR event and notify (2 minutes)

Post to #incidents:

🔴 DR EVENT — [Service Name] — Region Failure
Primary region: [region] — UNREACHABLE
Activating failover to: [dr-region]
Incident commander: [Name]
Next update: 15 minutes

Page [Engineering Manager] and [CTO/VP Eng] via PagerDuty.

Step 3 — Promote DR database (10 minutes)

# AWS RDS — promote read replica to primary
aws rds promote-read-replica \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Wait for promotion to complete
aws rds wait db-instance-available \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region]

# Record the new database endpoint
aws rds describe-db-instances \
  --db-instance-identifier [dr-replica-identifier] \
  --region [dr-region] \
  --query 'DBInstances[0].Endpoint.Address'

Step 4 — Deploy service in DR region (20 minutes)

# Update service configuration to point at DR database
kubectl set env deployment/[service-name] \
  DATABASE_URL=[new-dr-database-url] \
  -n [namespace] \
  --context [dr-region-context]

# Scale up the DR deployment
kubectl scale deployment/[service-name] --replicas=[N] \
  -n [namespace] \
  --context [dr-region-context]

# Verify all pods are running
kubectl get pods -n [namespace] --context [dr-region-context]

Step 5 — Cut over DNS / load balancer (5 minutes)

# Update DNS to point to DR region load balancer
# AWS Route 53:
aws route53 change-resource-record-sets \
  --hosted-zone-id [zone-id] \
  --change-batch file://dr-failover-dns.json

# Verify DNS propagation (may take up to [TTL] seconds)
dig [service-domain] @8.8.8.8

Step 6 — Verify end-to-end

# Full smoke test against DR endpoint
curl -s https://[service-url]/health
[run automated smoke test suite if available]

Recovery confirmed when: DNS resolves to DR region, smoke tests pass, error rate is at baseline.

Post-failover actions (not urgent — after service is stable):

  • Do not fail back to primary until root cause is confirmed resolved
  • Document data loss window (check replication lag at time of failure)
  • Begin post-incident review — see [incident-postmortem skill]

3.3 Database Corruption or Data Loss

Trigger: Data in the database is corrupted, deleted, or otherwise incorrect due to a software bug, operator error, or hardware fault. Detection: Application errors referencing missing/invalid data, monitoring alerts on query error rate, user reports. Expected RTO: [90 minutes] | Expected RPO: [Backup interval — e.g. 1 hour]

Step 1 — Stop the bleeding immediately

# Put the service into maintenance mode to prevent further writes to corrupted data
[command to enable maintenance mode — e.g. kubectl set env deployment/[name] MAINTENANCE_MODE=true]

# Or: scale down the service to zero to prevent writes
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

Step 2 — Assess scope of corruption

# Identify which tables/records are affected
[SQL query to check data integrity — e.g.]
# psql $DATABASE_URL -c "SELECT COUNT(*) FROM [table] WHERE [integrity check condition]"

# Determine when corruption started (cross-reference with deploy times and error logs)
[log query to find earliest error — e.g. in Datadog:]
# service:[service-name] status:error "[corruption error message]" | sort by timestamp asc

Step 3 — Identify the correct restore point

# List available backups
[command to list backups — e.g. for RDS:]
aws rds describe-db-snapshots \
  --db-instance-identifier [db-identifier] \
  --query 'DBSnapshots[*].[SnapshotCreateTime,DBSnapshotIdentifier]' \
  --output table

# Choose the most recent backup BEFORE corruption started
# Record the chosen snapshot ID: [snapshot-id]

Step 4 — Restore from backup

# Restore to a NEW database instance (never overwrite production directly)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-restored-[date] \
  --db-snapshot-identifier [snapshot-id] \
  --region [region]

# Wait for restore to complete
aws rds wait db-instance-available \
  --db-instance-identifier [service-name]-restored-[date]

# Get the restored instance endpoint
aws rds describe-db-instances \
  --db-instance-identifier [service-name]-restored-[date] \
  --query 'DBInstances[0].Endpoint.Address'

Step 5 — Validate restored data

# Connect to restored database and verify integrity
psql [restored-db-endpoint] -U [user] -d [database] -c "[data integrity query]"

# Confirm record counts match expectations
psql [restored-db-endpoint] -U [user] -d [database] -c "SELECT COUNT(*) FROM [critical-table]"

Step 6 — Point service at restored database

kubectl set env deployment/[service-name] \
  DATABASE_URL=postgres://[user]:[pass]@[restored-endpoint]/[db] \
  -n [namespace]

kubectl scale deployment/[service-name] --replicas=[N] -n [namespace]

Recovery confirmed when: Service is running against restored database, data integrity checks pass, error rate is at baseline.


3.4 Critical Dependency Outage

Trigger: A service that [service name] depends on is unavailable or degraded. Detection: Increased error rate or latency on endpoints that call [dependency], alerts from dependency owner. Expected RTO: Depends on dependency — [30 minutes for mitigation, resolution depends on dependency owner]

Dependency map:

DependencyCriticalityDegraded behaviourMitigation
[Database]Critical — all writes failFull outageActivate DR database (Section 3.3)
[Cache — Redis]High — latency increasesPerformance degradationBypass cache, serve from DB
[Auth service]Critical — auth failsAll authenticated endpoints failReturn cached tokens (if implemented)
[Message queue]Medium — async processing delaysWrites succeed, async jobs queueQueue backlog — see on-call runbook
[External API — name]Low — feature X unavailableGraceful degradationFeature flag to disable feature X

Mitigation steps:

# Enable circuit breaker / fallback for [dependency] if implemented
kubectl set env deployment/[service-name] [DEPENDENCY]_CIRCUIT_BREAKER=open -n [namespace]

# Enable feature flag to disable [dependency-backed feature]
[feature flag CLI command or dashboard link]

# Check if dependency has a status page
# [Dependency status URL]

Escalation: Contact [dependency] on-call via [PagerDuty / Slack #[channel]]. Share your service's error rate and the time dependency errors started.


3.5 Security Breach or Ransomware

Trigger: Evidence of unauthorized access, data exfiltration, or encryption of service data. Detection: Security tooling alert, unusual access patterns, user reports of data exposure. Expected RTO: [4+ hours — prioritise containment over speed] | Expected RPO: [Last verified clean backup]

Step 1 — Isolate immediately

# Take the service offline — do not attempt to recover while breach is active
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Revoke all API keys and service account credentials immediately
[command to rotate secrets — e.g. via Vault or cloud provider]

# Block all external access at network level
[firewall/security group command to deny all inbound traffic]

Step 2 — Notify security team immediately Page [Security lead] via PagerDuty. Do NOT attempt to remediate without security team involvement.

Post to #security-incidents (private channel, not #incidents):

🔴 SECURITY INCIDENT — [Service Name]
Time detected: [Time]
Evidence: [One sentence — what was observed]
Actions taken: Service isolated, credentials revoked
Awaiting: Security team guidance

Step 3 — Preserve evidence

# Export current logs before any remediation
[log export command — preserve evidence for forensics]

# Snapshot the current state of all infrastructure
[snapshot/image command]

Steps 4+ — Follow security team guidance. Do not restore from backup until security team confirms the attack vector is closed.


3.6 Accidental Bulk Data Deletion

Trigger: An operator, script, or application bug has deleted records in bulk. Detection: Sudden drop in record counts, user reports of missing data, application errors. Expected RTO: [60 minutes] | Expected RPO: [Backup interval]

# Step 1 — Stop further writes immediately
kubectl scale deployment/[service-name] --replicas=0 -n [namespace]

# Step 2 — Determine what was deleted and when
psql $DATABASE_URL -c "
  SELECT schemaname, tablename,
         n_dead_tup, last_autovacuum
  FROM pg_stat_user_tables
  ORDER BY n_dead_tup DESC LIMIT 10;
"

# Step 3 — Check if deletion is recoverable via MVCC (PostgreSQL)
# Records may still be recoverable if VACUUM has not run
psql $DATABASE_URL -c "
  SELECT * FROM [table]
  WHERE xmax != 0  -- recently deleted rows
  LIMIT 100;
"

# Step 4 — If not recoverable via MVCC, restore from backup
# Follow Section 3.3 (Database Corruption runbook) from Step 3 onward

4. Backup and Restore Procedures

Backup Configuration

Data storeBackup typeFrequencyRetentionLocation
[Primary database]Automated snapshotsEvery [N] hours[N] days[S3 bucket / cloud storage path]
[Primary database]Transaction log backupsContinuous[N] days[Location]
[Secondary store — e.g. Redis]RDB dumpDaily[N] days[Location]
[Blob/object storage]Cross-region replicationContinuous[N] days[DR region bucket]
[Config / secrets]Terraform state + Vault backupOn changeIndefinite[Location]

Backup Validation (Run Weekly)

# Test restore of latest database backup to a throwaway instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --db-snapshot-identifier $(aws rds describe-db-snapshots \
    --db-instance-identifier [db-id] \
    --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
    --output text)

# Wait for restore, then run integrity checks
psql [test-instance-endpoint] -c "[integrity check query]"

# Confirm row counts match recent production values (allow ≤ RPO difference)
psql [test-instance-endpoint] -c "SELECT COUNT(*) FROM [critical-table]"

# Destroy the test instance
aws rds delete-db-instance \
  --db-instance-identifier [service-name]-backup-test-$(date +%Y%m%d) \
  --skip-final-snapshot

5. DR Testing Cadence

Regular testing is mandatory. An untested DR plan is not a DR plan.

Test typeFrequencyWho runs itPass criteria
Backup restore validationWeekly (automated)On-call rotationRestore completes, integrity checks pass
Zone failover drillMonthlyEngineering teamRTO target met, zero data loss
Region failover drillQuarterlyEngineering + SRERTO/RPO targets met
Full DR game dayAnnuallyEngineering + stakeholdersAll scenarios exercised, gaps documented
Chaos engineering (infra failures)Weekly (automated)Chaos engineering toolingService degrades gracefully, recovers automatically

Game Day Procedure

  1. Pre-game day (1 week before): Notify all stakeholders, freeze production changes for the day, prepare DR environment.
  2. Scope definition: Choose 2–3 scenarios from Section 2. Document expected outcomes before the test.
  3. Execute: One person acts as incident commander, others execute runbook steps while another observes and times.
  4. Measure: Record actual RTO and RPO against targets for each scenario.
  5. Debrief (same day): Document gaps, runbook inaccuracies, and automation opportunities.
  6. Action items: File tickets for every gap found. Priority: P1 items must be fixed before next game day.

6. Communication Plan

Internal Communication During DR Event

Incident commander responsibilities:

  • Declare the DR event and open the incident channel
  • Post updates every 15 minutes minimum
  • Make the call to fail over (do not let the team decide by committee)
  • Notify business stakeholders of expected recovery time

Notify these people at DR event start:

RoleNameContactWhen to notify
Engineering manager[Name][Slack / Phone]Immediately
CTO / VP Engineering[Name][Phone]Tier 1 services: immediately
Customer success lead[Name][Slack]If customer-facing impact
Security lead[Name][Slack / PagerDuty]If breach suspected
Legal / compliance[Name][Email / Phone]If data loss involves PII

Communication Templates

DR event declared:

🔴 DR EVENT — [Service Name]
Time: [HH:MM UTC]
Scenario: [Zone failure / Region failure / Data loss / etc.]
Impact: [Who is affected and how]
RTO target: [X minutes]
Incident commander: [Name]
War room: [Slack channel / call link]
Next update: [Time + 15 min]

Status update (every 15 minutes):

🔴 DR UPDATE — [Service Name] — [HH:MM UTC]
Status: [Investigating / Executing recovery / Verifying]
Progress: [One sentence on current step]
Blockers: [Any — or "None"]
Updated RTO estimate: [Time]
Next update: [Time + 15 min]

Recovery confirmed:

✅ DR RESOLVED — [Service Name] — [HH:MM UTC]
Total downtime: [X minutes]
Data loss: [None / X minutes of transactions]
RTO target: [X min] — Actual: [Y min] — [MET / MISSED]
RPO target: [X min] — Actual: [Y min] — [MET / MISSED]
Root cause: [One sentence]
Post-incident review: [Scheduled for / Link when created]

7. DR Readiness Checklist

Run this checklist quarterly and before any major infrastructure change:

Backups:

  • Automated backups are running and alerts fire if they fail
  • Most recent backup restore was tested within the last 7 days
  • Backup retention meets RPO and compliance requirements
  • Backups are stored in a separate region / account from primary

Failover infrastructure:

  • DR region / environment exists and is provisioned (not just documented)
  • DNS failover procedure is documented with exact commands
  • DR database replica is current (replication lag is within RPO)
  • Service can be deployed in DR region with a single command or automated pipeline

Runbooks:

  • All runbooks in Section 3 have been tested within the last quarter
  • Runbook commands have been verified against current infrastructure (no stale references)
  • Contact list is current (no departed employees)

Access:

  • On-call engineers have access to DR region console / CLI
  • Service account credentials for DR region are provisioned and tested
  • Break-glass accounts exist for emergency access if SSO is unavailable

Monitoring:

  • Monitoring exists in DR region (not just primary)
  • Alerts fire correctly when DR environment has issues

Quality Checks

  • RPO and RTO targets are specific numbers, not ranges, and are agreed with the business
  • Every command in every runbook has been run by a human in the last quarter — not copied from documentation untested
  • DR database exists in the DR region and replication lag is monitored
  • Backup restore has been tested end-to-end within the last 7 days
  • The game day schedule is on the team calendar — not just documented here
  • Contact list contains current phone numbers, not just Slack handles (Slack may be down during a DR event)
  • Security breach runbook (3.5) explicitly names the security team contact and does not attempt self-remediation
  • All thresholds (RTO/RPO) are visible in the monitoring dashboard so actual vs. target is measurable in real time

Anti-Patterns

  • Do not write runbook commands without testing them — an untested command in a runbook is actively dangerous during a real disaster when cognitive load is highest
  • Do not set RTO/RPO targets without business sign-off — technical teams often set aspirational targets that do not reflect actual business cost tolerance for downtime
  • Do not include only the "happy path" of each failover scenario — runbooks must explicitly cover what to do when the recovery step itself fails
  • Do not list Slack handles as the only escalation contact — Slack may be unavailable during a region-wide failure; phone numbers are mandatory
  • Do not schedule DR game days without pre-committing to fix the gaps found — a game day that produces action items no one owns is theater, not preparedness

Bundled with this artifact

1 file

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Xlsx

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

software-engineering+2
0
SKILL0

Docx

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

software-engineering+1
0
SKILL0

Ticket Triage

Triage incoming support tickets by categorizing issues, assigning priority (P1-P4), and recommending routing. Use when a new ticket or customer issue comes in, when assessing severity, or when deciding which team should handle an issue.

customer-success+2
0