Database SRE

Database Disaster Recovery Runbook: RTO, RPO, and PITR Procedures

Build a production database DR runbook covering RTO/RPO definitions, failover steps, PITR recovery commands, and a verification checklist. Test it quarterly.

JusDB Team
October 14, 2025
5 min read
177 views

A disaster recovery runbook is a step-by-step guide for recovering your database after a catastrophic failure. Written in advance, tested regularly — never improvised during an incident.

RTO and RPO Definitions

  • RPO (Recovery Point Objective): maximum acceptable data loss (e.g., 5 minutes)
  • RTO (Recovery Time Objective): maximum acceptable downtime (e.g., 1 hour)

Runbook Template: PostgreSQL on RDS

text
INCIDENT: RDS Primary instance failure
SEVERITY: P1
RTO: 30 minutes
RPO: 5 minutes

STEP 1: Confirm failure (2 min)
  - Check RDS console: instance status
  - Check CloudWatch: CPUUtilization, DatabaseConnections
  - Verify application error logs

STEP 2: Trigger Multi-AZ failover (auto or manual, 5 min)
  - RDS Multi-AZ: automatic, no action needed
  - Verify: new endpoint resolves within 60-120s
  - Command: aws rds reboot-db-instance --force-failover

STEP 3: Verify application recovery (5 min)
  - Check application health endpoint
  - Verify write transactions succeed
  - Check replication lag on new standby

STEP 4: Post-incident (30 min)
  - Document timeline
  - Restore Multi-AZ standby in original AZ
  - Review CloudWatch alarms that should have fired

PITR Recovery Steps

bash
# Restore RDS to point in time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier prod-postgres \
  --target-db-instance-identifier prod-postgres-recovery \
  --restore-time 2025-10-01T14:30:00Z \
  --db-instance-class db.r6g.large

# After restore, verify data
psql -h recovery-endpoint -U postgres -c 'SELECT max(created_at) FROM orders;'

Verification Checklist

text
After recovery:
[ ] Application can connect to database
[ ] Read queries return correct data
[ ] Write queries succeed
[ ] Replication is established (if applicable)
[ ] Monitoring and alerting are active
[ ] No data loss beyond RPO threshold
[ ] Backup jobs are running

Regular DR Testing

  • Monthly: restore latest backup to test instance and verify row counts
  • Quarterly: execute full failover in staging, time the recovery
  • Annually: full DR test in production (planned maintenance window)

Key Takeaways

  • Document your runbook before an incident — adrenaline and unclear steps are a dangerous combination
  • Define RTO and RPO explicitly — they drive every architectural decision about backup and replication
  • Test DR quarterly — an untested runbook is just a document, not a recovery plan
  • Automate verification steps where possible — human checklist completion under stress is unreliable

JusDB Can Help

JusDB creates and tests disaster recovery runbooks for production database environments. Contact us to build and validate your DR plan.

Share this article

JusDB Team

Official JusDB content team