A disaster recovery runbook is a step-by-step guide for recovering your database after a catastrophic failure. Written in advance, tested regularly — never improvised during an incident.
RTO and RPO Definitions
- RPO (Recovery Point Objective): maximum acceptable data loss (e.g., 5 minutes)
- RTO (Recovery Time Objective): maximum acceptable downtime (e.g., 1 hour)
Runbook Template: PostgreSQL on RDS
INCIDENT: RDS Primary instance failure
SEVERITY: P1
RTO: 30 minutes
RPO: 5 minutes
STEP 1: Confirm failure (2 min)
- Check RDS console: instance status
- Check CloudWatch: CPUUtilization, DatabaseConnections
- Verify application error logs
STEP 2: Trigger Multi-AZ failover (auto or manual, 5 min)
- RDS Multi-AZ: automatic, no action needed
- Verify: new endpoint resolves within 60-120s
- Command: aws rds reboot-db-instance --force-failover
STEP 3: Verify application recovery (5 min)
- Check application health endpoint
- Verify write transactions succeed
- Check replication lag on new standby
STEP 4: Post-incident (30 min)
- Document timeline
- Restore Multi-AZ standby in original AZ
- Review CloudWatch alarms that should have firedPITR Recovery Steps
# Restore RDS to point in time
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier prod-postgres \
--target-db-instance-identifier prod-postgres-recovery \
--restore-time 2025-10-01T14:30:00Z \
--db-instance-class db.r6g.large
# After restore, verify data
psql -h recovery-endpoint -U postgres -c 'SELECT max(created_at) FROM orders;'Verification Checklist
After recovery:
[ ] Application can connect to database
[ ] Read queries return correct data
[ ] Write queries succeed
[ ] Replication is established (if applicable)
[ ] Monitoring and alerting are active
[ ] No data loss beyond RPO threshold
[ ] Backup jobs are runningRegular DR Testing
- Monthly: restore latest backup to test instance and verify row counts
- Quarterly: execute full failover in staging, time the recovery
- Annually: full DR test in production (planned maintenance window)
Key Takeaways
- Document your runbook before an incident — adrenaline and unclear steps are a dangerous combination
- Define RTO and RPO explicitly — they drive every architectural decision about backup and replication
- Test DR quarterly — an untested runbook is just a document, not a recovery plan
- Automate verification steps where possible — human checklist completion under stress is unreliable
JusDB Can Help
JusDB creates and tests disaster recovery runbooks for production database environments. Contact us to build and validate your DR plan.