Database Disaster Recovery Runbook: RTO, RPO, and PITR Procedures

A disaster recovery runbook is a step-by-step guide for recovering your database after a catastrophic failure. Written in advance, tested regularly — never improvised during an incident.

RTO and RPO Definitions

RPO (Recovery Point Objective): maximum acceptable data loss (e.g., 5 minutes)
RTO (Recovery Time Objective): maximum acceptable downtime (e.g., 1 hour)

Runbook Template: PostgreSQL on RDS

text

INCIDENT: RDS Primary instance failure
SEVERITY: P1
RTO: 30 minutes
RPO: 5 minutes

STEP 1: Confirm failure (2 min)
  - Check RDS console: instance status
  - Check CloudWatch: CPUUtilization, DatabaseConnections
  - Verify application error logs

STEP 2: Trigger Multi-AZ failover (auto or manual, 5 min)
  - RDS Multi-AZ: automatic, no action needed
  - Verify: new endpoint resolves within 60-120s
  - Command: aws rds reboot-db-instance --force-failover

STEP 3: Verify application recovery (5 min)
  - Check application health endpoint
  - Verify write transactions succeed
  - Check replication lag on new standby

STEP 4: Post-incident (30 min)
  - Document timeline
  - Restore Multi-AZ standby in original AZ
  - Review CloudWatch alarms that should have fired

PITR Recovery Steps

bash

# Restore RDS to point in time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier prod-postgres \
  --target-db-instance-identifier prod-postgres-recovery \
  --restore-time 2025-10-01T14:30:00Z \
  --db-instance-class db.r6g.large

# After restore, verify data
psql -h recovery-endpoint -U postgres -c 'SELECT max(created_at) FROM orders;'

Verification Checklist

text

After recovery:
[ ] Application can connect to database
[ ] Read queries return correct data
[ ] Write queries succeed
[ ] Replication is established (if applicable)
[ ] Monitoring and alerting are active
[ ] No data loss beyond RPO threshold
[ ] Backup jobs are running

Regular DR Testing

Monthly: restore latest backup to test instance and verify row counts
Quarterly: execute full failover in staging, time the recovery
Annually: full DR test in production (planned maintenance window)

Key Takeaways

Document your runbook before an incident — adrenaline and unclear steps are a dangerous combination
Define RTO and RPO explicitly — they drive every architectural decision about backup and replication
Test DR quarterly — an untested runbook is just a document, not a recovery plan
Automate verification steps where possible — human checklist completion under stress is unreliable

JusDB Can Help

JusDB creates and tests disaster recovery runbooks for production database environments. Contact us to build and validate your DR plan.

Database Disaster Recovery Runbook: RTO, RPO, and PITR Procedures

RTO and RPO Definitions

Runbook Template: PostgreSQL on RDS

PITR Recovery Steps

Verification Checklist

Regular DR Testing

Key Takeaways

JusDB Can Help

Share this article

JusDB Team

Need Expert Help?

MySQL High Availability

PostgreSQL High Availability

MSSQL High Availability

Cloud Database Cost Optimization

Related Articles

Ola Hallengren's SQL Server Maintenance Solution: Production Setup Guide

PostgreSQL Monitoring with Prometheus and postgres_exporter: A Production Guide

PostgreSQL 16: New Features Every DBA Should Know