Database Reliability Engineering

Database SRE: Applying Reliability Engineering Principles Specifically to Your Database Tier

Generic SRE covers the whole stack. Database SRE goes deep on the layer that causes most P1 incidents — the database. JusDB defines database-specific SLOs, engineers for your RTO/RPO targets, runs chaos experiments against your database topology, and writes DB-specific runbooks your on-call engineers actually use.

Building an organisation-wide SRE practice with team structure, Kubernetes, and on-call tooling? See our Enterprise SRE Consulting service →

Get a Database SRE Assessment Database Chaos Engineering

Database SLOs: What to Measure and Why

Generic SLOs measure HTTP success rate and p99 latency. Database SLOs must measure deeper — replication lag, query latency at percentile, connection pool saturation, deadlock rate, and replication divergence. JusDB defines the right SLOs for your database technology and workload.

MySQL / MariaDB SLOs

Replication lag SLO: < 5s (alert), < 30s (breach)
Query p99 latency SLO: < 100ms for OLTP
Connection pool utilisation SLO: < 80%
Deadlock rate SLO: < 10/hour
Slow query rate SLO: < 0.1% of total queries

PostgreSQL SLOs

Replication lag SLO: < 10s for standbys
WAL sender/receiver lag SLO: < 100MB
Vacuum bloat SLO: table bloat < 20%
Lock wait time SLO: < 5s average
Connection utilisation SLO: < 85% of max_connections

MongoDB SLOs

Replication oplog lag SLO: < 10s
Operation latency SLO: reads p99 < 50ms
Working set in RAM SLO: > 90% cache hit
Connection pool saturation SLO: < 75%
Replica set election frequency SLO: < 1/week

Cassandra SLOs

Read latency p99 SLO: < 10ms
Write latency p99 SLO: < 5ms
Compaction queue depth SLO: < 20 pending
Dropped messages SLO: 0 dropped/hour
Hinted handoff pending SLO: < 1,000

Database Chaos Engineering

You cannot know your database will survive a failure until you test it. JusDB runs controlled chaos experiments against your database topology — in staging first, then production — to verify your HA mechanisms work as expected.

Primary Crash Test

Kill the primary database process. Measure: time to failover, whether application reconnects, whether any writes are lost, whether the secondary correctly promotes.

Network Partition

Inject network latency and packet loss between primary and replica. Verify: split-brain protection activates, replication lag alerts fire, application degrades gracefully.

DCS Failure (Patroni/etcd)

Take etcd or Consul offline. Verify: Patroni enters read-only mode, no split-brain occurs, DCS recovery restores cluster to normal automatically.

Slow Query Storm

Inject a workload of slow queries to simulate a bad deployment. Verify: slow query alerts fire, query killer activates, connection pool degrades gracefully rather than exhausting all connections.

Disk Fill Simulation

Fill database disk to 80%, then 95%, then 100%. Verify: alerts fire at each threshold, database pauses writes at 95% before crash at 100%, recovery procedure works.

Backup & Recovery Drill

Actually restore from backup in a staging environment. Measure recovery time. Verify backup integrity. Most teams discover their backup restoration procedure is broken only during this exercise.

RTO & RPO Engineering

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are promises. JusDB engineers the actual mechanisms that make those promises achievable — and tests them to verify they hold.

RTO Engineering

Time from failure detection to database serving traffic again.

RTO < 60sPatroni/repmgrd automatic failover with pre-warmed standby

RTO < 5 minAutomated runbook: promote standby, update DNS/VIP, restart application connections

RTO < 30 minManual failover with documented step-by-step runbook, pg_basebackup from recent snapshot

RTO < 4 hoursFull restore from PITR backup to new instance

RPO Engineering

Maximum data loss acceptable in a failure scenario.

RPO = 0Synchronous replication to standby (PostgreSQL synchronous_commit=remote_apply)

RPO < 5sAsynchronous streaming replication with monitored replication lag SLO

RPO < 1 minWAL archiving to S3 every 60 seconds (pgBackRest / Barman)

RPO < 24hDaily automated snapshot with integrity verification

FAQ

Make your databases reliably reliable

JusDB defines your database SLOs, tests your HA mechanisms, writes your runbooks, and takes your on-call for the database tier.

Get a Database SRE Assessment Enterprise SRE Consulting →