Database SRE: Applying Reliability Engineering Principles Specifically to Your Database Tier
Generic SRE covers the whole stack. Database SRE goes deep on the layer that causes most P1 incidents — the database. JusDB defines database-specific SLOs, engineers for your RTO/RPO targets, runs chaos experiments against your database topology, and writes DB-specific runbooks your on-call engineers actually use.
Building an organisation-wide SRE practice with team structure, Kubernetes, and on-call tooling? See our Enterprise SRE Consulting service →
Database SLOs: What to Measure and Why
Generic SLOs measure HTTP success rate and p99 latency. Database SLOs must measure deeper — replication lag, query latency at percentile, connection pool saturation, deadlock rate, and replication divergence. JusDB defines the right SLOs for your database technology and workload.
MySQL / MariaDB SLOs
- Replication lag SLO: < 5s (alert), < 30s (breach)
- Query p99 latency SLO: < 100ms for OLTP
- Connection pool utilisation SLO: < 80%
- Deadlock rate SLO: < 10/hour
- Slow query rate SLO: < 0.1% of total queries
PostgreSQL SLOs
- Replication lag SLO: < 10s for standbys
- WAL sender/receiver lag SLO: < 100MB
- Vacuum bloat SLO: table bloat < 20%
- Lock wait time SLO: < 5s average
- Connection utilisation SLO: < 85% of max_connections
MongoDB SLOs
- Replication oplog lag SLO: < 10s
- Operation latency SLO: reads p99 < 50ms
- Working set in RAM SLO: > 90% cache hit
- Connection pool saturation SLO: < 75%
- Replica set election frequency SLO: < 1/week
Cassandra SLOs
- Read latency p99 SLO: < 10ms
- Write latency p99 SLO: < 5ms
- Compaction queue depth SLO: < 20 pending
- Dropped messages SLO: 0 dropped/hour
- Hinted handoff pending SLO: < 1,000
Database Chaos Engineering
You cannot know your database will survive a failure until you test it. JusDB runs controlled chaos experiments against your database topology — in staging first, then production — to verify your HA mechanisms work as expected.
Primary Crash Test
Kill the primary database process. Measure: time to failover, whether application reconnects, whether any writes are lost, whether the secondary correctly promotes.
Network Partition
Inject network latency and packet loss between primary and replica. Verify: split-brain protection activates, replication lag alerts fire, application degrades gracefully.
DCS Failure (Patroni/etcd)
Take etcd or Consul offline. Verify: Patroni enters read-only mode, no split-brain occurs, DCS recovery restores cluster to normal automatically.
Slow Query Storm
Inject a workload of slow queries to simulate a bad deployment. Verify: slow query alerts fire, query killer activates, connection pool degrades gracefully rather than exhausting all connections.
Disk Fill Simulation
Fill database disk to 80%, then 95%, then 100%. Verify: alerts fire at each threshold, database pauses writes at 95% before crash at 100%, recovery procedure works.
Backup & Recovery Drill
Actually restore from backup in a staging environment. Measure recovery time. Verify backup integrity. Most teams discover their backup restoration procedure is broken only during this exercise.
RTO & RPO Engineering
RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are promises. JusDB engineers the actual mechanisms that make those promises achievable — and tests them to verify they hold.
RTO Engineering
Time from failure detection to database serving traffic again.
RPO Engineering
Maximum data loss acceptable in a failure scenario.
FAQ
Make your databases reliably reliable
JusDB defines your database SLOs, tests your HA mechanisms, writes your runbooks, and takes your on-call for the database tier.