New Year 2026 Sale: 30%-50% OFF on long-term contracts

View Offer
Database Reliability Engineering

Database SRE: Applying Reliability Engineering Principles Specifically to Your Database Tier

Generic SRE covers the whole stack. Database SRE goes deep on the layer that causes most P1 incidents — the database. JusDB defines database-specific SLOs, engineers for your RTO/RPO targets, runs chaos experiments against your database topology, and writes DB-specific runbooks your on-call engineers actually use.

Building an organisation-wide SRE practice with team structure, Kubernetes, and on-call tooling? See our Enterprise SRE Consulting service →

Database SLOs: What to Measure and Why

Generic SLOs measure HTTP success rate and p99 latency. Database SLOs must measure deeper — replication lag, query latency at percentile, connection pool saturation, deadlock rate, and replication divergence. JusDB defines the right SLOs for your database technology and workload.

MySQL / MariaDB SLOs

  • Replication lag SLO: < 5s (alert), < 30s (breach)
  • Query p99 latency SLO: < 100ms for OLTP
  • Connection pool utilisation SLO: < 80%
  • Deadlock rate SLO: < 10/hour
  • Slow query rate SLO: < 0.1% of total queries

PostgreSQL SLOs

  • Replication lag SLO: < 10s for standbys
  • WAL sender/receiver lag SLO: < 100MB
  • Vacuum bloat SLO: table bloat < 20%
  • Lock wait time SLO: < 5s average
  • Connection utilisation SLO: < 85% of max_connections

MongoDB SLOs

  • Replication oplog lag SLO: < 10s
  • Operation latency SLO: reads p99 < 50ms
  • Working set in RAM SLO: > 90% cache hit
  • Connection pool saturation SLO: < 75%
  • Replica set election frequency SLO: < 1/week

Cassandra SLOs

  • Read latency p99 SLO: < 10ms
  • Write latency p99 SLO: < 5ms
  • Compaction queue depth SLO: < 20 pending
  • Dropped messages SLO: 0 dropped/hour
  • Hinted handoff pending SLO: < 1,000

Database Chaos Engineering

You cannot know your database will survive a failure until you test it. JusDB runs controlled chaos experiments against your database topology — in staging first, then production — to verify your HA mechanisms work as expected.

Primary Crash Test

Kill the primary database process. Measure: time to failover, whether application reconnects, whether any writes are lost, whether the secondary correctly promotes.

Network Partition

Inject network latency and packet loss between primary and replica. Verify: split-brain protection activates, replication lag alerts fire, application degrades gracefully.

DCS Failure (Patroni/etcd)

Take etcd or Consul offline. Verify: Patroni enters read-only mode, no split-brain occurs, DCS recovery restores cluster to normal automatically.

Slow Query Storm

Inject a workload of slow queries to simulate a bad deployment. Verify: slow query alerts fire, query killer activates, connection pool degrades gracefully rather than exhausting all connections.

Disk Fill Simulation

Fill database disk to 80%, then 95%, then 100%. Verify: alerts fire at each threshold, database pauses writes at 95% before crash at 100%, recovery procedure works.

Backup & Recovery Drill

Actually restore from backup in a staging environment. Measure recovery time. Verify backup integrity. Most teams discover their backup restoration procedure is broken only during this exercise.

RTO & RPO Engineering

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are promises. JusDB engineers the actual mechanisms that make those promises achievable — and tests them to verify they hold.

RTO Engineering

Time from failure detection to database serving traffic again.

RTO < 60sPatroni/repmgrd automatic failover with pre-warmed standby
RTO < 5 minAutomated runbook: promote standby, update DNS/VIP, restart application connections
RTO < 30 minManual failover with documented step-by-step runbook, pg_basebackup from recent snapshot
RTO < 4 hoursFull restore from PITR backup to new instance

RPO Engineering

Maximum data loss acceptable in a failure scenario.

RPO = 0Synchronous replication to standby (PostgreSQL synchronous_commit=remote_apply)
RPO < 5sAsynchronous streaming replication with monitored replication lag SLO
RPO < 1 minWAL archiving to S3 every 60 seconds (pgBackRest / Barman)
RPO < 24hDaily automated snapshot with integrity verification

FAQ

Make your databases reliably reliable

JusDB defines your database SLOs, tests your HA mechanisms, writes your runbooks, and takes your on-call for the database tier.