Database SRE

SLO and SLA for Databases: A Practical Framework

Define meaningful SLOs and SLAs for your database tier — latency, availability, durability, and error budgets

JusDB Team
January 12, 2026
12 min read
195 views

Your database is on fire at 2 a.m., your on-call engineer is paging, and your CTO wants to know what your uptime commitment actually was. If you cannot answer that question from a document you wrote before the incident, you do not have an SLA — you have a prayer. SLOs, SLAs, and SLIs are not bureaucratic paperwork; they are the engineering contracts that determine how your team allocates reliability work, when your pager fires, and what you owe your customers when things go wrong. Getting them right for the database tier is harder than for stateless services, because databases carry state, lag, and durability obligations that a simple HTTP availability check will never capture. This post gives you a concrete, working framework.

TL;DR
  • SLIs measure what you observe; SLOs are the targets you set; SLAs are the contractual consequences when you miss them.
  • Database SLIs must cover latency (p99), availability, replication lag, error rate, and durability — not just uptime.
  • 99.9% availability = 43.8 min/month downtime budget; 99.99% = 4.4 min/month. Choose the tier your business actually needs.
  • Error budgets convert abstract percentages into actionable time you can spend on incidents before consequences kick in.
  • Alert on error budget burn rate (fast burn + slow burn), not raw SLI thresholds, to avoid alert fatigue.
  • SLAs only make sense when there is a meaningful consequence — financial credit, contract clause, or customer penalty — attached to a miss.

SLO vs SLA vs SLI: The Definitions That Actually Matter

The three terms get conflated constantly, even by experienced engineers. Here is the precise hierarchy:

Service Level Indicator (SLI) is a quantitative measurement of a specific aspect of service behavior. It is a number derived from real telemetry: "p99 read latency over the past 5 minutes was 18 ms." SLIs live in your monitoring system. They have no opinions about whether that number is good or bad.

Service Level Objective (SLO) is a target range for an SLI over a defined time window. "p99 read latency will be below 50 ms for 99.5% of the 5-minute windows in a rolling 30-day period." SLOs express your reliability goal. Missing an SLO does not automatically trigger a penalty — it burns your error budget and should trigger a reliability review.

Service Level Agreement (SLA) is a contract, usually with a customer or between internal business units, that defines consequences when an SLO is missed. The consequence might be service credits, reduced fees, executive escalation, or contract termination rights. An SLA without a meaningful consequence is just an SLO with extra paperwork.

Warning

Many teams set SLAs before they have reliable SLI data. If you do not know your baseline p99 latency over the past 90 days, you are guessing at an SLO, and any SLA built on that guess will either be trivially easy to meet or reliably broken. Measure first, commit second.

The relationship flows one way: SLIs inform SLO targets, SLOs inform SLA commitments. Never work backwards from a business ask ("our customers want five nines!") without validating that your infrastructure can actually deliver it.

Database-Specific SLIs: What to Measure

Stateless service SLIs are simple: request success rate and latency. Databases require a richer set because they own persistent state, serve mixed read/write workloads, and often run in topologies where leader and replica health diverge.

Query Latency (p99 and p999)

Mean latency is a trap — it hides the tail experience. Use p99 (the 99th percentile) as your primary latency SLI and p999 (99.9th percentile) as a canary for severe tail events. Separate read and write latency; they degrade under different conditions and have different acceptable thresholds.

Availability

Availability for a database is the fraction of time the database successfully accepts connections and executes queries. A common mistake is measuring TCP-level connectivity. A database that accepts connections but returns errors on every query is not available. Measure successful query execution rate instead.

Replication Lag

In any primary-replica or leader-follower setup, replica lag is a distinct SLI. An application reading from a replica with 30-second lag is observing stale data. Define a lag threshold (e.g., replica lag must remain below 5 seconds for 99.9% of measurements) and track it separately from availability.

Error Rate

Track the rate of query errors — connection errors, lock timeouts, deadlocks, out-of-disk failures — as a fraction of total query attempts. Errors that your application retries successfully still count: they represent consumed latency budget and increased client complexity.

Durability

Durability SLIs are harder to measure continuously but critical to define. The relevant metrics are Recovery Point Objective (RPO — maximum acceptable data loss in time) and Recovery Time Objective (RTO — maximum acceptable time to restore service). Validate them through regular restore drills, not assumptions.

Tip

Start with exactly four SLIs: read latency p99, write latency p99, availability (successful query rate), and replication lag. Add error rate and durability once you have the first four instrumented and stable. More SLIs than your team actively reviews become noise.

Setting Realistic SLO Targets

The most common SLO mistake is aspiring to the highest tier without engineering justification. Here is the math that should anchor every SLO conversation:

SLO Tier Availability % Monthly Budget Annual Budget Suitable For
99% Two nines 7.3 hours 3.65 days Internal tools, dev environments
99.5% Two-and-a-half nines 3.65 hours 1.83 days Non-critical internal services
99.9% Three nines 43.8 minutes 8.77 hours Most production SaaS workloads
99.95% Three-and-a-half nines 21.9 minutes 4.38 hours High-value production services
99.99% Four nines 4.38 minutes 52.6 minutes Financial, healthcare, mission-critical
99.999% Five nines 26.3 seconds 5.26 minutes Carrier-grade, requires dedicated HA engineering

99.99% sounds like a reasonable goal until you realize your monthly maintenance window, a single failover, or one slow deploy can consume the entire budget. Four nines requires automated failover with sub-minute RTO, zero-downtime deployments, and continuous chaos testing. Be honest about your operational maturity before committing.

For latency SLOs, define targets based on your measured baseline plus a safety margin. If your p99 read latency baseline is 12 ms, an SLO of "p99 below 50 ms" gives real headroom for traffic spikes. An SLO of "p99 below 15 ms" will fire constantly and train your team to ignore alerts.

Warning

Do not set SLO targets at your current best performance. Your SLO should represent the threshold below which users are meaningfully harmed, not the threshold at which you feel proud. A target too tight will drain your error budget during normal variance and create alert fatigue that masks real incidents.

Error Budgets: Turning Percentages Into Operational Time

An error budget is the inverse of your SLO expressed as allowable failure. If your availability SLO is 99.9% over a 30-day window, your error budget is 0.1% of that window — 43.8 minutes of allowable downtime.

Error budgets make reliability conversations concrete. Instead of "we need to be more reliable," you can say "we burned 38 of our 43.8 minutes of error budget on Tuesday's schema migration; we have 5.8 minutes remaining this month." That statement drives real decisions: freeze risky deployments, invest in zero-downtime migration tooling, or accept the risk and plan for next month's reset.

The budget calculation for a 30-day window:

text
-- Error budget math
-- 30-day window = 30 * 24 * 60 = 43,200 minutes

99.9%  SLO  →  budget = 43,200 * 0.001  = 43.2  minutes
99.95% SLO  →  budget = 43,200 * 0.0005 = 21.6  minutes
99.99% SLO  →  budget = 43,200 * 0.0001 =  4.32 minutes

-- Budget consumed by an event:
-- event_minutes / total_window_minutes * 100 = % budget consumed
-- Example: 10-minute outage against 99.9% SLO
-- 10 / 43,200 * 100 = 0.023% of window
-- 10 / 43.2 * 100   = 23.1% of monthly error budget consumed

Track error budget as a percentage remaining, not as raw minutes. "We have 23% of our error budget left with 8 days remaining in the window" is a clear signal to reduce deployment risk. "We have 10 minutes left" requires mental math that slows incident response.

Measuring Database SLIs with SQL

You need real queries to instrument your SLIs. The following examples cover PostgreSQL and MySQL, the two most common open-source databases in production environments.

PostgreSQL: Measuring Query Latency SLIs

sql
-- Requires pg_stat_statements extension
-- Enable with: CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- p99 and p95 query latency by query type (last reset window)
SELECT
  queryid,
  LEFT(query, 80)                                    AS query_preview,
  calls,
  ROUND((mean_exec_time)::numeric, 2)                AS mean_ms,
  ROUND((stddev_exec_time)::numeric, 2)              AS stddev_ms,
  -- Approximate p99: mean + 2.33 * stddev (normal distribution assumption)
  ROUND((mean_exec_time + 2.33 * stddev_exec_time)::numeric, 2) AS approx_p99_ms,
  ROUND((total_exec_time / 1000.0)::numeric, 2)      AS total_exec_sec,
  ROUND((rows::numeric / NULLIF(calls, 0)), 2)       AS avg_rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY approx_p99_ms DESC
LIMIT 20;

-- Database-level availability proxy: error rate from pg_stat_database
SELECT
  datname,
  numbackends                                         AS active_connections,
  xact_commit                                         AS successful_txns,
  xact_rollback                                       AS rolled_back_txns,
  ROUND(
    100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0),
    4
  )                                                   AS rollback_rate_pct,
  deadlocks,
  blk_read_time + blk_hit_time                        AS total_io_ms
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY xact_commit + xact_rollback DESC;

PostgreSQL: Measuring Replication Lag SLI

sql
-- On the primary: check all replica lag in seconds
SELECT
  application_name,
  client_addr,
  state,
  sync_state,
  -- Lag in bytes
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)    AS send_lag_bytes,
  pg_wal_lsn_diff(sent_lsn, write_lsn)               AS write_lag_bytes,
  pg_wal_lsn_diff(write_lsn, flush_lsn)              AS flush_lag_bytes,
  pg_wal_lsn_diff(flush_lsn, replay_lsn)             AS replay_lag_bytes,
  -- Lag in time (PostgreSQL 10+)
  EXTRACT(EPOCH FROM write_lag)                        AS write_lag_sec,
  EXTRACT(EPOCH FROM flush_lag)                        AS flush_lag_sec,
  EXTRACT(EPOCH FROM replay_lag)                       AS replay_lag_sec
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- SLI check: flag replicas exceeding 5-second lag threshold
SELECT
  application_name,
  EXTRACT(EPOCH FROM replay_lag)  AS lag_seconds,
  CASE
    WHEN EXTRACT(EPOCH FROM replay_lag) > 5  THEN 'SLO_BREACH'
    WHEN EXTRACT(EPOCH FROM replay_lag) > 2  THEN 'WARNING'
    ELSE 'OK'
  END                             AS sli_status
FROM pg_stat_replication;

MySQL / MariaDB: Measuring Replication Lag SLI

sql
-- On a replica: check lag and thread status
SHOW REPLICA STATUS\G

-- Extract the key SLI fields programmatically:
SELECT
  CHANNEL_NAME,
  SERVICE_STATE                                   AS io_thread_state,
  -- Seconds_Behind_Source is the primary lag SLI
  -- Available via performance_schema in MySQL 8.0+
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE,
  LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;

-- MySQL 8.0: query latency percentiles from performance_schema
SELECT
  SCHEMA_NAME,
  DIGEST_TEXT                                     AS query_pattern,
  COUNT_STAR                                      AS executions,
  ROUND(AVG_TIMER_WAIT / 1e9, 2)                 AS avg_latency_ms,
  ROUND(MAX_TIMER_WAIT / 1e9, 2)                 AS max_latency_ms,
  ROUND(SUM_ERRORS * 100.0 / NULLIF(COUNT_STAR, 0), 4) AS error_rate_pct
FROM performance_schema.events_statements_summary_by_digest
WHERE COUNT_STAR > 50
  AND SCHEMA_NAME NOT IN ('mysql', 'performance_schema', 'information_schema', 'sys')
ORDER BY avg_latency_ms DESC
LIMIT 20;

Alerting on Error Budget Burn Rate

Alerting directly on SLI thresholds creates alert fatigue. A p99 spike that lasts 30 seconds and resolves itself will fire your alert, wake an engineer, and consume zero meaningful error budget. Instead, alert on error budget burn rate: how fast you are consuming the monthly budget right now.

The standard approach (borrowed from Google's SRE workbook) uses two burn rate windows — a fast window to catch acute incidents and a slow window to catch gradual degradation that individually looks fine but accumulates:

text
-- Burn rate alerting logic (pseudo-code / Prometheus-style rules)

-- FAST BURN: catches acute incidents
-- Alert if you are burning budget 14x faster than the sustainable rate
-- over a 1-hour window AND a 5-minute window (both must be true)
-- At 14x burn rate, a 99.9% SLO exhausts its monthly budget in ~2.1 hours

ALERT DatabaseAvailabilityFastBurn
  IF (
    avg_over_time(sli:db_success_rate:ratio_rate5m[5m]) < (1 - 14 * 0.001)
    AND
    avg_over_time(sli:db_success_rate:ratio_rate1h[1h]) < (1 - 14 * 0.001)
  )
  SEVERITY: page

-- SLOW BURN: catches gradual degradation
-- Alert if you are burning budget 3x faster than sustainable rate
-- over a 6-hour window AND a 30-minute window
-- At 3x burn rate, the monthly budget exhausts in ~10 days

ALERT DatabaseAvailabilitySlowBurn
  IF (
    avg_over_time(sli:db_success_rate:ratio_rate30m[30m]) < (1 - 3 * 0.001)
    AND
    avg_over_time(sli:db_success_rate:ratio_rate6h[6h])   < (1 - 3 * 0.001)
  )
  SEVERITY: ticket

-- Burn rate multipliers for common SLO tiers:
-- 99.9%  SLO: error_threshold = 0.001, fast_burn = 14x, slow_burn = 3x
-- 99.95% SLO: error_threshold = 0.0005, same multipliers
-- 99.99% SLO: error_threshold = 0.0001, fast_burn = 14x, slow_burn = 3x
Tip

Run a separate set of burn rate alerts for latency SLOs, not just availability. A database that is up but returning p99 latency of 500 ms when your SLO is 50 ms is burning its latency error budget at 10x. Latency budget burn is just as important to surface as availability budget burn, and it often predicts an availability incident by 15-30 minutes.

When SLAs Make Sense (and When They Do Not)

An SLA requires three things to be meaningful: a measurable SLI, an agreed SLO, and a consequence. Missing any one of these turns an SLA into a marketing document.

SLAs make sense when:

  • You are selling database-as-a-service or infrastructure to external customers who make purchasing decisions based on reliability commitments.
  • There is a financial consequence (service credits, refunds, contract penalties) that you are prepared to honor and that the customer values.
  • You have at least 90 days of SLI data proving you can consistently meet the target before you commit to it contractually.
  • Internal chargebacks or business unit cost allocation create a genuine incentive to honor the commitment.

SLAs do not make sense when:

  • You have no SLI instrumentation in place. You cannot honor a commitment you cannot measure.
  • The consequence is so small it creates no incentive to improve (a 1% credit on a $50/month bill does not change behavior).
  • The target is aspirational rather than based on measured baseline performance.
  • The service is purely internal with no financial or contractual relationship between provider and consumer.
Warning

Many organizations create internal SLAs between the database team and product teams as an accountability mechanism. This works only if there are real consequences — priority escalation, headcount allocation, or executive attention — when the SLA is breached. An internal SLA with no consequence is an SLO with extra meetings.

Key Takeaways
  • SLIs are measurements, SLOs are targets, SLAs are contracts with consequences. Never conflate them.
  • Database SLIs must cover latency (p99 reads and writes separately), availability (successful query rate), replication lag, error rate, and durability (RPO/RTO).
  • 99.9% availability gives you 43.8 minutes of monthly error budget; 99.99% gives you 4.38 minutes. Choose based on your operational maturity, not your ambition.
  • Express error budgets as percentage remaining, not raw minutes, to make burn rate intuitive during incidents.
  • Alert on fast-burn (14x rate, 1-hour window) and slow-burn (3x rate, 6-hour window) patterns, not raw SLI thresholds.
  • Run latency burn rate alerts alongside availability burn rate alerts — latency degradation usually precedes availability failure.
  • Commit to SLAs only after 90+ days of SLI data, with a meaningful consequence attached, and at a target your baseline proves you can meet.
  • Use pg_stat_statements and pg_stat_replication in PostgreSQL, and performance_schema in MySQL, as your primary SLI data sources.

How JusDB Helps You Operationalize Database SLOs

Defining SLOs is the easy part. Maintaining the instrumentation, reviewing burn rates every sprint, and keeping SLO targets calibrated as your workload changes is where most teams fall short. JusDB provides continuous database performance intelligence — query latency percentiles, replication lag tracking, error rate trending, and availability reporting — built specifically for the SLI metrics described in this post.

Instead of writing and maintaining bespoke SQL monitoring queries against pg_stat_statements and performance_schema, JusDB surfaces the SLI data your SLO framework needs out of the box. You define the thresholds; JusDB tracks the burn.

Explore JusDB's database observability platform and start measuring the SLIs that make your SLOs and SLAs credible.

Share this article