Database SRE

Database Incident Response Runbook: From Alert to Resolution

A practical runbook for database incidents — triage, diagnosis, escalation, and post-mortem for MySQL and PostgreSQL

JusDB Team
January 11, 2026
11 min read
228 views

Your monitoring fires at 2:47 AM. The database is down, the on-call engineer is half-awake, and the clock is ticking. Without a clear runbook, those first critical minutes become a blur of guesswork, Slack messages, and duplicate effort. A well-structured incident response runbook eliminates ambiguity and compresses your Mean Time to Resolution (MTTR) — turning a chaotic fire drill into a disciplined, repeatable process. This guide is that runbook: practical, opinionated, and battle-tested for MySQL and PostgreSQL production environments.

TL;DR
  • Follow a strict Triage → Diagnosis → Contain → Resolve → Post-mortem sequence to avoid costly missteps under pressure.
  • Severity classification (P1/P2/P3) determines who gets paged and how fast you must act — define these before incidents happen.
  • The four most common database incidents are connection exhaustion, lock waits, disk full, and replication lag — each has a distinct diagnostic fingerprint.
  • Every incident ends with a blameless 5-why post-mortem and tracked action items. No exceptions.
  • MTTR is your north-star metric: measure it per incident type and use post-mortems to drive it down over time.

Severity Classification: Know Your P1 from Your P3

Before a single query is run or a single engineer is woken up, classify the incident. Severity level dictates your response SLA, escalation path, and communication cadence. Misclassifying a P1 as a P3 is how you lose SLA compliance and customer trust simultaneously.

Severity Definition Response SLA Who Gets Paged
P1 — Critical Production database down or data loss in progress Acknowledge in 5 min, resolve in 60 min On-call DBA, engineering lead, VP Engineering
P2 — High Significant performance degradation, partial outage, replication broken Acknowledge in 15 min, resolve in 4 hours On-call DBA, engineering lead
P3 — Medium Degraded performance, non-critical service affected, warnings in logs Acknowledge in 1 hour, resolve in 24 hours On-call DBA during business hours
Warning

When in doubt, escalate severity upward. It is far easier to stand down a P1 response than it is to scramble a team after you realize a P3 has caused data corruption.

Initial Triage: The First Five Minutes

Triage is not diagnosis. Your goal in the first five minutes is to establish a baseline picture, not to fix anything. Resist the urge to start killing connections or restarting services before you understand the state of the system. Work through this checklist in order.

Step 1: Check Active Processes and Connections

The process list is your first window into what the database is actually doing right now.

MySQL:

sql
-- View all active queries, sorted by runtime descending
SHOW FULL PROCESSLIST;

-- More detail via information_schema (filterable)
SELECT
  id,
  user,
  host,
  db,
  command,
  time,
  state,
  LEFT(info, 200) AS query_preview
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;

PostgreSQL:

sql
-- View active queries with wait events
SELECT
  pid,
  usename,
  application_name,
  client_addr,
  state,
  wait_event_type,
  wait_event,
  now() - query_start AS query_duration,
  LEFT(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start ASC;

Step 2: Check Disk Usage

bash
# Overall disk usage
df -h

# MySQL data directory specifically
du -sh /var/lib/mysql/

# PostgreSQL data directory
du -sh /var/lib/postgresql/

# Find the largest tables in PostgreSQL
# (run inside psql)
sql
-- PostgreSQL: largest tables by size
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;

Step 3: Check Replication Lag

MySQL (replica side):

sql
SHOW SLAVE STATUS\G
-- Key fields: Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running

PostgreSQL (replica side):

sql
-- On the replica
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- On the primary, view all connected replicas
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;
Tip

Automate this triage checklist into a script that dumps all four outputs to a timestamped file the moment an alert fires. That way, you have a point-in-time snapshot even if the situation changes before you finish reading it.

Diagnosis: Common Incident Types

Once triage gives you a snapshot, pattern-match against the four most common database incident types. Each has a distinct diagnostic fingerprint and a corresponding containment strategy.

1. Connection Exhaustion

Symptoms: "Too many connections" errors, application timeouts, new connections refused.

sql
-- MySQL: check current vs max connections
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Connection_errors_max_connections';

-- Who is holding the most connections?
SELECT user, host, COUNT(*) AS connection_count
FROM information_schema.processlist
GROUP BY user, host
ORDER BY connection_count DESC;
sql
-- PostgreSQL: check connection limits
SELECT
  max_conn,
  used,
  res_for_super,
  max_conn - used - res_for_super AS available
FROM
  (SELECT count(*) used FROM pg_stat_activity) t1,
  (SELECT setting::int res_for_super FROM pg_settings WHERE name = 'superuser_reserved_connections') t2,
  (SELECT setting::int max_conn FROM pg_settings WHERE name = 'max_connections') t3;

Containment: Kill idle connections immediately, then address root cause (missing connection pooler, connection leak in application code).

sql
-- MySQL: kill a specific connection
KILL CONNECTION 12345;

-- MySQL: kill all sleeping connections older than 60 seconds (generate statements)
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 60;

-- PostgreSQL: terminate idle connections older than 10 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND query_start < now() - interval '10 minutes'
  AND pid != pg_backend_pid();

2. Lock Waits and Deadlocks

Symptoms: Queries queued or timing out, Lock wait timeout exceeded in MySQL, deadlock log entries.

sql
-- MySQL: find blocking and waiting transactions
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- View the most recent deadlock
SHOW ENGINE INNODB STATUS\G
sql
-- PostgreSQL: find blocking queries
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  blocked.wait_event_type,
  blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;

-- Kill a blocking query in PostgreSQL
SELECT pg_cancel_backend(blocking_pid);   -- Graceful cancel
SELECT pg_terminate_backend(blocking_pid); -- Hard terminate

3. Slow Queries Causing Cascade Degradation

Symptoms: CPU at 100%, query queue growing, specific tables or indexes under heavy load.

sql
-- MySQL: enable and read slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- Log queries over 1 second

-- Find currently running long queries
SELECT id, user, db, time, state, LEFT(info, 300) AS query
FROM information_schema.processlist
WHERE command != 'Sleep' AND time > 10
ORDER BY time DESC;

-- Kill the offending query
KILL QUERY 12345;
sql
-- PostgreSQL: find long-running queries
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle';

-- Cancel a query gracefully (allows rollback)
SELECT pg_cancel_backend(pid);

-- Terminate if cancel doesn't work within 30 seconds
SELECT pg_terminate_backend(pid);

4. Disk Full

Symptoms: Write failures, InnoDB errors, WAL cannot be written, binary log fill-up.

Warning

A disk-full condition on a database server can cause corruption if writes are partially committed. Do not simply delete random files to free space — identify the correct safe targets (old binary logs, WAL segments, temp files) before removing anything.

sql
-- MySQL: find binary logs consuming space
SHOW BINARY LOGS;

-- Safely purge old binary logs (keep at least enough for replica lag)
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 3 DAY;
-- Or by name:
PURGE BINARY LOGS TO 'mysql-bin.001234';
bash
# PostgreSQL: find and remove old WAL segments safely
# DO NOT manually delete WAL files — use pg_archivecleanup
pg_archivecleanup /var/lib/postgresql/wal_archive/ 000000010000000000000050

# Find large temp files left by crashed backends
find /var/lib/postgresql/ -name "pgsql_tmp*" -ls

Escalation Criteria

Not every incident requires all hands on deck, but every incident requires someone to own the escalation decision. Escalate immediately when any of the following are true:

  • Data loss is confirmed or suspected.
  • The incident has persisted beyond 30 minutes without a clear path to resolution.
  • Replication is broken and the replica is more than 60 minutes behind the primary.
  • Disk usage has crossed 90% and is still rising.
  • The on-call engineer does not have sufficient access or expertise to proceed safely.
  • A change made during the incident appears to have made things worse.

Escalation is not failure — it is the correct process working. Document the escalation time in your incident timeline alongside the initial alert time. This delta is a key MTTR component.

Resolution and MTTR Tracking

Resolution is declared only when the following conditions are all true: the immediate impact has been stopped, services are returning to normal baselines, monitoring confirms stability for at least 10 consecutive minutes, and a short-term mitigation is in place even if a permanent fix is still pending.

Log the following timestamps for every incident in your incident tracker:

  • T0 — Time the incident began (not when the alert fired — when the problem started).
  • T1 — Time the alert fired.
  • T2 — Time the on-call engineer acknowledged.
  • T3 — Time the incident was contained (impact stopped spreading).
  • T4 — Time of full resolution.

MTTR = T4 - T0. Detection lag = T1 - T0. Response lag = T2 - T1. Track all three. Your monitoring alerting thresholds should be tuned to minimize T1 - T0; your runbooks should be tuned to minimize T2 - T1 and T3 - T2.

Post-Mortem: The 5-Why Template

Every P1 and P2 incident must be followed by a blameless post-mortem within 48 hours of resolution. The 5-why method is a structured approach to tracing a symptom back to its root cause without stopping at the first convenient explanation.

text
INCIDENT POST-MORTEM TEMPLATE
==============================
Incident ID:        INC-XXXX
Date:               YYYY-MM-DD
Severity:           P1 / P2 / P3
Duration:           XX minutes (T0 to T4)
MTTR:               XX minutes
Responders:         [Names]
Incident Commander: [Name]

TIMELINE
--------
T0 (incident start):   HH:MM UTC — [what happened]
T1 (alert fired):      HH:MM UTC
T2 (acknowledged):     HH:MM UTC
T3 (contained):        HH:MM UTC
T4 (resolved):         HH:MM UTC

IMPACT
------
- Services affected:
- Users/requests affected (estimated):
- Revenue impact (if known):

ROOT CAUSE ANALYSIS — 5 WHYS
------------------------------
Symptom: [e.g., "Application could not write to the database"]

Why 1: [e.g., "The disk was full"]
Why 2: [e.g., "Binary logs were not being purged"]
Why 3: [e.g., "The automatic purge job failed silently last week"]
Why 4: [e.g., "The purge job had no alerting on failure"]
Why 5: [e.g., "Operational jobs are not covered by our monitoring standards"]

Root Cause: [e.g., "Operational maintenance jobs lack failure alerting"]

WHAT WENT WELL
--------------
- [List items honestly]

WHAT WENT POORLY
----------------
- [List items honestly]

ACTION ITEMS
------------
| # | Action Item                                | Owner  | Due Date   | Priority |
|---|-------------------------------------------|--------|------------|----------|
| 1 | Add failure alerting to all cron jobs      | SRE    | YYYY-MM-DD | P1       |
| 2 | Set up automatic binary log rotation       | DBA    | YYYY-MM-DD | P1       |
| 3 | Add disk usage alert at 80% threshold      | SRE    | YYYY-MM-DD | P2       |
| 4 | Update runbook with disk-full procedure    | DBA    | YYYY-MM-DD | P3       |
Tip

Post-mortem action items must be tracked in your project management system with a due date and a named owner. Action items that live only in a post-mortem document are action items that never get done. Review open action items at your next weekly SRE or DBA sync.

Key Takeaways

Key Takeaways
  • Classify severity (P1/P2/P3) first — it governs every decision that follows, from who you wake up to how fast you need to resolve.
  • Triage before you touch anything: run SHOW PROCESSLIST / pg_stat_activity, check disk, check replication. Snapshot the state before you change it.
  • The four incident archetypes — connection exhaustion, lock waits, slow queries, disk full — each have a distinct SQL fingerprint. Knowing these cold saves critical minutes at 3 AM.
  • Kill queries surgically: use pg_cancel_backend() before pg_terminate_backend() in PostgreSQL; use KILL QUERY before KILL CONNECTION in MySQL.
  • Escalate when in doubt, when you're stuck, or when a change made things worse. Escalation is process, not failure.
  • Track T0 through T4 on every incident. MTTR only improves if you measure it.
  • A blameless 5-why post-mortem with tracked action items is the only mechanism that prevents the same incident from recurring. Skip it and you will see the incident again.

Spend Less Time Fighting Fires with JusDB

The best database incident is the one that never happens — and the second best is the one you resolve in minutes, not hours. JusDB gives your team the managed database infrastructure, automated monitoring, and expert DBA support to compress both categories. With continuous performance monitoring, automated backups with point-in-time recovery, built-in replication health checks, and a team of database engineers available when you need them, JusDB turns the runbook above from a reactive emergency procedure into a rarely-needed safety net.

Whether you are running MySQL, PostgreSQL, or a multi-database production environment, JusDB handles the operational complexity so your engineering team can focus on building instead of firefighting. Talk to the JusDB team to learn how managed database support can reduce your incident frequency and cut your MTTR.

Share this article