Database SRE

Database Observability: Metrics, Logs, Traces, and Prometheus Alerting

Build a complete database observability stack with Prometheus metrics, slow query logs, and OpenTelemetry distributed traces. Includes alert rules for replication lag and cache hit rate.

JusDB Team
September 10, 2025
5 min read
171 views

Database observability goes beyond monitoring. It means having enough context to understand why the database is slow — not just that it is slow. Here is how to build a complete observability stack.

The Three Pillars

  • Metrics: aggregated numbers over time (TPS, latency, cache hit rate)
  • Logs: timestamped records of individual events (slow queries, errors)
  • Traces: end-to-end request paths showing time spent in each layer

Key Metrics to Track

text
PostgreSQL:
  - transactions per second (tps)
  - cache hit ratio (from pg_stat_bgwriter)
  - replication lag (pg_stat_replication.replay_lag)
  - table/index bloat (n_dead_tup)
  - connection count (pg_stat_activity)

MySQL:
  - Questions/Queries per second
  - InnoDB buffer pool hit ratio
  - Threads_running vs Threads_connected
  - Innodb_row_lock_waits
  - Seconds_Behind_Master

Prometheus + postgres_exporter

bash
# Run postgres_exporter
docker run -e DATA_SOURCE_NAME='postgresql://postgres:pass@localhost:5432/postgres?sslmode=disable' \
  quay.io/prometheuscommunity/postgres-exporter

# Key metrics exposed:
# pg_stat_database_tup_fetched
# pg_stat_bgwriter_buffers_alloc
# pg_replication_lag
# pg_stat_user_tables_n_dead_tup

Prometheus Alert Rules

yaml
groups:
  - name: postgres
    rules:
    - alert: PostgreSQLHighReplicationLag
      expr: pg_replication_lag > 30
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: PostgreSQL replication lag > 30s

    - alert: PostgreSQLLowCacheHitRate
      expr: |
        pg_stat_database_blks_hit /
        (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
      for: 5m
      labels:
        severity: critical

Distributed Tracing with OpenTelemetry

python
from opentelemetry import trace
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

# Auto-instrument database calls
Psycopg2Instrumentor().instrument()

# Every SQL call now appears in your trace with:
# - SQL query text
# - Execution time
# - Row counts
# - Connection attributes

Key Takeaways

  • Instrument metrics, logs, and traces — each answers different diagnostic questions
  • Set up Prometheus alerts for replication lag, low cache hit rate, and connection count spikes
  • OpenTelemetry auto-instrumentation adds database spans to distributed traces with zero code changes
  • Slow query logs are the most actionable log — ensure they are enabled and shipped to your log aggregator

JusDB Can Help

Building a complete database observability stack is complex. JusDB can design and implement monitoring, alerting, and tracing for your database layer.

Share this article

JusDB Team

Official JusDB content team