At 3:12 AM on a Saturday, an on-call DBA at a logistics company received a cascade of alerts — not from their databases, but from the complete silence of their monitoring. Their single PMM Server had gone down due to a failed Docker volume mount, and suddenly 47 MySQL and PostgreSQL instances were invisible: no slow query data, no replication lag graphs, no InnoDB buffer pool metrics. A failover that would normally take minutes to diagnose stretched to two hours because the team was flying blind, piecing together what happened from application logs and guesswork. The database was fine; the monitoring system was the single point of failure that paralyzed the response.
- A single PMM Server is a SPOF — when it goes down, you lose visibility across every monitored database instance simultaneously.
- Active-passive HA with a shared persistent volume (NFS, EBS, or GCP Persistent Disk) is the lowest-complexity path to PMM Server redundancy.
- In Kubernetes, deploy PMM Server as a
StatefulSetwith aPersistentVolumeClaimbacked by a retained storage class for automatic pod rescheduling. - VictoriaMetrics clustering mode decouples metric ingestion (
vminsert) from storage (vmstorage) and query (vmselect), enabling horizontal HA for the metrics tier. - PMM Clients buffer locally and reconnect automatically after a server restart — short outages (under 5 minutes) cause minimal metric gaps.
- Back up PMM data with VictoriaMetrics snapshots and
clickhouse-backup; test restores before you need them in production.
PMM Architecture: What Is Actually Running Inside the Server
Before designing HA, you need to understand what PMM Server is: not a single process, but a composition of several systems running inside a container or pod. Each component has different HA characteristics and failure modes.
Core Components
VictoriaMetrics is PMM's time-series metrics backend, replacing the Prometheus that shipped in PMM 1.x. It stores all numerical metrics — query latency histograms, InnoDB stats, replication lag, connection counts — collected by PMM Clients via the vmagent pipeline. VictoriaMetrics stores data in its own on-disk format under /srv/victoriametrics/data inside the container.
ClickHouse stores Query Analytics (QAN) data — the per-fingerprint query digests, example queries, explain plans, and per-schema breakdowns that make PMM's "Query Analytics" tab useful. This is the data engineers reach for first during a performance incident. ClickHouse stores its data under /srv/clickhouse inside the PMM container. ClickHouse is significantly larger than the VictoriaMetrics dataset for high-traffic databases.
Grafana is the visualization layer. It reads from VictoriaMetrics (via the VictoriaMetrics data source) and from ClickHouse (via the ClickHouse data source). Grafana stores its own state — dashboards, users, data sources, alerts — in a SQLite database at /srv/grafana/grafana.db by default, though PMM can be configured to use an external PostgreSQL backend for Grafana.
PMM Managed (pmm-managed) is the internal API server that handles PMM Client registration, agent configuration, and service inventory. It maintains its state in a PostgreSQL database running on port 5432 inside the container, stored at /srv/postgres.
PMM Clients
Each monitored database host runs a lightweight PMM Client (pmm-agent) that spawns per-service exporters — mysqld_exporter, postgres_exporter, node_exporter, mongodb_exporter — and a vmagent process that scrapes those exporters and pushes metrics to the PMM Server. Critically, pmm-agent uses a gRPC-based connection to the server for configuration, separate from the push path for metrics data.
Why a Single PMM Server Is a Monitoring SPOF
The PMM Server container bundles VictoriaMetrics, ClickHouse, Grafana, PostgreSQL, and pmm-managed into a single addressable endpoint. When that container crashes, restarts, or its underlying host becomes unavailable, every one of those subsystems disappears at once. Your dashboards go blank. QAN data stops accumulating. Alerts based on Grafana rules stop firing. Engineers responding to database incidents have no historical context for the five minutes before the problem started.
VictoriaMetrics and ClickHouse are in-process crash-consistent but not always write-consistent at the moment of an abrupt container kill. A SIGKILL or OOM kill without a graceful shutdown can corrupt the last few seconds of data. Always configure your container runtime to send SIGTERM and allow a 30-second graceful shutdown window before forcing termination.
The data volume on the PMM Server host is equally critical. The entire monitoring history — weeks or months of metrics and QAN data — lives on a single disk. Without HA storage, a failed volume means starting from scratch even after the server comes back up.
Option 1: Active-Passive PMM with Shared Persistent Storage
The simplest production-grade HA pattern is two PMM Server instances sharing a single persistent volume, with only one active at a time. A virtual IP or DNS TTL of 30 seconds handles failover. This requires no changes to PMM Clients — they point to the VIP or DNS name.
Storage Options by Platform
On AWS: use a single EBS volume in the same AZ, or EFS (NFS) if you want cross-AZ capability with slightly higher latency. On GCP: use a Regional Persistent Disk, which is replicated across two zones. On bare metal or VMware: an NFS share from a NAS appliance or Ceph RBD with RWO semantics. The requirement is that only one PMM Server mounts the volume read-write at any time — concurrent writes from two instances will corrupt VictoriaMetrics and ClickHouse data.
Docker Compose: PMM with an External Volume
version: "3.9"
services:
pmm-server:
image: percona/pmm-server:2
container_name: pmm-server
restart: unless-stopped
ports:
- "443:443"
- "80:80"
volumes:
# Mount the shared NFS or EBS volume to /srv
- /mnt/pmm-data:/srv
environment:
- PMM_DEBUG=0
stop_grace_period: 30s # allow graceful shutdown before SIGKILL
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/v1/readyz"]
interval: 10s
timeout: 5s
retries: 3
start_period: 60sOn the passive node, the identical docker-compose.yml is present but the service is stopped. A Pacemaker or Keepalived agent monitors the primary and, on failure, mounts the shared volume on the secondary host and starts the container. The total failover time — including VictoriaMetrics WAL recovery — is typically 60–90 seconds.
Set /mnt/pmm-data to an NFS mount with soft mount options and a reasonable timeo timeout. A hung NFS mount that blocks indefinitely will stall PMM Server startup and confuse your health check logic. Use hard,intr only if your NFS server is highly reliable.
Option 2: Kubernetes StatefulSet Deployment
On Kubernetes, the natural PMM HA primitive is a StatefulSet with a single replica and a PersistentVolumeClaim. Kubernetes automatically reschedules the pod to a healthy node when the original node fails, reattaching the same PVC. This gives you the same active-passive behavior without managing Pacemaker or Keepalived.
StatefulSet Manifest
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: pmm-server
namespace: monitoring
spec:
serviceName: pmm-server
replicas: 1
selector:
matchLabels:
app: pmm-server
template:
metadata:
labels:
app: pmm-server
spec:
terminationGracePeriodSeconds: 60
containers:
- name: pmm-server
image: percona/pmm-server:2.42.0
ports:
- containerPort: 80
- containerPort: 443
volumeMounts:
- name: pmm-data
mountPath: /srv
livenessProbe:
httpGet:
path: /v1/readyz
port: 80
initialDelaySeconds: 60
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /v1/readyz
port: 80
initialDelaySeconds: 30
periodSeconds: 10
volumeClaimTemplates:
- metadata:
name: pmm-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3-retain # use a Retain reclaim policy
resources:
requests:
storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
name: pmm-server
namespace: monitoring
spec:
selector:
app: pmm-server
ports:
- name: http
port: 80
targetPort: 80
- name: https
port: 443
targetPort: 443Use a StorageClass with reclaimPolicy: Retain, not Delete. If the StatefulSet is accidentally deleted, a Retain policy keeps your months of monitoring data on the PV. With Delete, the PV and all its data are destroyed immediately when the PVC is removed.
For PMM Clients in Kubernetes or outside the cluster, configure the --server-url to point to the Service's ClusterIP (for in-cluster agents) or to an Ingress / LoadBalancer Service (for external agents). The Service address remains stable across pod restarts, so PMM Clients reconnect without reconfiguration.
VictoriaMetrics Clustering Mode for Metrics HA
The single-node VictoriaMetrics bundled inside PMM Server is not horizontally scalable. For very high cardinality environments — hundreds of database instances, thousands of metrics per instance — VictoriaMetrics cluster mode separates the three functions into independently scalable components.
vminsert receives metric writes from PMM Clients' vmagent and fans them out to multiple vmstorage nodes using consistent hashing. Run two or more replicas for HA. vmstorage stores raw time-series data on local disk. With replication factor 2 (-replicationFactor=2), each metric is written to two storage nodes, so one node failure causes no data loss. vmselect handles queries from Grafana, merging results from all storage nodes. Run two or more replicas behind a load balancer.
# vminsert — two replicas, writing to three vmstorage nodes
docker run -d --name vminsert \
-p 8480:8480 \
victoriametrics/vminsert:stable \
-storageNode=vmstorage-1:8400 \
-storageNode=vmstorage-2:8400 \
-storageNode=vmstorage-3:8400 \
-replicationFactor=2
# vmstorage — three nodes, each with its own data volume
docker run -d --name vmstorage-1 \
-p 8400:8400 -p 8401:8401 \
-v /mnt/vmstorage-1:/storage \
victoriametrics/vmstorage:stable \
-storageDataPath=/storage \
-retentionPeriod=90d
# vmselect — queries across all storage nodes
docker run -d --name vmselect \
-p 8481:8481 \
victoriametrics/vmselect:stable \
-storageNode=vmstorage-1:8401 \
-storageNode=vmstorage-2:8401 \
-storageNode=vmstorage-3:8401 \
-deduplicateMinScrapeInterval=30sTo integrate VictoriaMetrics cluster with PMM, replace the bundled VictoriaMetrics data source in Grafana with the vmselect endpoint, and configure PMM's vmagent (or each PMM Client's vmagent) to push to the vminsert endpoint instead of the local PMM Server. This configuration is currently a manual integration step — Percona does not officially bundle cluster-mode VictoriaMetrics in the PMM Server image.
PMM Client Reconnection Behavior
A common concern during planned PMM Server maintenance is whether PMM Clients will lose data or need manual reconfiguration. The behavior is more resilient than most engineers expect.
pmm-agent maintains a local vmagent process that continues collecting and buffering metrics to disk even when the PMM Server is unreachable. The default buffer is 1 GiB. Once the server returns, vmagent replays the buffered data in chronological order, backfilling the gap. For outages under 5–10 minutes, the backfill completes quickly and graphs show no visible gaps.
The gRPC control channel from pmm-agent to the server will show disconnected, and pmm-agent will retry the connection with exponential backoff. No restart of pmm-agent or reconfiguration is needed — it reconnects automatically when the server becomes reachable again at the same address.
If you use a load balancer in front of PMM Server, register PMM Clients against the load balancer DNS name, not the individual node IPs. When the active PMM Server changes, the PMM Client gRPC reconnect will naturally land on the new active node through the load balancer, with no agent reconfiguration required.
Health Check Endpoints for Load Balancer Configuration
PMM Server 2.x exposes two health endpoints for load balancer probes. Use these to determine when the server is ready to accept traffic after startup or failover.
# Liveness: returns 200 OK when the pmm-managed process is alive
curl -sf http://pmm-server/v1/readyz
# Readiness: returns 200 OK when all components are initialized
# Use this as the LB health check — it fails during startup/initialization
curl -sf http://pmm-server/v1/readyz
# Example nginx upstream health check block
upstream pmm_backend {
server pmm-server-1:80;
server pmm-server-2:80 backup; # passive standby
}
# Example AWS ALB target group health check settings:
# Protocol: HTTP
# Path: /v1/readyz
# Healthy threshold: 2
# Unhealthy threshold: 3
# Interval: 10s
# Timeout: 5sVictoriaMetrics itself exposes /health on its port (default 8428 inside the container) and /api/v1/status/tsdb for storage diagnostics. In a clustered setup, each component exposes its own /health endpoint independently.
Backing Up PMM Data
VictoriaMetrics Snapshots
VictoriaMetrics supports consistent on-disk snapshots without stopping the process. Trigger a snapshot via the management API, then copy the snapshot directory to S3 or another durable store.
# Trigger a snapshot (returns the snapshot name)
SNAP=$(curl -s 'http://pmm-server:8428/snapshot/create' | python3 -c "import sys,json; print(json.load(sys.stdin)['snapshotName'])")
echo "Snapshot created: $SNAP"
# Snapshot directory is at:
# /srv/victoriametrics/data/snapshots/${SNAP}/
# Copy to S3
aws s3 sync /srv/victoriametrics/data/snapshots/${SNAP}/ \
s3://your-bucket/pmm-backups/victoriametrics/${SNAP}/ \
--storage-class STANDARD_IA
# List and clean up old snapshots (keep last 7)
curl -s 'http://pmm-server:8428/snapshot/list'
curl -s "http://pmm-server:8428/snapshot/delete?snapshot=${OLD_SNAP}"ClickHouse Backup
ClickHouse backup for PMM QAN data requires backing up the pmm database tables. Use clickhouse-backup (the open-source tool by Altinity) for consistent, S3-capable backups.
# Inside the PMM container or via docker exec
# List ClickHouse databases and tables used by PMM
clickhouse-client --query "SHOW TABLES FROM pmm"
# Using clickhouse-backup (install separately)
clickhouse-backup create pmm-$(date +%Y%m%d-%H%M)
# List backups
clickhouse-backup list
# Upload to S3
clickhouse-backup upload pmm-20240115-0300
# Restore (on a new instance)
clickhouse-backup download pmm-20240115-0300
clickhouse-backup restore pmm-20240115-0300The Grafana SQLite database at /srv/grafana/grafana.db is not included in VictoriaMetrics snapshots. Back it up separately, or migrate Grafana to use an external PostgreSQL database (configurable via GF_DATABASE_TYPE=postgres environment variables) so that Grafana state is decoupled from the PMM container filesystem entirely.
Upgrade Strategy for HA PMM Deployments
Upgrading PMM Server in an HA configuration requires sequencing to avoid data corruption. Never upgrade both nodes simultaneously.
For active-passive Docker deployments: take a VictoriaMetrics snapshot before starting, upgrade the passive node first (pull the new image, mount the volume in read-only mode to verify it starts), then fail over to the upgraded passive node and upgrade the formerly active node. If the upgrade introduces a migration failure, the old active node can be brought back immediately.
For Kubernetes StatefulSet deployments, update the image field in the StatefulSet spec. Kubernetes will perform a rolling update of the single replica — deleting the old pod and creating a new one with the updated image against the same PVC. The upgrade is equivalent to a controlled restart with a new image. Monitor the readiness probe; the new pod will not receive traffic until /v1/readyz returns healthy.
# Kubernetes: patch the PMM Server image to a new version
kubectl set image statefulset/pmm-server \
pmm-server=percona/pmm-server:2.43.0 \
-n monitoring
# Watch the rollout
kubectl rollout status statefulset/pmm-server -n monitoring
# If the upgrade fails, roll back immediately
kubectl rollout undo statefulset/pmm-server -n monitoringPin PMM Server to explicit version tags (2.42.0) rather than latest or 2 in all production manifests. The latest tag changes on every release. An unexpected pull during a container restart will upgrade PMM automatically, which can be dangerous if the new version includes ClickHouse schema migrations that run on startup.
- Treat PMM Server as a critical piece of database infrastructure, not an afterthought — single-node PMM is a SPOF that blinds your team during the incidents where monitoring matters most.
- Deploy PMM Server with persistent shared storage (NFS, EBS, Regional PD) so the monitoring data survives server restarts and node failures without rebuilding history.
- Use a Kubernetes StatefulSet with a Retain-policy PVC for the lowest operational overhead HA path — Kubernetes handles pod rescheduling automatically without Pacemaker or Keepalived.
- Configure load balancer health checks against
/v1/readyzto avoid routing traffic to a PMM Server that is still initializing after a failover. - Automate VictoriaMetrics snapshots and ClickHouse backups daily; test restore procedures quarterly — backup jobs that are never tested are not backups.
- PMM Clients buffer metrics locally and reconnect automatically, so short planned outages (under 10 minutes) for maintenance cause no metric data loss and require no manual agent reconfiguration.
- Pin PMM Server to explicit version tags in all production manifests and follow a passive-first upgrade sequence to preserve rollback capability.
Working with JusDB on PMM and Database Monitoring HA
JusDB manages database observability infrastructure for engineering teams who need production-grade monitoring reliability without building and maintaining it themselves. Our DBAs design and deploy PMM HA configurations — Docker active-passive, Kubernetes StatefulSet, VictoriaMetrics clustering — tuned to your fleet size, retention requirements, and cloud platform. We handle the initial setup, alert rule configuration, PMM Client deployment across every host, backup automation, and upgrade sequencing so your team has uninterrupted visibility into every database instance at all times.
When a database incident happens at 3 AM, your monitoring needs to be the most reliable thing in the stack. Talk to a JusDB DBA about building a monitoring setup that stays up when everything else is on fire.