Failover incidents — sound familiar?
- ▸ Sentinel quorum problems — 2-node Sentinel with only 1 healthy can't promote anything; you discovered
min-replicas-to-writeisn't tuned correctly only when production failed over. - ▸ Cluster slot migration stuck — one slot in MIGRATING/IMPORTING state across two nodes;
CLUSTER FAILOVERwon't run, ASKING redirects piling up, latency tail growing. - ▸ Replica failover taking 30s+ — Sentinel takes 10s to detect, 5s to vote, 15s to reconfigure replicas; app sees 30+ seconds of timeouts during what should be a graceful failover.
JusDB HA consultants own the failover playbook + 15-minute incident SLA. Book an HA architecture review →
Single-primary + Sentinel — not multi-shard
Valkey High Availability
Production Sentinel quorum design, replica lag monitoring, split-brain prevention, and 15–30 second automated failover SLAs. For horizontal multi-shard scaling, see Valkey Cluster.
Production HA capabilities
Sentinel Quorum Design
3 or 5 sentinel deployment, quorum/majority sizing, sentinel-monitor configuration, anti-affinity across AZs.
Replication Lag Monitoring
Sub-second lag tracking, master_link_status alerting, replication-offset deltas, lag SLO enforcement.
Split-Brain Prevention
min-replicas-to-write + min-replicas-max-lag tuning, partition-tolerance config, failover-timeout to prevent thrashing.
Automated Failover
down-after-milliseconds tuning, parallel-syncs config, failover SLA tracking, client-side reconnect strategy review.
Health & Status Observability
Prometheus exporter setup, Sentinel & primary dashboards, alert routing for replica-out / split-brain / failover events.
Cross-Region DR
Cross-region async replica placement, snapshot offsite, DR-runbook engineering, RPO/RTO target validation drills.
A typical Valkey HA deployment
The shape we deploy by default unless something in the workload pushes us to cluster mode.
Topology
1 primary + 2 replicas (one per AZ) — survives any single-node or AZ outage.
3 Sentinel instances co-located with application servers, spread across the same 3 AZs.
Cross-region async replica for DR (RPO ~30s, manual promotion).
Key parameters
down-after-milliseconds: 10000-15000
min-replicas-to-write: 1 (≥1 connected replica required)
min-replicas-max-lag: 10 seconds
failover-timeout: 180000 (prevents thrashing)