Failover incidents — sound familiar?
- ▸ Sentinel quorum problems — 2-node Sentinel with only 1 healthy can't promote anything; you discovered
min-replicas-to-writeisn't tuned correctly only when production failed over. - ▸ Cluster slot migration stuck — one slot in MIGRATING/IMPORTING state across two nodes;
CLUSTER FAILOVERwon't run, ASKING redirects piling up, latency tail growing. - ▸ Replica failover taking 30s+ — Sentinel takes 10s to detect, 5s to vote, 15s to reconfigure replicas; app sees 30+ seconds of timeouts during what should be a graceful failover.
JusDB HA consultants own the failover playbook + 15-minute incident SLA. Book an HA architecture review →
Valkey High Availability
In short: Valkey high availability (single-primary + replicas + Sentinel) involves quorum-based Sentinel deployment across AZs, sub-second replica lag monitoring, split-brain prevention via min-replicas-to-write and failover-timeout tuning, and automated failover in 15–30 seconds — plus cross-region async replicas and RDB snapshots for disaster recovery beyond local HA.
Production Sentinel quorum design, replica lag monitoring, split-brain prevention, and 15–30 second automated failover SLAs. For horizontal multi-shard scaling, see Valkey Cluster.
Production HA capabilities
A typical Valkey HA deployment
The shape we deploy by default unless something in the workload pushes us to cluster mode.