Free Database Audit: comprehensive health report for your database

Learn More

Failover incidents — sound familiar?

  • Sentinel quorum problems — 2-node Sentinel with only 1 healthy can't promote anything; you discovered min-replicas-to-writeisn't tuned correctly only when production failed over.
  • Cluster slot migration stuck — one slot in MIGRATING/IMPORTING state across two nodes; CLUSTER FAILOVER won't run, ASKING redirects piling up, latency tail growing.
  • Replica failover taking 30s+ — Sentinel takes 10s to detect, 5s to vote, 15s to reconfigure replicas; app sees 30+ seconds of timeouts during what should be a graceful failover.

JusDB HA consultants own the failover playbook + 15-minute incident SLA. Book an HA architecture review →

Single-primary + Sentinel — not multi-shard

Valkey High Availability

Production Sentinel quorum design, replica lag monitoring, split-brain prevention, and 15–30 second automated failover SLAs. For horizontal multi-shard scaling, see Valkey Cluster.

Production HA capabilities

Sentinel Quorum Design
3 or 5 sentinel deployment, quorum/majority sizing, sentinel-monitor configuration, anti-affinity across AZs.
Replication Lag Monitoring
Sub-second lag tracking, master_link_status alerting, replication-offset deltas, lag SLO enforcement.
Split-Brain Prevention
min-replicas-to-write + min-replicas-max-lag tuning, partition-tolerance config, failover-timeout to prevent thrashing.
Automated Failover
down-after-milliseconds tuning, parallel-syncs config, failover SLA tracking, client-side reconnect strategy review.
Health & Status Observability
Prometheus exporter setup, Sentinel & primary dashboards, alert routing for replica-out / split-brain / failover events.
Cross-Region DR
Cross-region async replica placement, snapshot offsite, DR-runbook engineering, RPO/RTO target validation drills.

A typical Valkey HA deployment

The shape we deploy by default unless something in the workload pushes us to cluster mode.

Topology
1 primary + 2 replicas (one per AZ) — survives any single-node or AZ outage.
3 Sentinel instances co-located with application servers, spread across the same 3 AZs.
Cross-region async replica for DR (RPO ~30s, manual promotion).
Key parameters
down-after-milliseconds: 10000-15000
min-replicas-to-write: 1 (≥1 connected replica required)
min-replicas-max-lag: 10 seconds
failover-timeout: 180000 (prevents thrashing)

HA FAQ

Build your Valkey HA topology

Send us your current Valkey/Redis HA setup (or none at all). We'll come back with a sized topology proposal, Sentinel quorum recommendation, and DR runbook outline within 48 hours.