Question 1

Sentinel or Cluster mode - which gives me high availability?

Accepted Answer

Both, but for different shapes of workload. Sentinel + 1 primary + N replicas (the topology on this page) is the right answer when your working set fits on one node and you need automated failover for that primary. Cluster mode shards data across multiple primaries, each with its own failover via gossip - that's a separate page (/databases/valkey/cluster) covering horizontal scaling. If you don't need to shard, Sentinel is simpler to operate and reason about.

Question 2

How many Sentinel instances do I need?

Accepted Answer

Minimum 3 for quorum-based failover decisions; 5 is the production sweet spot. Sentinels need an odd count to avoid split-brain - with 2 sentinels and 1 unreachable, the remaining sentinel can't reach quorum and won't trigger failover. Co-locate sentinels with application servers (not on the same boxes as Valkey nodes) so a Valkey-node outage doesn't take sentinels offline too. Spread across at least 3 availability zones.

Question 3

What's a realistic failover SLA with Sentinel?

Accepted Answer

Time-to-failover-decision: configurable via down-after-milliseconds (default 30s; we typically tune to 10-15s). Election + promotion: 2-5 seconds. Client reconnection: depends on the client's reconnect backoff (typically <5s). Total: 15-30 seconds of observed write outage in the typical case. Reads can continue on replicas during this window if the client opts in.

Question 4

How do you prevent split-brain after a network partition?

Accepted Answer

Two layers. (1) min-replicas-to-write on the primary - refuses writes if fewer than N replicas are connected, so a stranded former-primary can't accept new writes during the partition. (2) Sentinel quorum is the failover-trigger threshold, but failover-timeout prevents re-failover thrashing. Together these ensure that during a partition, at most one side accepts writes; the losing side reconnects as a replica when the partition heals.

Question 5

What replication lag is acceptable?

Accepted Answer

Targets depend on your RPO budget. For session/cache workloads with tolerable second-level lag: keep master_repl_offset minus slave_repl_offset under 10MB of buffer, equating to <1 second of replication lag at typical write rates. For workloads using replicas for read-after-write: lag must be in millisecond range, which constrains your write throughput. We monitor master_link_status and slave_repl_offset deltas with sub-second resolution.

Question 6

What's the disaster-recovery story beyond local HA?

Accepted Answer

Cross-region async replica + RDB snapshot offsite. Sentinel doesn't handle cross-region failover (Sentinel's down-after-milliseconds failure detection assumes low, stable intra-region latency and has no cross-region topology awareness); cross-region promotion is a manual or app-orchestrated step with explicit RPO loss (typically <30 seconds of writes during the regional failover). For zero-RPO DR, you need active-active replication, which is a Valkey 8.x roadmap item - currently not production-grade.

Valkey High Availability

Production HA capabilities

A typical Valkey HA deployment

HA FAQ

Build your Valkey HA topology

Related Valkey Services