Database Architecture

TiDB's Raft-Based Replication: How Distributed Consensus Keeps Your Data Safe

TiDB uses Raft consensus protocol through TiKV to replicate every region across three or more nodes. Here's how leader election, log replication, and region management work in practice.

JusDB Team
May 20, 2024
9 min read
181 views

Your e-commerce platform is processing tens of thousands of orders per minute when one of your database nodes crashes at 2 AM. In a traditional MySQL setup with async replication, you're now facing a tense question: how much data did you just lose? With TiDB, that question doesn't arise — by the time the write was acknowledged to your application, Raft consensus had already committed it to a majority of replicas. The node failure triggers an automatic leader election, and your cluster is serving traffic again within seconds, with zero data loss.

TiDB achieves this through TiKV, its distributed key-value storage layer, which implements the Raft consensus protocol across every region of data. Unlike MySQL's asynchronous or semi-synchronous binlog replication, Raft replication is synchronous by definition: a write is not committed until a quorum of peers have written it to their logs. Understanding how this works in practice — leader election, log replication, region management — helps you operate TiDB clusters more confidently and debug issues faster when they arise.

This post walks through the complete Raft replication lifecycle in TiDB, from the moment a write hits TiKV to how the cluster self-heals after node failures, and compares the model against MySQL's binlog-based replication for teams evaluating the trade-offs.

TL;DR
  • TiDB splits data into 96 MB regions; each region is replicated across three (or more) TiKV nodes as a Raft group.
  • Every write must be acknowledged by a majority of Raft peers (quorum) before it is committed — this guarantees zero data loss on node failure.
  • PD (Placement Driver) orchestrates leader balancing, region splits/merges, and detects node failures that trigger new leader elections.
  • Raft leader election completes in seconds; reads and writes resume automatically without operator intervention.
  • MySQL binlog replication is async by default and requires extra configuration (semi-sync, GTID) to approach Raft-level durability guarantees.

TiDB Architecture Refresher

Before diving into Raft mechanics, it helps to understand the three layers of a TiDB cluster and how they relate to replication.

TiDB (SQL Layer)

TiDB servers are stateless SQL processing nodes. They parse SQL, plan queries, and translate relational operations into key-value operations against TiKV. Because they hold no data themselves, you can scale TiDB servers horizontally without any replication concern at this layer. Replication lives entirely in TiKV.

TiKV (Storage Layer)

TiKV is the distributed key-value store where all data actually lives. It organizes the entire keyspace into contiguous ranges called regions, each approximately 96 MB by default. Every region is replicated across multiple TiKV nodes, and this replication is managed by Raft. Each TiKV node runs many Raft groups simultaneously — one per region it hosts — making it a highly parallel replication engine.

PD (Placement Driver)

PD is the cluster brain. It maintains the authoritative mapping of regions to TiKV nodes, monitors node health via heartbeats, and issues scheduling commands: move a region replica, split a hot region, merge two cold regions, balance leaders across nodes. PD is itself a Raft group (typically three nodes) so its metadata is also replicated with consensus guarantees.

Key relationship: TiDB servers talk to PD to discover which TiKV node holds the leader replica for a given region, then send reads and writes directly to that TiKV leader. PD never sits in the data path for normal operations.

What is Raft Consensus?

Raft is a distributed consensus algorithm designed by Diego Ongaro and John Ousterhout as a more understandable alternative to Paxos. Its core guarantee: as long as a majority of nodes in a group are alive and can communicate, the group can make progress and will never return inconsistent results.

The Three Raft Roles

Every node in a Raft group is always in one of three states:

  • Leader: Handles all client writes and replicates log entries to followers. There is exactly one leader per Raft group at any time.
  • Follower: Receives log entries from the leader, applies them to its state machine, and votes in elections. Followers redirect clients to the leader.
  • Candidate: A follower that has timed out waiting for a heartbeat and is actively campaigning to become the new leader.

Terms and Log Entries

Raft divides time into terms — monotonically increasing integers that serve as logical clocks. Each term begins with an election. Every write operation becomes a log entry with a term number and log index. Followers can always determine whether a message from a claimed leader is stale by comparing term numbers, which prevents split-brain scenarios.

Quorum Requirement

With three replicas, Raft requires two acknowledgments (the leader plus one follower) before committing an entry. With five replicas, it requires three. This means a three-replica group tolerates one node failure; a five-replica group tolerates two. The trade-off is write latency — more replicas means waiting for more acknowledgments.

How TiDB Implements Raft

Regions as Raft Groups

TiKV maps each region to a Raft group. A region covering the key range [a, b) has three replicas — one on each of three TiKV nodes — and these three replicas form a single Raft group. The leader replica processes all reads (by default) and writes for that key range. When data grows, PD splits the region into two, each becoming its own independent Raft group.

A production TiDB cluster with hundreds of gigabytes of data will have thousands of regions, meaning thousands of concurrent Raft groups, each independently electing leaders and replicating logs. This parallelism is why TiDB can sustain high write throughput across the cluster even though each individual write requires quorum agreement.

Replica Placement Policy

PD uses placement rules to ensure region replicas are distributed across failure domains. In a default three-replica setup, PD attempts to place each replica on a different host. In multi-zone deployments, you can configure placement rules to ensure replicas span availability zones:

text
# Example: placement rule for 3-replica, multi-AZ deployment
# Applied via pd-ctl or the TiDB Dashboard

pd-ctl config placement-rules save <

With this rule, PD refuses to schedule two replicas of the same region onto TiKV nodes with the same az label, ensuring a single AZ outage never takes down quorum.

Leader Election

When a follower stops receiving heartbeats from the leader for longer than the election timeout (150–300 ms by default), it increments its term, transitions to Candidate, and sends RequestVote RPCs to other peers. A peer grants its vote if the candidate's log is at least as up-to-date as the peer's own log and the peer hasn't already voted in this term. The first candidate to collect votes from a majority becomes the new leader and immediately begins sending heartbeats to suppress further elections.

Election timeout tuning: The default 150–300 ms randomized election timeout is deliberately short to minimize leader failover time. In high-latency network environments (cross-region clusters), you may need to increase this to avoid spurious elections. Set raft-election-timeout-ticks in your TiKV configuration, but be aware that increasing it directly increases recovery time after genuine node failures.

Log Replication Flow

Understanding the exact sequence of a write through TiDB's Raft implementation clarifies both the performance characteristics and the durability guarantees.

Step-by-Step Write Path

  1. Client sends SQL to TiDB server. The TiDB layer parses the statement, resolves the affected key range, and contacts PD to find the TiKV leader for the relevant region.
  2. TiDB sends a Prewrite request to the leader TiKV node. TiDB uses a two-phase commit (2PC) protocol internally. The Prewrite phase locks the keys involved in the transaction and writes the data to a buffer.
  3. Leader appends the log entry and sends AppendEntries RPCs to followers. The leader does not apply the entry to its state machine yet — it first replicates.
  4. Followers write the entry to their local Raft logs and respond. Each follower sends an acknowledgment to the leader once the entry is durably written (fsync'd) to its WAL.
  5. Leader commits once quorum is reached. With three replicas, the leader commits after receiving one follower acknowledgment. It then applies the entry to RocksDB (TiKV's local storage engine) and sends the commit index to followers.
  6. Leader responds to TiDB. TiDB receives the Prewrite success and proceeds to the Commit phase, which follows the same Raft replication path.
  7. TiDB acknowledges the client. The application receives its success response only after both Prewrite and Commit have achieved Raft quorum.

Read Handling

By default, TiDB routes reads to the Raft leader to guarantee linearizability. This avoids the risk of reading stale data from a follower whose log hasn't fully caught up. For read-heavy workloads, TiDB supports Follower Read, which allows bounded-staleness reads from follower replicas to reduce leader pressure:

text
-- Enable follower read at session level
SET tidb_replica_read = 'follower';

-- Or for closest replica (useful in multi-AZ)
SET tidb_replica_read = 'closest-replicas';

Follower reads use a mechanism called ReadIndex — the follower contacts the leader to obtain the current commit index and waits until its own log has caught up to that index before serving the read. This preserves linearizability while offloading read traffic.

Region Splits and Merges

Automatic Region Splits

When a region's size approaches 96 MB (configurable via region-split-size), TiKV reports this to PD, which schedules a split. The region is divided at its midpoint key into two new regions, each becoming its own Raft group. PD then monitors the new regions and may rebalance their leaders across the cluster to avoid hotspots.

High-velocity writes to a single key range (a common pattern with auto-increment primary keys or time-series data) can cause region hotspots where one region receives disproportionate write traffic. TiDB provides SHARD_ROW_ID_BITS and AUTO_RANDOM to spread such writes across multiple regions from the start:

text
-- Use AUTO_RANDOM to scatter inserts across regions
CREATE TABLE orders (
  id BIGINT AUTO_RANDOM PRIMARY KEY,
  user_id BIGINT NOT NULL,
  amount DECIMAL(10,2) NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Region Merges

After large deletes or TTL expiration, regions can become nearly empty. PD identifies adjacent small regions and schedules merges to reduce the total number of Raft groups the cluster must maintain. Merges are executed as a series of Raft log entries across both involved groups, eventually consolidating them into one.

Pre-splitting for bulk loads: Before importing large datasets, pre-split regions using SPLIT TABLE to avoid sequential region splits during import, which can cause write stalls and leader rebalancing churn.
text
-- Pre-split the orders table into 16 regions
SPLIT TABLE orders BETWEEN (1) AND (10000000) REGIONS 16;

Raft vs MySQL Binlog Replication

Teams migrating from MySQL to TiDB often want to understand the concrete differences in replication semantics. The comparison is significant.

Dimension TiDB (Raft via TiKV) MySQL (Binlog Replication)
Replication mode Synchronous (quorum-based) Asynchronous by default; semi-sync optional
Data loss on failure Zero (committed = replicated to quorum) Possible with async; reduced with semi-sync
Failover Automatic, seconds (Raft election) Manual or via MHA/Orchestrator; minutes
Consistency guarantee Linearizable (strong consistency) Eventual (async); monotonic (semi-sync)
Topology Peer-to-peer within Raft group Primary-replica hierarchy
Write overhead 2 network round trips minimum (quorum) Near-zero (async); 1 round trip (semi-sync)
Read scaling Follower Read (bounded staleness) Route reads to replicas (eventual consistency)
Split-brain protection Inherent (quorum required for writes) Requires external tooling (fencing tokens, VIP)
Operational complexity Managed by PD automatically Requires manual topology management

The headline trade-off: Raft replication costs an extra network round trip per write compared to async MySQL replication, but it eliminates an entire class of operational incidents — data loss on failover, stale reads after promotion, and split-brain during network partitions. For most production workloads, this trade-off is strongly favorable.

Monitoring Raft Health in TiDB

TiDB exposes rich Raft metrics through Prometheus and Grafana. The key panels and metrics to watch in production are below.

Leader Distribution

An uneven leader distribution means some TiKV nodes are handling more write traffic than others. Check the TiKV-Details > Cluster > Leader Balance Grafana panel. For CLI inspection:

text
# Check region and leader counts per TiKV store
pd-ctl store

# Force a leader rebalance if distribution is skewed
pd-ctl operator add transfer-leader  

Raft Log Lag

The metric tikv_raftstore_log_lag measures how far behind followers are from the leader's log index. A persistently high lag indicates a follower is struggling to keep up — often due to disk I/O saturation or network issues. Alert on this metric to catch degraded replicas before they fall behind enough to trigger a leader election storm.

Region Health

text
# Check for regions with fewer than 3 healthy replicas (under-replicated)
pd-ctl region --jq '.regions[] | select(.peers | length < 3)'

# Check for regions with no leader (unavailable)
pd-ctl region --jq '.regions[] | select(.leader == null)'

Raft Proposal Metrics

In the TiKV Grafana dashboard, watch Raft Proposals per Second and Raft Apply Duration. A spike in apply duration without a corresponding spike in proposals typically indicates RocksDB compaction pressure on the follower nodes.

Critical alert thresholds: Configure alerts for (1) any region with 0 peers or no leader — this region is unavailable for reads and writes; (2) tikv_raftstore_log_lag exceeding 1000 entries on any store; (3) PD scheduler errors, which indicate PD cannot place replicas according to your placement rules.
Key Takeaways
  • TiDB's Raft replication operates at the region level — each 96 MB region is an independent Raft group, enabling massive write parallelism across the cluster.
  • A write is committed only after a quorum of Raft peers acknowledge it, guaranteeing zero data loss on node failure — no configuration required.
  • PD continuously monitors cluster health and schedules region splits, merges, and leader transfers automatically, reducing operator burden significantly compared to MySQL replication topologies.
  • Leader election after node failure completes in seconds, with no manual intervention needed to resume normal operations.
  • Use AUTO_RANDOM keys and pre-split regions for write-heavy tables to avoid hotspot regions and ensure Raft write load is distributed evenly across TiKV nodes.
  • Monitor tikv_raftstore_log_lag and region health via pd-ctl to catch replication degradation before it affects availability.
  • Follower Read (tidb_replica_read = 'closest-replicas') is a safe way to scale read throughput without sacrificing consistency, using the ReadIndex mechanism.

Working with JusDB on TiDB

JusDB manages TiDB clusters for engineering teams who need distributed SQL without the operational complexity. Our DBAs handle cluster sizing, Raft health monitoring, region balance, and production incident response.

Explore JusDB Database Management →  |  Talk to a DBA

Related reading:

Share this article