TiDB Write Path: How Distributed Writes Work

Your e-commerce platform is processing a flash sale — 80,000 orders per minute, each touching five tables — and your on-call engineer opens Grafana to find write latency climbing past 200ms while TiKV compaction is saturating disk IO. The Raft log is accumulating faster than RocksDB can flush memtables, the GC worker is chasing a growing MVCC version history, and the auto-increment primary key is slamming every new insert into the same region leader. Understanding this failure mode requires understanding TiDB's write path end-to-end: from the SQL layer through two-phase commit, across Raft-replicated regions, and into RocksDB's LSM-tree. This guide walks through every layer with enough operational detail to diagnose and prevent write performance degradation before it becomes a 3 AM incident.

TL;DR

TiDB writes travel SQL layer → TiKV → Raft consensus → RocksDB memtable → SST compaction. Each hop adds latency that compounds under write-heavy workloads.
Two-phase commit (2PC) coordinates distributed transactions across TiKV regions: prewrite locks all keys first, then a single commit phase makes them visible atomically.
Use AUTO_RANDOM instead of AUTO_INCREMENT to scatter inserts across regions and eliminate write hotspots on the primary key.
Pessimistic transactions reduce retry storms under high contention; optimistic transactions maximize throughput when conflicts are rare.
MVCC version accumulation drives GC pressure — tune tidb_gc_life_time conservatively and monitor GC worker lag.
For bulk ingestion, TiDB Lightning's local backend bypasses the SQL layer entirely, achieving 10–50x throughput compared to batched INSERT statements.
Monitor tidb_tikvclient_txn_write_kv_num and tikv_engine_flow_bytes to catch write amplification and compaction bottlenecks early.

The TiDB Write Path: Four Layers in Sequence

Every write in TiDB traverses a deterministic four-layer stack. Understanding what happens at each layer — and where it can stall — is the foundation for write performance engineering.

Layer 1: SQL Layer (TiDB Server)

The TiDB SQL server receives the statement, parses it, runs the planner, and translates the logical row mutation into a set of key-value pairs. TiDB stores table rows as KV pairs with the key formatted as t{tableID}_r{rowID} and index entries as t{tableID}_i{indexID}_{indexValues}_{rowID}. For a single INSERT touching three indexes, the SQL layer generates four KV mutations before a single byte reaches TiKV. The SQL layer is stateless — all durability and coordination happen downstream.

Layer 2: TiKV and the Region Model

TiKV stores data as a sorted KV namespace partitioned into regions of roughly 96 MB each (tunable via region-split-size). Each region is replicated across three TiKV nodes via Raft. When the SQL layer sends a write, it consults the PD (Placement Driver) to locate the Raft leader for each key's region, then sends mutation RPCs directly to those leaders. A transaction spanning five regions requires coordination with five separate Raft groups — all within a single 2PC protocol.

Layer 3: Raft Consensus

The region leader receives the write, appends it to the Raft log, and replicates it to the two follower replicas. The write is acknowledged to the SQL layer only after a majority quorum (two of three nodes) confirms the log entry is durable. This is the source of TiDB's cross-AZ write latency: every write waits for at least one network round trip to a follower. In a three-AZ deployment with 2ms inter-AZ latency, the Raft quorum write adds a minimum 2ms floor to every transaction commit.

Layer 4: RocksDB Storage Engine

Each TiKV node uses RocksDB as its local storage engine. Writes land first in the memtable (an in-memory skip list, default 128 MB). When the memtable fills, RocksDB flushes it to an L0 SST file on disk. Background compaction threads then merge and sort L0 files into L1, L2, and deeper levels. This LSM-tree design delivers high sequential write throughput but introduces write amplification — each logical byte written may be physically written 10–30x across compaction rounds.

Warning

Write stalls occur when the RocksDB memtable fills faster than the flush thread can drain it, or when L0 file count exceeds the level0-slowdown-writes-trigger threshold (default 20 files). When a write stall activates, TiKV artificially slows incoming writes — which propagates back through Raft to the SQL layer as commit latency spikes. These stalls appear in Grafana as sawtooth patterns on TiKV write duration metrics.

Two-Phase Commit in TiDB

TiDB implements Percolator-style 2PC to guarantee atomicity across multiple TiKV regions. The protocol has two phases and a single designated primary key per transaction.

Phase 1: Prewrite

The TiDB SQL layer selects one mutation as the primary key. It then sends prewrite RPCs to every TiKV region involved in the transaction. Each prewrite writes a lock record alongside the pending value. If any region returns a lock conflict (another transaction holds a lock on the same key), the current transaction either waits (pessimistic mode) or aborts and retries (optimistic mode). A prewrite failure on any single key causes the entire transaction to abort — no partial writes are ever committed.

sql

-- Observe active transaction locks
SELECT * FROM information_schema.data_lock_waits\G

-- Check for long-running transactions holding locks
SELECT id, user, host, db, time, info
FROM information_schema.processlist
WHERE command != 'Sleep'
  AND time > 10
ORDER BY time DESC;

Phase 2: Commit

Once all prewrites succeed, TiDB writes a single commit record for the primary key to TiKV with a commit timestamp obtained from PD. This single write is the atomic commit point — the transaction is considered committed the moment this record is durable. Secondary keys are then committed asynchronously; TiDB does not wait for them before returning success to the client. Any reader encountering an uncommitted secondary lock resolves it by checking whether the primary key's commit record exists.

Tip

The Percolator protocol means that long-running transactions leave prewrite locks held across many regions for extended periods. A transaction open for 60 seconds with 500 prewrite locks across 50 regions blocks any conflicting writes on those keys for the entire duration. Keep OLTP transactions short and use SET tidb_idle_transaction_timeout to auto-kill idle open transactions before they accumulate lock pressure.

Pessimistic vs. Optimistic Transactions

TiDB supports both transaction models, each with distinct write path behavior. Since TiDB 3.0, pessimistic mode is the default.

Optimistic Transactions

In optimistic mode, TiDB buffers all writes client-side during the transaction body. No locks are acquired on TiKV until COMMIT is issued. At commit time, the prewrite phase checks for conflicts. If a conflict is detected — meaning another transaction modified one of the keys since the current transaction's start timestamp — the entire transaction aborts with error 9007: Write conflict. The application must then retry the full transaction. Optimistic mode delivers higher throughput in low-contention workloads because lock RPCs are eliminated during the transaction body.

Pessimistic Transactions

In pessimistic mode, SELECT ... FOR UPDATE and DML statements immediately acquire for-update locks on TiKV. Conflicts are resolved at lock-acquisition time rather than at commit time — the conflicting transaction blocks (or times out) before writing. This eliminates retry storms in high-contention workloads like inventory decrement, seat booking, and balance transfers, at the cost of one additional RPC per locked key during the transaction body.

sql

-- Check current transaction mode
SHOW VARIABLES LIKE 'tidb_txn_mode';

-- Switch to optimistic for a low-contention bulk session
SET SESSION tidb_txn_mode = 'optimistic';

-- Force pessimistic for a critical inventory update
SET SESSION tidb_txn_mode = 'pessimistic';
BEGIN PESSIMISTIC;
SELECT stock FROM inventory WHERE sku_id = 12345 FOR UPDATE;
UPDATE inventory SET stock = stock - 1 WHERE sku_id = 12345;
COMMIT;

Region Splitting and Hotspot Avoidance

The most common write performance mistake in TiDB migrations from MySQL is keeping AUTO_INCREMENT as the primary key. In MySQL, sequential auto-increment is desirable because it minimizes B-tree page splits. In TiDB, sequential keys cause every new insert to land on the rightmost region — the region covering the highest key range — creating a write hotspot that a single TiKV leader node must absorb alone, regardless of cluster size.

AUTO_RANDOM: The Right Default

AUTO_RANDOM assigns random shard bits to the high bits of the generated ID while preserving uniqueness and monotonicity within each shard. The result is that inserts are distributed uniformly across all regions from the moment data is written — no hotspot, no manual pre-splitting required.

sql

-- Problematic: all inserts land on one region leader
CREATE TABLE orders (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Correct: inserts scatter across all regions
CREATE TABLE orders (
  id BIGINT AUTO_RANDOM(5) PRIMARY KEY,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Identify hotspot regions
SELECT region_id, leader_store_id, approximate_keys, approximate_size
FROM information_schema.tikv_region_status
WHERE db_name = 'mydb' AND table_name = 'orders'
ORDER BY approximate_keys DESC
LIMIT 10;

Warning

AUTO_RANDOM IDs are not suitable for use in application code that expects sequential, user-visible ordering (such as invoice numbers). They are designed for surrogate primary keys only. If you need a sequential user-visible identifier, add a separate order_seq column and generate it from a sequence or application counter — keep AUTO_RANDOM as the internal PK.

MVCC in TiKV: Version Timestamps and GC

TiKV uses multi-version concurrency control (MVCC) to serve consistent reads without blocking writes. Every write creates a new version of a key tagged with the transaction's commit timestamp. Reads use a snapshot timestamp to select the appropriate version — they never block on concurrent writes.

GC Worker and tidb_gc_life_time

MVCC versions accumulate indefinitely until the GC worker cleans them up. The GC safe point is calculated as NOW() - tidb_gc_life_time. All versions older than the safe point are eligible for deletion. The default tidb_gc_life_time is 10 minutes — appropriate for OLTP, but too aggressive if you run long-running analytics queries directly against TiDB that need consistent snapshots older than 10 minutes.

sql

-- Check current GC life time
SHOW VARIABLES LIKE 'tidb_gc_life_time';

-- Extend GC life time before a long-running analytics job
SET GLOBAL tidb_gc_life_time = '2h';

-- Monitor GC worker progress
SELECT variable_name, variable_value
FROM mysql.tidb
WHERE variable_name IN (
  'tikv_gc_leader_uuid',
  'tikv_gc_last_run_time',
  'tikv_gc_run_interval',
  'tikv_gc_life_time',
  'tikv_gc_safe_point'
);

Tip

A common production mistake is extending tidb_gc_life_time before a long operation and forgetting to restore it. Every minute of extended GC life time means an additional minute of MVCC version accumulation across the entire cluster. On high-write clusters, this can grow stored data by gigabytes per hour. Always restore the original value immediately after the long operation completes.

Bulk Writes: TiDB Lightning vs. Batched INSERT

For loading large datasets — migrations, backups, initial data loads — the SQL write path is not the right tool. Each INSERT is a full 2PC transaction with prewrite and commit RPCs, Raft replication, and RocksDB memtable pressure. Batching helps but has limits.

Batched INSERT Best Practices

sql

-- Batch INSERT with explicit transaction boundaries
-- Target: 1,000–5,000 rows per transaction
START TRANSACTION;
INSERT INTO events (user_id, event_type, created_at) VALUES
  (1001, 'click', '2026-02-01 10:00:00'),
  (1002, 'view',  '2026-02-01 10:00:01'),
  -- ... up to 5000 rows
  (6000, 'click', '2026-02-01 10:05:00');
COMMIT;

-- Monitor write throughput during bulk load
SELECT sum(value) AS total_write_kv
FROM metrics_schema.tidb_tikvclient_txn_write_kv_num
WHERE time > NOW() - INTERVAL 5 MINUTE;

TiDB Lightning: Bypass the SQL Layer

TiDB Lightning's local backend bypasses TiKV's write path entirely. It sorts data locally, generates SST files in the correct RocksDB format, and ingests them directly into TiKV using the IngestSST RPC. This eliminates 2PC, Raft replication during load, and compaction pressure during the write phase. For a 500 GB dataset, the difference between batched INSERT and Lightning local backend is typically 8–12 hours versus 45–90 minutes.

toml

# tidb-lightning.toml — local backend configuration
[lightning]
level = "info"

[tikv-importer]
backend = "local"
sorted-kv-dir = "/data/lightning-tmp"
disk-quota = "400GiB"

[mydumper]
data-source-dir = "/data/export"

[tidb]
host = "tidb-host"
port = 4000
user = "root"
password = "secret"
status-port = 10080
pd-addr = "pd-host:2379"

[checkpoint]
enable = true
driver = "file"
dsn = "/data/lightning-checkpoint.pb"

Warning

TiDB Lightning local backend requires exclusive access to the target table during ingestion — DML against the target table while Lightning is running will cause data inconsistency. Schedule Lightning loads during maintenance windows or against freshly created tables. The TiDB backend mode (which routes through the SQL layer) is safe for live tables but delivers throughput comparable to batched INSERT.

Monitoring Write Latency and Write Amplification

Two metric families give the clearest signal on write path health:

tidb_tikvclient_txn_write_kv_num

This histogram tracks the number of KV pairs written per transaction from the TiDB SQL layer's perspective. A sudden increase in the p99 value — without a corresponding increase in application transaction count — indicates write amplification from cascading index updates, unintended full-table scans causing large transaction mutations, or runaway UPDATE statements.

tikv_engine_flow_bytes

This gauge tracks bytes written to the RocksDB engine per second, broken down by type: wal_file_bytes, write_done_by_self, and compaction_bytes_read/written. The ratio of compaction_bytes_written to write_done_by_self is your write amplification factor. A ratio above 20 indicates compaction is consuming disproportionate IO — time to tune compaction or reduce write throughput.

bash

# Query Prometheus directly for write amplification ratio
# (Run from TiDB Grafana or via promtool)
sum(rate(tikv_engine_flow_bytes{type="compaction_bytes_written"}[5m]))
  /
sum(rate(tikv_engine_flow_bytes{type="write_done_by_self"}[5m]))

RocksDB Compaction Tuning

When write amplification is high, two RocksDB parameters have the most impact. max-bytes-for-level-base controls the total size of L1 before compaction triggers — increasing it reduces compaction frequency at the cost of larger L0-to-L1 compaction bursts. write-buffer-size controls memtable size — larger memtables absorb more writes before flushing, reducing L0 file accumulation rate.

toml

# tikv.toml — RocksDB tuning for write-heavy workloads
[rocksdb.defaultcf]
write-buffer-size = "256MB"       # default: 128MB
max-write-buffer-number = 5       # default: 5
level0-slowdown-writes-trigger = 40   # default: 20 — increase to tolerate bursts
level0-stop-writes-trigger = 64       # default: 36
max-bytes-for-level-base = "1GB"  # default: 512MB

Key Takeaways

TiDB writes traverse four layers in order: SQL layer (KV translation), TiKV Raft (consensus replication), and RocksDB (LSM-tree storage). Write latency is the sum of all four — diagnose by layer, not holistically.
Two-phase commit (prewrite + commit) is atomic: a single primary key commit timestamp is the true commit point. Secondary key commits are asynchronous and do not block the client response.
Replace AUTO_INCREMENT with AUTO_RANDOM on all high-write tables. Sequential keys concentrate writes on a single region leader and defeat TiDB's horizontal scaling for inserts.
Use pessimistic transactions for high-contention workloads (inventory, balances) to eliminate retry storms; use optimistic transactions for low-contention bulk processing to eliminate per-statement lock RPCs.
MVCC version accumulation is silent until GC worker lag becomes visible as disk growth. Tune tidb_gc_life_time conservatively and always restore it after long-running analytics operations.
For bulk loads above a few hundred GB, TiDB Lightning local backend delivers 10–50x the throughput of batched INSERT by generating and ingesting SST files directly into TiKV.
Monitor tidb_tikvclient_txn_write_kv_num for write amplification at the transaction layer and tikv_engine_flow_bytes for RocksDB compaction amplification at the storage layer — these two metrics surface the most common write performance regressions before they become incidents.

Tuning TiDB's write path — sizing RocksDB compaction for your write volume, migrating schemas from AUTO_INCREMENT without downtime, calibrating GC life time against analytics query duration, and configuring TiDB Lightning for terabyte-scale loads — is the kind of hands-on distributed database work that JusDB does daily for engineering teams running TiDB in production. Whether you are diagnosing a write stall at 3 AM or architecting a greenfield TiDB deployment for a high-throughput application, reach out to JusDB for an architecture review, write performance audit, and operational support from engineers who have run TiDB at scale.

Demystifying TiDB's Write Path: A Guide to Scalable Data Management

The TiDB Write Path: Four Layers in Sequence

Layer 1: SQL Layer (TiDB Server)

Layer 2: TiKV and the Region Model

Layer 3: Raft Consensus

Layer 4: RocksDB Storage Engine

Two-Phase Commit in TiDB

Phase 1: Prewrite

Phase 2: Commit

Pessimistic vs. Optimistic Transactions

Optimistic Transactions

Pessimistic Transactions

Region Splitting and Hotspot Avoidance

AUTO_RANDOM: The Right Default

MVCC in TiKV: Version Timestamps and GC

GC Worker and tidb_gc_life_time

Bulk Writes: TiDB Lightning vs. Batched INSERT

Batched INSERT Best Practices

TiDB Lightning: Bypass the SQL Layer

Monitoring Write Latency and Write Amplification

tidb_tikvclient_txn_write_kv_num

tikv_engine_flow_bytes

RocksDB Compaction Tuning

Share this article

Keep reading

Normalization vs Denormalization (2026): When Each Wins for Database Performance

Database Schema Design for Scalability (2026): 12 Patterns DBAs Use in Production

Solving Cache Invalidation and Stampede in Valkey with Debezium CDC