At 2:47 a.m. on a Tuesday, a JusDB client's on-call engineer got paged: their e-commerce checkout was serving stale inventory data. Customers were completing orders for items that had sold out minutes earlier. The root cause was a read replica sitting 94 seconds behind the primary — a bulk import of 4.2 million SKU price updates had been pushed through a single large transaction, and the replica's single-threaded SQL thread was still grinding through it long after the primary had committed. Every checkout that read from the replica saw inventory counts that no longer existed. The fix took under ten minutes once the lag was understood; understanding it took two hours of frantic log-diving.
MySQL replication lag is one of those problems that seems abstract until it causes a production incident. It is the gap between the moment a write commits on the primary and the moment that write becomes visible on a replica. In a healthy setup this gap is under a second. In a degraded setup it can be minutes or hours. The danger is that most applications treat replicas as if they are perfectly synchronized with the primary — and they usually are, until they suddenly are not.
This guide covers exactly why replication lag occurs, why the standard monitoring metric lies to you, and the concrete steps — from tuning parallel replication workers to chunking large transactions — that eliminate lag in practice.
- Replication lag is caused by the replica SQL thread being single-threaded by default, plus large transactions, DDL statements, and network latency between primary and replica.
Seconds_Behind_MasterinSHOW SLAVE STATUSis unreliable — it reportsNULLduring I/O thread outages and can overstate lag after a replica restart.- Use pt-heartbeat or
performance_schemareplication tables for accurate, continuous lag measurement. - Enable multi-threaded replication (
slave_parallel_workers > 0withslave_parallel_type = LOGICAL_CLOCK) to let replicas replay transactions in parallel. - On the primary, set
binlog_group_commit_sync_delayto batch commits into groups so replicas see more parallelism to exploit. - Break large DML into small chunks with pt-archiver or batch scripts; avoid long-running DDL by using pt-online-schema-change or gh-ost.
Background
MySQL replication works by streaming a binary log (binlog) from the primary to one or more replicas. On the primary, every committed transaction is written to the binlog. On each replica, two threads do the work:
- I/O thread: connects to the primary, reads binlog events, and writes them into a local relay log on the replica's disk.
- SQL thread (also called the applier): reads from the relay log and replays each event against the replica's data files.
Lag accumulates when the SQL thread cannot keep up with the rate of events arriving in the relay log. In statement-based replication (SBR), the SQL thread re-executes the original SQL statement. In row-based replication (RBR), which is the recommended default since MySQL 5.7, the SQL thread applies pre-computed row-change images. RBR is generally faster to apply, but the fundamental bottleneck — that the SQL thread runs serially by default — still applies.
The key insight is that the primary can accept writes from hundreds of concurrent connections simultaneously, committing transactions in parallel. The replica, by default, replays those transactions one at a time, in the exact order they were written to the binlog. Even if each individual transaction is tiny, a high-throughput primary will eventually outrun a single-threaded applier.
Root Causes of Replication Lag
Single-Threaded SQL Thread
Out of the box, MySQL uses a single SQL thread to apply relay log events. This means every transaction — regardless of which table or database it touches — must wait for the previous one to finish. On a primary receiving 5,000 write transactions per second, the replica must apply those same 5,000 transactions per second through one thread. On a busy OLTP workload this gap opens fast.
The single-threaded default catches many teams by surprise after an initial period where replication looks fine. As write volume grows gradually, lag often stays near zero for months and then suddenly starts climbing and never comes back down. The tipping point is when average transaction apply time exceeds the interval between incoming transactions.
Large Transactions
A single transaction touching millions of rows — a bulk UPDATE, a large DELETE, a data migration — produces a massive binlog event or event group. The SQL thread on the replica must hold an open transaction for the entire duration of applying it. During that time, no other transactions can be applied (in single-threaded mode). A 30-second bulk update on the primary becomes a 30-second-plus lag spike on every replica.
Long transactions also increase the size of the relay log on disk. If disk is limited on the replica host, large relay log files can fill the filesystem, causing replication to stop entirely.
DDL Statements
Schema changes — ALTER TABLE, CREATE INDEX, DROP COLUMN — are replicated as DDL events and must be applied on the replica just as they ran on the primary. An ALTER TABLE that takes 8 minutes on the primary (while using online DDL to avoid locking) will take a similar 8 minutes on the replica. More critically, during that time the replica SQL thread is blocked on the DDL and all other transactions queue behind it.
Even with MySQL's online DDL (ALGORITHM=INPLACE, LOCK=NONE), the operation is still serialized on the replica's SQL thread. The replica does not benefit from online DDL in the same way the primary does. For tables with tens of millions of rows, expect significant replica lag during every schema change.
Network Latency Between Primary and Replica
The I/O thread introduces its own delay: binlog events must travel across the network from primary to replica before the SQL thread can apply them. In same-datacenter setups this is typically under 1ms and is rarely a bottleneck. In cross-region replication or cloud setups with network congestion, I/O thread lag can account for 50–500ms of baseline lag before the SQL thread adds its own delay.
Network-induced lag appears in SHOW SLAVE STATUS as a difference between Master_Log_File/Read_Master_Log_Pos (what the I/O thread has read) and Relay_Master_Log_File/Exec_Master_Log_Pos (what the SQL thread has applied). If both gaps are growing, you have both network lag and applier lag. If only the SQL thread position is behind, the I/O thread is keeping up and the problem is pure applier throughput.
Detecting Lag
SHOW SLAVE STATUS and Its Limitations
The first tool every MySQL DBA reaches for is SHOW SLAVE STATUS (or SHOW REPLICA STATUS in MySQL 8.0.22+):
SHOW SLAVE STATUS\GThe field most teams look at is Seconds_Behind_Master. This is calculated as the difference between the current timestamp on the replica and the timestamp embedded in the binlog event currently being applied. It sounds reliable. It is not, for several reasons:
- NULL during I/O thread failures: If the I/O thread is not connected to the primary — due to network issues, primary restart, or credential problems —
Seconds_Behind_MasterreturnsNULL. Your monitoring may interpretNULLas zero lag, masking a replication outage. - Overstatement after replica restarts: When a replica restarts and the SQL thread begins catching up from a relay log that accumulated during downtime,
Seconds_Behind_Mastermay report extremely high values even when the replica is catching up quickly. - Clock skew: If the primary and replica have different system clocks (even by a few seconds), the reported lag will be offset by that difference in the wrong direction.
- It only reports current event timestamp: If the SQL thread is idle (no new events),
Seconds_Behind_Masteris 0, regardless of whether the replica has applied all events. This can give a false sense of zero lag during quiet periods that follow a lag event.
The two log position fields are more trustworthy for understanding the applier state:
-- Check I/O thread position (what has been received from primary)
-- vs SQL thread position (what has been applied)
SELECT
MASTER_LOG_FILE,
READ_MASTER_LOG_POS,
RELAY_MASTER_LOG_FILE,
EXEC_MASTER_LOG_POS,
(READ_MASTER_LOG_POS - EXEC_MASTER_LOG_POS) AS bytes_behind
FROM performance_schema.replication_connection_status
JOIN performance_schema.replication_applier_status_by_worker
USING (CHANNEL_NAME)\GWhen using GTID-based replication, you can compare gtid_executed on the primary vs replica to get an exact transaction count difference, which is often more intuitive than byte positions. Run SELECT @@gtid_executed on both hosts and compare the sets.
Accurate Lag Measurement with pt-heartbeat
The most reliable lag measurement in the MySQL ecosystem is Percona's pt-heartbeat. It works by writing a timestamp row to a sentinel table on the primary at a configurable interval (typically every second), and then reading that same row from the replica and comparing the current time to the written timestamp. This sidesteps all the Seconds_Behind_Master failure modes because it is measuring actual data propagation, not binlog event metadata.
# On the primary: start the heartbeat writer (run as a daemon)
pt-heartbeat \
--host=primary.db.internal \
--user=repl_monitor \
--password=secret \
--database=percona \
--update \
--daemonize \
--pid=/var/run/pt-heartbeat.pid
# On the replica: read the current lag
pt-heartbeat \
--host=replica.db.internal \
--user=repl_monitor \
--password=secret \
--database=percona \
--monitor \
--master-server-id=1The output stream from --monitor gives lag in seconds with decimal precision, updated every second. This is exactly the metric you want feeding into Prometheus, Grafana, or PagerDuty.
performance_schema Replication Tables
MySQL 5.7+ exposes detailed replication state through performance_schema. These tables are particularly useful for multi-threaded replication:
-- Overall applier status
SELECT * FROM performance_schema.replication_applier_status\G
-- Per-worker status (when slave_parallel_workers > 0)
SELECT
WORKER_ID,
THREAD_ID,
SERVICE_STATE,
LAST_ERROR_NUMBER,
LAST_ERROR_MESSAGE,
LAST_APPLIED_TRANSACTION,
APPLYING_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker\G
-- Connection status (I/O thread)
SELECT
CHANNEL_NAME,
SERVICE_STATE,
RECEIVED_TRANSACTION_SET,
LAST_ERROR_MESSAGE,
LAST_HEARTBEAT_TIMESTAMP,
COUNT_RECEIVED_HEARTBEATS
FROM performance_schema.replication_connection_status\GThe replication_applier_status_by_worker table is especially valuable when diagnosing why parallel workers are not helping — you can see exactly which transactions each worker is currently applying and whether any worker is stalled on a long-running transaction.
Fixing Replication Lag
Chunking Large DML with pt-archiver
The single most impactful operational change for teams experiencing bulk-DML lag is to stop running large transactions and start chunking. Percona's pt-archiver was designed for this, but the principle applies to any batch script.
# Example: archive rows older than 90 days in 1,000-row chunks
# with a 50ms sleep between chunks to let the replica breathe
pt-archiver \
--source h=primary.db.internal,D=myapp,t=events \
--dest h=archive.db.internal,D=myapp_archive,t=events \
--where "created_at < DATE_SUB(NOW(), INTERVAL 90 DAY)" \
--limit 1000 \
--sleep 0.05 \
--no-check-charset \
--statisticsFor custom batch deletes or updates without pt-archiver, the pattern is:
-- Instead of this (creates one massive binlog event):
DELETE FROM events WHERE created_at < '2024-01-01';
-- Do this (small chunks, each a tiny binlog event):
DELETE FROM events
WHERE created_at < '2024-01-01'
ORDER BY id
LIMIT 1000;
-- Repeat until 0 rows affected, with application-side sleep between iterationsAdd a short sleep (50–200ms) between chunks. The goal is to give the replica SQL thread time to catch up between each batch so lag never accumulates beyond a second or two. On a replica with multi-threaded replication enabled, smaller sleeps are sufficient because the parallel workers can process the chunks faster.
Avoiding Long DDL — pt-online-schema-change and gh-ost
For schema changes on large tables, avoid ALTER TABLE directly. Use either pt-online-schema-change (pt-osc) or gh-ost (GitHub's online schema change tool). Both work by creating a shadow copy of the table, applying changes to it incrementally in small batches, and then doing a fast table-name swap at the end. The binlog events they produce are small row changes rather than a single large DDL event, so replicas handle them without lag spikes.
# pt-online-schema-change: add an index without blocking replica
pt-online-schema-change \
--host=primary.db.internal \
--user=admin \
--password=secret \
--database=myapp \
--table=orders \
--alter "ADD INDEX idx_customer_status (customer_id, status)" \
--max-lag=2 \
--check-interval=1 \
--executeThe --max-lag flag is critical: pt-osc monitors replica lag and automatically pauses its own copy operations when lag exceeds the threshold. This is self-throttling schema migration.
Multi-Threaded Replication
Multi-threaded replication (MTS) allows the replica to use multiple SQL applier threads in parallel, directly addressing the single-threaded bottleneck. It is controlled by two key variables on the replica:
# /etc/mysql/mysql.conf.d/mysqld.cnf (replica host) # Number of parallel applier threads (0 = single-threaded, legacy default) slave_parallel_workers = 8 # Parallelism strategy # DATABASE: parallel only across different schemas (limited benefit on single-schema apps) # LOGICAL_CLOCK: parallel within a schema, based on binlog group commit timestamps slave_parallel_type = LOGICAL_CLOCK # Preserve commit order on replica (important for read-your-writes consistency) slave_preserve_commit_order = ON
slave_preserve_commit_order = ON is strongly recommended when using LOGICAL_CLOCK parallelism. Without it, transactions can be committed on the replica in a different order than on the primary, which can cause inconsistent reads and GTID gaps that are painful to recover from.
DATABASE vs LOGICAL_CLOCK
DATABASE parallelism assigns each schema to its own worker thread. Transactions against different schemas run in parallel; transactions against the same schema are serialized. For applications using a single database schema (the vast majority), this provides almost no benefit.
LOGICAL_CLOCK parallelism is far more powerful. It uses the binlog group commit timestamps embedded by the primary to determine which transactions can safely be applied in parallel. Any transactions that were in the same group commit on the primary — meaning they were all prepared before any of them committed — can be safely applied concurrently on the replica because they do not conflict with each other by definition.
Tuning the Primary for Better Parallel Replay
The amount of parallelism available to LOGICAL_CLOCK workers depends on how many transactions the primary batches into each group commit. On a lightly loaded primary, group commits may contain only one or two transactions, giving replicas little parallelism to exploit. The binlog_group_commit_sync_delay variable addresses this by introducing a small artificial delay on the primary before flushing the binlog, giving more transactions time to join each commit group:
# /etc/mysql/mysql.conf.d/mysqld.cnf (PRIMARY host) # Wait up to 1000 microseconds (1ms) to accumulate transactions before fsync # This increases group commit batch size, giving replicas more parallel work binlog_group_commit_sync_delay = 1000 # Maximum transactions to batch regardless of the delay binlog_group_commit_sync_no_delay_count = 100
Start with binlog_group_commit_sync_delay = 100 (100 microseconds) and increase toward 1000 while monitoring primary write latency. The added delay is paid by the primary's writers but the benefit — significantly more parallel replay capacity on replicas — usually outweighs the cost for write-heavy workloads. Set binlog_group_commit_sync_no_delay_count to cap the wait at a maximum batch size.
Choosing the Right Worker Count
For slave_parallel_workers, a value between 4 and 16 covers most production workloads. The optimal number depends on the replica host's CPU count and the mix of transactions in your workload. Start at the number of CPU cores minus 2 (leaving headroom for I/O thread, monitoring, and OS tasks), then measure lag under realistic load. More workers do not always mean faster replay — contention on the replica's InnoDB buffer pool and I/O subsystem can create diminishing returns above a certain thread count.
After applying changes, verify worker activity in real time:
-- Watch worker utilization (run repeatedly)
SELECT
WORKER_ID,
SERVICE_STATE,
LAST_APPLIED_TRANSACTION,
APPLYING_TRANSACTION,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker
ORDER BY WORKER_ID\GIf most workers show SERVICE_STATE = 'WAITING' while one worker is consistently busy, your workload may have a hot table or schema that serializes transactions through a single worker despite MTS being enabled. In this case, application-level chunking and transaction decomposition will help more than adding more workers.
Key Takeaways
- Replication lag is fundamentally a throughput gap between the primary's parallel write capacity and the replica's serial (by default) apply capacity. Multi-threaded replication with
LOGICAL_CLOCKparallelism closes most of this gap. - Never rely solely on
Seconds_Behind_Masterfor lag monitoring. Use pt-heartbeat orperformance_schemareplication tables for production alerting — they cannot returnNULLwhen the I/O thread disconnects. - Large transactions are the single largest operational contributor to lag spikes. Chunking bulk DML into 500–2,000 row batches with inter-batch sleeps is the most reliable mitigation.
- DDL on large tables will cause lag on every replica. Use pt-osc or gh-ost with
--max-lagthrottling to avoid it. binlog_group_commit_sync_delayon the primary is a free performance lever: it increases group commit batch sizes, directly increasing the parallelism available toLOGICAL_CLOCKworkers on replicas.- Set
slave_preserve_commit_order = ONwhenever using multi-threaded replication. The consistency guarantees it provides are not optional for most applications.
Working with JusDB on MySQL Replication
MySQL replication lag issues rarely have a single root cause. In most production incidents, it is a combination of factors: a workload that outgrew the default single-threaded applier, a cron job that runs a large weekly batch without chunking, and a schema migration that nobody realised would stall the replica for twelve minutes. Diagnosing and resolving lag requires visibility into binlog group commit patterns, replica hardware limits, and application transaction shapes simultaneously.
JusDB's MySQL managed service includes replication health monitoring by default — we run pt-heartbeat on every replica, alert on lag above configurable thresholds, and maintain multi-threaded replication configuration tuned to each client's workload. When incidents occur, we have the tooling and production experience to diagnose them quickly.
- Learn more about our MySQL managed database service, including replication setup, monitoring, and incident response.
- Ready to fix a current replication issue or harden your setup before the next one? Contact the JusDB team.
Related Articles
- MySQL GTID Replication: A Complete Setup and Troubleshooting Guide — GTID-based replication eliminates binlog position management and makes failover dramatically simpler. This guide covers enabling GTIDs, handling GTID gaps, and migrating from position-based replication.
- ProxySQL for MySQL Load Balancing: Configuration and Query Routing — Once your replicas are healthy and lag-free, ProxySQL lets you route read queries to replicas automatically, distribute write traffic, and fail over connections without application changes.
- Database Monitoring with Prometheus, Grafana, and PMM — Percona Monitoring and Management (PMM) ships a pre-built MySQL replication dashboard with lag graphs, worker utilization, and relay log size — everything you need to catch lag before it becomes an incident.