High Availability

Mastering MySQL Group Replication: The Road to Fault-Tolerant Databases

Build fault-tolerant MySQL clusters with Group Replication. Learn multi-primary vs single-primary modes, conflict resolution, network partitioning handling, and failover strategies.

JusDB Team
July 12, 2022
4 min read
3947 views

MySQL Group Replication is the native distributed consensus layer that powers InnoDB Cluster. Understanding how it works — and more importantly, where it breaks — is essential if you're running or planning any HA MySQL deployment. This guide goes deeper than the setup tutorial: it covers the consensus mechanics, real failure scenarios, and the configuration decisions that determine whether your cluster survives a production incident.

TL;DR
  • Group Replication uses Paxos-based consensus — a write is committed only when a majority of nodes acknowledge it
  • Always use an odd number of nodes (3, 5, 7) — even-node clusters can split-brain on a 50/50 failure
  • Single-primary mode is production-safe; multi-primary is not for most workloads
  • Flow control is Group Replication's most misunderstood feature — it throttles writes when replicas fall behind
  • Group Replication is the engine under InnoDB Cluster — if you use MySQL Shell + Router, you're already using GR

How Group Replication Consensus Works

Traditional MySQL replication is asynchronous: the primary writes, logs it to the binlog, and lets replicas catch up whenever they can. This is fast, but it means replicas can be seconds (or minutes) behind the primary. If the primary crashes, you might lose those in-flight transactions.

Group Replication takes a fundamentally different approach. Every transaction goes through a certification phase before committing:

  1. Application issues a transaction to the primary
  2. Primary executes the transaction locally (not yet committed)
  3. Before committing, it broadcasts the transaction's write set to all group members
  4. Each member checks if the transaction conflicts with any concurrent transaction it knows about
  5. If a majority of nodes certify the transaction (no conflicts, quorum reached), it commits on all nodes simultaneously
  6. If the write set conflicts with another concurrent transaction, one of them is rolled back

The result: every committed transaction is guaranteed to be on a majority of nodes. Failover can happen without data loss because the surviving majority already has all committed transactions.

-- View certification statistics
SELECT * FROM performance_schema.replication_group_member_stats\G

-- Key fields:
-- COUNT_TRANSACTIONS_IN_QUEUE: transactions waiting for certification (should be near 0)
-- COUNT_TRANSACTIONS_CHECKED: total certified
-- COUNT_CONFLICTS_DETECTED: rollbacks due to conflicts (high values = too much concurrency)
-- FLOW_CONTROL_TIME_ELAPSED_MS: time lost to flow control throttling

Quorum: Why Node Count Matters

Group Replication requires a majority (quorum) to function. For a 3-node cluster, quorum = 2 nodes. For a 5-node cluster, quorum = 3 nodes.

Cluster Size Quorum Required Tolerable Node Failures Write Overhead
1 node10Minimal
3 nodes (recommended)21Low
5 nodes32Moderate
7 nodes43Higher

Never run an even-numbered cluster in production. A 2-node cluster cannot self-heal — if one node goes down, the survivor has only 1 of 2 nodes (50%, not a majority) and refuses to accept writes to prevent split-brain. A 4-node cluster has the same problem on a 2-2 split. Always use 3, 5, or 7.


Setup: Single-Primary Mode

-- Prerequisites (add to my.cnf on all nodes before starting):
[mysqld]
server_id = 1                          # unique per node
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_checksum = NONE                 # required for GR
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
sync_binlog = 1
plugin_load_add = group_replication.so

group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"  # same UUID on all nodes
group_replication_start_on_boot = OFF
group_replication_local_address = "node1:33061"
group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
group_replication_bootstrap_group = OFF
loose-group_replication_single_primary_mode = ON
loose-group_replication_enforce_update_everywhere_checks = OFF

Bootstrap the first node

-- On node1 only, once
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

-- Confirm node1 is PRIMARY
SELECT MEMBER_HOST, MEMBER_ROLE, MEMBER_STATE
FROM performance_schema.replication_group_members;

Join the remaining nodes

-- On node2 and node3 (no bootstrap needed):
START GROUP_REPLICATION;

-- Verify all 3 nodes are ONLINE
SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_ROLE, MEMBER_STATE
FROM performance_schema.replication_group_members;
-- Expected: 1 PRIMARY, 2 SECONDARY, all ONLINE

Flow Control: The Hidden Performance Killer

Flow control is one of the least-understood Group Replication features, and the source of the most surprising performance problems. When a secondary node's certification queue or applier queue grows too large (because it's falling behind the primary), Group Replication throttles the primary's write throughput to let the secondary catch up.

This means: a slow secondary can directly slow down all writes to the primary. Under burst write scenarios, you can see primary write throughput drop 50–90% because a secondary is struggling with disk I/O.

-- Check flow control activity
SELECT
  MEMBER_HOST,
  COUNT_TRANSACTIONS_IN_QUEUE AS cert_queue,
  FLOW_CONTROL_TIME_ELAPSED_MS AS flow_control_ms,
  COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE AS applier_queue
FROM performance_schema.replication_group_member_stats;

-- flow_control_ms > 0 means writes were throttled on this node
-- cert_queue > 1000 means certification backlog — secondary is falling behind

-- Tune flow control thresholds (default: 25,000 transactions):
SET GLOBAL group_replication_flow_control_certifier_threshold = 50000;
SET GLOBAL group_replication_flow_control_applier_threshold = 50000;

-- Increase only if you have fast SSD storage on all nodes and understand the trade-off:
-- higher threshold = less throttling but more divergence before replication catches up

Monitoring Group Replication in Production

-- Cluster health overview
SELECT
  MEMBER_ID,
  MEMBER_HOST,
  MEMBER_PORT,
  MEMBER_STATE,   -- ONLINE, RECOVERING, UNREACHABLE, ERROR
  MEMBER_ROLE,    -- PRIMARY, SECONDARY
  MEMBER_VERSION
FROM performance_schema.replication_group_members;

-- Replication applier lag per secondary
SELECT
  member_id,
  member_host,
  ROUND(SUM(TRANSACTIONS_COMMITTED_ALL_MEMBERS) / COUNT(*)) AS avg_committed,
  ROUND(SUM(LAST_CONFLICT_FREE_TIMESTAMP)) AS latest_conflict_free
FROM performance_schema.replication_group_member_stats
GROUP BY member_id, member_host;

-- Transactions waiting for certification (spikes here = write contention)
SELECT
  MEMBER_HOST,
  COUNT_TRANSACTIONS_IN_QUEUE AS queue_depth
FROM performance_schema.replication_group_member_stats
ORDER BY queue_depth DESC;

-- Long-running transactions blocking replication (find and kill these)
SELECT
  p.ID,
  p.USER,
  p.TIME AS running_sec,
  LEFT(p.INFO, 100) AS query
FROM information_schema.PROCESSLIST p
WHERE p.COMMAND != 'Sleep'
  AND p.TIME > 30
ORDER BY p.TIME DESC;

Alert thresholds

MetricWarningCritical
Any member MEMBER_STATERECOVERINGUNREACHABLE or ERROR
cert_queue (COUNT_TRANSACTIONS_IN_QUEUE)> 1,000> 10,000
FLOW_CONTROL_TIME_ELAPSED_MS (delta)> 0 (any throttling)> 5,000ms/minute
COUNT_CONFLICTS_DETECTEDGrowing rapidly> 100/minute sustained

Real Failure Scenarios

Scenario 1: Single node crashes (common)

-- Before: 3 nodes, 1 PRIMARY (node1), 2 SECONDARY
-- Event: node1 (PRIMARY) crashes

-- Group Replication:
-- 1. node2 and node3 detect node1 is UNREACHABLE (after group_replication_member_expel_timeout)
-- 2. Election happens: node with highest transaction count and memberWeight wins
-- 3. New PRIMARY starts accepting writes

-- Check new primary:
SELECT MEMBER_HOST, MEMBER_ROLE FROM performance_schema.replication_group_members
WHERE MEMBER_STATE = 'ONLINE';

-- Rejoin node1 after it recovers:
START GROUP_REPLICATION;  -- run on node1
-- It will automatically use clone or incremental sync to catch up

Scenario 2: Network partition (split-brain prevention)

-- Before: 3 nodes in a cluster
-- Event: node3 gets network-isolated (can't see node1 or node2)

-- node3's behavior:
-- It cannot reach a majority (1 of 3 = no quorum)
-- It transitions to ERROR state and stops accepting writes
-- This prevents split-brain: only node1+node2 partition can write

-- node1+node2 behavior:
-- They have quorum (2 of 3)
-- They continue operating normally

-- When node3's network recovers:
-- node3 automatically rejoins via autoRejoinTries
-- Or manually: START GROUP_REPLICATION; on node3

Scenario 3: Large transaction causing replication delay

-- Symptom: flow control kicks in, writes slow down across the cluster
-- Cause: a large batch operation on primary generates a huge binlog event

-- Identify the large transaction:
SHOW ENGINE INNODB STATUS\G  -- look for TRANSACTION section
SELECT * FROM information_schema.INNODB_TRX ORDER BY trx_rows_modified DESC LIMIT 5;

-- Prevention: break large batch operations into smaller chunks
-- Instead of:
DELETE FROM events WHERE created_at < '2023-01-01';  -- might affect millions of rows

-- Do:
REPEAT
  DELETE FROM events WHERE created_at < '2023-01-01' LIMIT 10000;
UNTIL ROW_COUNT() = 0 END REPEAT;

Group Replication vs Async Replication

Aspect Group Replication Traditional Async Replication
FailoverAutomatic, <30sManual or via Orchestrator (~minutes)
Data loss on failoverZero (committed txns on majority)Possible (async lag = potential loss)
Write throughputLower (certification overhead)Higher (no consensus per write)
Write latency~+RTT between nodes (sub-ms intra-DC)Local only (no cross-node overhead)
ComplexityHigher (quorum, flow control, GCS)Lower (familiar replication model)
Best forStrong HA requirements, automated failoverHigh write throughput, read scaling

Common Mistakes and How to Avoid Them

  • Using 2-node clusters: Cannot form quorum after one failure. Always use 3+ nodes.
  • Large transactions in the workload: A single 10M-row DELETE causes replication lag and triggers flow control. Break large batch operations into smaller chunks (10,000 rows max per transaction).
  • Different MySQL versions across nodes: Group Replication requires the same major version across all members. Mixed 8.0 + 8.4 clusters only work during rolling upgrades, not as a permanent state.
  • Not setting group_replication_start_on_boot = OFF: If set to ON, a node that restarts during a split-brain event might bootstrap a new group by itself, causing data divergence.
  • Ignoring flow control metrics: Flow control throttling appears as "intermittent slowdowns" that are hard to diagnose without specifically monitoring FLOW_CONTROL_TIME_ELAPSED_MS.

Working with JusDB

Group Replication is the foundation of MySQL's HA story, but it has real operational complexity — flow control tuning, quorum management, large transaction handling, and the transition to InnoDB Cluster all require hands-on expertise.

We deploy and operate Group Replication clusters as part of our MySQL consulting and managed MySQL SRE services. If you're debugging flow control issues, planning a migration from async replication to GR, or designing a multi-DC InnoDB ClusterSet, reach out.

Related reading: MySQL InnoDB Cluster Guide | MySQL Performance Tuning | MySQL 8.0 EOL Upgrade Guide

Share this article

JusDB Team

Official JusDB content team