A startup engineering team spends months building their MySQL primary-replica setup, confident they have high availability covered. Then, during a late-night primary failure, they discover their async replica is 45 seconds behind — and those 45 seconds of transactions are simply gone. The manual failover takes 20 minutes, their on-call engineer is promoting the wrong host, and the application is throwing errors the entire time. This scenario plays out constantly because MySQL offers several distinct replication approaches, and teams frequently choose based on familiarity rather than actual durability requirements.
MySQL's replication ecosystem has grown significantly: you have traditional asynchronous replication, semi-synchronous replication, Group Replication (GA since MySQL 5.7), and InnoDB Cluster — a full HA stack built on top of Group Replication. Each solves a different problem. Picking the wrong one means either over-engineering a simple use case or under-protecting a critical one. This guide breaks down exactly how each option behaves under failure, what you give up, and when to reach for each.
Traditional Asynchronous Replication
Async replication is MySQL's default and the most widely deployed topology. The primary commits transactions and writes them to the binary log. Replicas connect, read the binlog, and apply events independently — with no acknowledgment back to the primary. This makes writes on the primary completely decoupled from replica state.
How It Works
The replica runs two threads: an IO thread that fetches binlog events from the primary and writes them to the local relay log, and a SQL thread (or parallel applier threads) that replays those events. The primary never waits for the replica.
# Primary my.cnf
[mysqld]
server_id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
# Replica my.cnf
[mysqld]
server_id = 2
log_bin = /var/log/mysql/mysql-bin.log
relay_log = /var/log/mysql/relay-bin.log
gtid_mode = ON
enforce_gtid_consistency = ON
read_only = ON-- On replica: configure and start replication
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = '10.0.1.10',
SOURCE_USER = 'replicator',
SOURCE_PASSWORD = 'strongpassword',
SOURCE_AUTO_POSITION = 1;
START REPLICA;
SHOW REPLICA STATUS\GFailure Behavior and Data Loss
When the primary crashes, any transactions committed but not yet received by the replica are permanently lost. Replication lag is the enemy here — a replica that is 10 seconds behind at the moment of failure loses 10 seconds of data. Failover is manual unless you layer on an external orchestrator like Orchestrator or ProxySQL with custom scripts.
When to Use It
Async replication is appropriate for read scaling where data loss is acceptable (analytics replicas, reporting), for disaster recovery with relaxed RPO requirements, and for teams that need simplicity above all else. Do not rely on it as your sole HA mechanism for transactional workloads where data loss is unacceptable.
Semi-Synchronous Replication
Semi-sync is a plugin-based enhancement that requires the primary to wait for at least one replica to acknowledge receipt of the binlog event before returning success to the client. "Acknowledged" means the replica has written the event to its relay log — not that it has applied it — but this is enough to guarantee the event survives a primary failure.
Configuration
# Install plugins on primary and replica
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
-- On primary
SET GLOBAL rpl_semi_sync_source_enabled = 1;
SET GLOBAL rpl_semi_sync_source_timeout = 1000; -- ms before fallback to async
SET GLOBAL rpl_semi_sync_source_wait_for_replica_count = 1;
-- On replica
SET GLOBAL rpl_semi_sync_replica_enabled = 1;# Persist in my.cnf (primary)
[mysqld]
plugin-load-add = semisync_source.so
rpl_semi_sync_source_enabled = 1
rpl_semi_sync_source_timeout = 1000
rpl_semi_sync_source_wait_for_replica_count = 1The Timeout Fallback Problem
The critical gotcha with semi-sync is the timeout behavior. If a replica falls behind or disconnects, the primary waits up to rpl_semi_sync_source_timeout milliseconds and then falls back to fully asynchronous mode silently. You can check current mode with SHOW STATUS LIKE 'Rpl_semi_sync_source_status'. A value of OFF means you are currently running async without knowing it.
Rpl_semi_sync_source_status and Rpl_semi_sync_source_no_tx (count of transactions sent async) in your metrics pipeline. An alert on Rpl_semi_sync_source_status = OFF is essential.
When to Use It
Semi-sync is appropriate when you need near-zero data loss without the operational complexity of Group Replication, when your replicas are on the same network segment (low RTT keeps added latency minimal), and when you have a small number of replicas — typically one or two. It is a meaningful improvement over async for durability but still requires manual failover orchestration.
Group Replication
Group Replication (GR) is MySQL's built-in distributed consensus mechanism, available since MySQL 5.7.17 and significantly improved in 8.0. Rather than a single primary writing to passive replicas, GR uses Paxos-based consensus: every write must be certified by a majority of group members before it commits. This provides strong consistency guarantees that async and semi-sync cannot match.
Single-Primary vs Multi-Primary Mode
GR supports two modes. Single-primary (recommended for most workloads) routes all writes to one elected primary while other members are read-only replicas. If the primary fails, the group elects a new primary automatically — no external orchestrator needed. Multi-primary allows writes on all members simultaneously, but requires application-level handling of write conflicts and has significant restrictions on DDL and foreign keys.
Setting Up Group Replication
# my.cnf — all members need these settings
[mysqld]
server_id = 1 # unique per member
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_checksum = NONE
log_replica_updates = ON
plugin-load-add = group_replication.so
group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
group_replication_start_on_boot = OFF
group_replication_local_address = "10.0.1.10:33061"
group_replication_group_seeds = "10.0.1.10:33061,10.0.1.11:33061,10.0.1.12:33061"
group_replication_bootstrap_group = OFF-- Bootstrap the group from the first node only
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;
-- Join remaining members
START GROUP_REPLICATION;
-- Check group membership and primary
SELECT MEMBER_ID, MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;Consistency and Failure Behavior
GR uses a certification process: before a transaction commits, its write set is broadcast to all members. If no conflicting transaction is in-flight, it certifies and commits. If the primary fails, surviving members elect a new primary within seconds — typically 5–15 seconds depending on failure detection settings. No data loss occurs for committed transactions because certification ensures quorum acknowledgment.
group_replication_member_expel_timeout and related parameters.
When to Use It
Group Replication is appropriate when you need automatic primary election without external tooling, when you require zero data loss for committed transactions, and when you can accept slightly higher write latency due to consensus overhead. It is the foundation for InnoDB Cluster — but you can run it standalone if you want to manage the MySQL Shell and Router layers yourself.
MySQL InnoDB Cluster
InnoDB Cluster is Oracle's complete HA solution that bundles Group Replication (the consistency engine) with MySQL Shell (the management interface) and MySQL Router (the transparent application proxy). Where raw Group Replication requires you to manage failover awareness at the application layer, InnoDB Cluster handles it: MySQL Router monitors group membership and automatically redirects write traffic to the current primary and read traffic to replicas.
Architecture Overview
The three components work together: MySQL Shell provides a JavaScript/Python AdminAPI for cluster lifecycle management — creating clusters, adding instances, triggering failover. MySQL Router sits between your application and the cluster, exposing a read/write port (default 6446) and a read-only port (6447). Router reads cluster metadata from Group Replication and updates its routing table when membership changes. Applications connect to Router and are transparently redirected without any awareness of which member is currently primary.
Creating an InnoDB Cluster with MySQL Shell
# Install MySQL Shell
# Connect to the seed instance
mysqlsh --uri root@10.0.1.10:3306
# Check instance readiness
dba.checkInstanceConfiguration('root@10.0.1.10:3306')
# Configure instance (auto-fixes configuration issues)
dba.configureInstance('root@10.0.1.10:3306')
# Create the cluster
var cluster = dba.createCluster('ProductionCluster')
# Add secondary members
cluster.addInstance('root@10.0.1.11:3306')
cluster.addInstance('root@10.0.1.12:3306')
# Check cluster status
cluster.status()# Deploy MySQL Router (run on application servers or dedicated proxy hosts)
mysqlrouter --bootstrap root@10.0.1.10:3306 --directory /etc/mysqlrouter
mysqlrouter --config /etc/mysqlrouter/mysqlrouter.conf &
# Application connects to Router ports:
# Read/Write: 127.0.0.1:6446
# Read-Only: 127.0.0.1:6447Automatic Failover in Practice
When a primary fails, Group Replication elects a new primary within seconds. MySQL Router detects the topology change — it polls the cluster metadata every second by default — and begins routing writes to the new primary. From the application's perspective, existing connections on port 6446 receive a connection error (which the application must handle with retry logic), and new connections route to the new primary. Total client-visible downtime is typically 10–30 seconds depending on detection timing.
# Monitor cluster health
cluster.status()
# Manual switchover (planned maintenance)
cluster.setPrimaryInstance('root@10.0.1.11:3306')
# Force failover if primary is unresponsive
cluster.forceQuorumUsingPartitionOf('root@10.0.1.11:3306')
# Rejoin a recovered member
cluster.rejoinInstance('root@10.0.1.10:3306')When to Use It
InnoDB Cluster is the right choice when you need a complete, supported HA stack without building custom orchestration, when you want automatic failover with transparent application routing, and when your team is willing to learn MySQL Shell's AdminAPI. It is more operationally complex than standalone async replication but dramatically less complex than building equivalent automation yourself. For production MySQL workloads where downtime is measured in revenue impact, InnoDB Cluster is the current best-practice recommendation.
Comparison Table
| Feature | Async Replication | Semi-Sync | Group Replication | InnoDB Cluster |
|---|---|---|---|---|
| Data Loss on Failover | Yes (lag-dependent) | Near-zero (relay log receipt) | Zero (committed txns) | Zero (committed txns) |
| Failover Type | Manual | Manual | Automatic | Automatic + transparent routing |
| Failover Time | Minutes (manual) | Minutes (manual) | 5–15 seconds | 10–30 seconds (includes router) |
| Write Latency Overhead | None | Low (1 RTT) | Medium (consensus round) | Medium (consensus round) |
| Read Scaling | Yes (replicas) | Yes (replicas) | Yes (secondary members) | Yes (Router port 6447) |
| Multi-Primary Writes | No | No | Yes (with restrictions) | Yes (not recommended) |
| External Orchestrator Needed | Yes (Orchestrator, etc.) | Yes | No | No (MySQL Router handles routing) |
| Minimum Nodes | 2 | 2 | 3 | 3 |
| Network Sensitivity | Low | Low-medium | High | High |
| Operational Complexity | Low | Medium | High | High (but managed via Shell) |
| Best For | Read replicas, DR | Improved durability, small clusters | Self-managed HA | Production HA with full automation |
Choosing the Right Option for Your Workload
Start with Your RPO and RTO Requirements
Recovery Point Objective (RPO) defines how much data loss you can tolerate. Recovery Time Objective (RTO) defines how long your system can be unavailable. If your RPO is zero and your RTO is under 30 seconds, InnoDB Cluster is the only option that reliably delivers both without custom tooling. If you can tolerate 5 minutes of downtime and some data loss, async replication with an orchestrator is far simpler to operate.
Operational Maturity Matters
Group Replication and InnoDB Cluster introduce new failure modes — split-brain scenarios, quorum loss, members expelled for being too far behind — that require operational knowledge to handle correctly. A team that has never run Group Replication will struggle during an incident. If your team is smaller or less experienced with MySQL internals, async replication with Orchestrator for automated failover is often a better risk-adjusted choice than deploying InnoDB Cluster without the expertise to operate it.
Network Architecture Is Non-Negotiable for GR
Group Replication requires reliable, low-latency connectivity between all members. Deploying GR across availability zones with 2–3ms RTT is reasonable. Deploying across regions with 50ms RTT will cause constant performance problems and member expulsions. If your architecture requires geographic distribution, consider async replication for the cross-region leg and Group Replication within each region.
Application Connection Handling
With async and semi-sync replication, your application likely connects directly to MySQL hosts or through a proxy like ProxySQL. With InnoDB Cluster, applications connect to MySQL Router — which means Router becomes a critical component in your stack. Ensure Router is deployed with redundancy (multiple Router instances) and that your application implements connection retry logic, since a failover will momentarily drop active connections.
- Async replication is simple but loses data proportional to replication lag at failover time — do not use it as your sole HA mechanism for transactional workloads.
- Semi-sync reduces data loss risk to near-zero but can silently fall back to async mode; monitor
Rpl_semi_sync_source_statuscontinuously. - Group Replication provides automatic primary election and zero data loss for committed transactions, but requires low-latency networks and 3+ members for quorum.
- InnoDB Cluster (Group Replication + MySQL Shell + MySQL Router) is the complete HA stack — it handles failover detection, primary election, and transparent write routing without custom orchestration.
- Choose your replication topology based on RPO, RTO, network architecture, and your team's operational maturity — not default familiarity with async setups.
- Always test failover under load before relying on any topology in production; theoretical behavior and real-world behavior diverge significantly under stress.
Working with JusDB on MySQL Replication
JusDB designs and manages MySQL replication topologies for engineering teams — from simple primary-replica setups to full InnoDB Cluster with automatic failover. Our DBAs match your replication strategy to your actual durability and availability requirements.
Explore JusDB MySQL Management → | Talk to a DBA
Related reading: