MySQL InnoDB Cluster Deployment: Step-by-Step Guide

At 2:47 AM on a Black Friday, a fintech team's primary MySQL node went down mid-transaction. Their traditional async replica did not automatically promote — someone had to SSH in, run STOP SLAVE, update application connection strings, and restart services while 40,000 payment sessions timed out. The outage lasted 23 minutes. MySQL InnoDB Cluster exists precisely to prevent that scenario: it provides automatic primary election, connection routing abstraction, and a quorum-based consensus protocol that handles single-node failures without human intervention. This guide covers the full deployment — Group Replication, MySQL Router, and MySQL Shell — with the production detail that the official documentation buries in footnotes.

TL;DR

InnoDB Cluster is three components working together: MySQL Group Replication handles replication and consensus, MySQL Router abstracts connection routing, and MySQL Shell provides the management API.
Prerequisites before setup: GTID mode enabled, binary logging on, MySQL 8.0.11+ (8.0.27+ recommended for production), and matching server versions across all nodes.
Use dba.configureInstance() to pre-validate every node, dba.createCluster() on the primary, and cluster.addInstance() for each additional member.
MySQL Router bootstraps against the cluster metadata and exposes port 6446 for read/write traffic and port 6447 for read-only traffic — your application points to the router, never directly to nodes.
Single-primary mode is the default and strongly recommended for production; multi-primary is available but introduces certification-based conflict detection that adds latency.
After any node failure, use cluster.rejoinInstance() to bring it back; after a total outage, dba.rebootClusterFromCompleteOutage() is the recovery path.

InnoDB Cluster Architecture: The Three-Component Stack

Understanding why each component exists prevents the most common misconfigurations. InnoDB Cluster is not a single product — it is three independently deployable components that the MySQL team designed to interoperate.

MySQL Group Replication

Group Replication is the replication engine underneath the cluster. Unlike traditional async replication, it uses a distributed consensus protocol (Paxos-based) to certify transactions before they commit. In single-primary mode, only one member accepts writes at a time; all other members apply replicated transactions and serve reads. Automatic primary election fires when the primary is unreachable by a majority of members — typically within 5 to 30 seconds depending on your group_replication_member_expel_timeout setting.

MySQL Shell

MySQL Shell is the management interface. It wraps the Group Replication SQL primitives behind a JavaScript or Python AdminAPI. dba.createCluster(), cluster.addInstance(), and cluster.status() are Shell commands that translate to dozens of underlying SQL statements — configuration validation, metadata writes, and replication channel setup — that would be painful to run manually.

MySQL Router

MySQL Router sits between your application and the cluster nodes. It reads the cluster's metadata to know which node is currently the primary, and routes incoming connections to the correct target. When primary election occurs, the router detects the topology change and redirects new connections within seconds. Your application connection string never changes.

Prerequisites: Getting Every Node Ready

Skipping the prerequisite phase is the single most common reason InnoDB Cluster deployments fail. Run this checklist on every node before touching MySQL Shell.

MySQL Version and Server IDs

All members must run MySQL 8.0.11 or later. Production clusters should be on 8.0.27+ where Group Replication's single-consensus-leader mode significantly reduces write latency. Each node must have a unique server_id.

ini

# /etc/mysql/mysql.conf.d/mysqld.cnf — node 1
[mysqld]
server_id                    = 1
gtid_mode                    = ON
enforce_gtid_consistency     = ON
log_bin                      = mysql-bin
binlog_format                = ROW
log_replica_updates          = ON
plugin_load_add              = group_replication.so
group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
group_replication_local_address = "node1:33061"
group_replication_group_seeds  = "node1:33061,node2:33061,node3:33061"
group_replication_start_on_boot = OFF
group_replication_bootstrap_group = OFF

Warning

Set group_replication_bootstrap_group = OFF in the config file and only enable it transiently in the shell when bootstrapping. Leaving it ON in a running cluster causes a split-brain: the node will elect itself as the sole primary regardless of what the rest of the group has decided.

Network Requirements

Group Replication requires TCP connectivity on two ports per node: the standard MySQL port (3306) for client and replication traffic, and the GR communication port (33061 by default). Both must be open bidirectionally between all cluster members. Verify with:

bash

nc -zv node2 3306
nc -zv node2 33061

Create the Cluster Admin Account

MySQL Shell's AdminAPI requires a dedicated account with a specific privilege set. Create it identically on all three nodes before running any Shell commands:

sql

CREATE USER 'clusteradmin'@'%' IDENTIFIED BY 'StrongPassw0rd!';
GRANT ALL PRIVILEGES ON *.* TO 'clusteradmin'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;

Tip

dba.configureInstance() can create this account automatically if you pass clusterAdmin and clusterAdminPassword options. Use the automatic path in new deployments — it sets the exact privilege set the AdminAPI expects and avoids the "insufficient privileges" errors that trip up manual account creation.

Step-by-Step Cluster Setup with MySQL Shell

Step 1: Validate and Configure Each Node

Run this against every node before creating the cluster. MySQL Shell will check GTID mode, binary log settings, server ID uniqueness, and dozens of other parameters, then offer to fix any issues it finds:

javascript

// Connect to node1 in MySQL Shell
\connect clusteradmin@node1:3306

// Validate and autoconfigure the instance
dba.configureInstance('clusteradmin@node1:3306', {
  clusterAdmin: 'clusteradmin',
  clusterAdminPassword: 'StrongPassw0rd!',
  restart: true
});

Repeat for node2 and node3. The restart: true option allows the shell to restart the MySQL service if configuration changes require it. After all three nodes pass without errors, proceed to cluster creation.

Step 2: Create the Cluster on the Primary Node

javascript

// Connect to the intended primary
var cluster = dba.createCluster('ProductionCluster', {
  ipAllowlist: '10.0.1.0/24',
  exitStateAction: 'READ_ONLY',
  memberWeight: 50,
  expelTimeout: 5
});

The exitStateAction: 'READ_ONLY' setting is critical for production: if a member loses contact with the group, it transitions to super-read-only mode instead of continuing to accept writes in isolation. This is the default in MySQL 8.0.12+ but always set it explicitly.

Step 3: Add the Remaining Members

javascript

cluster.addInstance('clusteradmin@node2:3306', {
  recoveryMethod: 'clone',
  memberWeight: 50
});

cluster.addInstance('clusteradmin@node3:3306', {
  recoveryMethod: 'clone',
  memberWeight: 50
});

The recoveryMethod: 'clone' option uses MySQL Clone Plugin to transfer a full snapshot from the donor to the joining member. This is far more reliable than the default incremental recovery for nodes joining a cluster with significant transaction history. After each addInstance() completes, the new member will be online and replicating.

Step 4: Verify Cluster Status

javascript

cluster.status();

The output shows each member's role (PRIMARY or SECONDARY), health status (OK), and replication lag. A healthy three-node cluster looks like:

json

{
  "clusterName": "ProductionCluster",
  "defaultReplicaSet": {
    "name": "default",
    "primary": "node1:3306",
    "ssl": "REQUIRED",
    "status": "OK",
    "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
    "topology": {
      "node1:3306": { "memberRole": "PRIMARY", "status": "ONLINE" },
      "node2:3306": { "memberRole": "SECONDARY", "status": "ONLINE" },
      "node3:3306": { "memberRole": "SECONDARY", "status": "ONLINE" }
    }
  }
}

MySQL Router Deployment

Bootstrap the Router

Bootstrap connects the router to the cluster metadata schema and generates its configuration automatically. Run this on the application server or on a dedicated router host:

bash

mysqlrouter --bootstrap clusteradmin@node1:3306 \
  --user=mysqlrouter \
  --directory=/etc/mysqlrouter/production \
  --conf-use-gr-notifications

The --conf-use-gr-notifications flag enables Group Replication change notifications so the router receives topology updates via push rather than polling, reducing failover detection time from the default polling interval to under a second.

Connection Ports and Routing Modes

After bootstrap, the router exposes four ports by default. The two you will use in production:

Port 6446 — Read/write endpoint. Routes all connections to the current primary. Your application's write connection pool points here.
Port 6447 — Read-only endpoint. Routes connections to secondary members in round-robin. Your read replicas, reporting queries, and read-heavy microservices point here.

ini

# /etc/mysqlrouter/production/mysqlrouter.conf (auto-generated, key section)
[routing:ProductionCluster_rw]
bind_address  = 0.0.0.0
bind_port     = 6446
destinations  = metadata-cache://ProductionCluster/?role=PRIMARY
routing_strategy = first-available

[routing:ProductionCluster_ro]
bind_address  = 0.0.0.0
bind_port     = 6447
destinations  = metadata-cache://ProductionCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback

Tip

Deploy at least two router instances on separate hosts and put them behind a hardware or software load balancer (HAProxy, AWS NLB). MySQL Router itself is stateless — it does not hold session state — so load balancing across multiple router instances is safe and eliminates the router as a single point of failure.

Monitoring Cluster Health

Checking Status and Topology

javascript

// Summary view
cluster.status();

// Detailed topology and configuration
cluster.describe();

// Verbose output with per-member replication lag
cluster.status({ extended: 1 });

For automated monitoring, you can query the cluster metadata schema directly from any application:

sql

-- Current primary member
SELECT MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members
WHERE MEMBER_ROLE = 'PRIMARY';

-- All members with replication lag
SELECT
  m.MEMBER_HOST,
  m.MEMBER_ROLE,
  m.MEMBER_STATE,
  s.COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE AS applier_queue_depth
FROM performance_schema.replication_group_members m
LEFT JOIN performance_schema.replication_group_member_stats s
  ON s.MEMBER_ID = m.MEMBER_ID;

Handling Node Failures and Recovery

Automatic Failover

When the primary loses contact with a majority of members (or vice versa), Group Replication triggers an automatic election. The member with the highest memberWeight (configured at addInstance() time) wins the election, becomes the new primary, and transitions to read/write mode. MySQL Router detects this change and starts routing writes to the new primary within seconds. No application restart is required.

Rejoining a Failed Node

When the failed node comes back online, it will not automatically rejoin the group. Bring it back with:

javascript

cluster.rejoinInstance('clusteradmin@node1:3306');

If the node has fallen too far behind (its binary log position is outside the group's purged GTID set), rejoinInstance() will fail with a gap error. In that case, use clone-based recovery:

javascript

cluster.rejoinInstance('clusteradmin@node1:3306', {
  recoveryMethod: 'clone'
});

Updating Cluster Metadata After Topology Changes

After manually adding or removing nodes outside the Shell, or after a node's hostname changes, the cluster metadata may diverge from reality. Rescan reconciles it:

javascript

cluster.rescan();

Split-Brain Protection and Full Outage Recovery

Quorum Requirements

A three-node cluster requires two members to be reachable to form a quorum. If only one node is running, the cluster is quorum-less and will reject all writes — including from the surviving node. This is intentional: the cluster would rather be unavailable than allow a single node to accept writes that may conflict with writes accepted by an isolated partition.

Warning

Never force quorum (cluster.forceQuorumUsingPartitionOf()) unless you are certain the other members are genuinely unreachable and not just partitioned. Forcing quorum on the wrong partition while the other partition is still processing writes will create a split-brain that cannot be automatically resolved — you will need to discard one partition's transactions manually.

Recovering from a Complete Outage

If all three nodes go down simultaneously — data center power failure, network partition, coordinated OS crash — none of them will auto-bootstrap because none can confirm they have the most recent transaction set. Recovery requires explicitly identifying the most advanced node and rebooting from it:

javascript

// Identify the node with the highest GTID executed set
// Run on each node to compare:
// SELECT @@global.gtid_executed;

// Then on the most advanced node:
dba.rebootClusterFromCompleteOutage('ProductionCluster', {
  primary: 'node1'
});

After the cluster reboots, the other nodes will rejoin as secondaries and receive missing transactions via clone or incremental recovery.

InnoDB Cluster vs. Galera Cluster

Feature	InnoDB Cluster (Group Replication)	Galera Cluster
Write topology	Single-primary (default) or multi-primary	Multi-primary (all nodes writable)
Conflict detection	Certification-based (optimistic, post-commit)	Certification-based (optimistic, pre-commit)
Management API	MySQL Shell AdminAPI	No built-in; third-party tools (ClusterControl)
Connection routing	MySQL Router (official, metadata-aware)	HAProxy or ProxySQL (manual config)
Storage engine support	InnoDB only	InnoDB only (MyISAM unsupported)
Oracle support	Yes (native MySQL product)	Community (Percona XtraDB, MariaDB)
Cross-WAN latency sensitivity	High (consensus on every commit)	High (flow control triggers under lag)
DDL handling	Total order isolation (TOI) on MySQL 8.0+	Total order isolation (TOI)

For most production MySQL shops already on 8.0+, InnoDB Cluster is the lower-friction choice: it ships with MySQL, uses the same tooling, and the single-primary default avoids the write conflict overhead that plagues Galera in write-heavy workloads. Galera's multi-primary model is compelling only if you genuinely need writes distributed across geographic regions with no single bottleneck.

Backup Strategies for InnoDB Cluster

Take Backups from a Secondary

Always run backups against a secondary member to avoid impacting primary write throughput. MySQL Enterprise Backup and Percona XtraBackup both support hot backups from Group Replication members.

bash

# Percona XtraBackup from a secondary node
xtrabackup \
  --backup \
  --host=node2 \
  --user=backupuser \
  --password=BackupPass! \
  --target-dir=/backup/$(date +%Y%m%d_%H%M%S) \
  --compress \
  --parallel=4

GTID Consistency for Cluster Restores

When restoring a backup to a new node and adding it to the cluster, the restored node's GTID set must be a subset of the cluster's current GTID set — or you must reset it explicitly:

sql

-- On the restored node, before rejoining:
RESET MASTER;
SET GLOBAL gtid_purged = '';

Then use cluster.addInstance() with recoveryMethod: 'clone' to let the cluster bring it fully up to date from a live donor rather than relying on the backup's GTID state.

Tip

Enable point-in-time recovery by retaining binary logs on all cluster members for at least 7 days. Set binlog_expire_logs_seconds = 604800 in mysqld.cnf. Group Replication's purge mechanism can aggressively expire logs if a member is offline — ensure your slowest-rejoining member scenario still has the binary logs it needs for incremental recovery before clone becomes mandatory.

Key Takeaways

Run dba.configureInstance() against every node before cluster creation — it catches the GTID, binary log, and server ID mismatches that will silently break Group Replication at the worst possible time.
Set exitStateAction: 'READ_ONLY' explicitly on cluster creation so that any member that loses group contact enters super-read-only mode instead of accepting isolated writes.
Deploy MySQL Router on at least two hosts with a load balancer in front — the router is stateless and free to scale horizontally, and it is the only component that abstracts your application from the cluster topology.
Use port 6446 for read/write connections and port 6447 for read-only connections; never hardcode node IP addresses in application connection strings.
Recover failed nodes with cluster.rejoinInstance() and use recoveryMethod: 'clone' when the node's binary logs have been purged — do not attempt manual replication catch-up.
After a total cluster outage, identify the node with the highest gtid_executed set and use dba.rebootClusterFromCompleteOutage() with that node as the forced primary.
Take all backups from a secondary member using XtraBackup or MySQL Enterprise Backup, retain binary logs for at least 7 days, and test full restore-to-cluster procedures quarterly — not on the night of an incident.

Working with JusDB on MySQL InnoDB Cluster

JusDB manages MySQL InnoDB Cluster for engineering teams who need production-grade high availability without a full-time DBA team dedicated to it. We handle the full lifecycle: initial cluster deployment and Router configuration, ongoing health monitoring, failover testing, backup validation, and 24/7 incident response when a node goes down at 2:47 AM. Our team has deployed and recovered InnoDB Clusters across fintech, logistics, and SaaS platforms — we have seen every failure mode this guide describes, and then some.

If you are evaluating InnoDB Cluster for an upcoming migration from async replication, planning a Galera-to-InnoDB Cluster move, or dealing with a cluster in a degraded state right now, reach out.

Explore JusDB MySQL Consulting → | Talk to a MySQL DBA

Related reading:

A Practical Guide to Deploying and Managing MySQL InnoDB Cluster

InnoDB Cluster Architecture: The Three-Component Stack

MySQL Group Replication

MySQL Shell

MySQL Router

Prerequisites: Getting Every Node Ready

MySQL Version and Server IDs

Network Requirements

Create the Cluster Admin Account

Step-by-Step Cluster Setup with MySQL Shell

Step 1: Validate and Configure Each Node

Step 2: Create the Cluster on the Primary Node

Step 3: Add the Remaining Members

Step 4: Verify Cluster Status

MySQL Router Deployment

Bootstrap the Router

Connection Ports and Routing Modes

Monitoring Cluster Health

Checking Status and Topology

Handling Node Failures and Recovery

Automatic Failover

Rejoining a Failed Node

Updating Cluster Metadata After Topology Changes

Split-Brain Protection and Full Outage Recovery

Quorum Requirements

Recovering from a Complete Outage

InnoDB Cluster vs. Galera Cluster

Backup Strategies for InnoDB Cluster

Take Backups from a Secondary

GTID Consistency for Cluster Restores

Working with JusDB on MySQL InnoDB Cluster

Share this article

Need Expert Help?

MySQL Performance Tuning

MySQL Consulting

MySQL Migration

MySQL High Availability

MySQL on Kubernetes

MySQL Support

Related Articles

MySQL Group Replication (2026): Multi-Primary HA, Conflict Resolution & Production Setup

MariaDB Galera Cluster Production Guide: Mariabackup, MaxScale & Schema Strategy

Patroni with Consul DCS: PostgreSQL HA Without etcd