High Availability

MySQL Galera Cluster: Setup, Tuning, and Common Pitfalls

Set up MySQL Galera Cluster for synchronous multi-master replication — configuration, split-brain avoidance, and monitoring

JusDB Team
April 18, 2022
13 min read
143 views
MySQL Galera Cluster: Setup, Tuning, and Pitfalls | JusDB

Running MySQL in a single-master setup means every write path goes through one machine — when that machine goes down, so does your application. MySQL Galera Cluster changes the equation by providing synchronous multi-master replication where every node can accept writes and every write is committed across all nodes before the client receives an acknowledgement. That guarantee sounds expensive, and it is, but for workloads where data loss is not an option it is the right trade-off. This post walks through a full production setup, the tuning knobs that matter most, and the failure modes that catch teams off guard.

TL;DR
  • Galera uses the WSRep API and write-set certification to achieve synchronous multi-master replication with no single point of failure.
  • Bootstrap only the first node with --wsrep-new-cluster; all subsequent nodes join via SST or IST.
  • Keep wsrep_cluster_size at an odd number (3, 5) to guarantee quorum and prevent split-brain.
  • Large transactions are the number-one source of certification conflicts — batch writes, not single giant statements.
  • Monitor wsrep_local_recv_queue and wsrep_flow_control_paused continuously; spikes mean a node is lagging behind.
  • Front the cluster with ProxySQL to route reads and writes intelligently and to handle node failures transparently.

What is Galera Cluster?

Galera is a synchronous multi-master replication plugin for MySQL and MariaDB developed by Codership. It exposes the WSRep API (Write-Set Replication API) — a standardised interface between the database engine and the replication provider. When a transaction commits on any node, Galera packages the row-level changes into a write set, broadcasts that write set to every other node, and runs a deterministic certification algorithm to detect conflicts before the commit is acknowledged. If no conflicts exist, all nodes apply the write set and commit in lockstep. If a conflict is detected, the originating node rolls back its local transaction and returns a deadlock error to the application.

Three properties define the model: synchronous replication (no replica lag), true multi-master (any node accepts writes), and automatic membership control (nodes leaving or joining do not require manual reconfiguration). The trade-off is write latency — each commit incurs at least one network round-trip across all nodes, so Galera is unsuitable for write-heavy workloads that require sub-millisecond commit times.

Architecture Overview

A minimal production cluster has three nodes. The cluster uses a group communication system (GCS) — Galera ships with its own called gcomm — to maintain membership and order write sets. Each node maintains a GCache (the Galera cache, a ring buffer on disk) that holds recent write sets. When a joining node is slightly behind it can catch up via Incremental State Transfer (IST), pulling only the missing write sets from the GCache of a donor node. If the joining node is too far behind — or is brand new — it requires a State Snapshot Transfer (SST), which is a full copy of the data.

ConceptDescription
Write SetRow-level changes + metadata packaged for replication
CertificationDeterministic conflict detection before commit
ISTIncremental catch-up from GCache; fast, non-blocking
SSTFull data copy from donor; blocks donor's writes briefly
GCacheRing buffer holding recent write sets for IST
Flow ControlBack-pressure mechanism when a node falls behind

Setting Up a Three-Node Cluster

Prerequisites

  • Three Linux nodes (bare metal or VM) with the same MySQL/MariaDB major version installed
  • Galera library installed: galera-4 (MySQL 8) or galera-3 (MySQL 5.7 / MariaDB 10.x)
  • Firewall rules open: port 3306 (MySQL), 4567 (Galera replication), 4568 (IST), 4444 (SST)
  • xtrabackup or mariabackup installed on all nodes for SST

Core Configuration (my.cnf)

Add the following to /etc/mysql/mysql.conf.d/galera.cnf on every node, adjusting the addresses for each:

text
[mysqld]
# Galera provider
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so

# Cluster topology
wsrep_cluster_name="prod_galera"
wsrep_cluster_address="gcomm://10.0.1.11,10.0.1.12,10.0.1.13"

# Node identity
wsrep_node_name="db-node-1"          # unique per node
wsrep_node_address="10.0.1.11"       # this node's IP

# SST method
wsrep_sst_method=mariabackup         # or xtrabackup-v2 for MySQL 8

# InnoDB settings required by Galera
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2           # required for Galera
binlog_format=ROW                    # required

# Optional but recommended
wsrep_slave_threads=4                # parallel apply threads
innodb_flush_log_at_trx_commit=0    # relax durability slightly for speed
Warning

innodb_autoinc_lock_mode=2 is mandatory. Modes 0 and 1 use table-level locks for AUTO_INCREMENT which are incompatible with multi-master writes and will cause deadlocks under concurrent inserts.

Bootstrapping the First Node

The first node must be started with a special flag to initialise the cluster. Never run this command on a node that is joining an existing cluster — it will create a new cluster of one, causing split-brain.

text
# On node 1 ONLY — one time only at cluster creation
sudo mysqld_bootstrap
# or via systemd with the Galera-aware unit:
sudo galera_new_cluster

Confirm the node is the sole member:

text
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 1     |
+--------------------+-------+

Joining Subsequent Nodes

Start MySQL normally on nodes 2 and 3. Galera detects the cluster address in the config, connects via gcomm, and automatically initiates SST or IST:

text
sudo systemctl start mysql

Watch /var/log/mysql/error.log for SST progress. Once joined, verify on any node:

text
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+

mysql> SHOW STATUS LIKE 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+

Key Tuning Parameters

GCache Size

GCache is the single most important tuning target for IST success. If a node goes offline and comes back before GCache wraps around, it will use IST (fast, non-blocking). If GCache has overwritten the needed write sets, SST is required (slow, donor is briefly read-only). Size GCache based on your write rate and expected node downtime:

text
wsrep_provider_options="gcache.size=2G; gcache.page_size=512M"

A rule of thumb: GCache size = write throughput (bytes/sec) × maximum acceptable downtime (seconds) × 1.5 safety factor. For a cluster doing 50 MB/s of writes and a 30-minute maintenance window, that is roughly 135 GB — plan accordingly.

Flow Control Threshold

Galera pauses replication (flow control) when a node's receive queue exceeds a threshold:

text
wsrep_provider_options="gcs.fc_limit=100; gcs.fc_factor=0.5"

gcs.fc_limit is the receive queue length at which flow control fires. Raising it allows a slower node more headroom but risks it falling further behind. Do not tune this until you have profiled the actual lag cause — it is often a missing index or a large transaction, not a capacity issue.

Parallel Apply Threads

text
wsrep_slave_threads=8    # match to CPU count; start at 4
Tip

Set wsrep_slave_threads to a value equal to or slightly below the number of vCPUs on the node. Galera's parallel applier is safe — it uses the same dependency tracking as MySQL Group Replication. Higher values reduce apply lag on busy clusters but add per-thread memory overhead.

Common Pitfalls

Split-Brain Scenarios

A split-brain occurs when the cluster loses quorum — more than half of the nodes are unreachable from the remaining nodes — and two sub-clusters continue accepting writes independently. When the partition heals, there is no safe automatic merge; you have divergent data. Galera prevents this by halting writes on any partition that does not hold a majority of wsrep_cluster_size.

Warning

Always deploy an odd number of nodes (3, 5, 7). A two-node cluster has no quorum mechanism — if the network partitions, both nodes halt rather than one continuing. A third node (even a lightweight arbitrator running garbd) breaks the tie without storing data.

text
# Run the Galera Arbitrator daemon on a lightweight node
garbd --group=prod_galera \
      --address="gcomm://10.0.1.11,10.0.1.12,10.0.1.13" \
      --log=/var/log/garbd.log \
      --daemon

Large Transaction Certification Conflicts

Galera certifies write sets at commit time. If two nodes modify the same row during the same time window — even without an explicit conflict at the SQL level — one transaction will be rolled back with ER_LOCK_DEADLOCK. Large transactions that touch many rows dramatically increase the window of conflict. Common culprits: batch deletes, mass updates, and ALTER TABLE on large tables.

Warning

Online DDL (ALTER TABLE) with Galera requires Total Order Isolation (TOI), which blocks all writes cluster-wide for the duration of the DDL. Use pt-online-schema-change or gh-ost with wsrep_OSU_method=RSU for large table alterations to avoid cluster-wide stalls.

Decompose large batches:

text
-- Instead of one giant delete:
-- DELETE FROM events WHERE created_at < '2024-01-01';

-- Use a loop in small batches:
DELETE FROM events WHERE created_at < '2024-01-01' LIMIT 1000;
-- repeat until ROW_COUNT() = 0

SST Donor Blocking

When a node requires SST, the donor node is briefly set to read-only while the snapshot is taken (with xtrabackup or mariabackup this window is very short — under a second — because backup tools use InnoDB's non-blocking backup lock). Avoid SST during peak traffic if possible, and always prefer IST by keeping GCache appropriately sized.

Monitoring wsrep Status

All Galera health metrics are available via SHOW STATUS LIKE 'wsrep_%'. The critical ones to alert on:

text
SHOW STATUS LIKE 'wsrep_%';
VariableHealthy ValueWhat it Signals
wsrep_cluster_sizeExpected node count (e.g., 3)Cluster membership
wsrep_local_state_commentSyncedNode is fully caught up
wsrep_cluster_statusPrimaryNode is in the primary component (has quorum)
wsrep_local_recv_queue0 or near 0Receive queue depth; rising = node is lagging
wsrep_flow_control_pausedClose to 0Fraction of time cluster was paused; >0.1 needs investigation
wsrep_cert_deps_distanceHigher = better parallelismGuides wsrep_slave_threads tuning
wsrep_local_cert_failuresLow / stable rateRising = certification conflicts, large transactions

Export these via mysqld_exporter with the info_schema.wsrep_status collector and graph them in Grafana. Alert when wsrep_cluster_size drops below expected or wsrep_cluster_status != Primary.

text
# Prometheus mysqld_exporter snippet
--collect.info_schema.innodb_metrics \
--collect.global_status \
--collect.global_variables

ProxySQL Integration

ProxySQL is the standard connection proxy for Galera clusters. It monitors wsrep_local_state_comment on each backend and automatically removes nodes that are not Synced from the active pool. Configure a hostgroup for writers (any single node) and a hostgroup for readers (all synced nodes):

text
-- In ProxySQL admin (port 6032):
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight)
VALUES
  (10, '10.0.1.11', 3306, 1),  -- writer hostgroup
  (20, '10.0.1.11', 3306, 1),  -- reader hostgroup
  (20, '10.0.1.12', 3306, 1),
  (20, '10.0.1.13', 3306, 1);

-- Galera health check user
INSERT INTO mysql_users (username, password, default_hostgroup, transaction_persistent)
VALUES ('app_user', 'secret', 10, 1);

-- Enable Galera-aware monitoring
UPDATE global_variables
SET variable_value='true'
WHERE variable_name='mysql-monitor_galera_healthcheck_interval';

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
Tip

Set transaction_persistent=1 in ProxySQL's mysql_users table so that once a connection begins a transaction on a writer node, all subsequent queries in that session are pinned to the same backend. Without this, a BEGIN followed by a SELECT could be routed to a reader, breaking read-your-own-writes semantics.

Key Takeaways
  • Galera's WSRep protocol delivers synchronous multi-master replication through write-set certification — every commit is guaranteed on all nodes before the client receives a response.
  • Bootstrap the cluster on exactly one node with --wsrep-new-cluster / galera_new_cluster; all other nodes join by starting normally.
  • Always run an odd number of nodes (minimum 3) or add a garbd arbitrator to ensure quorum and prevent split-brain.
  • Size gcache.size based on write throughput × maximum downtime to maximise IST success and avoid expensive SST operations.
  • Break large batch operations into small transactions to reduce certification conflicts; use pt-online-schema-change or gh-ost for DDL on large tables.
  • Monitor wsrep_local_recv_queue, wsrep_flow_control_paused, and wsrep_cluster_status continuously — these three metrics surface the most common operational issues.
  • Place ProxySQL in front of the cluster with Galera health checking enabled to route traffic away from lagging or failed nodes automatically.

Need Help Running Galera in Production?

Configuring Galera is one thing — keeping it healthy across upgrades, traffic spikes, and network partitions is another. At JusDB, our database engineers have run multi-datacenter Galera clusters at scale and can help you design, tune, and operate your cluster with confidence.

Talk to a JusDB engineer about a health check on your current setup, or explore our MySQL managed services if you would rather hand off the operational burden entirely.

Share this article