MySQL

MySQL Full-Text Search: FULLTEXT Indexes, MATCH AGAINST, and Boolean Mode

A production guide to MySQL full-text search — FULLTEXT index creation, MATCH AGAINST in natural language and boolean modes, the 50% rule, InnoDB vs MyISAM FTS, combining with B-tree indexes, and comparison against Elasticsearch.

JusDB Team
November 14, 2022
11 min read
149 views

Your analytics-heavy SaaS platform has just crossed 500 tenants, each generating millions of events per day, and your single PostgreSQL instance is saturating its write throughput ceiling — VACUUM can barely keep pace, autovacuum is always running, and pg_stat_activity shows a growing queue of waiting locks. You've already maxed out your vertical scaling options: 96 vCPUs, 768 GB RAM, and NVMe SSDs. Adding a read replica helps with SELECT queries, but your INSERT and UPDATE bottleneck lives entirely on the primary. This is the exact scenario where Citus — distributed PostgreSQL with transparent horizontal sharding — was built to solve. Rather than rewriting your application for a completely different database paradigm, Citus extends PostgreSQL itself so your existing SQL, drivers, and ORM code continue to work unchanged while writes fan out across a cluster of worker nodes.

TL;DR
  • Citus extends PostgreSQL with a coordinator/worker architecture, distributing table rows across shards using a hash, range, or append strategy.
  • create_distributed_table('orders', 'tenant_id') is the single command that transforms a local PostgreSQL table into a horizontally sharded distributed table.
  • Choosing the right distribution column — almost always a high-cardinality tenant or entity key — is the most critical design decision; it controls whether JOINs are local or expensive cross-shard.
  • Colocation (distributing two tables on the same column) keeps related rows on the same worker shard, enabling fast local JOINs without network round-trips.
  • Reference tables are small, globally replicated tables copied to every worker so lookups like countries or plans are always local.
  • Azure Cosmos DB for PostgreSQL is the fully managed Citus offering — ideal if you want the distributed model without operating the coordinator/worker cluster yourself.

Background

Citus originated as a commercial product at Citus Data before the company was acquired by Microsoft in 2019. Since then, the core extension has been open-sourced under the PostgreSQL License and is available on GitHub. The project's central insight is that horizontal sharding does not require abandoning the relational model: because Citus is a PostgreSQL extension (not a fork or a separate database engine), it inherits the full SQL dialect, ACID semantics within a shard, all PostgreSQL index types, pg_stat_* views, EXPLAIN ANALYZE, and every client library that speaks the PostgreSQL wire protocol.

The problem Citus targets is write-heavy workloads at a scale that outgrows a single PostgreSQL node. Pure read scaling is well-served by streaming replication and read replicas. But when your primary node is the bottleneck — whether due to write throughput, table size, or working set exceeding RAM — you need a way to split the data itself across multiple machines. Citus accomplishes this by dividing each distributed table into a configurable number of logical shards (default: 32 per table) and placing those shards across worker nodes. Queries that specify the distribution column value are routed directly to the single worker that owns the matching shard. Queries that do not specify it are broadcast to all workers and results are merged at the coordinator.

Citus Architecture

A Citus cluster consists of three logical components: one coordinator node, one or more worker nodes, and the Citus extension installed on all of them.

Coordinator Node

The coordinator is the single entry point for all client connections. It holds the distributed metadata catalog — tables like pg_dist_shard and pg_dist_placement — and is responsible for query planning and routing. When a client executes a query, the coordinator's planner inspects the query's filter predicates, looks up which shard owns the target rows, and either sends the query directly to that worker (single-shard fast path) or generates a distributed execution plan that fans the query out across multiple workers in parallel. The coordinator never stores the actual user data rows for distributed tables; it only stores metadata and reference table copies.

Worker Nodes

Worker nodes are ordinary PostgreSQL instances with the Citus extension installed. Each worker holds a subset of the shards for every distributed table. Shards are implemented as regular PostgreSQL tables with names like orders_102008, so all standard PostgreSQL tooling — pg_dump, VACUUM, individual index creation — works directly on them. Workers execute query fragments sent by the coordinator and return results. Because each worker handles only its own shards, write throughput scales linearly as you add workers.

Distributed Tables vs. Reference Tables

Citus recognizes three table types:

  • Distributed tables — rows are split across shards on multiple workers based on the distribution column. This is the primary scaling mechanism.
  • Reference tables — small lookup tables replicated in full to every worker. JOINs against reference tables are always local, never cross-shard. Typical candidates: countries, currencies, subscription_plans.
  • Local tables — ordinary PostgreSQL tables that only exist on the coordinator. Useful for application metadata not involved in distributed queries.
sql
-- Install Citus on coordinator and all workers
CREATE EXTENSION citus;

-- Register worker nodes from the coordinator
SELECT citus_add_node('worker-1.internal', 5432);
SELECT citus_add_node('worker-2.internal', 5432);
SELECT citus_add_node('worker-3.internal', 5432);

-- Verify cluster topology
SELECT * FROM pg_dist_node;

Setting Up Citus

The setup process follows a predictable pattern: install the extension, register workers, create your schema, then distribute the tables. The critical constraint is that create_distributed_table() must be called before you insert production data — migrating an existing large table requires an online migration strategy using logical replication or the create_distributed_table_concurrently() function introduced in Citus 11.

sql
-- Multi-tenant SaaS schema
CREATE TABLE tenants (
    tenant_id  BIGSERIAL PRIMARY KEY,
    name       TEXT NOT NULL,
    plan       TEXT NOT NULL DEFAULT 'starter',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE events (
    event_id   BIGSERIAL,
    tenant_id  BIGINT      NOT NULL,
    event_type TEXT        NOT NULL,
    payload    JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, event_id)
);

CREATE TABLE sessions (
    session_id UUID        NOT NULL DEFAULT gen_random_uuid(),
    tenant_id  BIGINT      NOT NULL,
    user_id    BIGINT      NOT NULL,
    started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (tenant_id, session_id)
);

-- Distribute on tenant_id so all data for one tenant lives on one worker
SELECT create_distributed_table('events',   'tenant_id');
SELECT create_distributed_table('sessions', 'tenant_id');

-- tenants is small — replicate it to every worker as a reference table
SELECT create_reference_table('tenants');
Warning: Distribution column is immutable

Once a table is distributed, you cannot change its distribution column without re-distributing the entire table. Choose the distribution column carefully before you distribute, ideally before any data is loaded. Changing it on a table with hundreds of millions of rows requires an offline rebuild or a carefully orchestrated swap using logical replication.

sql
-- Create indexes on each shard (runs in parallel across workers)
CREATE INDEX idx_events_tenant_type
    ON events (tenant_id, event_type, created_at DESC);

-- Verify shard placement
SELECT shardid, nodename, nodeport, shardstate
FROM pg_dist_shard_placement
JOIN pg_dist_shard USING (shardid)
WHERE logicalrelid = 'events'::regclass
ORDER BY shardid
LIMIT 10;

Distribution Strategies

Citus supports three distribution methods, each suited to different access patterns. Choosing the wrong one degrades cross-shard query performance significantly.

Hash Distribution

Hash distribution is the default and most common strategy. Citus computes hashint8(distribution_column_value) % shard_count to assign each row to a shard. The result is an even distribution of rows regardless of value skew — a tenant with 10 million events and a tenant with 10 events both hash to a single shard. Hash distribution is optimal when you query by exact distribution column value (e.g., WHERE tenant_id = $1) and when the distribution column has high cardinality.

sql
-- Default hash distribution
SELECT create_distributed_table('events', 'tenant_id');

-- Explicit hash (same as default)
SELECT create_distributed_table('events', 'tenant_id',
    distribution_type := 'hash');

Range Distribution

Range distribution maps value ranges to specific shards. It is useful when your queries frequently include range predicates on the distribution column — for example, time-series data queried by date windows. However, Citus requires you to manually manage shard boundaries for range-distributed tables, which adds operational overhead. For most new deployments, hash distribution with a time column as a secondary index is simpler and equally effective.

Append Distribution

Append distribution is designed for bulk-load, immutable data sets — think event log ingestion where data is written once and never updated. Each bulk load creates a new shard. Append tables are append-only by convention; UPDATE and DELETE will work but perform a broadcast across all shards. This strategy is less commonly used in new Citus deployments since hash distribution with partitioning handles most time-series use cases cleanly.

Tip: Colocation for JOIN performance

When two distributed tables share the same distribution column type and shard count, you can colocate them so matching rows always land on the same worker. This makes JOINs between the two tables purely local — no network data movement required. Pass colocate_with => 'events' to create_distributed_table() to explicitly set colocation. In the multi-tenant pattern, all tables distributed on tenant_id are automatically colocated as long as the column type matches.

sql
-- Explicit colocation: sessions shares shards with events
SELECT create_distributed_table('sessions', 'tenant_id',
    colocate_with := 'events');

-- This JOIN is now local on each worker — no cross-shard movement
SELECT e.event_type, count(*) AS cnt
FROM   events   e
JOIN   sessions s USING (tenant_id)
WHERE  e.tenant_id = 42
  AND  e.created_at >= now() - interval '7 days'
GROUP BY e.event_type;

Performance Characteristics

Citus's performance profile differs meaningfully from single-node PostgreSQL. Understanding where it wins — and where it has overhead — prevents surprises in production.

Parallel Query Execution

Queries that include the distribution column in the WHERE clause are routed to a single shard on one worker. Latency for these queries is comparable to a well-indexed single-node PostgreSQL query — often sub-millisecond for point lookups. Queries that omit the distribution column are broadcast across all workers in parallel and results are merged at the coordinator. Parallel aggregations across 8 workers complete roughly 6–7x faster than the same query on a single node, because the coordinator merges partial aggregates rather than doing full scans itself.

sql
-- Single-shard fast path (tenant_id specified) — routed to one worker
SELECT count(*) FROM events WHERE tenant_id = 42;

-- Cross-shard parallel aggregation — broadcast to all workers
SELECT tenant_id, count(*) AS event_count
FROM   events
WHERE  created_at >= now() - interval '30 days'
GROUP BY tenant_id
ORDER BY event_count DESC
LIMIT 20;

-- Use EXPLAIN to see distributed query plan
EXPLAIN (VERBOSE, COSTS OFF)
SELECT event_type, count(*)
FROM   events
WHERE  tenant_id = 42
GROUP BY event_type;
Important: Cross-shard JOIN cost

JOINs between two distributed tables that are not colocated require data reshuffling across the network — Citus implements this via a repartition join, which creates temporary tables on workers, redistributes rows, performs the local join, and then collects results at the coordinator. This is significantly more expensive than a local join. Always design your schema so frequently joined tables are colocated on the same distribution column.

Write Throughput Scaling

Because each worker handles independent shards, write throughput scales linearly with the number of workers for workloads where writes are distributed evenly. A cluster with 8 workers can sustain roughly 8x the single-node write throughput for INSERT statements that specify the distribution column. Bulk loads using COPY are automatically parallelized by the Citus coordinator — it splits the input stream and routes rows to the appropriate worker in parallel.

sql
-- Bulk ingestion: coordinator splits and routes automatically
COPY events (tenant_id, event_type, payload, created_at)
FROM '/data/events_2026_02.csv'
WITH (FORMAT csv, HEADER true);

-- Monitor per-worker shard sizes to detect hotspots
SELECT nodename,
       pg_size_pretty(sum(shard_size)) AS total_data
FROM citus_shards
GROUP BY nodename
ORDER BY sum(shard_size) DESC;

Citus vs Native PostgreSQL Partitioning

Citus is frequently compared to native PostgreSQL declarative partitioning (often managed by pg_partman). They solve related but distinct problems.

Dimension pg_partman (Native Partitioning) Citus (Distributed)
Write scaling Single node only — all writes go to one machine Horizontal — writes fan across multiple worker nodes
Partition pruning Native planner — very efficient range/list pruning Shard pruning on distribution column predicate
Data archival Excellent — detach and move old partitions to cold storage Append tables support archival; hash tables require re-distribution
Operational complexity Low — managed by pg_partman extension Medium-High — coordinator + multiple workers to operate
JOIN complexity Standard local JOINs Colocated JOINs are local; non-colocated are expensive
Use case fit Time-series retention, large tables on one node Multi-tenant SaaS, write throughput beyond single node

Native partitioning and Citus are complementary: Citus distributed tables can themselves be partitioned tables. A common production pattern combines pg_partman-managed time-based partitioning on the events table with Citus distribution on tenant_id. Each worker holds a subset of tenant shards, and within each shard the data is further partitioned by month, enabling both parallel writes and efficient partition pruning for time-range queries.

Citus vs. Vitess for MySQL: Vitess is the horizontally sharded MySQL proxy used at YouTube and PlanetScale. The architectural comparison is instructive: Vitess sits in front of MySQL as a proxy, shards via a VSchema configuration, and requires application awareness of keyspace/shard topology for some operations. Citus sits inside PostgreSQL as an extension, is transparent to most SQL clients, and leverages PostgreSQL's native planner for distributed query optimization. If your stack is already PostgreSQL, Citus is the more natural choice. If you are on MySQL and need horizontal sharding, Vitess is the established solution. Citus does not have a MySQL equivalent in the same architectural sense.

Tip: Azure Cosmos DB for PostgreSQL (managed Citus)

If you want the Citus distributed model without operating the coordinator and worker fleet yourself, Azure Cosmos DB for PostgreSQL is the fully managed offering. It handles node provisioning, Citus version upgrades, connection pooling via PgBouncer, and automated backups. You interact with it through a standard PostgreSQL connection string. Worker node count can be scaled up (and down) without downtime. For teams without dedicated database infrastructure engineers, this is often the fastest path to production with Citus.

Key Takeaways

Key Takeaways
  • Citus adds horizontal sharding to PostgreSQL as an extension — your SQL, drivers, and ORM code continue to work without rewriting for a new database paradigm.
  • The distribution column is the most important design decision. For multi-tenant SaaS, tenant_id is almost always the right choice — it ensures all data for one tenant is co-resident on a single worker and enables fast, local query execution.
  • Colocation is the mechanism that makes distributed JOINs practical. Distribute all frequently joined tables on the same column with the same shard count to keep JOINs local.
  • Reference tables (small lookup tables replicated to every worker) eliminate the most common source of cross-shard data movement for lookup JOINs.
  • Native PostgreSQL partitioning (pg_partman) and Citus are complementary: combine time-based partitioning with distributed sharding for the best of both — retention management and horizontal write throughput.
  • For managed infrastructure, Azure Cosmos DB for PostgreSQL delivers the full Citus feature set without the operational burden of running a coordinator/worker cluster.

Working with JusDB on PostgreSQL Scaling

Deciding whether Citus is the right tool — versus vertical scaling, native partitioning, read replicas, or a different distributed database entirely — depends on your specific write throughput numbers, tenant count, query patterns, and team's operational capacity. The cost of getting the distribution column wrong after you have hundreds of gigabytes in production is significant. JusDB engineers help teams evaluate these tradeoffs before committing to a distributed architecture.

Our PostgreSQL managed services include distributed Citus cluster design, distribution column analysis, colocation schema review, shard rebalancing after worker additions, and ongoing performance monitoring. If you are already running PostgreSQL and are approaching single-node limits, we can audit your workload and provide a concrete scaling recommendation — whether that is Citus, pg_partman partitioning, connection pooling via PgBouncer, or a combination.

Contact JusDB to discuss your PostgreSQL scaling requirements with an engineer who has deployed Citus in production.

Related reading:

Share this article