PostgreSQL

PostgreSQL Full-Text Search: tsvector, GIN Indexes, and ts_rank in Production

A production guide to PostgreSQL full-text search — building tsvector generated columns, GIN indexes, multi-column search with setweight, ts_rank scoring, websearch_to_tsquery, and comparison against Elasticsearch.

JusDB Team
January 24, 2023
13 min read
259 views

A mid-sized e-commerce platform came to JusDB with a product catalog of 4.2 million SKUs. Their search endpoint was hitting a LIKE '%keyword%' query against a single TEXT column, bypassing every index on the table and triggering sequential scans that pegged CPU at 100% for three to eight seconds per request during peak hours. Adding an index on the column helped not at all — PostgreSQL cannot use a B-tree index for a leading-wildcard LIKE pattern. They were one marketing campaign away from a full outage.

We replaced the LIKE query with PostgreSQL's built-in full-text search engine in a single afternoon. The same product search query now runs in under 12ms, uses a GIN index on a pre-computed tsvector column, and returns ranked results sorted by relevance — something the LIKE approach could never provide. The index took 18 minutes to build and added 380 MB of storage on a 22 GB table.

PostgreSQL ships a complete full-text search system — lexer, stemmer, stop-word lists, language dictionaries, relevance ranking, and query syntax — with no extensions required. Most engineering teams reach for Elasticsearch when a LIKE query stops scaling, without realizing the solution is already in the database they are already running.

This guide covers everything you need to deploy production-grade full-text search inside PostgreSQL: the tsvector and tsquery types, GIN vs GiST indexes, ranking functions, multi-column search, and an honest comparison against Elasticsearch and pg_trgm.

TL;DR
  • tsvector is a pre-processed, normalized list of lexemes; tsquery is a parsed search query. The @@ operator matches them.
  • Store a pre-computed tsvector column and keep it current with a trigger or generated column — do not call to_tsvector() inline on every query row.
  • GIN indexes make full-text search fast (sub-20ms on millions of rows); GiST is smaller but slower — use GIN for production workloads.
  • plainto_tsquery() handles raw user input safely; websearch_to_tsquery() supports Google-style syntax (-exclude, "exact phrase"); to_tsquery() requires pre-formatted input.
  • ts_rank() scores results by term frequency; ts_headline() generates keyword-highlighted snippets for display.
  • Language configuration matters critically — indexing with 'english' and querying with 'simple' produces zero matches for stemmed words.

Background

PostgreSQL has shipped a full-text search subsystem since version 8.3 (2008). Unlike an external search engine, it runs inside the database process, shares the same transaction and MVCC semantics as your relational data, and requires no separate infrastructure to deploy, monitor, or keep in sync.

The system is built around two specialized data types:

  • tsvector — a sorted list of normalized lexemes (stems) with their positional information. Produced by parsing and normalizing a text string: stop words are removed, words are reduced to their root form ("running" becomes "run"), and positions are recorded for phrase-distance queries.
  • tsquery — a parsed search query consisting of lexemes combined with Boolean operators (& for AND, | for OR, ! for NOT) and optionally phrase operators (<-> for adjacent, <2> for within-2-positions).

The match operator @@ returns true when a tsquery matches a tsvector. This is the operator GIN and GiST indexes are built to accelerate.

Full-text search in PostgreSQL is language-aware. The text search configuration — 'english', 'french', 'german', 'simple', etc. — determines which dictionary is used for stemming and which words are treated as stop words. This is both the most powerful and the most commonly misunderstood aspect of PostgreSQL FTS.

How It Works

When you call to_tsvector('english', 'The quick brown foxes are running fast'), PostgreSQL applies the english text search configuration in sequence:

  1. Tokenize — split into individual tokens: words, numbers, URLs, email addresses, etc.
  2. Normalize — apply the dictionary: "foxes" stems to "fox", "running" stems to "run", "are" and "The" are stop words and are removed entirely.
  3. Store with positions — the result is 'brown':3 'fast':7 'fox':4 'quick':2 'run':6. Stop words preserve position gaps so phrase-distance queries remain accurate.

The result is a compact, indexed representation of the text's semantic content. Two documents with different surface forms but the same root words — "The system is running" and "Systems run" — produce overlapping tsvector values and will both match a query for "run".

Warning

Stop words are silently discarded. A query for to_tsquery('english', 'the') produces an error because "the" has no lexeme after normalization. Use plainto_tsquery() or websearch_to_tsquery() for user-provided input — they handle stop words gracefully by simply dropping them from the query rather than raising an error.

The four query-building functions serve different purposes:

  • to_tsquery('english', 'run & fast') — expects pre-formatted Boolean query syntax. Raises an error on stop words. Use for programmatically constructed queries.
  • plainto_tsquery('english', 'quick brown fox') — converts plain text to an AND query. Safe for direct user input. Equivalent to 'quick' & 'brown' & 'fox'.
  • phraseto_tsquery('english', 'quick brown fox') — requires the terms to appear in adjacent positions. Useful for exact-phrase matching.
  • websearch_to_tsquery('english', 'quick -slow "brown fox"') — parses Google-style syntax: quoted phrases, minus-prefix exclusion, OR keywords. The safest and most user-friendly function for search boxes.

Configuration and Setup

Step 1: Understand Your Schema

Start with a representative table. This example uses a product catalog:

sql
CREATE TABLE products (
    id          BIGSERIAL PRIMARY KEY,
    sku         TEXT NOT NULL UNIQUE,
    name        TEXT NOT NULL,
    description TEXT,
    brand       TEXT,
    category    TEXT,
    tags        TEXT[],
    is_active   BOOLEAN DEFAULT TRUE,
    created_at  TIMESTAMPTZ DEFAULT now()
);

Step 2: Add a Pre-Computed tsvector Column

Calling to_tsvector() inline on every query row is the most common performance mistake. For a 4-million-row table, it means re-parsing and re-stemming every row on every query — even with a GIN index, the index cannot be used if the expression does not match the indexed expression exactly. The correct approach is a stored, generated column:

sql
-- PostgreSQL 12+: use a generated column (automatically maintained)
ALTER TABLE products
    ADD COLUMN search_vector tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(brand, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(description, '')), 'C') ||
        setweight(to_tsvector('english', coalesce(category, '')), 'D')
    ) STORED;

setweight() assigns a relevance weight to each source column: 'A' is highest, 'D' is lowest. When ts_rank() scores results, matches in the product name outrank matches in the description. The || operator concatenates multiple tsvector values.

Tip

On PostgreSQL 11 and earlier, generated columns are not available. Use a trigger instead — create a BEFORE INSERT OR UPDATE trigger that sets NEW.search_vector using the same expression. Performance and query-plan behavior are identical; only the maintenance mechanism differs.

Step 3: Build the GIN Index

sql
-- GIN index on the pre-computed column
-- This is what makes full-text search fast
CREATE INDEX products_search_vector_gin_idx
    ON products
    USING gin (search_vector);

-- For large tables, set maintenance_work_mem before building
-- to avoid slow disk-based sort steps:
SET maintenance_work_mem = '1GB';
CREATE INDEX products_search_vector_gin_idx
    ON products USING gin (search_vector);

On the 4.2-million-row product table, this index build took 18 minutes with maintenance_work_mem = '2GB' and produced a 380 MB index on disk. After the build, search queries execute in 8–15ms.

Step 4: Query with Ranking

sql
-- Basic full-text search with ranking
SELECT
    id,
    name,
    brand,
    ts_rank(search_vector, query) AS rank
FROM products,
     websearch_to_tsquery('english', 'wireless noise cancelling headphones') query
WHERE
    is_active = TRUE
    AND search_vector @@ query
ORDER BY rank DESC
LIMIT 20;
sql
-- With ts_headline for result snippets
SELECT
    id,
    name,
    ts_headline(
        'english',
        description,
        websearch_to_tsquery('english', 'wireless noise cancelling headphones'),
        'StartSel=, StopSel=, MaxWords=35, MinWords=15, MaxFragments=2'
    ) AS excerpt,
    ts_rank(search_vector, websearch_to_tsquery('english', 'wireless noise cancelling headphones')) AS rank
FROM products
WHERE
    is_active = TRUE
    AND search_vector @@ websearch_to_tsquery('english', 'wireless noise cancelling headphones')
ORDER BY rank DESC
LIMIT 10;

ts_headline() is intentionally called on the raw description column, not the tsvector — it operates on the original text to produce human-readable snippets with highlighted terms. The options string controls snippet length, the highlight tags, and whether to return multiple fragments.

Step 5: Multi-Column Search with Custom Weights

The generated column approach above already handles multi-column search. For cases where you need dynamic weight adjustment at query time:

sql
-- Construct a weighted tsvector on the fly for dynamic weight tuning
-- (only necessary if weights need to change per query; use generated column otherwise)
SELECT
    id,
    name,
    ts_rank(
        setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(description, '')), 'C'),
        plainto_tsquery('english', $1)
    ) AS rank
FROM products
WHERE
    (
        setweight(to_tsvector('english', coalesce(name, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(description, '')), 'C')
    ) @@ plainto_tsquery('english', $1)
ORDER BY rank DESC
LIMIT 20;
-- Note: this form does NOT use the pre-built GIN index on search_vector.
-- Use only for prototyping or infrequent queries on small tables.
Important

The GIN index is only used when the WHERE clause references the exact same expression that was indexed. If you indexed search_vector (the generated column), you must query WHERE search_vector @@ query. Querying WHERE to_tsvector('english', name) @@ query will not use the index — it will sequential scan. Run EXPLAIN (ANALYZE, BUFFERS) to verify index usage before deploying.

Step 6: Custom Text Search Configurations

For domain-specific content — medical records, legal documents, source code — the default English dictionary may stem words incorrectly or treat important terms as stop words. Create a custom configuration:

sql
-- Create a custom text search configuration based on English
-- but with no stop words (useful for product SKUs, part numbers, codes)
CREATE TEXT SEARCH CONFIGURATION product_search (COPY = english);

-- Replace the English stop words dictionary with a simple one (no stop words)
ALTER TEXT SEARCH CONFIGURATION product_search
    ALTER MAPPING FOR asciiword, word
    WITH english_stem;

-- Use the custom configuration
SELECT to_tsvector('product_search', 'the quick brown fox')::text;
-- Result includes 'the' as a lexeme instead of discarding it

Performance and Best Practices

Full-text search performance in PostgreSQL is largely determined by index type selection and whether the pre-computed vector is being used correctly.

GIN vs GiST: The Real Trade-off

Both GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) support the @@ operator, but they have fundamentally different characteristics:

  • GIN stores an inverted index: for each lexeme, a sorted list of all document IDs containing that lexeme. Lookups are fast because finding documents containing "fox" is a direct index entry lookup. Build is slower than GiST and the index is larger (typically 2–4x the tsvector column size), but query performance is 3–10x better than GiST for typical FTS workloads.
  • GiST stores a lossy, balanced tree structure. It is faster to build and smaller on disk, but queries require re-checking each candidate row (the index may have false positives). For write-heavy tables where the index is rebuilt frequently, GiST may be preferable. For read-heavy production search, use GIN.
sql
-- Check index size and usage statistics
SELECT
    indexname,
    pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'products'
ORDER BY pg_relation_size(indexname::regclass) DESC;

Handling Update Overhead

GIN indexes have higher update overhead than B-tree indexes. Each insert or update must modify the inverted index entries for every lexeme in the document. For high-write tables, PostgreSQL mitigates this with a "pending list" — new entries are collected in a small pending list and merged into the main GIN structure in batches.

sql
-- Tune the GIN pending list size (default 4MB)
-- Larger pending list = fewer, larger merges = better insert throughput
-- but slightly slower queries until the next merge
ALTER INDEX products_search_vector_gin_idx
    SET (fastupdate = on, gin_pending_list_limit = 16384); -- 16MB

-- Manually trigger a pending list flush if needed
SELECT gin_clean_pending_list('products_search_vector_gin_idx'::regclass);
Warning

Language configuration must be consistent between index build time and query time. If you build the index with to_tsvector('english', ...) but query with to_tsquery('simple', ...), stemming will not match. "Running" indexed under 'english' becomes the lexeme "run". Querying with 'simple' config looks for the literal token "running" — and finds nothing. Always use the same text search configuration for both indexing and querying.

Phrase Search and Proximity Queries

sql
-- Exact phrase search (adjacent words)
SELECT id, name
FROM products
WHERE search_vector @@ phraseto_tsquery('english', 'noise cancelling');

-- Words within N positions of each other
-- <2> means "within 2 positions"
SELECT id, name
FROM products
WHERE search_vector @@ to_tsquery('english', 'noise <2> cancelling');

-- Negation: find documents with "wireless" but not "bluetooth"
SELECT id, name
FROM products
WHERE search_vector @@ to_tsquery('english', 'wireless & !bluetooth');

Comparison: PostgreSQL FTS vs Elasticsearch vs pg_trgm

Criteria PostgreSQL FTS (tsvector + GIN) Elasticsearch / OpenSearch pg_trgm (trigram)
Infrastructure Zero extra infra. Runs inside your existing PostgreSQL instance. Separate cluster: JVM heap, shard management, index rotation, snapshot policy. PostgreSQL extension. Zero extra infra. No new process.
Indexing model Linguistic: stemming, stop words, dictionaries. Language-aware. Linguistic + BM25 scoring, pluggable analyzers, custom tokenizers per field. Character-level trigrams. Language-agnostic. No stemming.
Fuzzy / typo tolerance None natively. Requires pg_trgm for similarity or custom dictionary for synonyms. Native fuzzy matching with Levenshtein distance. Configurable per-field. Excellent. similarity() and word_similarity() handle typos and partial matches naturally.
Ranking quality Good. ts_rank uses term frequency + position weights. No IDF by default. Excellent. BM25 (TF-IDF with saturation). Field-level boosts. Learned-to-rank support. Similarity score only (0–1). Not a relevance ranking system.
Partial / prefix matching Requires tsquery prefix syntax (run:*) or pre-indexed prefixes. Not as natural as trigram. Native prefix and wildcard queries. Edge n-gram analyzer for autocomplete. Excellent. Handles LIKE '%fox%' via index on tables with trigram GIN/GiST.
SQL joins & transactions Full SQL. Joins, CTEs, RLS, ACID. Search results can be joined to any table in the same query. No SQL. Must join in application code after fetching IDs from Elasticsearch. Full SQL. Same as PostgreSQL FTS — it is PostgreSQL.
Operational cost Effectively zero if already running PostgreSQL. High. Dedicated cluster, heap tuning, shard rebalancing, separate monitoring, data sync pipeline. Effectively zero if already running PostgreSQL.
Scale ceiling Tens of millions of documents on a well-sized Postgres instance. Beyond that, consider Elasticsearch. Billions of documents across distributed shards. Works well up to a few million rows. GIN index can grow large on high-cardinality text.
Best use case Application search where data already lives in PostgreSQL and linguistic matching is sufficient. Large-scale search platforms, log analytics, enterprise search, multilingual content with custom analyzers. Autocomplete, fuzzy name matching, email search, short-string similarity — where typo tolerance matters more than linguistic accuracy.
Tip

For most production applications, PostgreSQL FTS and pg_trgm are complementary, not competing. Use FTS for semantic search and relevance ranking on long-form content, and add a pg_trgm GIN index on short fields (name, SKU, email) for autocomplete and fuzzy matching. Both indexes can live on the same table. Many teams run both without needing Elasticsearch at all.

Key Takeaways

Key Takeaways
  • Always store a pre-computed tsvector generated column rather than calling to_tsvector() inline in queries — inline calls bypass the GIN index and scan the full table.
  • Build a GIN index on the tsvector column for production workloads. GiST is acceptable for rarely-queried or frequently-updated tables, but GIN queries are 3–10x faster for typical read workloads.
  • Use websearch_to_tsquery() for user-facing search boxes. It parses Google-style syntax safely and handles stop words gracefully without raising errors.
  • Language configuration must match between index definition and query execution. Mismatched configurations produce zero results silently — always verify with EXPLAIN (ANALYZE).
  • ts_rank() with setweight() on multi-column vectors gives meaningful relevance ordering. Weight the title and name fields heavily ('A') and body text lightly ('C' or 'D').
  • For typo tolerance, combine PostgreSQL FTS with pg_trgm in the same database — you get linguistic accuracy from FTS and fuzzy matching from trigrams without operating a second system.
  • GIN fastupdate with a larger gin_pending_list_limit reduces index write amplification on high-insert tables. Tune this before enabling FTS on a write-heavy workload.

Working with JusDB on PostgreSQL Full-Text Search

Implementing PostgreSQL full-text search correctly — choosing between GIN and GiST, setting up generated columns with proper weights, tuning fastupdate for your write volume, and combining FTS with pg_trgm for autocomplete — requires production PostgreSQL experience. The difference between a working FTS implementation and a fast one is often a single misconfigured index expression or a language mismatch that shows up as mysteriously empty results in a specific locale.

JusDB's PostgreSQL consultants have deployed full-text search on catalogs ranging from 50,000 to 40 million rows, across e-commerce, legal document platforms, SaaS knowledge bases, and multi-language content sites. We cover schema design, index strategy, query optimization, ts_headline snippet rendering, custom dictionary configuration, and monitoring — before go-live, not after the first production incident.

PostgreSQL Consulting Services Talk to a PostgreSQL Expert

We also offer a PostgreSQL search audit for teams currently running LIKE queries or considering an Elasticsearch migration — reviewing your query plans, index configuration, and data model to identify the fastest path to production-quality search before committing to a new system.


Share this article