Building RAG Pipelines with PostgreSQL and pgvector

Build production RAG pipelines using PostgreSQL + pgvector — chunking, embedding storage, similarity search, and retrieval

JusDB Team
February 5, 2026
9 min read
154 views

Most teams building retrieval-augmented generation systems reach for a dedicated vector database by default — Pinecone, Weaviate, Qdrant — and then spend the next six months maintaining the synchronization layer between it and their primary PostgreSQL database. The irony is that PostgreSQL, with the pgvector extension, can serve as both the document store and the vector index for the majority of production RAG workloads. You get ACID transactions, row-level security, standard SQL filtering, and full-text hybrid search all in one system, without the operational overhead of a second datastore. This post walks through every layer of a production RAG pipeline built on PostgreSQL and pgvector: schema design, chunking strategy, embedding insertion with Python, HNSW index tuning, and the retrieval query patterns that matter at scale.

TL;DR
  • Install pgvector and create a documents table with a VECTOR(1536) column to store OpenAI embeddings alongside your document metadata.
  • Chunk documents into 300–600 token windows with 10–15% overlap before embedding; chunk size is the single biggest lever on retrieval quality.
  • Use psycopg2 with register_vector() to insert embedding arrays and run cosine similarity queries with the <=> operator.
  • Build an HNSW index with USING hnsw (embedding vector_cosine_ops) for sub-10ms p99 retrieval on datasets up to ~10 million chunks.
  • Tune hnsw.ef_search at query time to trade recall for latency: 40 is a reasonable production default for most RAG use cases.
  • PostgreSQL's hybrid search — WHERE filters, full-text tsvector, and vector similarity in one query — eliminates the need for a dedicated vector store in almost every mid-scale RAG deployment.

What Is RAG and Why PostgreSQL

Retrieval-augmented generation is an architectural pattern for grounding large language model responses in specific, up-to-date knowledge. Instead of relying on what a model memorised during pre-training, RAG retrieves semantically relevant passages from a document corpus at query time and injects them into the prompt as context. The LLM then answers based on those passages, which drastically reduces hallucination and keeps responses current without fine-tuning.

The retrieval step requires a vector index: every document chunk is encoded into a dense embedding vector, and a nearest-neighbour search identifies which chunks are closest to the user's query embedding in that vector space. This is where dedicated vector databases traditionally enter the picture.

PostgreSQL with pgvector earns its place here for several structural reasons. First, your document metadata — author, source URL, access permissions, creation date, tenant ID — already lives in PostgreSQL. Splitting retrieval across two systems means you must either duplicate that metadata in the vector store or do a two-hop query: vector search, then SQL join. Both approaches add latency and synchronisation risk. Second, PostgreSQL's MVCC and transaction model mean you can update document chunks and their embeddings atomically. There is no eventual consistency gap where the vector index returns a chunk whose text was already updated in the relational store. Third, row-level security policies apply automatically to vector searches, so multi-tenant RAG systems get access control for free without any application-layer filtering logic.

Tip

pgvector is available on Amazon RDS PostgreSQL 15+, Aurora PostgreSQL 15.2+, Google Cloud SQL PostgreSQL 14+, Azure Database for PostgreSQL Flexible Server 13+, Supabase, and Neon. No custom installation step is required on these managed platforms — just run CREATE EXTENSION IF NOT EXISTS vector;.

Schema Design for Embeddings

A well-designed RAG schema separates the source document record from the individual chunk records. This separation matters because a single PDF or web page typically yields dozens of chunks, and you need to be able to re-chunk or re-embed without touching the source document row. It also makes it straightforward to query document-level metadata (author, permissions, publication date) alongside chunk-level content during retrieval.

sql
-- Enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- Source documents table
CREATE TABLE sources (
    id          BIGSERIAL PRIMARY KEY,
    title       TEXT        NOT NULL,
    source_url  TEXT,
    author      TEXT,
    tenant_id   BIGINT      NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Document chunks with embeddings
CREATE TABLE documents (
    id              BIGSERIAL    PRIMARY KEY,
    source_id       BIGINT       NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
    tenant_id       BIGINT       NOT NULL,
    chunk_index     INTEGER      NOT NULL,          -- ordinal position in source
    chunk_text      TEXT         NOT NULL,          -- raw text of this chunk
    token_count     INTEGER,                        -- useful for prompt budgeting
    embedding       VECTOR(1536) NOT NULL,          -- OpenAI text-embedding-3-small
    embedding_model TEXT         NOT NULL DEFAULT 'text-embedding-3-small',
    metadata        JSONB        NOT NULL DEFAULT '{}',
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

-- Fast lookup by source (for re-embedding on document update)
CREATE INDEX ON documents (source_id, chunk_index);

-- Tenant isolation index (multi-tenant RAG)
CREATE INDEX ON documents (tenant_id);

-- Full-text search column for hybrid retrieval
ALTER TABLE documents ADD COLUMN tsv TSVECTOR
    GENERATED ALWAYS AS (to_tsvector('english', chunk_text)) STORED;

CREATE INDEX ON documents USING gin (tsv);

A few design choices here deserve explanation. The VECTOR(1536) dimension is fixed at text-embedding-3-small's output size. If you switch to a different model — text-embedding-3-large (3072 dimensions), Cohere embed-v3 (1024), or a locally hosted model — you will need to create a new column or migrate the table. It is worth recording embedding_model in the row so you can detect staleness when you upgrade models. The metadata JSONB column stores arbitrary per-chunk attributes (page number, section heading, confidence score from a pre-processing step) without requiring schema changes as your pipeline evolves.

Warning

Do not store embeddings from different models in the same column. Cosine similarity between a 1536-dim OpenAI vector and a 1024-dim Cohere vector is meaningless, and PostgreSQL will reject the query if the stored dimension does not match the query vector dimension. Use the embedding_model column and a WHERE clause to partition queries by model if you must mix models during a migration window.

Chunking Strategies for Documents

Chunking is the step most teams underestimate. The size and overlap of your chunks control the quality of retrieved context more directly than any index parameter or embedding model choice. A chunk that is too small loses surrounding context that disambiguates meaning. A chunk that is too large wastes prompt tokens and dilutes the similarity signal with off-topic content.

There are three practical chunking approaches for production RAG pipelines:

Fixed-size token window

Split the document every N tokens, with an M-token overlap between adjacent chunks. This is the simplest approach and works well for homogeneous documents like support tickets, legal clauses, or product descriptions. A window of 400 tokens with 60-token overlap (15%) is a reasonable starting point for OpenAI embeddings.

python
import tiktoken

def chunk_by_tokens(text: str, max_tokens: int = 400, overlap: int = 60) -> list[str]:
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        if end == len(tokens):
            break
        start += max_tokens - overlap
    return chunks

Sentence-boundary chunking

Use a sentence tokenizer (spaCy, NLTK, or a simple regex) to split at sentence boundaries, then greedily pack sentences into chunks up to the token limit. This preserves grammatical completeness and tends to produce higher-quality embeddings for prose documents.

Structural chunking

For documents with explicit structure — Markdown headings, HTML sections, PDF bookmarks — split at structural boundaries first, then apply token-window chunking within each section. Inject the section heading as a prefix to every chunk: "Introduction: PostgreSQL was released in 1996..." The heading prefix dramatically improves recall for queries that name a specific section topic.

Tip

Run a retrieval evaluation before choosing a chunking strategy. Generate 50–100 representative user questions, embed them, run cosine similarity search, and measure whether the correct ground-truth passage appears in the top-5 results. A 15-minute evaluation harness will tell you more than any rule of thumb.

Storing and Querying Embeddings with pgvector

The following Python module handles both embedding insertion and similarity-based retrieval. It uses psycopg2 with pgvector's register_vector() helper, which teaches psycopg2 how to serialise Python lists and NumPy arrays into the PostgreSQL vector wire format.

python
import os
import psycopg2
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
import openai

# --- Connection setup ---
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)  # Must be called after connect()

openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM   = 1536


def embed_texts(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts using OpenAI. Rate-limit-safe batch size: 100."""
    response = openai_client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texts,
    )
    return [item.embedding for item in response.data]


def insert_chunks(source_id: int, tenant_id: int, chunks: list[str]) -> None:
    """Embed and insert document chunks into the documents table."""
    embeddings = embed_texts(chunks)

    rows = [
        (source_id, tenant_id, idx, text, len(text.split()), emb, EMBEDDING_MODEL)
        for idx, (text, emb) in enumerate(zip(chunks, embeddings))
    ]

    with conn.cursor() as cur:
        execute_values(
            cur,
            """
            INSERT INTO documents
                (source_id, tenant_id, chunk_index, chunk_text,
                 token_count, embedding, embedding_model)
            VALUES %s
            """,
            rows,
            template="(%s, %s, %s, %s, %s, %s::vector, %s)",
        )
    conn.commit()


def retrieve(
    query: str,
    tenant_id: int,
    top_k: int = 5,
    similarity_threshold: float = 0.75,
) -> list[dict]:
    """
    Retrieve the top-k most similar chunks for a query.
    Uses cosine similarity via the <=> operator (lower = more similar).
    """
    [query_embedding] = embed_texts([query])

    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT
                d.id,
                d.chunk_text,
                d.metadata,
                s.title        AS source_title,
                s.source_url,
                1 - (d.embedding <=> %s::vector) AS cosine_similarity
            FROM documents d
            JOIN sources s ON s.id = d.source_id
            WHERE d.tenant_id = %s
              AND 1 - (d.embedding <=> %s::vector) >= %s
            ORDER BY d.embedding <=> %s::vector
            LIMIT %s
            """,
            (
                query_embedding, tenant_id,
                query_embedding, similarity_threshold,
                query_embedding, top_k,
            ),
        )
        rows = cur.fetchall()

    return [
        {
            "id": r[0],
            "chunk_text": r[1],
            "metadata": r[2],
            "source_title": r[3],
            "source_url": r[4],
            "cosine_similarity": float(r[5]),
        }
        for r in rows
    ]

The <=> operator computes cosine distance (0 = identical, 2 = opposite), so ORDER BY embedding <=> query_vector returns the most similar chunks first. Converting to similarity for application use: similarity = 1 - distance. The threshold of 0.75 (cosine similarity) is a starting point; tune it by inspecting retrieval results on representative queries. Values below 0.6 typically indicate the query is outside the document corpus's topic coverage.

Hybrid search: vector + full-text

For knowledge bases where users ask keyword-specific questions ("What is the exact clause number for termination?"), pure vector search sometimes misses exact matches that full-text search would catch. Combining both with ts_rank and normalised vector similarity gives the best of both worlds:

sql
-- Hybrid retrieval: vector similarity + full-text rank, fused with RRF
SELECT
    d.id,
    d.chunk_text,
    s.title AS source_title,
    (
        0.7 * (1 - (d.embedding <=> $1::vector))   -- vector similarity weight
      + 0.3 * ts_rank(d.tsv, plainto_tsquery('english', $2))  -- BM25 weight
    ) AS hybrid_score
FROM documents d
JOIN sources s ON s.id = d.source_id
WHERE d.tenant_id = $3
ORDER BY hybrid_score DESC
LIMIT 10;

Optimizing HNSW Index for Production

The sequential scan (no index) works acceptably for datasets under ~50,000 chunks on fast NVMe storage. Beyond that, you need an index. pgvector ships two index types: IVFFlat and HNSW. For RAG retrieval, HNSW is almost always the right choice: it delivers lower query latency at the cost of higher memory usage and a slower one-time build.

Building the HNSW index

sql
-- Build the HNSW index for cosine similarity
-- m: max connections per layer (default 16; use 32 for higher recall at ~2x memory cost)
-- ef_construction: search width during build (default 64; higher = better index quality)
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- To increase recall at query time (session-scoped):
SET hnsw.ef_search = 40;

-- Verify the index is being used:
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, chunk_text, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
WHERE tenant_id = 42
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

The index build for 1 million 1536-dimensional vectors typically takes 5–15 minutes on a 4-vCPU instance with maintenance_work_mem = 2GB. Set max_parallel_maintenance_workers to the number of available cores to parallelise the build. Do not build the HNSW index on a live production database during peak traffic — it acquires a ShareLock that blocks writes for the duration of the build. Use CREATE INDEX CONCURRENTLY syntax, which pgvector supports as of version 0.6.0.

sql
-- Non-blocking index build (pgvector >= 0.6.0)
CREATE INDEX CONCURRENTLY ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

HNSW tuning reference

ParameterDefaultEffectRecommended range
m 16 Max edges per node per layer. Higher = better recall, more memory, slower build. 16–32 for most RAG workloads
ef_construction 64 Beam width during index build. Higher = better index quality, slower build. 64–128
hnsw.ef_search 40 Beam width during query. Higher = better recall, higher latency. 20–100; tune against recall@5

Memory consumption for the HNSW index is approximately 1.2 × m × dim × 4 bytes × num_vectors. For 1 million 1536-dim vectors with m=16, expect roughly 118 GB for the index to fit entirely in RAM — which is not realistic for most deployments. PostgreSQL's buffer pool will cache the hot portion of the HNSW graph automatically. In practice, for a typical RAG corpus where 10–20% of chunks receive 80% of queries, the effective working set is much smaller.

Warning

HNSW indexes do not support WHERE clause pre-filtering natively. PostgreSQL will compute the nearest-neighbour candidates first, then apply your WHERE tenant_id = $1 filter to the result set. If your WHERE filter is very selective (e.g., a single tenant in a large shared table), the planner may fall back to a sequential scan. Partition your documents table by tenant_id using PostgreSQL declarative partitioning to ensure the HNSW index is built and queried per partition, keeping each partition's vector count manageable.

Monitoring index health

sql
-- Check index size and bloat
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'documents';

-- Check if queries are using the HNSW index
SELECT query, calls, mean_exec_time, rows
FROM pg_stat_statements
WHERE query LIKE '%<=>%'
ORDER BY mean_exec_time DESC
LIMIT 10;

If idx_scan is zero after deploying the index, run ANALYZE documents; to refresh planner statistics. The planner will not use the HNSW index if it estimates a sequential scan is cheaper, which can happen when the table statistics are stale or when LIMIT is not present in the query. Always include ORDER BY embedding <=> $1 LIMIT n together — the HNSW index is only activated by this combined pattern.

Tip

Set hnsw.ef_search as a connection-level parameter in your application's connection pool configuration rather than as a session SET statement on every query. PgBouncer and pgpool-II both support per-pool startup commands. This avoids the per-query overhead of a SET round-trip and ensures all queries in the pool use consistent recall settings.

Key Takeaways
  • PostgreSQL with pgvector eliminates the dedicated vector store for most RAG workloads, consolidating document metadata and embeddings into a single ACID-compliant system.
  • The documents table schema should store VECTOR(1536) embeddings alongside chunk text, source reference, tenant ID, and a generated TSVECTOR column for hybrid search.
  • Chunk documents at 300–600 tokens with 10–15% overlap; record the embedding model name in each row so you can detect staleness when upgrading models.
  • Use psycopg2 with register_vector() and the <=> cosine distance operator for insertion and retrieval; batching embed calls to 100 texts at a time keeps OpenAI API costs manageable.
  • Build the HNSW index with CREATE INDEX CONCURRENTLY ... USING hnsw (embedding vector_cosine_ops) to avoid blocking writes; tune hnsw.ef_search at the connection pool level.
  • Hybrid vector + full-text queries with weighted scoring outperform pure vector search for knowledge bases where users ask keyword-specific questions.
  • Partition large multi-tenant tables by tenant_id so the HNSW index size per partition stays tractable and pre-filtering remains efficient.

Working with JusDB on AI Database Infrastructure

JusDB's database engineers design and operate production RAG infrastructure for teams at every stage — from the initial pgvector schema to multi-tenant partitioning strategies, HNSW index tuning under production load, and migrations from dedicated vector databases back to PostgreSQL. We handle the index build scheduling, vacuum configuration, connection pool tuning, and monitoring setup so your team can focus on building the application layer.

If your RAG pipeline is hitting latency walls, facing escalating vector-store bills, or struggling with the synchronisation complexity of a split PostgreSQL + external vector database architecture, we can run a one-day audit and give you a concrete remediation plan.

Explore JusDB pgvector Services →  |  Talk to a DBA

Share this article