Most teams building retrieval-augmented generation systems reach for a dedicated vector database by default — Pinecone, Weaviate, Qdrant — and then spend the next six months maintaining the synchronization layer between it and their primary PostgreSQL database. The irony is that PostgreSQL, with the pgvector extension, can serve as both the document store and the vector index for the majority of production RAG workloads. You get ACID transactions, row-level security, standard SQL filtering, and full-text hybrid search all in one system, without the operational overhead of a second datastore. This post walks through every layer of a production RAG pipeline built on PostgreSQL and pgvector: schema design, chunking strategy, embedding insertion with Python, HNSW index tuning, and the retrieval query patterns that matter at scale.
- Install pgvector and create a
documentstable with aVECTOR(1536)column to store OpenAI embeddings alongside your document metadata. - Chunk documents into 300–600 token windows with 10–15% overlap before embedding; chunk size is the single biggest lever on retrieval quality.
- Use
psycopg2withregister_vector()to insert embedding arrays and run cosine similarity queries with the<=>operator. - Build an HNSW index with
USING hnsw (embedding vector_cosine_ops)for sub-10ms p99 retrieval on datasets up to ~10 million chunks. - Tune
hnsw.ef_searchat query time to trade recall for latency: 40 is a reasonable production default for most RAG use cases. - PostgreSQL's hybrid search —
WHEREfilters, full-texttsvector, and vector similarity in one query — eliminates the need for a dedicated vector store in almost every mid-scale RAG deployment.
What Is RAG and Why PostgreSQL
Retrieval-augmented generation is an architectural pattern for grounding large language model responses in specific, up-to-date knowledge. Instead of relying on what a model memorised during pre-training, RAG retrieves semantically relevant passages from a document corpus at query time and injects them into the prompt as context. The LLM then answers based on those passages, which drastically reduces hallucination and keeps responses current without fine-tuning.
The retrieval step requires a vector index: every document chunk is encoded into a dense embedding vector, and a nearest-neighbour search identifies which chunks are closest to the user's query embedding in that vector space. This is where dedicated vector databases traditionally enter the picture.
PostgreSQL with pgvector earns its place here for several structural reasons. First, your document metadata — author, source URL, access permissions, creation date, tenant ID — already lives in PostgreSQL. Splitting retrieval across two systems means you must either duplicate that metadata in the vector store or do a two-hop query: vector search, then SQL join. Both approaches add latency and synchronisation risk. Second, PostgreSQL's MVCC and transaction model mean you can update document chunks and their embeddings atomically. There is no eventual consistency gap where the vector index returns a chunk whose text was already updated in the relational store. Third, row-level security policies apply automatically to vector searches, so multi-tenant RAG systems get access control for free without any application-layer filtering logic.
pgvector is available on Amazon RDS PostgreSQL 15+, Aurora PostgreSQL 15.2+, Google Cloud SQL PostgreSQL 14+, Azure Database for PostgreSQL Flexible Server 13+, Supabase, and Neon. No custom installation step is required on these managed platforms — just run CREATE EXTENSION IF NOT EXISTS vector;.
Schema Design for Embeddings
A well-designed RAG schema separates the source document record from the individual chunk records. This separation matters because a single PDF or web page typically yields dozens of chunks, and you need to be able to re-chunk or re-embed without touching the source document row. It also makes it straightforward to query document-level metadata (author, permissions, publication date) alongside chunk-level content during retrieval.
-- Enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;
-- Source documents table
CREATE TABLE sources (
id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL,
source_url TEXT,
author TEXT,
tenant_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Document chunks with embeddings
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source_id BIGINT NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
tenant_id BIGINT NOT NULL,
chunk_index INTEGER NOT NULL, -- ordinal position in source
chunk_text TEXT NOT NULL, -- raw text of this chunk
token_count INTEGER, -- useful for prompt budgeting
embedding VECTOR(1536) NOT NULL, -- OpenAI text-embedding-3-small
embedding_model TEXT NOT NULL DEFAULT 'text-embedding-3-small',
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Fast lookup by source (for re-embedding on document update)
CREATE INDEX ON documents (source_id, chunk_index);
-- Tenant isolation index (multi-tenant RAG)
CREATE INDEX ON documents (tenant_id);
-- Full-text search column for hybrid retrieval
ALTER TABLE documents ADD COLUMN tsv TSVECTOR
GENERATED ALWAYS AS (to_tsvector('english', chunk_text)) STORED;
CREATE INDEX ON documents USING gin (tsv);A few design choices here deserve explanation. The VECTOR(1536) dimension is fixed at text-embedding-3-small's output size. If you switch to a different model — text-embedding-3-large (3072 dimensions), Cohere embed-v3 (1024), or a locally hosted model — you will need to create a new column or migrate the table. It is worth recording embedding_model in the row so you can detect staleness when you upgrade models. The metadata JSONB column stores arbitrary per-chunk attributes (page number, section heading, confidence score from a pre-processing step) without requiring schema changes as your pipeline evolves.
Do not store embeddings from different models in the same column. Cosine similarity between a 1536-dim OpenAI vector and a 1024-dim Cohere vector is meaningless, and PostgreSQL will reject the query if the stored dimension does not match the query vector dimension. Use the embedding_model column and a WHERE clause to partition queries by model if you must mix models during a migration window.
Chunking Strategies for Documents
Chunking is the step most teams underestimate. The size and overlap of your chunks control the quality of retrieved context more directly than any index parameter or embedding model choice. A chunk that is too small loses surrounding context that disambiguates meaning. A chunk that is too large wastes prompt tokens and dilutes the similarity signal with off-topic content.
There are three practical chunking approaches for production RAG pipelines:
Fixed-size token window
Split the document every N tokens, with an M-token overlap between adjacent chunks. This is the simplest approach and works well for homogeneous documents like support tickets, legal clauses, or product descriptions. A window of 400 tokens with 60-token overlap (15%) is a reasonable starting point for OpenAI embeddings.
import tiktoken
def chunk_by_tokens(text: str, max_tokens: int = 400, overlap: int = 60) -> list[str]:
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
if end == len(tokens):
break
start += max_tokens - overlap
return chunksSentence-boundary chunking
Use a sentence tokenizer (spaCy, NLTK, or a simple regex) to split at sentence boundaries, then greedily pack sentences into chunks up to the token limit. This preserves grammatical completeness and tends to produce higher-quality embeddings for prose documents.
Structural chunking
For documents with explicit structure — Markdown headings, HTML sections, PDF bookmarks — split at structural boundaries first, then apply token-window chunking within each section. Inject the section heading as a prefix to every chunk: "Introduction: PostgreSQL was released in 1996..." The heading prefix dramatically improves recall for queries that name a specific section topic.
Run a retrieval evaluation before choosing a chunking strategy. Generate 50–100 representative user questions, embed them, run cosine similarity search, and measure whether the correct ground-truth passage appears in the top-5 results. A 15-minute evaluation harness will tell you more than any rule of thumb.
Storing and Querying Embeddings with pgvector
The following Python module handles both embedding insertion and similarity-based retrieval. It uses psycopg2 with pgvector's register_vector() helper, which teaches psycopg2 how to serialise Python lists and NumPy arrays into the PostgreSQL vector wire format.
import os
import psycopg2
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
import openai
# --- Connection setup ---
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn) # Must be called after connect()
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 1536
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts using OpenAI. Rate-limit-safe batch size: 100."""
response = openai_client.embeddings.create(
model=EMBEDDING_MODEL,
input=texts,
)
return [item.embedding for item in response.data]
def insert_chunks(source_id: int, tenant_id: int, chunks: list[str]) -> None:
"""Embed and insert document chunks into the documents table."""
embeddings = embed_texts(chunks)
rows = [
(source_id, tenant_id, idx, text, len(text.split()), emb, EMBEDDING_MODEL)
for idx, (text, emb) in enumerate(zip(chunks, embeddings))
]
with conn.cursor() as cur:
execute_values(
cur,
"""
INSERT INTO documents
(source_id, tenant_id, chunk_index, chunk_text,
token_count, embedding, embedding_model)
VALUES %s
""",
rows,
template="(%s, %s, %s, %s, %s, %s::vector, %s)",
)
conn.commit()
def retrieve(
query: str,
tenant_id: int,
top_k: int = 5,
similarity_threshold: float = 0.75,
) -> list[dict]:
"""
Retrieve the top-k most similar chunks for a query.
Uses cosine similarity via the <=> operator (lower = more similar).
"""
[query_embedding] = embed_texts([query])
with conn.cursor() as cur:
cur.execute(
"""
SELECT
d.id,
d.chunk_text,
d.metadata,
s.title AS source_title,
s.source_url,
1 - (d.embedding <=> %s::vector) AS cosine_similarity
FROM documents d
JOIN sources s ON s.id = d.source_id
WHERE d.tenant_id = %s
AND 1 - (d.embedding <=> %s::vector) >= %s
ORDER BY d.embedding <=> %s::vector
LIMIT %s
""",
(
query_embedding, tenant_id,
query_embedding, similarity_threshold,
query_embedding, top_k,
),
)
rows = cur.fetchall()
return [
{
"id": r[0],
"chunk_text": r[1],
"metadata": r[2],
"source_title": r[3],
"source_url": r[4],
"cosine_similarity": float(r[5]),
}
for r in rows
]The <=> operator computes cosine distance (0 = identical, 2 = opposite), so ORDER BY embedding <=> query_vector returns the most similar chunks first. Converting to similarity for application use: similarity = 1 - distance. The threshold of 0.75 (cosine similarity) is a starting point; tune it by inspecting retrieval results on representative queries. Values below 0.6 typically indicate the query is outside the document corpus's topic coverage.
Hybrid search: vector + full-text
For knowledge bases where users ask keyword-specific questions ("What is the exact clause number for termination?"), pure vector search sometimes misses exact matches that full-text search would catch. Combining both with ts_rank and normalised vector similarity gives the best of both worlds:
-- Hybrid retrieval: vector similarity + full-text rank, fused with RRF
SELECT
d.id,
d.chunk_text,
s.title AS source_title,
(
0.7 * (1 - (d.embedding <=> $1::vector)) -- vector similarity weight
+ 0.3 * ts_rank(d.tsv, plainto_tsquery('english', $2)) -- BM25 weight
) AS hybrid_score
FROM documents d
JOIN sources s ON s.id = d.source_id
WHERE d.tenant_id = $3
ORDER BY hybrid_score DESC
LIMIT 10;Optimizing HNSW Index for Production
The sequential scan (no index) works acceptably for datasets under ~50,000 chunks on fast NVMe storage. Beyond that, you need an index. pgvector ships two index types: IVFFlat and HNSW. For RAG retrieval, HNSW is almost always the right choice: it delivers lower query latency at the cost of higher memory usage and a slower one-time build.
Building the HNSW index
-- Build the HNSW index for cosine similarity
-- m: max connections per layer (default 16; use 32 for higher recall at ~2x memory cost)
-- ef_construction: search width during build (default 64; higher = better index quality)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- To increase recall at query time (session-scoped):
SET hnsw.ef_search = 40;
-- Verify the index is being used:
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, chunk_text, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
WHERE tenant_id = 42
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;The index build for 1 million 1536-dimensional vectors typically takes 5–15 minutes on a 4-vCPU instance with maintenance_work_mem = 2GB. Set max_parallel_maintenance_workers to the number of available cores to parallelise the build. Do not build the HNSW index on a live production database during peak traffic — it acquires a ShareLock that blocks writes for the duration of the build. Use CREATE INDEX CONCURRENTLY syntax, which pgvector supports as of version 0.6.0.
-- Non-blocking index build (pgvector >= 0.6.0)
CREATE INDEX CONCURRENTLY ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);HNSW tuning reference
| Parameter | Default | Effect | Recommended range |
|---|---|---|---|
m |
16 | Max edges per node per layer. Higher = better recall, more memory, slower build. | 16–32 for most RAG workloads |
ef_construction |
64 | Beam width during index build. Higher = better index quality, slower build. | 64–128 |
hnsw.ef_search |
40 | Beam width during query. Higher = better recall, higher latency. | 20–100; tune against recall@5 |
Memory consumption for the HNSW index is approximately 1.2 × m × dim × 4 bytes × num_vectors. For 1 million 1536-dim vectors with m=16, expect roughly 118 GB for the index to fit entirely in RAM — which is not realistic for most deployments. PostgreSQL's buffer pool will cache the hot portion of the HNSW graph automatically. In practice, for a typical RAG corpus where 10–20% of chunks receive 80% of queries, the effective working set is much smaller.
HNSW indexes do not support WHERE clause pre-filtering natively. PostgreSQL will compute the nearest-neighbour candidates first, then apply your WHERE tenant_id = $1 filter to the result set. If your WHERE filter is very selective (e.g., a single tenant in a large shared table), the planner may fall back to a sequential scan. Partition your documents table by tenant_id using PostgreSQL declarative partitioning to ensure the HNSW index is built and queried per partition, keeping each partition's vector count manageable.
Monitoring index health
-- Check index size and bloat
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'documents';
-- Check if queries are using the HNSW index
SELECT query, calls, mean_exec_time, rows
FROM pg_stat_statements
WHERE query LIKE '%<=>%'
ORDER BY mean_exec_time DESC
LIMIT 10;If idx_scan is zero after deploying the index, run ANALYZE documents; to refresh planner statistics. The planner will not use the HNSW index if it estimates a sequential scan is cheaper, which can happen when the table statistics are stale or when LIMIT is not present in the query. Always include ORDER BY embedding <=> $1 LIMIT n together — the HNSW index is only activated by this combined pattern.
Set hnsw.ef_search as a connection-level parameter in your application's connection pool configuration rather than as a session SET statement on every query. PgBouncer and pgpool-II both support per-pool startup commands. This avoids the per-query overhead of a SET round-trip and ensures all queries in the pool use consistent recall settings.
- PostgreSQL with pgvector eliminates the dedicated vector store for most RAG workloads, consolidating document metadata and embeddings into a single ACID-compliant system.
- The
documentstable schema should storeVECTOR(1536)embeddings alongside chunk text, source reference, tenant ID, and a generatedTSVECTORcolumn for hybrid search. - Chunk documents at 300–600 tokens with 10–15% overlap; record the embedding model name in each row so you can detect staleness when upgrading models.
- Use
psycopg2withregister_vector()and the<=>cosine distance operator for insertion and retrieval; batching embed calls to 100 texts at a time keeps OpenAI API costs manageable. - Build the HNSW index with
CREATE INDEX CONCURRENTLY ... USING hnsw (embedding vector_cosine_ops)to avoid blocking writes; tunehnsw.ef_searchat the connection pool level. - Hybrid vector + full-text queries with weighted scoring outperform pure vector search for knowledge bases where users ask keyword-specific questions.
- Partition large multi-tenant tables by
tenant_idso the HNSW index size per partition stays tractable and pre-filtering remains efficient.
Working with JusDB on AI Database Infrastructure
JusDB's database engineers design and operate production RAG infrastructure for teams at every stage — from the initial pgvector schema to multi-tenant partitioning strategies, HNSW index tuning under production load, and migrations from dedicated vector databases back to PostgreSQL. We handle the index build scheduling, vacuum configuration, connection pool tuning, and monitoring setup so your team can focus on building the application layer.
If your RAG pipeline is hitting latency walls, facing escalating vector-store bills, or struggling with the synchronisation complexity of a split PostgreSQL + external vector database architecture, we can run a one-day audit and give you a concrete remediation plan.