pgvector: PostgreSQL Vector Search with HNSW, Hybrid Queries & RAG

Last April, a Series B SaaS team reached out to JusDB in a familiar situation. Their AI-powered document search feature was running on two separate systems: PostgreSQL for user data, permissions, and document metadata, and Pinecone for vector similarity search. The Pinecone bill had climbed to $1,800 per month. Their infrastructure engineer was maintaining two connection pools, two sets of credentials, two monitoring dashboards, and a custom sync layer to keep document IDs consistent between both systems.

Their average similarity search latency was 180ms end-to-end, most of which was the round-trip to Pinecone plus the subsequent PostgreSQL join to fetch the actual document content and check access permissions. Users searching over their own documents were waiting nearly two-tenths of a second per query.

We migrated them to pgvector inside their existing PostgreSQL 16 instance over a single weekend. The Pinecone bill went to zero. End-to-end query latency dropped to 28ms because the vector search and the permission check now run in the same query, on the same machine, inside a single transaction. The sync layer was deleted. Their codebase got 400 lines shorter.

This guide explains exactly how we did it, when this approach is the right call, and when pgvector will not be enough.

TL;DR

pgvector is a PostgreSQL extension that adds a vector data type and approximate nearest-neighbor (ANN) indexes directly inside Postgres.
It eliminates a dedicated vector database for most teams with fewer than 50–100 million vectors and latency requirements above 5–10ms.
HNSW indexes give the best query latency at the cost of larger memory and slower build time. IVFFlat builds faster and uses less RAM but queries slightly slower.
The real superpower is hybrid search: vector similarity and SQL WHERE clauses in a single query, with proper row-level security, joins, and transactions.
For datasets exceeding 100M vectors or requiring sub-5ms p99 at massive QPS, dedicated vector databases (Pinecone, Qdrant, Weaviate) will outperform pgvector.

What pgvector Actually Does

pgvector is a PostgreSQL extension maintained by Andrew Kane, now with significant contributions from AWS, Supabase, and Neon. It was first released in 2021 and reached the critical HNSW index support milestone in version 0.5.0 (October 2023), which fundamentally changed its performance characteristics at scale.

The extension adds three things to PostgreSQL:

A vector(n) data type that stores a fixed-dimension array of 32-bit floats. You declare the dimensionality at schema definition time. OpenAI text-embedding-3-small produces 1536-dimensional vectors; text-embedding-3-large produces 3072; Cohere embed-v3 produces 1024.
Two index types: HNSW (Hierarchical Navigable Small World) and IVFFlat (Inverted File with Flat quantization). Both are approximate nearest-neighbor indexes. Exact search (sequential scan) is also available for small datasets.
Three distance operators: <-> for L2 (Euclidean), <=> for cosine distance, and <#> for negative inner product.

Under the hood, pgvector stores vector data in standard PostgreSQL heap pages. This means vectors participate fully in PostgreSQL's MVCC, WAL, VACUUM, replication, and backup systems. There is no separate process to manage, no separate WAL stream to monitor, and no separate backup job to schedule.

Setting Up pgvector

Installation

On most Linux distributions, pgvector is available through the PostgreSQL PGDG repository:

text

# Debian / Ubuntu
sudo apt install postgresql-16-pgvector

# RHEL / Amazon Linux 2023
sudo yum install pgvector_16

# macOS via Homebrew
brew install pgvector

# Build from source (any platform)
git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install

On managed services: pgvector is available on Amazon RDS PostgreSQL 15+, Amazon Aurora PostgreSQL 15.2+, Google Cloud SQL PostgreSQL 14+, Azure Database for PostgreSQL Flexible Server 13+, Supabase (all tiers), and Neon (all tiers). No installation step is required on these platforms.

Enable the Extension

text

-- Run once per database, requires SUPERUSER or CREATE privilege
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify
SELECT extversion FROM pg_extension WHERE extname = 'vector';
-- 0.8.0

Create a Table with a Vector Column

This example models a RAG (Retrieval-Augmented Generation) pipeline for a document search application:

text

CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    title       TEXT NOT NULL,
    body        TEXT NOT NULL,
    source_url  TEXT,
    embedding   vector(1536),          -- OpenAI text-embedding-3-small
    created_at  TIMESTAMPTZ DEFAULT now(),
    updated_at  TIMESTAMPTZ DEFAULT now()
);

-- Standard indexes for relational queries
CREATE INDEX ON documents (user_id);
CREATE INDEX ON documents (created_at DESC);

Insert Embeddings

Generate embeddings in application code (Python example using OpenAI SDK), then insert:

text

-- Individual insert
INSERT INTO documents (user_id, title, body, embedding)
VALUES (
    '550e8400-e29b-41d4-a716-446655440000',
    'Q3 2024 Financial Report',
    'Revenue increased 23% year-over-year...',
    '[0.0023, -0.0112, 0.0891, ...]'::vector   -- 1536 floats
);

-- Bulk insert with COPY (fastest for large loads)
-- Generate a CSV with the embedding column as a bracketed float array
COPY documents (user_id, title, body, embedding)
FROM '/tmp/documents_with_embeddings.csv'
WITH (FORMAT csv, HEADER true);

From Python using psycopg2 or asyncpg, pass the embedding as a list of floats and let the pgvector Python adapter handle serialization:

text

import psycopg2
from pgvector.psycopg2 import register_vector
import openai

conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)

client = openai.OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

embedding = embed("Q3 2024 Financial Report revenue...")

with conn.cursor() as cur:
    cur.execute(
        "INSERT INTO documents (user_id, title, body, embedding) VALUES (%s, %s, %s, %s)",
        (user_id, title, body, embedding)
    )
conn.commit()

Index Types: HNSW vs IVFFlat

Choosing the wrong index type is the most common pgvector performance mistake. Here is the honest trade-off:

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph structure. During a query, it navigates the graph greedily from a coarse upper layer down to fine-grained lower layers. The result is very fast queries with high recall, at the cost of significant memory and slow index builds.

Use HNSW when:

Query latency matters more than index build time
You can afford the memory overhead (~2-4x the raw vector data size)
The dataset is relatively stable (infrequent bulk inserts)
You need recall above 95%

text

-- HNSW index for cosine similarity (most common for text embeddings)
CREATE INDEX documents_embedding_hnsw_idx
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (
        m = 16,               -- Number of bidirectional links per node.
                              -- Range: 2-100. Higher = better recall, more memory.
                              -- Default 16 is good for most cases. Use 32 for higher
                              -- recall requirements.

        ef_construction = 64  -- Size of candidate list during index build.
                              -- Range: 4-1000. Higher = better recall, slower build.
                              -- Default 64 is reasonable. Use 128 for higher quality.
    );

Memory estimate for HNSW with m=16: approximately 1.2 * dimensions * 4 bytes * num_vectors + graph_overhead. For 1M documents with 1536-dimensional embeddings and m=16, expect roughly 7–10 GB of RAM needed during queries when the index is hot in shared_buffers or the OS page cache.

IVFFlat (Inverted File with Flat Quantization)

IVFFlat partitions vectors into lists (Voronoi cells) using k-means clustering. During a query, it searches only the nearest probes lists rather than all vectors. Builds fast, uses less memory, but queries are slightly slower than HNSW at equivalent recall.

Use IVFFlat when:

You need fast index builds (high-velocity insert workloads, frequent REINDEX)
RAM is constrained
Recall of 90–95% is acceptable
Dataset size is between 100K and 10M vectors

text

-- IVFFlat index for cosine similarity
-- Rule of thumb: lists = sqrt(num_rows) for datasets up to ~1M rows
--                lists = num_rows / 1000 for larger datasets
CREATE INDEX documents_embedding_ivfflat_idx
    ON documents
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- At query time, set probes to control recall vs speed trade-off
-- More probes = higher recall, slower query
-- Aim for probes = sqrt(lists) as a starting point
SET ivfflat.probes = 10;

Important IVFFlat caveat: The index must be built after you have loaded a representative sample of data. An IVFFlat index built on an empty or small table will have poor cluster quality. The general rule is: load at least 3x the target lists count in rows before building the IVFFlat index.

Tuning ef_search for HNSW at Query Time

text

-- ef_search controls the size of the dynamic candidate list during search.
-- Higher = better recall, slower query. Default is 40.
-- Tune this per-query or per-session based on your recall requirements.

SET hnsw.ef_search = 100;   -- higher recall, ~1.5-2x slower queries
SET hnsw.ef_search = 20;    -- faster queries, ~5-10% recall drop

Similarity Search Operators: Which Distance Metric to Use

pgvector supports three distance operators, and picking the wrong one silently returns wrong results:

`<->` L2 Distance (Euclidean)

text

-- Returns documents closest in Euclidean space
SELECT id, title, embedding <-> query_embedding AS distance
FROM documents
ORDER BY distance
LIMIT 10;

Use L2 when: embeddings are normalized to unit length (many models do this by default) or when geometric proximity in the raw embedding space is meaningful. L2 and cosine are equivalent on unit-normalized vectors.

`<=>` Cosine Distance

text

-- Returns documents with highest cosine similarity (lowest cosine distance)
SELECT id, title, 1 - (embedding <=> query_embedding) AS cosine_similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;

Use cosine distance when: working with text embeddings from OpenAI, Cohere, Voyage AI, or most sentence-transformer models. Cosine similarity measures the angle between vectors and is invariant to magnitude, making it ideal for semantic similarity tasks.

`<#>` Negative Inner Product

text

-- Inner product: returns negative of dot product (lower = more similar)
SELECT id, title, (embedding <#> query_embedding) * -1 AS inner_product
FROM documents
ORDER BY embedding <#> query_embedding
LIMIT 10;

Use inner product when: your embedding model was trained with inner product as the similarity metric (some bi-encoder models for dense retrieval like SPLADE variants, or when the embedding provider documentation explicitly recommends it). Also the fastest of the three operators to compute.

Hybrid Search: The Key Advantage Over Dedicated Vector Databases

This is where pgvector's real value becomes obvious. In a dedicated vector database, similarity search returns a list of IDs. You then take those IDs back to your relational database for a second query to fetch actual data, apply business logic filters, and check permissions. That second query is a round-trip, and it breaks transactional consistency.

With pgvector, the similarity search and all relational logic run in a single SQL statement:

Semantic Search with Permission Filtering

text

-- Find the 10 most semantically similar documents for a given user
-- Permission check (user_id) runs inside the same query as vector similarity
-- No round-trips, no consistency gaps

SELECT
    d.id,
    d.title,
    d.body,
    1 - (d.embedding <=> $1::vector) AS similarity
FROM documents d
WHERE
    d.user_id = $2                          -- permission filter
    AND d.embedding <=> $1::vector < 0.4   -- similarity threshold (cosine distance)
ORDER BY d.embedding <=> $1::vector
LIMIT 10;

RAG Pipeline with Metadata Filters

text

-- RAG retrieval: find relevant document chunks for a given query,
-- filtered by document type and recency, joined to source metadata

SELECT
    dc.chunk_text,
    dc.chunk_index,
    d.title         AS source_title,
    d.source_url,
    1 - (dc.embedding <=> $1::vector) AS relevance_score
FROM document_chunks dc
JOIN documents d ON d.id = dc.document_id
WHERE
    d.user_id      = $2
    AND d.doc_type IN ('contract', 'report', 'memo')
    AND d.created_at >= now() - interval '90 days'
    AND dc.embedding <=> $1::vector < 0.35
ORDER BY dc.embedding <=> $1::vector
LIMIT 20;

Reciprocal Rank Fusion: Combining Vector and Full-Text Search

For the highest retrieval quality in RAG pipelines, combine pgvector similarity with PostgreSQL's native full-text search using Reciprocal Rank Fusion (RRF):

text

-- Hybrid search: vector similarity + full-text, merged with RRF
WITH vector_results AS (
    SELECT
        id,
        ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank
    FROM documents
    WHERE user_id = $2
    ORDER BY embedding <=> $1::vector
    LIMIT 60
),
fts_results AS (
    SELECT
        id,
        ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', body), query) DESC) AS rank
    FROM documents,
         plainto_tsquery('english', $3) query
    WHERE user_id = $2
      AND to_tsvector('english', body) @@ query
    ORDER BY rank
    LIMIT 60
),
rrf AS (
    SELECT
        COALESCE(v.id, f.id) AS id,
        COALESCE(1.0 / (60 + v.rank), 0) +
        COALESCE(1.0 / (60 + f.rank), 0) AS rrf_score
    FROM vector_results v
    FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT d.id, d.title, r.rrf_score
FROM rrf r
JOIN documents d ON d.id = r.id
ORDER BY r.rrf_score DESC
LIMIT 10;

This query is impossible to write against a dedicated vector database without two separate systems and application-layer result merging. In PostgreSQL, it is a single query with a single network round-trip.

Performance Tuning

memory for Index Builds

HNSW and IVFFlat index builds are memory-intensive. The default maintenance_work_mem (64MB) will cause index builds to use temporary disk files and take 10–50x longer than necessary:

text

-- Set before building a pgvector index (session level)
SET maintenance_work_mem = '8GB';   -- Use 25-50% of available RAM for large indexes

-- Then build the index
CREATE INDEX documents_embedding_hnsw_idx
    ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Or configure it permanently in postgresql.conf for index builds
-- maintenance_work_mem = 4GB

Parallel Index Builds

text

-- HNSW index builds are parallelizable in pgvector 0.7+
-- This dramatically reduces build time on multi-core machines

SET max_parallel_maintenance_workers = 7;   -- use 7 additional workers
SET maintenance_work_mem = '2GB';           -- per worker

CREATE INDEX documents_embedding_hnsw_idx
    ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

With 8 workers and sufficient RAM, HNSW index build time on 10M vectors drops from ~4 hours to ~35 minutes on a 32-core machine.

Tuning ef_search at Query Time

text

-- Lower ef_search for latency-sensitive paths (e.g., autocomplete)
SET hnsw.ef_search = 20;
SELECT id, title, embedding <=> $1::vector AS dist
FROM documents ORDER BY dist LIMIT 5;

-- Higher ef_search for quality-sensitive paths (e.g., RAG context retrieval)
SET hnsw.ef_search = 100;
SELECT id, title, embedding <=> $1::vector AS dist
FROM documents ORDER BY dist LIMIT 20;

Partitioning for Scale

For datasets with natural partitioning boundaries (multi-tenant SaaS, time-series documents), table partitioning reduces the number of vectors each query must consider:

text

-- Partition documents table by user organization (tenant)
CREATE TABLE documents (
    id          BIGSERIAL,
    org_id      UUID NOT NULL,
    user_id     UUID NOT NULL,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (id, org_id)
) PARTITION BY HASH (org_id);

-- Create 8 partitions
CREATE TABLE documents_0 PARTITION OF documents FOR VALUES WITH (MODULUS 8, REMAINDER 0);
CREATE TABLE documents_1 PARTITION OF documents FOR VALUES WITH (MODULUS 8, REMAINDER 1);
-- ... through documents_7

-- Create HNSW index on each partition separately
CREATE INDEX ON documents_0 USING hnsw (embedding vector_cosine_ops) WITH (m = 16);
CREATE INDEX ON documents_1 USING hnsw (embedding vector_cosine_ops) WITH (m = 16);
-- ...

With partition pruning, a query filtered by org_id only scans one-eighth of the total data. This is 8x less memory pressure on the HNSW graph and proportionally faster queries at scale.

When pgvector Is the Wrong Choice

This section matters. Many teams reach for pgvector because it is familiar, and end up fighting performance problems they should have anticipated. Be honest with yourself about these scenarios:

Dataset Exceeds 100 Million Vectors

HNSW graph memory consumption scales linearly with dataset size. At 100M 1536-dimensional vectors with m=16, the HNSW index alone requires approximately 700 GB–1 TB of RAM to serve queries at full speed. No practical PostgreSQL server has that. You will fall back to disk I/O, and your latency will become unpredictable.

Dedicated vector databases (Qdrant, Weaviate, Milvus) use memory-mapped files, scalar quantization, and product quantization to serve billion-scale indexes from disk with predictable latency. pgvector has experimental halfvec and binary vector support in 0.7+ for compression, but dedicated systems are still significantly ahead at this scale.

Sub-5ms p99 Latency at High QPS

If your SLA requires p99 similarity search latency under 5ms at 1,000+ QPS on a dataset of tens of millions of vectors, dedicated vector databases will beat pgvector. Pinecone, Qdrant, and Weaviate are built exclusively for this workload, with SIMD-optimized distance computations, purpose-built memory layouts, and no competition from relational workloads on shared storage.

pgvector on a dedicated instance with high ef_search can achieve 10–30ms p99 on 10M vectors, which is excellent for most applications but not for real-time product recommendation at Netflix or Spotify scale.

pgvector stores dense float vectors. If you need sparse vector search (SPLADE, BM25 dense representations), multi-modal indexes, or vector quantization pipelines, purpose-built systems like Qdrant (which supports sparse vectors natively) or Weaviate (which has multi-modal modules) are more capable today.

pgvector vs Dedicated Vector Databases

Criteria	pgvector (PostgreSQL)	Pinecone	Weaviate	Qdrant
Managed overhead	Runs inside your existing Postgres. Zero additional infra if already on RDS/Cloud SQL.	Fully managed SaaS. No infra to manage; vendor lock-in.	Self-hosted or managed (Weaviate Cloud). Separate cluster to operate.	Self-hosted or Qdrant Cloud. Separate cluster to operate.
SQL joins & transactions	Full SQL. Joins, CTEs, row-level security, ACID transactions.	No. Metadata filters only; must join in application code.	No SQL. GraphQL interface; joins require application code.	No SQL. Payload filters only; joins in application code.
Cost at 5M vectors	~$0 extra if Postgres already running. ~$50–150/mo on a dedicated Postgres instance.	~$700–900/mo (Standard plan, s1.x2 pod).	~$150–300/mo (Weaviate Cloud Standard).	~$80–200/mo (Qdrant Cloud).
Max practical scale	~50M vectors with adequate RAM. Harder beyond that.	Billions of vectors. Designed for this.	Hundreds of millions to billions (with quantization).	Hundreds of millions with scalar/product quantization.
Query latency (10M vectors)	10–40ms p99 with HNSW, tuned ef_search, adequate RAM.	5–20ms p99 (managed, dedicated pods).	10–30ms p99 (self-hosted, tuned).	5–15ms p99 (self-hosted with HNSW, tuned).

Latency figures are approximate and vary significantly by hardware, index parameters, vector dimensionality, and query concurrency. Benchmark your specific workload before making architectural decisions.

A Realistic pgvector Decision Framework

Given everything above, here is the practical decision logic we use when advising clients:

Fewer than 50M vectors and you already run PostgreSQL: Start with pgvector. You can always migrate to a dedicated system later if you genuinely outgrow it. Most teams never do.
Need SQL joins, RLS, or transactional consistency between vector search and relational data: pgvector, without question. Dedicated vector databases cannot do this.
More than 100M vectors, sub-5ms p99, or billion-scale: Evaluate Qdrant or Weaviate (self-hosted for cost control) or Pinecone (fully managed). Quantify the operational cost honestly before choosing Pinecone's convenience premium.
Multi-tenant SaaS with per-tenant isolation requirements: pgvector with partitioning + row-level security is dramatically simpler than replicating RLS logic across a separate vector database.

Get Expert Help With pgvector and PostgreSQL

Implementing pgvector correctly—index selection, memory tuning, hybrid search architecture, and integration with your embedding pipeline—requires hands-on PostgreSQL experience. Getting the index parameters wrong on a 10M-vector dataset means either wasting gigabytes of RAM or watching your queries degrade under load.

JusDB's team of PostgreSQL consultants has helped SaaS companies, fintech platforms, and enterprise teams implement pgvector in production. We cover the full lifecycle: schema design, index strategy, embedding pipeline integration, query optimization, monitoring, and capacity planning.

PostgreSQL Consulting Services Talk to a PostgreSQL Expert

We also offer a PostgreSQL health check specifically for teams adding vector workloads to existing Postgres instances — reviewing index configuration, memory allocation, and query plans before a production launch, not after the first incident.

PostgreSQL Explained: A Complete Guide for Modern Applications — foundations before adding vector capabilities
PostgreSQL VACUUM Tuning: A Comprehensive Guide — critical for tables with frequent embedding updates
Database Schema Design Fundamentals for Scalability — how to model document chunks and metadata alongside vectors

PostgreSQL as a Vector Database: A Complete Guide

What pgvector Actually Does

Setting Up pgvector

Installation

Enable the Extension

Create a Table with a Vector Column

Insert Embeddings

Index Types: HNSW vs IVFFlat

HNSW (Hierarchical Navigable Small World)

IVFFlat (Inverted File with Flat Quantization)

Tuning ef_search for HNSW at Query Time

Similarity Search Operators: Which Distance Metric to Use

<-> L2 Distance (Euclidean)

<=> Cosine Distance

<#> Negative Inner Product

Hybrid Search: The Key Advantage Over Dedicated Vector Databases

Semantic Search with Permission Filtering

RAG Pipeline with Metadata Filters

Reciprocal Rank Fusion: Combining Vector and Full-Text Search

Performance Tuning

memory for Index Builds

Parallel Index Builds

Tuning ef_search at Query Time

Partitioning for Scale

When pgvector Is the Wrong Choice

Dataset Exceeds 100 Million Vectors

Sub-5ms p99 Latency at High QPS

Multi-Modal or Sparse Vector Requirements

pgvector vs Dedicated Vector Databases

A Realistic pgvector Decision Framework

Get Expert Help With pgvector and PostgreSQL

Related Reading

Share this article

JusDB Team

Keep reading

PostgreSQL 19 Beta: Every New Feature That Matters to DBAs

PostgreSQL Performance Tuning Playbook: A Top-Down Method for Faster Queries

PostgreSQL Architecture Deep Dive: Process Model, MVCC, WAL & Replication Explained

Need Expert Help?

PostgreSQL Consulting

PostgreSQL Migration

PostgreSQL Support

PostgreSQL High Availability

PostgreSQL Cloud Migration

PostgreSQL on Kubernetes

`<->` L2 Distance (Euclidean)

`<=>` Cosine Distance

`<#>` Negative Inner Product