Last April, a Series B SaaS team reached out to JusDB in a familiar situation. Their AI-powered document search feature was running on two separate systems: PostgreSQL for user data, permissions, and document metadata, and Pinecone for vector similarity search. The Pinecone bill had climbed to $1,800 per month. Their infrastructure engineer was maintaining two connection pools, two sets of credentials, two monitoring dashboards, and a custom sync layer to keep document IDs consistent between both systems.
Their average similarity search latency was 180ms end-to-end, most of which was the round-trip to Pinecone plus the subsequent PostgreSQL join to fetch the actual document content and check access permissions. Users searching over their own documents were waiting nearly two-tenths of a second per query.
We migrated them to pgvector inside their existing PostgreSQL 16 instance over a single weekend. The Pinecone bill went to zero. End-to-end query latency dropped to 28ms because the vector search and the permission check now run in the same query, on the same machine, inside a single transaction. The sync layer was deleted. Their codebase got 400 lines shorter.
This guide explains exactly how we did it, when this approach is the right call, and when pgvector will not be enough.
- pgvector is a PostgreSQL extension that adds a
vectordata type and approximate nearest-neighbor (ANN) indexes directly inside Postgres. - It eliminates a dedicated vector database for most teams with fewer than 50–100 million vectors and latency requirements above 5–10ms.
- HNSW indexes give the best query latency at the cost of larger memory and slower build time. IVFFlat builds faster and uses less RAM but queries slightly slower.
- The real superpower is hybrid search: vector similarity and SQL
WHEREclauses in a single query, with proper row-level security, joins, and transactions. - For datasets exceeding 100M vectors or requiring sub-5ms p99 at massive QPS, dedicated vector databases (Pinecone, Qdrant, Weaviate) will outperform pgvector.
What pgvector Actually Does
pgvector is a PostgreSQL extension maintained by Andrew Kane, now with significant contributions from AWS, Supabase, and Neon. It was first released in 2021 and reached the critical HNSW index support milestone in version 0.5.0 (October 2023), which fundamentally changed its performance characteristics at scale.
The extension adds three things to PostgreSQL:
- A
vector(n)data type that stores a fixed-dimension array of 32-bit floats. You declare the dimensionality at schema definition time. OpenAItext-embedding-3-smallproduces 1536-dimensional vectors;text-embedding-3-largeproduces 3072; Cohere embed-v3 produces 1024. - Two index types: HNSW (Hierarchical Navigable Small World) and IVFFlat (Inverted File with Flat quantization). Both are approximate nearest-neighbor indexes. Exact search (sequential scan) is also available for small datasets.
- Three distance operators:
<->for L2 (Euclidean),<=>for cosine distance, and<#>for negative inner product.
Under the hood, pgvector stores vector data in standard PostgreSQL heap pages. This means vectors participate fully in PostgreSQL's MVCC, WAL, VACUUM, replication, and backup systems. There is no separate process to manage, no separate WAL stream to monitor, and no separate backup job to schedule.
Setting Up pgvector
Installation
On most Linux distributions, pgvector is available through the PostgreSQL PGDG repository:
# Debian / Ubuntu sudo apt install postgresql-16-pgvector # RHEL / Amazon Linux 2023 sudo yum install pgvector_16 # macOS via Homebrew brew install pgvector # Build from source (any platform) git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git cd pgvector make make install
On managed services: pgvector is available on Amazon RDS PostgreSQL 15+, Amazon Aurora PostgreSQL 15.2+, Google Cloud SQL PostgreSQL 14+, Azure Database for PostgreSQL Flexible Server 13+, Supabase (all tiers), and Neon (all tiers). No installation step is required on these platforms.
Enable the Extension
-- Run once per database, requires SUPERUSER or CREATE privilege CREATE EXTENSION IF NOT EXISTS vector; -- Verify SELECT extversion FROM pg_extension WHERE extname = 'vector'; -- 0.8.0
Create a Table with a Vector Column
This example models a RAG (Retrieval-Augmented Generation) pipeline for a document search application:
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
title TEXT NOT NULL,
body TEXT NOT NULL,
source_url TEXT,
embedding vector(1536), -- OpenAI text-embedding-3-small
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Standard indexes for relational queries
CREATE INDEX ON documents (user_id);
CREATE INDEX ON documents (created_at DESC);
Insert Embeddings
Generate embeddings in application code (Python example using OpenAI SDK), then insert:
-- Individual insert
INSERT INTO documents (user_id, title, body, embedding)
VALUES (
'550e8400-e29b-41d4-a716-446655440000',
'Q3 2024 Financial Report',
'Revenue increased 23% year-over-year...',
'[0.0023, -0.0112, 0.0891, ...]'::vector -- 1536 floats
);
-- Bulk insert with COPY (fastest for large loads)
-- Generate a CSV with the embedding column as a bracketed float array
COPY documents (user_id, title, body, embedding)
FROM '/tmp/documents_with_embeddings.csv'
WITH (FORMAT csv, HEADER true);
From Python using psycopg2 or asyncpg, pass the embedding as a list of floats and let the pgvector Python adapter handle serialization:
import psycopg2
from pgvector.psycopg2 import register_vector
import openai
conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)
client = openai.OpenAI()
def embed(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
embedding = embed("Q3 2024 Financial Report revenue...")
with conn.cursor() as cur:
cur.execute(
"INSERT INTO documents (user_id, title, body, embedding) VALUES (%s, %s, %s, %s)",
(user_id, title, body, embedding)
)
conn.commit()
Index Types: HNSW vs IVFFlat
Choosing the wrong index type is the most common pgvector performance mistake. Here is the honest trade-off:
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph structure. During a query, it navigates the graph greedily from a coarse upper layer down to fine-grained lower layers. The result is very fast queries with high recall, at the cost of significant memory and slow index builds.
Use HNSW when:
- Query latency matters more than index build time
- You can afford the memory overhead (~2-4x the raw vector data size)
- The dataset is relatively stable (infrequent bulk inserts)
- You need recall above 95%
-- HNSW index for cosine similarity (most common for text embeddings)
CREATE INDEX documents_embedding_hnsw_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (
m = 16, -- Number of bidirectional links per node.
-- Range: 2-100. Higher = better recall, more memory.
-- Default 16 is good for most cases. Use 32 for higher
-- recall requirements.
ef_construction = 64 -- Size of candidate list during index build.
-- Range: 4-1000. Higher = better recall, slower build.
-- Default 64 is reasonable. Use 128 for higher quality.
);
Memory estimate for HNSW with m=16: approximately 1.2 * dimensions * 4 bytes * num_vectors + graph_overhead. For 1M documents with 1536-dimensional embeddings and m=16, expect roughly 7–10 GB of RAM needed during queries when the index is hot in shared_buffers or the OS page cache.
IVFFlat (Inverted File with Flat Quantization)
IVFFlat partitions vectors into lists (Voronoi cells) using k-means clustering. During a query, it searches only the nearest probes lists rather than all vectors. Builds fast, uses less memory, but queries are slightly slower than HNSW at equivalent recall.
Use IVFFlat when:
- You need fast index builds (high-velocity insert workloads, frequent REINDEX)
- RAM is constrained
- Recall of 90–95% is acceptable
- Dataset size is between 100K and 10M vectors
-- IVFFlat index for cosine similarity
-- Rule of thumb: lists = sqrt(num_rows) for datasets up to ~1M rows
-- lists = num_rows / 1000 for larger datasets
CREATE INDEX documents_embedding_ivfflat_idx
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- At query time, set probes to control recall vs speed trade-off
-- More probes = higher recall, slower query
-- Aim for probes = sqrt(lists) as a starting point
SET ivfflat.probes = 10;
Important IVFFlat caveat: The index must be built after you have loaded a representative sample of data. An IVFFlat index built on an empty or small table will have poor cluster quality. The general rule is: load at least 3x the target lists count in rows before building the IVFFlat index.
Tuning ef_search for HNSW at Query Time
-- ef_search controls the size of the dynamic candidate list during search. -- Higher = better recall, slower query. Default is 40. -- Tune this per-query or per-session based on your recall requirements. SET hnsw.ef_search = 100; -- higher recall, ~1.5-2x slower queries SET hnsw.ef_search = 20; -- faster queries, ~5-10% recall drop
Similarity Search Operators: Which Distance Metric to Use
pgvector supports three distance operators, and picking the wrong one silently returns wrong results:
<-> L2 Distance (Euclidean)
-- Returns documents closest in Euclidean space SELECT id, title, embedding <-> query_embedding AS distance FROM documents ORDER BY distance LIMIT 10;
Use L2 when: embeddings are normalized to unit length (many models do this by default) or when geometric proximity in the raw embedding space is meaningful. L2 and cosine are equivalent on unit-normalized vectors.
<=> Cosine Distance
-- Returns documents with highest cosine similarity (lowest cosine distance) SELECT id, title, 1 - (embedding <=> query_embedding) AS cosine_similarity FROM documents ORDER BY embedding <=> query_embedding LIMIT 10;
Use cosine distance when: working with text embeddings from OpenAI, Cohere, Voyage AI, or most sentence-transformer models. Cosine similarity measures the angle between vectors and is invariant to magnitude, making it ideal for semantic similarity tasks.
<#> Negative Inner Product
-- Inner product: returns negative of dot product (lower = more similar) SELECT id, title, (embedding <#> query_embedding) * -1 AS inner_product FROM documents ORDER BY embedding <#> query_embedding LIMIT 10;
Use inner product when: your embedding model was trained with inner product as the similarity metric (some bi-encoder models for dense retrieval like SPLADE variants, or when the embedding provider documentation explicitly recommends it). Also the fastest of the three operators to compute.
Hybrid Search: The Key Advantage Over Dedicated Vector Databases
This is where pgvector's real value becomes obvious. In a dedicated vector database, similarity search returns a list of IDs. You then take those IDs back to your relational database for a second query to fetch actual data, apply business logic filters, and check permissions. That second query is a round-trip, and it breaks transactional consistency.
With pgvector, the similarity search and all relational logic run in a single SQL statement:
Semantic Search with Permission Filtering
-- Find the 10 most semantically similar documents for a given user
-- Permission check (user_id) runs inside the same query as vector similarity
-- No round-trips, no consistency gaps
SELECT
d.id,
d.title,
d.body,
1 - (d.embedding <=> $1::vector) AS similarity
FROM documents d
WHERE
d.user_id = $2 -- permission filter
AND d.embedding <=> $1::vector < 0.4 -- similarity threshold (cosine distance)
ORDER BY d.embedding <=> $1::vector
LIMIT 10;
RAG Pipeline with Metadata Filters
-- RAG retrieval: find relevant document chunks for a given query,
-- filtered by document type and recency, joined to source metadata
SELECT
dc.chunk_text,
dc.chunk_index,
d.title AS source_title,
d.source_url,
1 - (dc.embedding <=> $1::vector) AS relevance_score
FROM document_chunks dc
JOIN documents d ON d.id = dc.document_id
WHERE
d.user_id = $2
AND d.doc_type IN ('contract', 'report', 'memo')
AND d.created_at >= now() - interval '90 days'
AND dc.embedding <=> $1::vector < 0.35
ORDER BY dc.embedding <=> $1::vector
LIMIT 20;
Reciprocal Rank Fusion: Combining Vector and Full-Text Search
For the highest retrieval quality in RAG pipelines, combine pgvector similarity with PostgreSQL's native full-text search using Reciprocal Rank Fusion (RRF):
-- Hybrid search: vector similarity + full-text, merged with RRF
WITH vector_results AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank
FROM documents
WHERE user_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 60
),
fts_results AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', body), query) DESC) AS rank
FROM documents,
plainto_tsquery('english', $3) query
WHERE user_id = $2
AND to_tsvector('english', body) @@ query
ORDER BY rank
LIMIT 60
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(1.0 / (60 + v.rank), 0) +
COALESCE(1.0 / (60 + f.rank), 0) AS rrf_score
FROM vector_results v
FULL OUTER JOIN fts_results f ON v.id = f.id
)
SELECT d.id, d.title, r.rrf_score
FROM rrf r
JOIN documents d ON d.id = r.id
ORDER BY r.rrf_score DESC
LIMIT 10;
This query is impossible to write against a dedicated vector database without two separate systems and application-layer result merging. In PostgreSQL, it is a single query with a single network round-trip.
Performance Tuning
memory for Index Builds
HNSW and IVFFlat index builds are memory-intensive. The default maintenance_work_mem (64MB) will cause index builds to use temporary disk files and take 10–50x longer than necessary:
-- Set before building a pgvector index (session level)
SET maintenance_work_mem = '8GB'; -- Use 25-50% of available RAM for large indexes
-- Then build the index
CREATE INDEX documents_embedding_hnsw_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Or configure it permanently in postgresql.conf for index builds
-- maintenance_work_mem = 4GB
Parallel Index Builds
-- HNSW index builds are parallelizable in pgvector 0.7+
-- This dramatically reduces build time on multi-core machines
SET max_parallel_maintenance_workers = 7; -- use 7 additional workers
SET maintenance_work_mem = '2GB'; -- per worker
CREATE INDEX documents_embedding_hnsw_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
With 8 workers and sufficient RAM, HNSW index build time on 10M vectors drops from ~4 hours to ~35 minutes on a 32-core machine.
Tuning ef_search at Query Time
-- Lower ef_search for latency-sensitive paths (e.g., autocomplete) SET hnsw.ef_search = 20; SELECT id, title, embedding <=> $1::vector AS dist FROM documents ORDER BY dist LIMIT 5; -- Higher ef_search for quality-sensitive paths (e.g., RAG context retrieval) SET hnsw.ef_search = 100; SELECT id, title, embedding <=> $1::vector AS dist FROM documents ORDER BY dist LIMIT 20;
Partitioning for Scale
For datasets with natural partitioning boundaries (multi-tenant SaaS, time-series documents), table partitioning reduces the number of vectors each query must consider:
-- Partition documents table by user organization (tenant)
CREATE TABLE documents (
id BIGSERIAL,
org_id UUID NOT NULL,
user_id UUID NOT NULL,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (id, org_id)
) PARTITION BY HASH (org_id);
-- Create 8 partitions
CREATE TABLE documents_0 PARTITION OF documents FOR VALUES WITH (MODULUS 8, REMAINDER 0);
CREATE TABLE documents_1 PARTITION OF documents FOR VALUES WITH (MODULUS 8, REMAINDER 1);
-- ... through documents_7
-- Create HNSW index on each partition separately
CREATE INDEX ON documents_0 USING hnsw (embedding vector_cosine_ops) WITH (m = 16);
CREATE INDEX ON documents_1 USING hnsw (embedding vector_cosine_ops) WITH (m = 16);
-- ...
With partition pruning, a query filtered by org_id only scans one-eighth of the total data. This is 8x less memory pressure on the HNSW graph and proportionally faster queries at scale.
When pgvector Is the Wrong Choice
This section matters. Many teams reach for pgvector because it is familiar, and end up fighting performance problems they should have anticipated. Be honest with yourself about these scenarios:
Dataset Exceeds 100 Million Vectors
HNSW graph memory consumption scales linearly with dataset size. At 100M 1536-dimensional vectors with m=16, the HNSW index alone requires approximately 700 GB–1 TB of RAM to serve queries at full speed. No practical PostgreSQL server has that. You will fall back to disk I/O, and your latency will become unpredictable.
Dedicated vector databases (Qdrant, Weaviate, Milvus) use memory-mapped files, scalar quantization, and product quantization to serve billion-scale indexes from disk with predictable latency. pgvector has experimental halfvec and binary vector support in 0.7+ for compression, but dedicated systems are still significantly ahead at this scale.
Sub-5ms p99 Latency at High QPS
If your SLA requires p99 similarity search latency under 5ms at 1,000+ QPS on a dataset of tens of millions of vectors, dedicated vector databases will beat pgvector. Pinecone, Qdrant, and Weaviate are built exclusively for this workload, with SIMD-optimized distance computations, purpose-built memory layouts, and no competition from relational workloads on shared storage.
pgvector on a dedicated instance with high ef_search can achieve 10–30ms p99 on 10M vectors, which is excellent for most applications but not for real-time product recommendation at Netflix or Spotify scale.
Multi-Modal or Sparse Vector Requirements
pgvector stores dense float vectors. If you need sparse vector search (SPLADE, BM25 dense representations), multi-modal indexes, or vector quantization pipelines, purpose-built systems like Qdrant (which supports sparse vectors natively) or Weaviate (which has multi-modal modules) are more capable today.
pgvector vs Dedicated Vector Databases
| Criteria | pgvector (PostgreSQL) | Pinecone | Weaviate | Qdrant |
|---|---|---|---|---|
| Managed overhead | Runs inside your existing Postgres. Zero additional infra if already on RDS/Cloud SQL. | Fully managed SaaS. No infra to manage; vendor lock-in. | Self-hosted or managed (Weaviate Cloud). Separate cluster to operate. | Self-hosted or Qdrant Cloud. Separate cluster to operate. |
| SQL joins & transactions | Full SQL. Joins, CTEs, row-level security, ACID transactions. | No. Metadata filters only; must join in application code. | No SQL. GraphQL interface; joins require application code. | No SQL. Payload filters only; joins in application code. |
| Cost at 5M vectors | ~$0 extra if Postgres already running. ~$50–150/mo on a dedicated Postgres instance. | ~$700–900/mo (Standard plan, s1.x2 pod). | ~$150–300/mo (Weaviate Cloud Standard). | ~$80–200/mo (Qdrant Cloud). |
| Max practical scale | ~50M vectors with adequate RAM. Harder beyond that. | Billions of vectors. Designed for this. | Hundreds of millions to billions (with quantization). | Hundreds of millions with scalar/product quantization. |
| Query latency (10M vectors) | 10–40ms p99 with HNSW, tuned ef_search, adequate RAM. | 5–20ms p99 (managed, dedicated pods). | 10–30ms p99 (self-hosted, tuned). | 5–15ms p99 (self-hosted with HNSW, tuned). |
Latency figures are approximate and vary significantly by hardware, index parameters, vector dimensionality, and query concurrency. Benchmark your specific workload before making architectural decisions.
A Realistic pgvector Decision Framework
Given everything above, here is the practical decision logic we use when advising clients:
- Fewer than 50M vectors and you already run PostgreSQL: Start with pgvector. You can always migrate to a dedicated system later if you genuinely outgrow it. Most teams never do.
- Need SQL joins, RLS, or transactional consistency between vector search and relational data: pgvector, without question. Dedicated vector databases cannot do this.
- More than 100M vectors, sub-5ms p99, or billion-scale: Evaluate Qdrant or Weaviate (self-hosted for cost control) or Pinecone (fully managed). Quantify the operational cost honestly before choosing Pinecone's convenience premium.
- Multi-tenant SaaS with per-tenant isolation requirements: pgvector with partitioning + row-level security is dramatically simpler than replicating RLS logic across a separate vector database.
Get Expert Help With pgvector and PostgreSQL
Implementing pgvector correctly—index selection, memory tuning, hybrid search architecture, and integration with your embedding pipeline—requires hands-on PostgreSQL experience. Getting the index parameters wrong on a 10M-vector dataset means either wasting gigabytes of RAM or watching your queries degrade under load.
JusDB's team of PostgreSQL consultants has helped SaaS companies, fintech platforms, and enterprise teams implement pgvector in production. We cover the full lifecycle: schema design, index strategy, embedding pipeline integration, query optimization, monitoring, and capacity planning.
PostgreSQL Consulting Services Talk to a PostgreSQL Expert
We also offer a PostgreSQL health check specifically for teams adding vector workloads to existing Postgres instances — reviewing index configuration, memory allocation, and query plans before a production launch, not after the first incident.
Related Reading
- PostgreSQL Explained: A Complete Guide for Modern Applications — foundations before adding vector capabilities
- PostgreSQL VACUUM Tuning: A Comprehensive Guide — critical for tables with frequent embedding updates
- Database Schema Design Fundamentals for Scalability — how to model document chunks and metadata alongside vectors