Embedding Storage at Scale: PostgreSQL, Redis, and Dedicated Vector DBs

Choose where to store embeddings at scale — pgvector vs Redis Stack vs dedicated vector databases for production AI apps

JusDB Team
February 7, 2026
7 min read
167 views

Choosing where to store your embeddings is one of the most consequential infrastructure decisions you will make when building a production AI application. Get it wrong at 10 million vectors and you are looking at a costly migration under load — something no engineering team wants to explain to their users. The landscape has matured rapidly: PostgreSQL with pgvector, Redis Stack with RediSearch, and purpose-built vector databases like Pinecone, Qdrant, and Weaviate each occupy a distinct niche defined by scale, latency requirements, and operational overhead. This post cuts through the marketing noise and gives you the engineering tradeoffs you need to pick the right store before you are locked in.

TL;DR
  • PostgreSQL + pgvector is the right default up to roughly 10 million vectors when you already run Postgres and need ACID guarantees alongside your embeddings.
  • Redis Stack with RediSearch delivers sub-millisecond p99 latency up to ~50 million vectors, but every byte lives in RAM — budget accordingly.
  • Dedicated vector databases (Pinecone, Qdrant, Weaviate) are purpose-built for 100M+ vectors with horizontal sharding, built-in replication, and cloud-native cost models.
  • HNSW indexes offer better recall and faster query times; IVFFlat uses significantly less memory at the cost of recall tuning complexity.
  • There is no universal winner — the right answer depends on your existing stack, your vector count trajectory, and how much operational complexity your team can absorb.

Scale Thresholds for Each Option

Before diving into the mechanics of each system, it helps to anchor the conversation with concrete numbers. "Scale" in vector search is a function of three variables: total vector count, query throughput (queries per second), and the dimensionality of your embeddings. A 1536-dimensional OpenAI text-embedding-3-large vector consumes 6 KB of storage per row in float32 precision. At 10 million vectors that is roughly 60 GB of index data — still manageable on a single Postgres instance. At 100 million vectors you are looking at 600 GB of index state, and at that point the assumptions baked into pgvector's single-node architecture start to work against you.

As a practical guide, the thresholds below represent where each system starts to show strain rather than where it completely fails. Teams routinely push past these limits with careful tuning, but doing so requires increasingly specialized knowledge and operational investment.

System Practical Max Vectors p99 Query Latency Cost / 1M Vectors / Month Self-Hosted ACID Transactions
PostgreSQL + pgvector ~10M 5–50 ms ~$2–8 (RDS/Aurora) Yes Yes (full)
Redis Stack (RediSearch) ~50M 0.5–5 ms ~$15–40 (ElastiCache) Yes No (eventual)
Pinecone (managed) Unlimited (sharded) 10–80 ms ~$8–25 (pod-based) No No
Qdrant Unlimited (sharded) 2–30 ms ~$3–12 (self-hosted) Yes Partial (WAL)
Weaviate Unlimited (sharded) 5–40 ms ~$5–20 (self-hosted) Yes Partial (WAL)
Tip

Cost estimates above assume 1536-dimensional float32 vectors with moderate query throughput (100–500 QPS). Your actual numbers will vary significantly based on dimensionality, replication factor, and whether you use quantization. Always benchmark with a representative sample of your own data before committing to a storage strategy.

pgvector in PostgreSQL (up to ~10M vectors)

pgvector is the most pragmatic choice for teams already running PostgreSQL. It ships as a first-class extension, works seamlessly with connection poolers like PgBouncer, participates fully in Postgres transactions, and requires zero additional infrastructure. If your product already stores user data, content metadata, and embeddings in the same transactional context — and you need JOIN semantics between your vector results and relational rows — pgvector is often the only sane answer.

The extension supports two index types, and choosing between them is a genuine engineering decision rather than a default-accept situation.

HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph over your vectors at index time. Query traversal is fast, recall at reasonable ef settings is typically above 95%, and once the index is built it does not require any cluster-level rebalancing. The tradeoff is memory: HNSW loads its graph structure into shared buffers, so for large vector tables you need to size your Postgres instance's work_mem and shared_buffers carefully or you will see index scans spilling to disk and blowing out your latency budget.

IVFFlat (Inverted File with Flat quantization) partitions the vector space into lists centroids using k-means, then performs a flat search within the nearest probes partitions at query time. This approach uses significantly less memory than HNSW because the graph structure does not need to live in RAM. The cost is tuning complexity: lists should be approximately sqrt(n) where n is your row count, and probes controls the recall-latency tradeoff at query time. IVFFlat also requires a full index rebuild when you cross certain size thresholds, which can be operationally painful on large tables.

sql
-- Create a table with an embedding column
CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    source_url  TEXT,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT now()
);

-- HNSW index: better recall, higher memory usage
-- m = number of connections per layer (default 16)
-- ef_construction = build-time search depth (default 64)
CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- IVFFlat index: less memory, requires tuning
-- lists = sqrt(row_count) is a good starting point
CREATE INDEX ON document_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- Set probes at query time for IVFFlat recall tuning
SET ivfflat.probes = 10;

-- Nearest-neighbor search with metadata filter
SELECT id, content, 1 - (embedding <=> '[0.12, 0.34, ...]'::vector) AS similarity
FROM document_chunks
WHERE source_url LIKE '%docs.example.com%'
ORDER BY embedding <=> '[0.12, 0.34, ...]'::vector
LIMIT 10;

One pgvector limitation worth calling out explicitly: parallel index scans are not supported in the same way as B-tree indexes. On tables with tens of millions of rows, query planning can sometimes choose a sequential scan over the vector index when the planner's cost estimates are off — something you will want to monitor with EXPLAIN (ANALYZE, BUFFERS) and potentially address with enable_seqscan = off in your session configuration for search endpoints.

Redis Stack with RediSearch (up to ~50M vectors, in-memory)

Redis Stack bundles the RediSearch module into the standard Redis distribution, enabling vector similarity search on top of the in-memory data structure you likely already use for caching and session management. The fundamental value proposition is latency: because all data lives in RAM and Redis's I/O loop is single-threaded with no lock contention on reads, you can achieve sub-millisecond p99 query times that are structurally impossible on a disk-backed system like Postgres.

RediSearch supports both HNSW and FLAT (brute-force) index types for vector fields. HNSW is the right default for any collection above a few thousand vectors. Index creation uses the FT.CREATE command with a VECTOR field type and an explicit algorithm specification.

sql
-- Redis CLI: create a vector index on a hash field
FT.CREATE document_idx
  ON HASH
  PREFIX 1 "doc:"
  SCHEMA
    content TEXT WEIGHT 1.0
    source_url TAG
    embedding VECTOR HNSW 10
      TYPE FLOAT32
      DIM 1536
      DISTANCE_METRIC COSINE
      M 16
      EF_CONSTRUCTION 200
      EF_RUNTIME 10

-- Insert a document with embedding
HSET doc:001
  content "PostgreSQL is an advanced relational database"
  source_url "docs.example.com"
  embedding "\x3d\xcc\x8c\x3e..."   -- raw float32 bytes

-- KNN vector search with pre-filter on TAG field
FT.SEARCH document_idx
  "@source_url:{docs\\.example\\.com}"
  =>[KNN 10 @embedding $query_vec AS score]
  PARAMS 2 query_vec "\x3d\xcc\x8c\x3e..."
  SORTBY score ASC
  DIALECT 2

The operational constraint that dominates every Redis vector deployment is memory. At 1536 dimensions and float32 precision, each vector consumes 6 KB before overhead. An HNSW index adds graph metadata that can push per-vector memory to 8–12 KB depending on your M setting. At 50 million vectors that translates to 400–600 GB of RAM — at cloud instance prices, that is a significant monthly cost that must be weighed against the latency benefits. Redis persistence (RDB + AOF) mitigates durability risk but does not reduce the memory footprint, and Redis Cluster sharding distributes the memory requirement across nodes but adds latency jitter from cross-slot operations.

Tip

Use Redis vector search for the "hot" fraction of your corpus — the most recently accessed or highest-priority embeddings — and fall back to pgvector or a dedicated vector DB for the long tail. This tiered architecture can cut your Redis memory costs by 60–80% while preserving sub-millisecond latency for the queries that matter most.

Dedicated Vector DBs at 100M+ Scale (Pinecone, Qdrant, Weaviate)

When your vector count crosses 100 million or your throughput requirements exceed what a single-node system can provide, dedicated vector databases become the pragmatic choice. Each of the major players has made different architectural bets, and those bets have meaningful operational implications.

Pinecone is fully managed with no self-hosted option. Its pod-based pricing model is predictable and its API is deliberately simple — you get upsert, query, fetch, and delete, and that is roughly the full surface area. Pinecone uses a proprietary ANN algorithm and stores indexes across distributed SSDs with in-memory caching for hot segments. The managed nature means zero operational overhead but also means you cannot inspect or tune the underlying index, and egress costs can surprise teams with high-throughput read workloads.

Qdrant is a Rust-based open-source vector database with an optional managed cloud offering. It supports HNSW as its primary index type with extensive tuning knobs, payload filtering with a rich query language, sparse vectors (useful for hybrid dense+sparse retrieval), and on-disk indexes for cost-sensitive deployments. Qdrant's WAL-based persistence model gives you crash recovery without full ACID semantics. Its quantization support (scalar, product, binary) can reduce memory usage by 4–16x at a modest recall penalty, which is a critical feature at 100M+ scale.

Weaviate positions itself as a "vector-first" database with built-in support for generative search modules. It can call embedding models directly via its vectorizer module system, which reduces the operational surface area for teams that want a more integrated pipeline. Weaviate uses HNSW by default and supports multi-tenancy at the index level, which matters for SaaS products where each customer's data must be isolated. Its ACORN filter pushdown optimizes filtered vector search significantly compared to naive post-filtering approaches.

Cost and Operational Comparison

Cost comparisons across these systems are genuinely difficult to make on a per-vector basis because the cost drivers are so different. Pinecone charges by pod type and count; Redis charges by memory; Postgres charges by compute and storage; Qdrant and Weaviate on self-hosted infrastructure charge by whatever your cloud provider charges for the underlying VMs and block storage.

A more useful framing is total cost of ownership at a given operational maturity level. For a team of three engineers with no dedicated infra specialty, a managed service like Pinecone or a simple pgvector deployment on RDS is almost always cheaper in total than self-hosting Qdrant or Weaviate, even if the raw infrastructure cost is lower. The hidden costs are in on-call burden, upgrade cycles, backup verification, and the engineering time spent tuning index parameters rather than building product features.

For teams with dedicated platform engineering, self-hosted Qdrant on spot instances with quantization enabled is typically the most cost-efficient path above 50 million vectors. Binary quantization in Qdrant can reduce memory requirements by 32x relative to float32, bringing 500 million vectors into the realm of a modestly sized cluster rather than a datacenter-scale deployment.

ACID compliance deserves special attention in this comparison. PostgreSQL + pgvector is the only option in this survey that provides full ACID guarantees. This matters most when your embeddings are tightly coupled to relational data that must be consistent — for example, if deleting a user record must atomically remove all their associated embeddings. Redis, Pinecone, Qdrant, and Weaviate all offer eventual consistency models that require application-level logic to handle edge cases like partial writes or visible-but-not-yet-indexed vectors.

Key Takeaways
  • Default to pgvector when you are already on PostgreSQL and your vector count is below 10 million. The operational simplicity and ACID guarantees outweigh the performance gap at this scale.
  • Choose HNSW over IVFFlat when memory is not your primary constraint. HNSW provides consistently better recall with simpler tuning, and recall degradation is the silent killer of RAG application quality.
  • IVFFlat is worth the tuning complexity when you are pushing the memory limits of a single Postgres instance and cannot yet justify a dedicated vector store.
  • Redis Stack is the right choice when your application already relies on Redis and you need sub-millisecond latency for a bounded, high-value subset of your corpus.
  • Dedicated vector databases become the correct default above 100 million vectors. Qdrant with quantization is the most cost-efficient self-hosted option; Pinecone is the lowest operational overhead managed option.
  • Never evaluate a vector store without benchmarking recall alongside latency. A system that returns results 2x faster but with 20% lower recall is often a net negative for RAG quality.
  • Plan your migration path before you need it. Moving 50 million embeddings between stores under production load is a multi-week project. Design your embedding pipeline with a storage abstraction layer from day one.

Working with JusDB on AI Infrastructure

JusDB's engineering team has helped production teams navigate embedding storage decisions across all of the systems covered in this post — from pgvector index tuning on RDS Aurora to Qdrant cluster sizing for billion-scale retrieval pipelines. We bring deep operational experience with the failure modes that do not show up in benchmarks: index corruption recovery, quantization recall degradation in production, Redis memory fragmentation under heavy write load, and Pinecone pod migration strategies during traffic spikes.

If you are evaluating your vector storage architecture, our DBAs can audit your current setup, model your scale trajectory, and deliver a concrete recommendation with implementation support — whether that means optimizing your existing pgvector deployment or architecting a zero-downtime migration to a dedicated vector database.

Explore JusDB pgvector Services →  |  Talk to a DBA

Share this article