DuckDB runs analytical SQL queries in-process — no server, no cluster, no ETL pipeline. A single Python import and you're querying 10GB Parquet files at 2GB/s. For database engineers who've spent years managing Redshift clusters, this changes the economics of ad-hoc analytics.
- DuckDB is an in-process OLAP database — it runs inside your Python, R, or CLI session
- Vectorized execution with columnar storage delivers sub-second queries on GB-scale data
- Native Parquet, CSV, JSON, and Arrow support — query files directly without ingestion
- Not a replacement for Postgres/MySQL: DuckDB is for analytics, not OLTP transactions
What Is DuckDB?
DuckDB is an in-process SQL OLAP database management system. Unlike traditional databases that run as separate server processes, DuckDB embeds directly into your application. It uses vectorized query execution — processing data in column-oriented batches of 1024–2048 values — which makes it 10–100x faster than row-at-a-time processing for analytical workloads.
How DuckDB's Execution Engine Works
Vectorized Columnar Processing
DuckDB stores data in column-oriented format and processes it in vectors. A filter on a 100M-row table reads only the relevant column bytes, then applies SIMD instructions to evaluate predicates across 1024 values simultaneously.
import duckdb
con = duckdb.connect()
# Query a 10GB Parquet file directly from S3
result = con.sql(
"SELECT date_trunc('month', event_ts) as month, event_type, "
"count(*) as events, avg(duration_ms) as avg_duration "
"FROM read_parquet('s3://my-bucket/events/*.parquet') "
"WHERE event_ts >= '2024-01-01' "
"GROUP BY 1, 2 ORDER BY 1, 3 DESC"
).fetchdf()
print(result.head(20))Direct File Querying
DuckDB's killer feature for data engineers: query Parquet, CSV, JSON, and Arrow files without any ingestion step. It uses predicate pushdown and projection pushdown to read only the required bytes from object storage.
-- Query multiple Parquet files with glob
SELECT * FROM read_parquet('data/events/2024-*.parquet')
WHERE user_id = 12345;
-- Query CSV with automatic type inference
SELECT AVG(revenue), MAX(order_date)
FROM read_csv_auto('orders.csv');
-- Join across different file formats
SELECT p.product_name, SUM(o.quantity) as total_sold
FROM read_parquet('products.parquet') p
JOIN read_csv_auto('orders.csv') o ON p.id = o.product_id
GROUP BY 1 ORDER BY 2 DESC;DuckDB vs PostgreSQL vs ClickHouse
| Dimension | DuckDB | PostgreSQL | ClickHouse |
|---|---|---|---|
| Architecture | In-process (embedded) | Client-server | Client-server cluster |
| Best for | Ad-hoc analytics, ETL | OLTP transactions | High-volume OLAP |
| Data scale | GB–low TB | GB–TB | TB–PB |
| Setup time | 0 (pip install duckdb) | Minutes | Hours |
| Parquet support | Native, predicate pushdown | Via extension | Native |
Production Use Cases for Database Engineers
Replacing Heavy ETL Pipelines
import duckdb
con = duckdb.connect('analytics.duckdb')
# Incremental load from S3 Parquet
con.execute(
"INSERT INTO daily_aggregates "
"SELECT date_trunc('day', event_ts) as day, user_segment, "
"count(*) as events, sum(revenue_cents) / 100.0 as revenue "
"FROM read_parquet('s3://events/2025/06/24/*.parquet') "
"GROUP BY 1, 2"
)
con.execute("CHECKPOINT")PostgreSQL Integration via postgres_scanner
import duckdb
con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute(
"ATTACH 'host=prod-db.internal port=5432 dbname=appdb user=analyst' "
"AS pg (TYPE postgres, READ_ONLY)"
)
# Vectorized analytics directly against Postgres tables
result = con.execute(
"SELECT p.category, count(*) as orders, sum(o.total_cents)/100.0 as revenue "
"FROM pg.orders o JOIN pg.products p ON o.product_id = p.id "
"WHERE o.created_at >= '2025-01-01' "
"GROUP BY 1 ORDER BY 3 DESC"
).fetchdf()Use PRAGMA threads=8 to control DuckDB's parallelism. On a 16-core machine, DuckDB defaults to all cores — set this explicitly in shared environments.
DuckDB allows only one writer at a time per database file. For multi-process write workloads, use the MotherDuck managed service or partition writes by file. Do not use a DuckDB file on NFS — locking semantics are undefined.
- DuckDB eliminates the server/cluster overhead for analytics — install it as a library and query GB-scale data in seconds.
- Native Parquet and S3 support with predicate pushdown replaces many ETL pipeline steps.
- The postgres_scanner extension lets you run vectorized analytics directly against PostgreSQL tables without a separate data warehouse.
- DuckDB is single-writer; for high-concurrency OLAP, consider ClickHouse or BigQuery instead.
Working with JusDB on Analytics Architecture
JusDB helps engineering teams design analytics pipelines that use the right tool for each layer — DuckDB for embedded analytics, ClickHouse for high-volume OLAP, and PostgreSQL for transactional workloads.
Explore JusDB Architecture Services → | Talk to a DBA
Related reading: