DuckDB for Database Engineers: In-Process Analytics Guide 2025

DuckDB runs analytical SQL queries in-process — no server, no cluster, no ETL pipeline. A single Python import and you're querying 10GB Parquet files at 2GB/s. For database engineers who've spent years managing Redshift clusters, this changes the economics of ad-hoc analytics.

TL;DR

DuckDB is an in-process OLAP database — it runs inside your Python, R, or CLI session
Vectorized execution with columnar storage delivers sub-second queries on GB-scale data
Native Parquet, CSV, JSON, and Arrow support — query files directly without ingestion
Not a replacement for Postgres/MySQL: DuckDB is for analytics, not OLTP transactions

What Is DuckDB?

DuckDB is an in-process SQL OLAP database management system. Unlike traditional databases that run as separate server processes, DuckDB embeds directly into your application. It uses vectorized query execution — processing data in column-oriented batches of 1024–2048 values — which makes it 10–100x faster than row-at-a-time processing for analytical workloads.

How DuckDB's Execution Engine Works

Vectorized Columnar Processing

DuckDB stores data in column-oriented format and processes it in vectors. A filter on a 100M-row table reads only the relevant column bytes, then applies SIMD instructions to evaluate predicates across 1024 values simultaneously.

python

import duckdb

con = duckdb.connect()

# Query a 10GB Parquet file directly from S3
result = con.sql(
    "SELECT date_trunc('month', event_ts) as month, event_type, "
    "count(*) as events, avg(duration_ms) as avg_duration "
    "FROM read_parquet('s3://my-bucket/events/*.parquet') "
    "WHERE event_ts >= '2024-01-01' "
    "GROUP BY 1, 2 ORDER BY 1, 3 DESC"
).fetchdf()

print(result.head(20))

Direct File Querying

DuckDB's killer feature for data engineers: query Parquet, CSV, JSON, and Arrow files without any ingestion step. It uses predicate pushdown and projection pushdown to read only the required bytes from object storage.

sql

-- Query multiple Parquet files with glob
SELECT * FROM read_parquet('data/events/2024-*.parquet')
WHERE user_id = 12345;

-- Query CSV with automatic type inference
SELECT AVG(revenue), MAX(order_date)
FROM read_csv_auto('orders.csv');

-- Join across different file formats
SELECT p.product_name, SUM(o.quantity) as total_sold
FROM read_parquet('products.parquet') p
JOIN read_csv_auto('orders.csv') o ON p.id = o.product_id
GROUP BY 1 ORDER BY 2 DESC;

DuckDB vs PostgreSQL vs ClickHouse

Dimension	DuckDB	PostgreSQL	ClickHouse
Architecture	In-process (embedded)	Client-server	Client-server cluster
Best for	Ad-hoc analytics, ETL	OLTP transactions	High-volume OLAP
Data scale	GB–low TB	GB–TB	TB–PB
Setup time	0 (pip install duckdb)	Minutes	Hours
Parquet support	Native, predicate pushdown	Via extension	Native

Production Use Cases for Database Engineers

Replacing Heavy ETL Pipelines

python

import duckdb

con = duckdb.connect('analytics.duckdb')

# Incremental load from S3 Parquet
con.execute(
    "INSERT INTO daily_aggregates "
    "SELECT date_trunc('day', event_ts) as day, user_segment, "
    "count(*) as events, sum(revenue_cents) / 100.0 as revenue "
    "FROM read_parquet('s3://events/2025/06/24/*.parquet') "
    "GROUP BY 1, 2"
)

con.execute("CHECKPOINT")

PostgreSQL Integration via postgres_scanner

python

import duckdb

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")

con.execute(
    "ATTACH 'host=prod-db.internal port=5432 dbname=appdb user=analyst' "
    "AS pg (TYPE postgres, READ_ONLY)"
)

# Vectorized analytics directly against Postgres tables
result = con.execute(
    "SELECT p.category, count(*) as orders, sum(o.total_cents)/100.0 as revenue "
    "FROM pg.orders o JOIN pg.products p ON o.product_id = p.id "
    "WHERE o.created_at >= '2025-01-01' "
    "GROUP BY 1 ORDER BY 3 DESC"
).fetchdf()

Tip

Use PRAGMA threads=8 to control DuckDB's parallelism. On a 16-core machine, DuckDB defaults to all cores — set this explicitly in shared environments.

Warning

DuckDB allows only one writer at a time per database file. For multi-process write workloads, use the MotherDuck managed service or partition writes by file. Do not use a DuckDB file on NFS — locking semantics are undefined.

Key Takeaways

DuckDB eliminates the server/cluster overhead for analytics — install it as a library and query GB-scale data in seconds.
Native Parquet and S3 support with predicate pushdown replaces many ETL pipeline steps.
The postgres_scanner extension lets you run vectorized analytics directly against PostgreSQL tables without a separate data warehouse.
DuckDB is single-writer; for high-concurrency OLAP, consider ClickHouse or BigQuery instead.

Working with JusDB on Analytics Architecture

JusDB helps engineering teams design analytics pipelines that use the right tool for each layer — DuckDB for embedded analytics, ClickHouse for high-volume OLAP, and PostgreSQL for transactional workloads.

Explore JusDB Architecture Services → | Talk to a DBA

Related reading:

DuckDB for Database Engineers: In-Process Analytics Without the Cluster

What Is DuckDB?

How DuckDB's Execution Engine Works

Vectorized Columnar Processing

Direct File Querying

DuckDB vs PostgreSQL vs ClickHouse

Production Use Cases for Database Engineers

Replacing Heavy ETL Pipelines

PostgreSQL Integration via postgres_scanner

Working with JusDB on Analytics Architecture

Share this article

JusDB Team

Keep reading

TimescaleDB Hypertables, Continuous Aggregates & Compression (2026 Production Guide)

ClickHouse Explained (2026): MergeTree, Distributed Engine & Real-Time OLAP

StarRocks vs ClickHouse: Architecture, Table Models & When to Choose Each (2026)

Need Expert Help?

PostgreSQL Consulting

PostgreSQL Migration

PostgreSQL Support

PostgreSQL High Availability

PostgreSQL Cloud Migration

PostgreSQL on Kubernetes