Analytics & OLAP

DuckDB for Database Engineers: In-Process Analytics Without the Cluster

DuckDB brings OLAP query performance to your Python process without a server or cluster. Learn how to query Parquet files, replace heavy ETL jobs, and run analytics directly against PostgreSQL.

JusDB Team
June 24, 2025
8 min read
185 views

DuckDB runs analytical SQL queries in-process — no server, no cluster, no ETL pipeline. A single Python import and you're querying 10GB Parquet files at 2GB/s. For database engineers who've spent years managing Redshift clusters, this changes the economics of ad-hoc analytics.

TL;DR
  • DuckDB is an in-process OLAP database — it runs inside your Python, R, or CLI session
  • Vectorized execution with columnar storage delivers sub-second queries on GB-scale data
  • Native Parquet, CSV, JSON, and Arrow support — query files directly without ingestion
  • Not a replacement for Postgres/MySQL: DuckDB is for analytics, not OLTP transactions

What Is DuckDB?

DuckDB is an in-process SQL OLAP database management system. Unlike traditional databases that run as separate server processes, DuckDB embeds directly into your application. It uses vectorized query execution — processing data in column-oriented batches of 1024–2048 values — which makes it 10–100x faster than row-at-a-time processing for analytical workloads.

How DuckDB's Execution Engine Works

Vectorized Columnar Processing

DuckDB stores data in column-oriented format and processes it in vectors. A filter on a 100M-row table reads only the relevant column bytes, then applies SIMD instructions to evaluate predicates across 1024 values simultaneously.

python
import duckdb

con = duckdb.connect()

# Query a 10GB Parquet file directly from S3
result = con.sql(
    "SELECT date_trunc('month', event_ts) as month, event_type, "
    "count(*) as events, avg(duration_ms) as avg_duration "
    "FROM read_parquet('s3://my-bucket/events/*.parquet') "
    "WHERE event_ts >= '2024-01-01' "
    "GROUP BY 1, 2 ORDER BY 1, 3 DESC"
).fetchdf()

print(result.head(20))

Direct File Querying

DuckDB's killer feature for data engineers: query Parquet, CSV, JSON, and Arrow files without any ingestion step. It uses predicate pushdown and projection pushdown to read only the required bytes from object storage.

sql
-- Query multiple Parquet files with glob
SELECT * FROM read_parquet('data/events/2024-*.parquet')
WHERE user_id = 12345;

-- Query CSV with automatic type inference
SELECT AVG(revenue), MAX(order_date)
FROM read_csv_auto('orders.csv');

-- Join across different file formats
SELECT p.product_name, SUM(o.quantity) as total_sold
FROM read_parquet('products.parquet') p
JOIN read_csv_auto('orders.csv') o ON p.id = o.product_id
GROUP BY 1 ORDER BY 2 DESC;

DuckDB vs PostgreSQL vs ClickHouse

DimensionDuckDBPostgreSQLClickHouse
ArchitectureIn-process (embedded)Client-serverClient-server cluster
Best forAd-hoc analytics, ETLOLTP transactionsHigh-volume OLAP
Data scaleGB–low TBGB–TBTB–PB
Setup time0 (pip install duckdb)MinutesHours
Parquet supportNative, predicate pushdownVia extensionNative

Production Use Cases for Database Engineers

Replacing Heavy ETL Pipelines

python
import duckdb

con = duckdb.connect('analytics.duckdb')

# Incremental load from S3 Parquet
con.execute(
    "INSERT INTO daily_aggregates "
    "SELECT date_trunc('day', event_ts) as day, user_segment, "
    "count(*) as events, sum(revenue_cents) / 100.0 as revenue "
    "FROM read_parquet('s3://events/2025/06/24/*.parquet') "
    "GROUP BY 1, 2"
)

con.execute("CHECKPOINT")

PostgreSQL Integration via postgres_scanner

python
import duckdb

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")

con.execute(
    "ATTACH 'host=prod-db.internal port=5432 dbname=appdb user=analyst' "
    "AS pg (TYPE postgres, READ_ONLY)"
)

# Vectorized analytics directly against Postgres tables
result = con.execute(
    "SELECT p.category, count(*) as orders, sum(o.total_cents)/100.0 as revenue "
    "FROM pg.orders o JOIN pg.products p ON o.product_id = p.id "
    "WHERE o.created_at >= '2025-01-01' "
    "GROUP BY 1 ORDER BY 3 DESC"
).fetchdf()
Tip

Use PRAGMA threads=8 to control DuckDB's parallelism. On a 16-core machine, DuckDB defaults to all cores — set this explicitly in shared environments.

Warning

DuckDB allows only one writer at a time per database file. For multi-process write workloads, use the MotherDuck managed service or partition writes by file. Do not use a DuckDB file on NFS — locking semantics are undefined.

Key Takeaways
  • DuckDB eliminates the server/cluster overhead for analytics — install it as a library and query GB-scale data in seconds.
  • Native Parquet and S3 support with predicate pushdown replaces many ETL pipeline steps.
  • The postgres_scanner extension lets you run vectorized analytics directly against PostgreSQL tables without a separate data warehouse.
  • DuckDB is single-writer; for high-concurrency OLAP, consider ClickHouse or BigQuery instead.

Working with JusDB on Analytics Architecture

JusDB helps engineering teams design analytics pipelines that use the right tool for each layer — DuckDB for embedded analytics, ClickHouse for high-volume OLAP, and PostgreSQL for transactional workloads.

Explore JusDB Architecture Services →  |  Talk to a DBA

Related reading:

Share this article

JusDB Team

Official JusDB content team