If you have spent years writing stored procedures, views, and ETL scripts that live in undocumented corners of your data warehouse, dbt offers a fundamentally different way to work. It treats your SQL transformations as version-controlled software — testable, documented, and deployable through a CI/CD pipeline. Data build tool (dbt) has become the de facto standard for the transformation layer in the modern data stack, and understanding it is now an expected skill for any database engineer working with analytical workloads. This guide walks through every major concept you need to go from zero to a production-ready dbt project.
- dbt transforms data inside your warehouse using SELECT-only SQL files called models.
- The
ref()function builds a DAG of dependencies so dbt runs models in the correct order. - Built-in tests (
not_null,unique,accepted_values,relationships) catch data quality issues before they reach consumers. - Materializations (table, view, incremental, ephemeral) control how each model is persisted.
- Snapshots handle slowly changing dimensions (SCD Type 2) out of the box.
- dbt Core is free and open source; dbt Cloud adds a managed scheduler, IDE, and CI runner.
- GitHub Actions integrates cleanly with dbt Core for fully automated CI/CD.
What Is dbt?
dbt (data build tool) is an open-source command-line tool that enables data teams to define, run, and test SQL-based transformations directly inside a data warehouse. Unlike traditional ETL tools that move data out of the database to transform it and then load it back, dbt follows an ELT pattern: data lands in the warehouse first, and dbt handles the T — entirely in-warehouse using the compute you are already paying for.
Every transformation in dbt is a plain .sql file containing a single SELECT statement. dbt compiles these files into the appropriate DDL or DML (CREATE TABLE AS SELECT, CREATE VIEW AS, MERGE, etc.) and executes them against your warehouse. This means your team writes only SELECT logic — dbt handles materialization.
dbt supports all major analytical warehouses through a pluggable adapter system. First-class adapters exist for BigQuery, Snowflake, PostgreSQL, Amazon Redshift, Databricks, DuckDB, and more. Switching warehouses requires changing the adapter and connection profile, not your model SQL.
Core Concepts
Models
A model is a .sql file that contains a single SELECT statement. When you run dbt run, dbt materializes each model in your warehouse according to its configured materialization type. The file name becomes the relation name in the database.
The ref() Function
The ref() function is how models declare dependencies on other models. When model B references model A via {{ ref('model_a') }}, dbt builds a directed acyclic graph (DAG) and guarantees that model A is materialized before model B. This replaces the fragile, hand-maintained ordering found in most legacy ETL frameworks.
Sources
Sources declare raw tables that dbt does not own — typically tables loaded by an ingestion tool like Fivetran or Airbyte. Declaring them in sources.yml lets you reference them with {{ source('schema', 'table') }}, apply freshness checks, and document them alongside your models.
Materializations
Materializations control how dbt persists each model's results:
- view — dbt creates or replaces a database view. No data is stored; the query runs on every read. Cheapest to build, most expensive to query repeatedly.
- table — dbt creates or replaces a physical table on every run. Expensive to build, fast to query.
- incremental — dbt inserts or merges only new or changed rows since the last run. Ideal for large, append-heavy event tables.
- ephemeral — the model is never materialized; dbt inlines its SQL as a CTE wherever it is referenced. Useful for reusable logic that does not need to exist as a standalone object.
Project Structure
A standard dbt project follows a well-established directory layout:
my_dbt_project/
├── dbt_project.yml # Project configuration
├── profiles.yml # Connection credentials (keep out of version control)
├── models/
│ ├── staging/ # 1:1 mappings from raw sources, light renaming/casting
│ │ ├── _sources.yml # Source declarations
│ │ └── stg_orders.sql
│ ├── intermediate/ # Business logic, joins, aggregations
│ │ └── int_order_items_pivoted.sql
│ └── marts/ # Final, consumer-facing tables
│ ├── schema.yml # Column-level tests and documentation
│ └── fct_orders.sql
├── seeds/ # Static CSV files loaded as tables
├── snapshots/ # SCD Type 2 snapshot definitions
├── tests/ # Custom singular tests (SQL files)
├── macros/ # Reusable Jinja macros
└── analyses/ # Ad-hoc SQL not materialized by dbtThe three-layer architecture (staging → intermediate → marts) is the most widely adopted convention. Staging models are thin wrappers that rename columns and cast types. Intermediate models contain business logic. Mart models are the final, query-optimized tables exposed to BI tools and analysts.
Writing Models
A Staging Model
-- models/staging/stg_orders.sql
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
customer_id,
status,
cast(created_at as timestamp) as created_at,
cast(amount_cents as numeric) / 100 as order_amount
from source
)
select * from renamedA Mart Model Using ref()
-- models/marts/fct_orders.sql
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
final as (
select
o.order_id,
o.order_amount,
o.status,
o.created_at,
c.email as customer_email,
c.country as customer_country
from orders o
left join customers c using (customer_id)
)
select * from finalIncremental Models
For tables with millions or billions of rows, rebuilding from scratch on every run is wasteful. Incremental models solve this by processing only new data:
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
on_schema_change='sync_all_columns'
)
}}
select
event_id,
user_id,
event_type,
occurred_at
from {{ ref('stg_events') }}
{% if is_incremental() %}
where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}
The is_incremental() macro evaluates to true only when the target table already exists and the run is not a full refresh. On the first run — or when you pass --full-refresh — dbt rebuilds the table from scratch.
Late-arriving data is a common pitfall with incremental models. If your source ever delivers rows out of order, a simple max(occurred_at) filter will miss them. Add a lookback window — e.g., occurred_at > (select max(occurred_at) - interval '3 days' from {{ this }}) — to catch late arrivals at the cost of reprocessing a small window of data.
Sources Definition
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: my_warehouse
schema: raw_data
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
loaded_at_field: _ingested_at
description: "Raw orders table loaded by Fivetran."
- name: customers
loaded_at_field: _ingested_at
description: "Raw customers table loaded by Fivetran."Testing Your Models
dbt ships with four generic tests that cover the most common data quality assertions. You declare them in schema.yml files alongside your model definitions:
# models/marts/schema.yml
version: 2
models:
- name: fct_orders
description: "One row per order, joined with customer dimensions."
columns:
- name: order_id
description: "Primary key for the orders table."
tests:
- not_null
- unique
- name: status
description: "Current fulfillment status of the order."
tests:
- not_null
- accepted_values:
values: ['placed', 'shipped', 'delivered', 'returned', 'cancelled']
- name: customer_email
description: "Email of the customer who placed the order."
tests:
- relationships:
to: ref('stg_customers')
field: emailRun tests with:
dbt test # Run all tests
dbt test --select fct_orders # Run tests for a specific model
dbt test --select source:raw # Run freshness + schema tests on sourcesBeyond the four generic tests, you can write custom singular tests as plain SQL files in the tests/ directory. Any query that returns rows is considered a failing test. This lets you encode complex, multi-table business rules — for example, asserting that total revenue in the mart matches total revenue in the staging layer.
Seeds and Snapshots
Seeds
Seeds are CSV files stored in the seeds/ directory that dbt loads into the warehouse as tables. They are ideal for small, slowly changing reference data — country codes, product category mappings, fiscal calendar definitions — that you want under version control but do not need to ingest via a data pipeline.
dbt seed # Load all seed files
dbt seed --select country_codes # Load a specific seedSnapshots (SCD Type 2)
Snapshots capture the state of a source table over time, automatically maintaining historical records using slowly changing dimension Type 2 logic. dbt adds dbt_valid_from and dbt_valid_to columns to track when each version of a record was current.
-- snapshots/snp_customers.sql
{% snapshot snp_customers %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at'
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}dbt snapshot # Run all snapshotsDocumentation
dbt generates a static documentation site directly from your project. Column descriptions written in schema.yml files, model descriptions, and the dependency DAG are all compiled into an interactive web app:
dbt docs generate # Compile docs and DAG
dbt docs serve # Serve locally on http://localhost:8080The generated site includes a visual DAG explorer that lets anyone on your team trace the lineage of any table — from raw source through every intermediate model to the final mart. This replaces the tribal knowledge that typically lives in Confluence pages or Slack threads, and it is always in sync with your code because it is generated from the same files.
Use the description field in schema.yml aggressively. Even a single sentence per column eliminates hours of archaeology when onboarding new engineers or debugging a dashboard discrepancy. dbt Cloud and many BI tools can surface these descriptions directly in their interfaces.
dbt Core vs dbt Cloud
dbt Core is the open-source CLI. It is free, has no seat licensing, and runs anywhere Python runs — your laptop, a Docker container, a GitHub Actions runner, or an Airflow worker. All model authoring, testing, and documentation generation happens in Core. The vast majority of dbt's functionality lives here.
dbt Cloud is a managed SaaS platform built on top of dbt Core. It adds a browser-based IDE, a managed job scheduler with a visual UI, integrated CI (automated runs on pull requests), Semantic Layer (centralized metric definitions), and hosted documentation. If your team lacks the infrastructure to self-host a scheduler or wants a polished UI without operational overhead, dbt Cloud is worth evaluating. It uses a per-developer seat model with a free tier for individuals.
CI/CD with GitHub Actions
Automating dbt runs in CI catches breaking changes before they merge to main. The following workflow runs dbt build — which runs models, tests, seeds, and snapshots in dependency order — against a dedicated CI schema on every pull request:
# .github/workflows/dbt-ci.yml
name: dbt CI
on:
pull_request:
branches: [main]
jobs:
dbt-build:
runs-on: ubuntu-latest
env:
DBT_PROFILES_DIR: .
DBT_TARGET: ci
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dbt
run: pip install dbt-snowflake==1.8.0 # swap adapter as needed
- name: Install dbt packages
run: dbt deps
- name: Run dbt build (models + tests)
run: |
dbt build \
--target ci \
--vars '{"is_ci_run": true}' \
--defer \
--state ./prod-artifacts
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
- name: Generate docs
run: dbt docs generate--defer and --state
The --defer flag tells dbt to resolve ref() calls for models that are not selected in the current run against the production manifest stored in ./prod-artifacts/manifest.json. This means a CI run that only touches two changed models does not need to rebuild the entire DAG — it borrows the rest from production. This dramatically reduces CI run time and warehouse costs for large projects.
For the CD (continuous deployment) side, a separate workflow triggered on merge to main runs dbt build --target prod against your production warehouse and uploads the resulting manifest.json and catalog.json as build artifacts for the next CI run's --state reference.
- dbt keeps all transformation logic in version-controlled SELECT-only SQL files, making your warehouse code as reviewable and testable as application code.
- The
ref()function replaces brittle, hand-maintained job ordering with an automatically computed DAG. - Choose materializations deliberately: views for lightweight or infrequently queried logic, tables for stability, incremental for large event streams, and ephemeral for reusable CTEs.
- Always add a lookback buffer to incremental models to handle late-arriving data from upstream sources.
- Generic tests (
not_null,unique,accepted_values,relationships) inschema.ymlare the minimum baseline; augment with custom singular tests for business rules. - Snapshots give you SCD Type 2 history without writing any merge logic yourself.
- Slim CI with
--deferand--statekeeps pull request feedback loops fast even as your project grows to hundreds of models. - dbt Core covers every technical capability; dbt Cloud adds managed scheduling, a browser IDE, and integrated CI for teams that want less operational overhead.
Run Your dbt Models on JusDB
dbt is only as good as the database underneath it. JusDB gives your dbt project a high-performance, fully managed analytical database — no infrastructure to provision, no connection pools to tune, no vacuum jobs to schedule. The PostgreSQL-compatible wire protocol means your existing dbt-postgres adapter connects without modification, and JusDB's columnar storage and vectorized execution engine make even complex mart queries return in milliseconds.
Whether you are migrating an existing dbt project from a self-managed Postgres instance or starting a new data warehouse from scratch, JusDB handles the database layer so your team can focus on what dbt does best: building clean, tested, documented transformations that analysts actually trust.
Start your free JusDB trial and connect your dbt project in minutes.