Database Engineering

dbt (data build tool): A Guide for Database Engineers

Use dbt to build modular SQL transformations — models, tests, documentation, and CI/CD for your data warehouse

JusDB Team
January 26, 2026
12 min read
186 views
dbt Guide for Database Engineers: Models, Tests, CI/CD | JusDB

If you have spent years writing stored procedures, views, and ETL scripts that live in undocumented corners of your data warehouse, dbt offers a fundamentally different way to work. It treats your SQL transformations as version-controlled software — testable, documented, and deployable through a CI/CD pipeline. Data build tool (dbt) has become the de facto standard for the transformation layer in the modern data stack, and understanding it is now an expected skill for any database engineer working with analytical workloads. This guide walks through every major concept you need to go from zero to a production-ready dbt project.

TL;DR
  • dbt transforms data inside your warehouse using SELECT-only SQL files called models.
  • The ref() function builds a DAG of dependencies so dbt runs models in the correct order.
  • Built-in tests (not_null, unique, accepted_values, relationships) catch data quality issues before they reach consumers.
  • Materializations (table, view, incremental, ephemeral) control how each model is persisted.
  • Snapshots handle slowly changing dimensions (SCD Type 2) out of the box.
  • dbt Core is free and open source; dbt Cloud adds a managed scheduler, IDE, and CI runner.
  • GitHub Actions integrates cleanly with dbt Core for fully automated CI/CD.

What Is dbt?

dbt (data build tool) is an open-source command-line tool that enables data teams to define, run, and test SQL-based transformations directly inside a data warehouse. Unlike traditional ETL tools that move data out of the database to transform it and then load it back, dbt follows an ELT pattern: data lands in the warehouse first, and dbt handles the T — entirely in-warehouse using the compute you are already paying for.

Every transformation in dbt is a plain .sql file containing a single SELECT statement. dbt compiles these files into the appropriate DDL or DML (CREATE TABLE AS SELECT, CREATE VIEW AS, MERGE, etc.) and executes them against your warehouse. This means your team writes only SELECT logic — dbt handles materialization.

dbt supports all major analytical warehouses through a pluggable adapter system. First-class adapters exist for BigQuery, Snowflake, PostgreSQL, Amazon Redshift, Databricks, DuckDB, and more. Switching warehouses requires changing the adapter and connection profile, not your model SQL.

Core Concepts

Models

A model is a .sql file that contains a single SELECT statement. When you run dbt run, dbt materializes each model in your warehouse according to its configured materialization type. The file name becomes the relation name in the database.

The ref() Function

The ref() function is how models declare dependencies on other models. When model B references model A via {{ ref('model_a') }}, dbt builds a directed acyclic graph (DAG) and guarantees that model A is materialized before model B. This replaces the fragile, hand-maintained ordering found in most legacy ETL frameworks.

Sources

Sources declare raw tables that dbt does not own — typically tables loaded by an ingestion tool like Fivetran or Airbyte. Declaring them in sources.yml lets you reference them with {{ source('schema', 'table') }}, apply freshness checks, and document them alongside your models.

Materializations

Materializations control how dbt persists each model's results:

  • view — dbt creates or replaces a database view. No data is stored; the query runs on every read. Cheapest to build, most expensive to query repeatedly.
  • table — dbt creates or replaces a physical table on every run. Expensive to build, fast to query.
  • incremental — dbt inserts or merges only new or changed rows since the last run. Ideal for large, append-heavy event tables.
  • ephemeral — the model is never materialized; dbt inlines its SQL as a CTE wherever it is referenced. Useful for reusable logic that does not need to exist as a standalone object.

Project Structure

A standard dbt project follows a well-established directory layout:

text
my_dbt_project/
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Connection credentials (keep out of version control)
├── models/
│   ├── staging/             # 1:1 mappings from raw sources, light renaming/casting
│   │   ├── _sources.yml     # Source declarations
│   │   └── stg_orders.sql
│   ├── intermediate/        # Business logic, joins, aggregations
│   │   └── int_order_items_pivoted.sql
│   └── marts/               # Final, consumer-facing tables
│       ├── schema.yml       # Column-level tests and documentation
│       └── fct_orders.sql
├── seeds/                   # Static CSV files loaded as tables
├── snapshots/               # SCD Type 2 snapshot definitions
├── tests/                   # Custom singular tests (SQL files)
├── macros/                  # Reusable Jinja macros
└── analyses/                # Ad-hoc SQL not materialized by dbt

The three-layer architecture (staging → intermediate → marts) is the most widely adopted convention. Staging models are thin wrappers that rename columns and cast types. Intermediate models contain business logic. Mart models are the final, query-optimized tables exposed to BI tools and analysts.

Writing Models

A Staging Model

sql
-- models/staging/stg_orders.sql
with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        id                          as order_id,
        customer_id,
        status,
        cast(created_at as timestamp) as created_at,
        cast(amount_cents as numeric) / 100 as order_amount
    from source
)

select * from renamed

A Mart Model Using ref()

sql
-- models/marts/fct_orders.sql
with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('stg_customers') }}
),

final as (
    select
        o.order_id,
        o.order_amount,
        o.status,
        o.created_at,
        c.email          as customer_email,
        c.country        as customer_country
    from orders o
    left join customers c using (customer_id)
)

select * from final

Incremental Models

For tables with millions or billions of rows, rebuilding from scratch on every run is wasteful. Incremental models solve this by processing only new data:

sql
-- models/marts/fct_events.sql
{{
  config(
    materialized='incremental',
    unique_key='event_id',
    on_schema_change='sync_all_columns'
  )
}}

select
    event_id,
    user_id,
    event_type,
    occurred_at
from {{ ref('stg_events') }}

{% if is_incremental() %}
  where occurred_at > (select max(occurred_at) from {{ this }})
{% endif %}

The is_incremental() macro evaluates to true only when the target table already exists and the run is not a full refresh. On the first run — or when you pass --full-refresh — dbt rebuilds the table from scratch.

Warning

Late-arriving data is a common pitfall with incremental models. If your source ever delivers rows out of order, a simple max(occurred_at) filter will miss them. Add a lookback window — e.g., occurred_at > (select max(occurred_at) - interval '3 days' from {{ this }}) — to catch late arrivals at the cost of reprocessing a small window of data.

Sources Definition

yaml
# models/staging/_sources.yml
version: 2

sources:
  - name: raw
    database: my_warehouse
    schema: raw_data
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        description: "Raw orders table loaded by Fivetran."
      - name: customers
        loaded_at_field: _ingested_at
        description: "Raw customers table loaded by Fivetran."

Testing Your Models

dbt ships with four generic tests that cover the most common data quality assertions. You declare them in schema.yml files alongside your model definitions:

yaml
# models/marts/schema.yml
version: 2

models:
  - name: fct_orders
    description: "One row per order, joined with customer dimensions."
    columns:
      - name: order_id
        description: "Primary key for the orders table."
        tests:
          - not_null
          - unique

      - name: status
        description: "Current fulfillment status of the order."
        tests:
          - not_null
          - accepted_values:
              values: ['placed', 'shipped', 'delivered', 'returned', 'cancelled']

      - name: customer_email
        description: "Email of the customer who placed the order."
        tests:
          - relationships:
              to: ref('stg_customers')
              field: email

Run tests with:

bash
dbt test                         # Run all tests
dbt test --select fct_orders     # Run tests for a specific model
dbt test --select source:raw     # Run freshness + schema tests on sources
Tip

Beyond the four generic tests, you can write custom singular tests as plain SQL files in the tests/ directory. Any query that returns rows is considered a failing test. This lets you encode complex, multi-table business rules — for example, asserting that total revenue in the mart matches total revenue in the staging layer.

Seeds and Snapshots

Seeds

Seeds are CSV files stored in the seeds/ directory that dbt loads into the warehouse as tables. They are ideal for small, slowly changing reference data — country codes, product category mappings, fiscal calendar definitions — that you want under version control but do not need to ingest via a data pipeline.

bash
dbt seed                         # Load all seed files
dbt seed --select country_codes  # Load a specific seed

Snapshots (SCD Type 2)

Snapshots capture the state of a source table over time, automatically maintaining historical records using slowly changing dimension Type 2 logic. dbt adds dbt_valid_from and dbt_valid_to columns to track when each version of a record was current.

sql
-- snapshots/snp_customers.sql
{% snapshot snp_customers %}

{{
  config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at'
  )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}
bash
dbt snapshot                     # Run all snapshots

Documentation

dbt generates a static documentation site directly from your project. Column descriptions written in schema.yml files, model descriptions, and the dependency DAG are all compiled into an interactive web app:

bash
dbt docs generate                # Compile docs and DAG
dbt docs serve                   # Serve locally on http://localhost:8080

The generated site includes a visual DAG explorer that lets anyone on your team trace the lineage of any table — from raw source through every intermediate model to the final mart. This replaces the tribal knowledge that typically lives in Confluence pages or Slack threads, and it is always in sync with your code because it is generated from the same files.

Tip

Use the description field in schema.yml aggressively. Even a single sentence per column eliminates hours of archaeology when onboarding new engineers or debugging a dashboard discrepancy. dbt Cloud and many BI tools can surface these descriptions directly in their interfaces.

dbt Core vs dbt Cloud

dbt Core is the open-source CLI. It is free, has no seat licensing, and runs anywhere Python runs — your laptop, a Docker container, a GitHub Actions runner, or an Airflow worker. All model authoring, testing, and documentation generation happens in Core. The vast majority of dbt's functionality lives here.

dbt Cloud is a managed SaaS platform built on top of dbt Core. It adds a browser-based IDE, a managed job scheduler with a visual UI, integrated CI (automated runs on pull requests), Semantic Layer (centralized metric definitions), and hosted documentation. If your team lacks the infrastructure to self-host a scheduler or wants a polished UI without operational overhead, dbt Cloud is worth evaluating. It uses a per-developer seat model with a free tier for individuals.

CI/CD with GitHub Actions

Automating dbt runs in CI catches breaking changes before they merge to main. The following workflow runs dbt build — which runs models, tests, seeds, and snapshots in dependency order — against a dedicated CI schema on every pull request:

yaml
# .github/workflows/dbt-ci.yml
name: dbt CI

on:
  pull_request:
    branches: [main]

jobs:
  dbt-build:
    runs-on: ubuntu-latest

    env:
      DBT_PROFILES_DIR: .
      DBT_TARGET: ci

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dbt
        run: pip install dbt-snowflake==1.8.0   # swap adapter as needed

      - name: Install dbt packages
        run: dbt deps

      - name: Run dbt build (models + tests)
        run: |
          dbt build \
            --target ci \
            --vars '{"is_ci_run": true}' \
            --defer \
            --state ./prod-artifacts
        env:
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}

      - name: Generate docs
        run: dbt docs generate
Note: Slim CI with --defer and --state

The --defer flag tells dbt to resolve ref() calls for models that are not selected in the current run against the production manifest stored in ./prod-artifacts/manifest.json. This means a CI run that only touches two changed models does not need to rebuild the entire DAG — it borrows the rest from production. This dramatically reduces CI run time and warehouse costs for large projects.

For the CD (continuous deployment) side, a separate workflow triggered on merge to main runs dbt build --target prod against your production warehouse and uploads the resulting manifest.json and catalog.json as build artifacts for the next CI run's --state reference.

Key Takeaways
  • dbt keeps all transformation logic in version-controlled SELECT-only SQL files, making your warehouse code as reviewable and testable as application code.
  • The ref() function replaces brittle, hand-maintained job ordering with an automatically computed DAG.
  • Choose materializations deliberately: views for lightweight or infrequently queried logic, tables for stability, incremental for large event streams, and ephemeral for reusable CTEs.
  • Always add a lookback buffer to incremental models to handle late-arriving data from upstream sources.
  • Generic tests (not_null, unique, accepted_values, relationships) in schema.yml are the minimum baseline; augment with custom singular tests for business rules.
  • Snapshots give you SCD Type 2 history without writing any merge logic yourself.
  • Slim CI with --defer and --state keeps pull request feedback loops fast even as your project grows to hundreds of models.
  • dbt Core covers every technical capability; dbt Cloud adds managed scheduling, a browser IDE, and integrated CI for teams that want less operational overhead.

Run Your dbt Models on JusDB

dbt is only as good as the database underneath it. JusDB gives your dbt project a high-performance, fully managed analytical database — no infrastructure to provision, no connection pools to tune, no vacuum jobs to schedule. The PostgreSQL-compatible wire protocol means your existing dbt-postgres adapter connects without modification, and JusDB's columnar storage and vectorized execution engine make even complex mart queries return in milliseconds.

Whether you are migrating an existing dbt project from a self-managed Postgres instance or starting a new data warehouse from scratch, JusDB handles the database layer so your team can focus on what dbt does best: building clean, tested, documented transformations that analysts actually trust.

Start your free JusDB trial and connect your dbt project in minutes.

Share this article