Database Engineering

Airbyte: Open-Source ELT for Database Replication

Deploy Airbyte to replicate data from MySQL, PostgreSQL, and MongoDB to your warehouse — connectors, scheduling, and transforms

JusDB Team
July 14, 2022
Updated May 31, 2026
11 min read
136 views

Moving data from production databases into your analytics warehouse is one of those tasks that sounds simple until you're debugging CDC offsets at 2am. Airbyte, the open-source ELT platform, gives data engineers a self-hosted alternative to expensive SaaS connectors — with over 350 pre-built connectors, flexible sync modes, and a pluggable architecture that doesn't lock you into a single vendor. Whether you're replicating a MySQL transactional database, pulling documents from MongoDB, or syncing a fleet of PostgreSQL instances, Airbyte handles the plumbing so your team can focus on the analytics. In this guide we walk through a production-ready Airbyte deployment: Docker setup, source connector configuration for MySQL, PostgreSQL, and MongoDB, sync mode selection, dbt transformations post-sync, and an honest comparison against Fivetran.

TL;DR
  • Airbyte OSS is free to self-host via Docker Compose; Airbyte Cloud adds managed infrastructure and a usage-based price.
  • Configure MySQL, PostgreSQL, and MongoDB source connectors with CDC or cursor-based incremental syncs to minimise load on production.
  • Choose from Full Refresh, Incremental Append, or Incremental Deduped History sync modes depending on whether your downstream consumers need a full snapshot or an up-to-date view.
  • Airbyte ships data raw into a staging schema; run dbt transformations afterwards to produce clean, business-ready models.
  • Airbyte beats Fivetran on total cost of ownership for high-volume pipelines and self-hosted requirements; Fivetran wins on managed reliability and SLA guarantees.
  • PyAirbyte lets you drive syncs programmatically from Python scripts, CI pipelines, or orchestrators like Airflow.

What is Airbyte?

Airbyte is an open-source ELT (Extract, Load, Transform) platform that moves data from sources — databases, APIs, files — into destinations — warehouses, data lakes, object storage — with minimal custom code. It was founded in 2020 and quickly became the most-starred data integration project on GitHub, largely because it ships connector development to the community rather than maintaining a closed catalog.

The platform has two deployment options:

  • Airbyte OSS: Self-hosted, free, Apache 2.0 licensed. You own the infrastructure, the secrets, and the operational burden. Best for teams with existing DevOps capacity and high-volume pipelines where per-row SaaS pricing becomes painful.
  • Airbyte Cloud: Managed service hosted by Airbyte Inc. Usage is billed on credits (roughly tied to rows synced). Best for teams that want connectors without the ops overhead. Includes features like data residency controls and SOC 2 compliance.

Both flavours use the same connector catalog and the same core sync engine, so migrating from Cloud to OSS (or vice versa) is mostly a configuration exercise rather than a re-architecture.

Deploying Airbyte with Docker Compose

For a local or single-node production deployment, Docker Compose is the fastest path. Airbyte ships an official docker-compose.yaml that wires together the web server, scheduler, worker, and database services.

bash
# Clone the Airbyte repository
git clone --depth=1 https://github.com/airbytehq/airbyte.git
cd airbyte

# Launch all services (first run pulls ~2 GB of images)
./run-ab-platform.sh

Once the stack is healthy, the UI is available at http://localhost:8000. Default credentials are airbyte / password — change these immediately for anything beyond a laptop demo.

Warning

The default Docker Compose stack runs everything on a single host and stores Airbyte's internal state in a local PostgreSQL container. For production workloads, externalise the Airbyte database to a managed RDS or Cloud SQL instance so a host failure doesn't corrupt your job history and connection configs. See the Airbyte docs on External Database configuration.

For Kubernetes deployments, Airbyte publishes a Helm chart (airbyte/airbyte) that separates the scheduler, workers, and API server into individual pods, enabling horizontal scaling of the worker tier.

Configuring Source Connectors

Airbyte treats every data origin as a source. Each source connector is a Docker image that implements the Airbyte Protocol — a set of JSON messages for catalog discovery and record emission. Here is how to configure the three most common database sources.

MySQL

The MySQL source connector supports two replication methods:

  • Standard (cursor-based): Queries the table on each sync using a cursor column (typically a timestamp or auto-increment ID). Simple to set up but misses hard-deletes.
  • CDC via Debezium: Reads the MySQL binary log for real-time, row-level changes including deletes. Requires binlog_format = ROW and a replication user with REPLICATION SLAVE privileges.
sql
-- Create a dedicated replication user for Airbyte
CREATE USER 'airbyte_replication'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT
  ON *.* TO 'airbyte_replication'@'%';
FLUSH PRIVILEGES;

In the Airbyte UI, set the replication method to Logical Replication (CDC) and supply the host, port, database name, and the replication user credentials.

PostgreSQL

The PostgreSQL connector also supports both cursor-based and CDC replication. CDC uses PostgreSQL logical replication slots and the pgoutput plugin (available since PG 10).

sql
-- Enable logical replication in postgresql.conf
-- wal_level = logical

-- Create a publication for the tables you want to replicate
CREATE PUBLICATION airbyte_pub FOR TABLE orders, customers, products;

-- Grant replication permissions
CREATE USER airbyte_user WITH REPLICATION LOGIN PASSWORD 'strong_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO airbyte_user;
Warning

PostgreSQL replication slots accumulate WAL segments if Airbyte falls behind or is paused. An unbounded slot can fill your disk and crash the primary. Set max_slot_wal_keep_size (PG 13+) or monitor pg_replication_slots and drop stale slots proactively.

MongoDB

MongoDB replication in Airbyte uses the change stream API, which requires a replica set (or sharded cluster with mongos). Standalone mongod instances do not support change streams.

javascript
// Create a least-privilege user in the admin database
use admin
db.createUser({
  user: "airbyte",
  pwd: "strong_password",
  roles: [
    { role: "read", db: "your_database" },
    { role: "read", db: "local" }
  ]
})

Because MongoDB documents are schemaless, Airbyte infers a schema by sampling documents. You can configure the sample size and whether Airbyte should expand nested objects into columns or emit them as JSON strings.

Sync Modes Explained

Sync mode is the most consequential configuration choice in Airbyte. It determines how records are extracted from the source and how they land in the destination.

  • Full Refresh | Overwrite: Airbyte reads the entire source table and replaces the destination table on every sync. Simple and correct, but expensive on large tables and destructive to any downstream transformations applied on top.
  • Full Refresh | Append: Reads the entire source and appends a new snapshot to the destination. Useful for slowly changing dimension history but produces duplicates that dbt must deduplicate.
  • Incremental | Append: Reads only new or updated rows since the last sync using a cursor field (e.g. updated_at). Efficient for large tables; hard-deletes are invisible unless you use CDC.
  • Incremental | Deduped History: Reads incrementally and maintains a deduplicated destination table keyed on a primary key. The destination always reflects the latest version of each row. This is the recommended mode for most OLTP-to-warehouse pipelines.
Tip

For the Incremental Deduped History mode to work correctly, every stream must have a primary key configured and a reliable cursor field. Use a database-generated updated_at timestamp (with microsecond precision if possible) rather than application-layer timestamps, which can be skewed by client clock drift.

Destination Connectors

Airbyte supports all major cloud warehouses and object stores as destinations. Configuration follows the same pattern: provide credentials, choose a destination schema, and let Airbyte handle staging and loading.

  • BigQuery: Airbyte uses a GCS staging bucket for bulk loads. Supply a service account JSON key with bigquery.dataEditor and storage.objectAdmin roles.
  • Snowflake: Airbyte stages data in an internal Snowflake stage by default, or an S3/GCS/Azure Blob external stage. Use key-pair authentication for service accounts rather than password auth.
  • Redshift: Airbyte stages records in S3 then issues a COPY command. The S3 bucket must be in the same region as your Redshift cluster to avoid egress charges.
  • S3 (Data Lake): Output raw records as JSON, CSV, or Parquet to S3. Useful as a cheap landing zone before a separate transformation layer processes the files.

dbt Transformations Post-Sync

Airbyte's job ends when raw records land in the destination. The raw tables are prefixed with _airbyte_raw_ and include Airbyte metadata columns (_airbyte_emitted_at, _airbyte_ab_id). The standard workflow is to run dbt after each Airbyte sync to produce clean, typed, and business-ready models.

yaml
# dbt/models/staging/stg_orders.sql
-- Reads from Airbyte raw table and casts types
SELECT
    JSON_VALUE(_airbyte_data, '$.id')          AS order_id,
    JSON_VALUE(_airbyte_data, '$.customer_id') AS customer_id,
    CAST(JSON_VALUE(_airbyte_data, '$.amount') AS FLOAT64) AS amount,
    TIMESTAMP(JSON_VALUE(_airbyte_data, '$.created_at')) AS created_at,
    _airbyte_emitted_at
FROM {{ source('airbyte_raw', '_airbyte_raw_orders') }}

Airbyte Cloud offers a native dbt Cloud integration that triggers a dbt job automatically after a successful sync. In OSS deployments, orchestrate the dependency with Airflow using the AirbyteTriggerSyncOperator followed by a DbtRunOperator, or with Dagster's Airbyte and dbt assets.

Connection Scheduling

Each Airbyte connection has its own schedule, configurable as a cron expression or a fixed interval (every 1 hour, every 6 hours, etc.). For near-real-time requirements, CDC connections can be run every 5 minutes with low overhead since they only read WAL deltas rather than full tables.

Tip

Avoid scheduling all connections at the same time (e.g. every hour on the hour). Stagger starts by 5–10 minutes to prevent thundering-herd load spikes on both your source databases and your Airbyte worker pool.

PyAirbyte for Programmatic Use

PyAirbyte is the official Python library for driving Airbyte syncs without the UI. It is useful for dynamic pipeline generation, CI validation of connector configs, and embedding Airbyte into orchestration frameworks.

python
import airbyte as ab

# Define a source connector
source = ab.get_source(
    "source-postgres",
    config={
        "host": "prod-db.example.com",
        "port": 5432,
        "database": "analytics",
        "username": "airbyte_user",
        "password": "strong_password",
        "replication_method": {"method": "CDC"},
    },
    streams=["orders", "customers"],
)

# Read into an in-memory cache (DuckDB by default)
source.check()
cache = ab.get_default_cache()
result = source.read(cache=cache)

# Access as a pandas DataFrame
orders_df = result["orders"].to_pandas()
print(orders_df.head())

PyAirbyte is particularly powerful in notebook-driven EDA workflows — you can pull a fresh sample from a production source into DuckDB in a few lines of code without configuring a full Airbyte server.

Airbyte vs Fivetran

Both platforms solve the same problem — automated data pipeline management — but make different trade-offs.

Dimension Airbyte OSS Airbyte Cloud Fivetran
Cost model Free (infra costs only) Usage-based credits MAR or row-based pricing
Self-hosted Yes No Fivetran HVR only (enterprise)
Connector count 350+ 350+ 500+
Connector quality Variable (community-built) Variable High (vendor-maintained)
Custom connectors Connector Builder / CDK Connector Builder Fivetran SDK (limited)
Ops burden High (self-managed) Low Very low
SLA / uptime guarantee None 99.9% (paid tiers) 99.9%+
Data residency control Full Partial (region selection) Limited

Choose Airbyte OSS when you have high row volumes where per-MAR Fivetran pricing becomes prohibitive, when data sovereignty requires on-premise or VPC-only processing, or when you need to build custom connectors without paying enterprise fees.

Choose Fivetran when your team lacks the DevOps capacity to manage a self-hosted platform, when you need ironclad SLAs for business-critical pipelines, or when your source connectors are in Fivetran's premium catalog (Salesforce, NetSuite, etc.) and the quality delta justifies the cost.

Warning

Airbyte's community-built connectors vary significantly in quality. Before committing to a connector for a production pipeline, check the connector's GitHub issues and the community Slack for known bugs. Critical-path connectors (your core transactional DB) should be validated with a parallel run against your existing pipeline before cutover.

Key Takeaways
  • Airbyte OSS is the most cost-effective path for high-volume database replication pipelines where self-hosting is viable.
  • Use CDC replication (Debezium for MySQL/PostgreSQL, change streams for MongoDB) over cursor-based sync for accurate, low-latency replication that captures deletes.
  • The Incremental | Deduped History sync mode is the right default for OLTP-to-warehouse pipelines — it keeps a current view of each row without full table scans.
  • Manage replication slot WAL accumulation proactively on PostgreSQL sources; unbounded growth will crash your primary.
  • Chain Airbyte syncs with dbt using Airflow, Dagster, or Airbyte Cloud's native dbt integration to produce clean warehouse models from raw Airbyte landing tables.
  • PyAirbyte enables programmatic connector use in Python scripts and notebooks without a running Airbyte server.
  • Fivetran is the better choice when managed reliability and SLAs outweigh total cost of ownership concerns.

How JusDB Fits In

Running Airbyte is the data movement layer — JusDB is the layer that helps you understand, document, and query what landed. Once your MySQL, PostgreSQL, or MongoDB data is replicated into your warehouse via Airbyte, JusDB automatically catalogues the arriving schemas, tracks column lineage from raw Airbyte tables through your dbt models, and surfaces query performance insights so you can optimise the most expensive post-sync transformations. Teams using JusDB alongside Airbyte cut time-to-insight after pipeline incidents by surfacing exactly which upstream source change caused a downstream schema drift. If you're standing up a new Airbyte pipeline or migrating an existing Fivetran workflow, try JusDB free to add observability and documentation to your replication stack from day one.

Share this article