Database Engineering

Database-to-Data-Warehouse Replication: Patterns and Anti-Patterns

Replicate OLTP databases to analytical warehouses — CDC, bulk load, incremental sync, and schema evolution handling

JusDB Team
January 17, 2023
12 min read
168 views
DB to Data Warehouse Replication: Patterns Guide | JusDB

Your OLTP database is doing its job — handling thousands of transactions per second while your application stays responsive. But the moment your analytics team runs a complex aggregation query across three years of order history, everything grinds to a halt. Replicating operational data into an analytical warehouse is the standard solution, yet the implementation details determine whether you get a reliable pipeline or a slow-motion data disaster. Getting it right requires understanding not just the mechanics but the failure modes of each approach. This guide walks through the three core replication patterns, how to handle schema evolution without breaking downstream consumers, and the anti-patterns that will quietly corrupt your warehouse data.

TL;DR
  • Bulk load (mysqldump / pg_dump + reload) is the simplest pattern but introduces hours of latency and hammers your source database.
  • Incremental sync with an updated_at cursor cuts load and latency but silently misses hard deletes.
  • CDC with Debezium + Kafka gives you near-real-time replication including deletes — at the cost of operational complexity.
  • Schema evolution is the silent killer: adding nullable columns is safe; renaming or dropping columns breaks pipelines.
  • Staging tables + MERGE (upsert) is the correct landing pattern for all three strategies.
  • Always extract from a read replica, never from your primary.

Why Replicate to a Warehouse at All?

OLTP databases — MySQL, PostgreSQL, SQL Server — are optimized for transactional workloads: point reads, short writes, and strict ACID guarantees. Their storage formats (row-oriented pages, B-tree indexes) are deliberately tuned for these access patterns. The moment you run a full-table analytical scan or a multi-table join across millions of rows, you are working against every design decision the engine made.

Analytical warehouses like BigQuery, Snowflake, Redshift, and ClickHouse use columnar storage, vectorized execution, and massively parallel processing. The same aggregation that takes 45 seconds on your production Postgres instance takes 400 milliseconds in Snowflake — not because the warehouse is faster in general, but because it was built for that specific shape of query.

Replication also decouples your analytics workloads from your production database. A rogue BI query cannot take down your application if it runs against a warehouse copy of the data. This separation of concerns is reason enough to invest in a solid replication strategy.


Pattern 1: Bulk Load

The bulk load pattern is exactly what it sounds like: export the entire source table, transfer the file, and reload it into the warehouse on a schedule. Tools like mysqldump, pg_dump, and COPY TO produce flat files (CSV, Parquet, or a database-native format) that can be staged in object storage (S3, GCS) and then loaded using LOAD DATA, BigQuery's bq load, or Snowflake's COPY INTO.

The appeal is simplicity. There is no state to manage, no cursor to maintain, and no change-tracking infrastructure to operate. The pipeline is a cron job that runs a dump and a load. For small tables (under a few million rows) refreshed nightly, this is often the right answer.

The problems surface at scale. A full dump of a 500 GB table takes time, consumes I/O on the source, and produces a snapshot that is already stale by the time it lands. Latency measured in hours is acceptable for overnight batch reports but untenable for dashboards that business teams expect to reflect today's activity. The export also puts real read pressure on the source — which is why you should always run bulk dumps against a read replica, never primary.

Warning

Never run mysqldump or pg_dump against your primary database instance. Even with --single-transaction on MySQL, large exports hold metadata locks and increase replication lag on your replica set. Configure a dedicated read replica for extraction and accept the minor replica lag as the trade-off.

Pattern 2: Incremental Sync

Incremental sync improves on bulk load by only transferring rows that changed since the last run. The mechanism relies on a high-watermark column — typically updated_at — that the application populates on every insert and update. The pipeline stores the last successfully processed timestamp, queries WHERE updated_at > :last_run, and merges the results into the warehouse.

This dramatically reduces data transfer volume and source load. A table with 100 million rows but only 50,000 changes per day transfers 50,000 rows per run rather than 100 million. Latency drops from hours to minutes depending on your run frequency.

The critical blind spot is hard deletes. When a row is deleted from the source, it leaves no trace that an incremental query can detect. Your warehouse accumulates ghost rows — records that no longer exist in the source but persist in your analytical copy indefinitely. This matters acutely for compliance (GDPR right-to-erasure) and for any metric that counts distinct active records.

Warning

Incremental sync with updated_at will silently miss all hard deletes. If your application ever calls DELETE FROM orders WHERE ..., those rows will remain in your warehouse forever. Either switch to soft deletes (a deleted_at column) on the source, run periodic full reconciliations, or move to CDC.

The soft delete pattern is a practical workaround: instead of issuing a DELETE, the application sets deleted_at = NOW() and the incremental query picks up the change on the next run. The warehouse then filters WHERE deleted_at IS NULL for active-record queries. This requires application-level discipline but keeps the pipeline simple.

Pattern 3: CDC Streaming with Debezium and Kafka

Change Data Capture reads the database's internal replication log — the MySQL binlog or the PostgreSQL WAL — and converts every insert, update, and delete into a structured event stream. Debezium is the most widely deployed open-source CDC connector and publishes these events to Kafka topics, one topic per source table.

A consumer then reads from Kafka and applies the changes to the warehouse. Deletes appear as explicit tombstone events rather than absences, so the warehouse can execute the corresponding delete or mark the row as deleted. Latency drops to seconds. Because the extraction reads the replication log rather than executing queries, the impact on the source database is minimal — approximately equivalent to adding one more replica.

Tip

For PostgreSQL CDC with Debezium, use the pgoutput logical replication plugin (available natively since PG 10) rather than decoderbufs. It requires no additional extensions and is maintained by the Postgres project itself. Set wal_level = logical and create a replication slot — and monitor slot lag closely, as an unconsumed slot will cause WAL to accumulate and can fill your disk.

The trade-off is operational complexity. You are now running Kafka, Debezium (typically as a Kafka Connect worker), schema registry for Avro serialization, and a consumer that understands the warehouse landing pattern. Each component has its own failure modes, monitoring surface, and upgrade cycle. For teams without existing Kafka infrastructure, this is a significant investment.

The canonical landing pattern for CDC events is staging tables plus MERGE. Events land in a staging table first. A warehouse procedure then runs MERGE staging INTO target ON id = id WHEN MATCHED AND op = 'DELETE' THEN DELETE WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. This handles all three event types atomically and is idempotent — reprocessing the same batch of events produces the same result.


Schema Evolution: The Silent Pipeline Killer

Source schemas change. Engineers add columns, rename fields, split tables, and occasionally drop columns that "nobody is using." Each of these operations interacts differently with your replication pipeline.

Adding a nullable column to the source is safe. Debezium and most incremental sync tools will propagate the new column, and warehouse ALTER TABLE ADD COLUMN operations are typically online and non-blocking. Downstream queries that do not reference the new column continue working unchanged.

Renaming a column breaks everything. From the pipeline's perspective, a rename is an old column disappearing and a new column appearing. CDC consumers that reference the old field name by string will start dropping data. Downstream BI queries and dbt models that reference the old name fail silently or loudly, depending on your error handling. If a rename is unavoidable, add the new column, backfill it, update the application to write both columns, cut over downstream consumers, then drop the old column — a multi-step migration measured in days, not minutes.

Warning

Never rename a column in a production source table without first coordinating with every downstream pipeline consumer. A rename is a breaking change. Treat it with the same gravity as a public API change: version it, communicate it, and execute it in stages.

Dropping a column that is referenced by a warehouse materialization will cause the next pipeline run to either fail (if the schema is strict) or silently produce nulls (if it is permissive). Maintain a schema registry — Confluent Schema Registry for Avro-based CDC or a dbt source freshness + column lineage graph — so you know which downstream artifacts depend on which source columns before making any DDL change.


Denormalization at Load Time

OLTP schemas are normalized: orders reference customers by customer_id, line items reference products by product_id. Joins are cheap on row-oriented databases with indexes. In a columnar warehouse, repeated joins across large tables are expensive and complicate the SQL that analysts write.

The load step is the right moment to denormalize. When landing an orders batch, join against the current customersproducts, and promotions dimensions and embed the relevant attributes directly into the fact table. Analysts get a single wide table. Aggregation queries avoid joins entirely. The cost is increased storage, which is cheap, and the need to re-denormalize when dimension attributes change — which your CDC pipeline handles automatically.


Tools Comparison

Tool Pattern Latency Handles Deletes Operational Cost Best For
Airbyte Bulk / Incremental / CDC Minutes–Hours CDC mode only Low (managed or self-hosted) Teams wanting managed connectors without Kafka
Fivetran Incremental / CDC Minutes Yes (log-based) Very low (fully managed) Teams prioritizing reliability over cost control
Debezium + Kafka CDC streaming Seconds Yes (native) High (self-managed Kafka cluster) High-volume pipelines with existing Kafka infra
Custom scripts Bulk / Incremental Minutes–Hours No (unless soft deletes) Low initial, high long-term Simple tables, small teams, tight budgets
AWS DMS Full load / CDC Seconds–Minutes Yes Medium (AWS-managed) AWS-native stacks, RDS to Redshift pipelines

Anti-Patterns to Avoid

Extracting from primary. Running any bulk or incremental extraction query against your production primary adds read load at exactly the wrong time. Configure a dedicated read replica for all analytical extraction. The few milliseconds of replication lag are an acceptable trade-off.

Trusting updated_at without a delete strategy. If you use incremental sync and your application issues hard deletes, your warehouse will diverge from the source over time. Audit this explicitly: compare row counts between source and warehouse on a schedule, and alert on discrepancies above a threshold.

Loading directly into production warehouse tables. Always land data in a staging table first. Direct loads that fail mid-batch leave production tables in a partially updated state. A staging + MERGE pattern is atomic: either the merge completes or it rolls back.

Ignoring replication slot lag. PostgreSQL logical replication slots prevent WAL cleanup until the slot consumer has processed the events. A paused or slow Debezium consumer will cause WAL files to accumulate and eventually fill the disk — a database-down event. Monitor pg_replication_slots and alert when confirmed_flush_lsn falls too far behind pg_current_wal_lsn().

Replicating without schema contracts. Undocumented source schemas evolve unpredictably. Define schema contracts using Avro or Protobuf with a schema registry, or enforce column documentation in dbt. Downstream breakage from surprise schema changes is entirely preventable.

Tip

Run a weekly automated reconciliation job that compares row counts and a checksum of a sample column between your source and warehouse for every replicated table. This catches silent drift from missed deletes, failed merges, and off-by-one cursor bugs before they compound into month-old data quality issues.


Key Takeaways
  • Match the replication pattern to your latency requirement: bulk load for nightly batches, incremental sync for hourly refreshes, CDC for near-real-time dashboards.
  • Incremental sync misses hard deletes — use soft deletes or CDC if your application issues DELETE statements.
  • Always extract from a read replica, never from your primary database.
  • Use staging tables and atomic MERGE operations for all warehouse landing — never load directly into production tables.
  • Schema evolution is a pipeline concern: adding nullable columns is safe, renaming is a breaking change that requires a multi-step migration.
  • Monitor PostgreSQL replication slot lag; an unconsumed slot can bring down your database by filling disk with WAL files.
  • Denormalize at load time to simplify downstream analytics queries and avoid expensive joins at query time.
  • Run periodic reconciliation jobs to detect and alert on data drift between source and warehouse.

How JusDB Helps

Implementing and maintaining replication pipelines requires deep expertise across database internals, streaming infrastructure, and warehouse landing patterns. JusDB works with engineering teams to design replication architectures that match their latency requirements, scale constraints, and existing tooling — whether that means a well-structured Airbyte configuration, a production-hardened Debezium deployment, or an optimized custom pipeline.

If you are running incremental sync today and worried about delete drift, planning a CDC migration, or debugging a pipeline that quietly fell behind weeks ago, talk to a JusDB engineer. We have instrumented, migrated, and rescued pipelines at every scale — and we can help you build one that you can trust.

Share this article