Every data pipeline decision carries a hidden cost: pick the wrong integration pattern and you're either drowning in stale reports, buckling under source-system load, or wrestling with a transformation layer that breaks every time a schema changes. CDC, ETL, and ELT are not interchangeable — each one makes a different trade-off between latency, infrastructure complexity, source impact, and operational overhead. Getting that trade-off right separates pipelines that scale from pipelines that become technical debt. This post dissects all three patterns so you can make the choice with eyes open.
- ETL transforms data before loading — reliable, battle-tested, but limited to batch windows (hours of latency).
- ELT loads raw data first, then transforms inside the warehouse — ideal for modern cloud DWHs like BigQuery, Snowflake, and Redshift with dbt.
- CDC streams database log events in near real-time (sub-second to seconds) without hammering the source system.
- Most mature data platforms use all three: CDC for operational feeds, ELT for analytics, ETL for regulated or legacy pipelines.
- Choose by latency requirement, source system sensitivity, schema volatility, and existing tooling.
Data Integration Fundamentals
Data integration is the discipline of moving data from where it is produced to where it can be analyzed or acted upon. Every integration pattern answers the same three questions: when does the movement happen, where does the transformation happen, and how much of the source dataset moves each time?
The answers have compounding downstream effects. A pattern optimized for throughput (full-table batch) will saturate OLTP read replicas during the extract window. A pattern optimized for freshness (streaming CDC) requires a more complex operational stack. A pattern that transforms in the source-side ETL layer couples your pipeline tightly to schema changes in the upstream system. None of these is inherently wrong — the key is matching the pattern to your actual requirements.
Three patterns dominate the industry: ETL (Extract, Transform, Load), ELT (Extract, Load, Transform), and CDC (Change Data Capture). They differ fundamentally in where transformation logic lives, how frequently data moves, and what they demand from source and destination systems.
ETL Explained
ETL is the original enterprise data integration pattern, formalized in the 1970s alongside the first data warehouses. The sequence is literal: extract data from source systems, apply transformations in an intermediate processing layer, then load the cleaned and structured result into the destination warehouse.
How ETL Works
An ETL job typically runs on a schedule — nightly is the classic cadence, though hourly batches are common in modern implementations. The extraction step queries or exports source data (full table dumps or incremental deltas based on a timestamp column). The transformation step runs in a dedicated ETL engine: aggregations, joins, type casting, deduplication, business rule application, and data quality checks all happen here before a single row reaches the warehouse. The load step writes clean, conformed data into fact and dimension tables.
Classic ETL tooling includes Informatica PowerCenter, IBM DataStage, Microsoft SSIS, Talend, and Apache Spark for large-scale batch processing. Cloud-native options like AWS Glue and Azure Data Factory follow the same extract-transform-load sequence.
ETL Strengths
- Data quality and business logic are enforced before the warehouse, keeping the destination clean by construction.
- Works well with legacy on-premise warehouses with limited compute (Teradata, Oracle DW) that cannot afford to run heavy transformations internally.
- Regulatory and compliance use cases benefit from a documented transformation audit trail that exists independently of the warehouse.
- Predictable resource utilization — the ETL window is bounded and schedulable.
ETL's biggest liability is latency. Nightly batch jobs mean your analysts are always working with yesterday's data. For operational dashboards, fraud detection, or any use case where hours matter, ETL is the wrong starting point. Additionally, schema changes in source systems break ETL transformation logic immediately — every column rename or type change requires a pipeline code change before the next run.
ELT Explained
ELT emerged as cloud data warehouses — BigQuery, Snowflake, Redshift, Databricks — made it economically viable to store and process raw data at scale. The insight behind ELT is simple: modern cloud DWHs are extremely good at running complex SQL transformations against large datasets. There is no longer a compelling reason to do that work outside the warehouse.
How ELT Works
In ELT, extraction and loading happen first — often using dedicated ingestion tools like Fivetran, Airbyte, or Stitch. Raw data lands in a staging schema inside the warehouse exactly as it came from the source, preserving full fidelity. Transformation then happens inside the warehouse using SQL, orchestrated by dbt (data build tool), which has become the de facto standard for the transform layer. dbt models define a DAG of SQL transformations that produce clean, tested, documented datasets consumed by analysts and BI tools.
ELT Strengths
- Raw data is always available — you can reprocess historical data with new transformation logic without re-extracting from source.
- Separation of concerns: ingestion tools handle connectivity, dbt handles transformation, the warehouse handles compute.
- Schema evolution is more forgiving — new source columns land in staging and can be incorporated into transformations on the next dbt run without pipeline failure.
- Analysts with SQL skills can own transformation logic; no specialized ETL developer required.
- Latency is better than nightly ETL — many ELT setups run every 15–60 minutes.
Use dbt's incremental materialization strategy to avoid reprocessing your entire raw dataset on every transformation run. Combined with a short ingestion interval, this can bring ELT latency down to 10–15 minutes for many use cases — good enough for most analytical dashboards without the operational complexity of streaming CDC.
ELT still relies on periodic extraction from source systems. If your source is a high-traffic OLTP database, bulk SELECT queries on a schedule will create read load spikes. Always run ingestion against a read replica, and be aware that timestamp-based incremental extraction can miss hard-deletes unless your source preserves soft-delete records.
CDC Explained
Change Data Capture takes a fundamentally different approach: instead of periodically querying the source system for new or changed rows, CDC taps into the database's internal change log — the transaction log (WAL in PostgreSQL, binlog in MySQL, redo log in Oracle) — and streams every insert, update, and delete as it happens.
How Log-Based CDC Works
Log-based CDC tools like Debezium act as logical replication consumers. Debezium connects to the database replication slot (PostgreSQL) or reads the binlog (MySQL/MariaDB) and publishes structured change events to Apache Kafka topics. Each event contains the before and after state of the row, the operation type (INSERT/UPDATE/DELETE), and the source transaction timestamp. Consumers downstream — whether a streaming processor like Apache Flink, a Kafka Connect sink, or a purpose-built CDC platform like Fivetran, Airbyte CDC mode, or Striim — apply these events to the destination system.
Cloud databases offer managed CDC equivalents: AWS DMS with ongoing replication, Google Datastream, and Azure Database Migration Service all implement log-based CDC without requiring Debezium deployment.
CDC Strengths
- Sub-second to low-second latency — change events arrive at the destination within seconds of the source commit.
- Minimal source load — reading from the transaction log imposes negligible overhead compared to periodic bulk SELECT queries.
- Captures hard deletes natively — a DELETE operation in the source produces a delete event in the stream; no soft-delete workarounds needed.
- Exact event ordering is preserved within a transaction, enabling accurate event-driven downstream processing.
CDC introduces significant operational complexity. You must configure the source database for logical replication (requires superuser or replication role), manage Kafka topic retention and consumer lag, handle schema evolution across the event stream (Confluent Schema Registry with Avro/Protobuf is the standard approach), and ensure connector idempotency so that consumer restarts do not produce duplicate records. Schema changes that drop or rename columns can break CDC consumers immediately if not handled with a schema evolution strategy.
For PostgreSQL CDC with Debezium, set wal_level = logical and use a dedicated replication slot per connector. Monitor replication slot lag carefully — an unconsumed slot causes WAL files to accumulate on disk and can fill the data volume, bringing the source database to a halt.
Comparison Table
| Dimension | ETL | ELT | CDC |
|---|---|---|---|
| Typical latency | Hours (batch window) | Minutes to hours | Sub-second to seconds |
| Where transformation runs | External ETL engine | Inside the warehouse (dbt/SQL) | Stream processor (Flink, ksqlDB) or post-landing |
| Source load impact | High — bulk SELECT during extract window | Medium — periodic bulk SELECT or API call | Very low — reads transaction log only |
| Hard deletes captured | No (requires soft delete or CDC supplement) | No (same limitation) | Yes — natively |
| Schema evolution handling | Brittle — pipeline code must be updated | Resilient — new columns land in staging | Requires schema registry and versioning strategy |
| Operational complexity | Medium | Low to medium | High |
| Infrastructure cost | ETL server / Spark cluster | Ingestion tool + warehouse compute | Kafka cluster + connectors + warehouse |
| Best destination type | On-premise DWH, regulated systems | Cloud DWH (BigQuery, Snowflake, Redshift) | Kafka, operational databases, real-time analytics (StarRocks, Druid, ClickHouse) |
| Primary tooling | Informatica, SSIS, Talend, AWS Glue | Fivetran, Airbyte + dbt | Debezium, Kafka, AWS DMS, Datastream, Striim |
| Ideal use cases | Nightly reporting, compliance, legacy DWH | Analytical dashboards, data lake modeling | Real-time dashboards, event-driven microservices, operational analytics |
Choosing the Right Pattern
The decision matrix below maps use case characteristics to the appropriate pattern. In practice, most mid-to-large data platforms combine all three: CDC for operational feeds requiring freshness, ELT for analytical modeling, and ETL for legacy or regulated pipelines that cannot be migrated.
Choose ETL when:
- Your destination is an on-premise or legacy data warehouse with limited internal compute.
- Regulatory requirements mandate that transformation logic and data quality checks exist as auditable artifacts outside the warehouse.
- The source system exports data in flat files or proprietary formats that require pre-processing before they can be loaded.
- Your team has deep expertise in a specific ETL platform (Informatica, SSIS) and migration cost outweighs latency improvement.
Choose ELT when:
- Your destination is a cloud data warehouse — BigQuery, Snowflake, Redshift, or Databricks — with ample compute elasticity.
- Analysts and analytics engineers need to iterate on transformation logic frequently using SQL and dbt.
- You want to preserve raw data fidelity for reprocessing and historical analysis.
- Latency requirements are in the 15-minute to 1-hour range and source systems can tolerate periodic incremental queries.
Choose CDC when:
- You need sub-minute or real-time data freshness for operational dashboards, fraud detection, inventory visibility, or event-driven applications.
- Source systems are high-traffic OLTP databases where bulk SELECT queries during business hours are not acceptable.
- You need to capture hard deletes accurately — critical for CRM sync, compliance deletion tracking, or any GDPR-driven right-to-erasure workflow.
- You are building event-driven microservices that react to data changes rather than polling for them.
Hybrid Approaches
The most scalable architectures do not pick one pattern — they compose them. A common hybrid is CDC into ELT: Debezium streams change events into Kafka, a Kafka Connect sink writes raw events to a warehouse staging area, and dbt transforms them on a 5–15 minute schedule. This delivers near-real-time freshness with the maintainability of SQL-based transformations.
Another proven pattern is CDC into a real-time OLAP engine (StarRocks, Apache Druid, ClickHouse) alongside a parallel ELT pipeline into Snowflake or BigQuery. The real-time engine serves operational dashboards requiring second-level freshness; the cloud DWH serves complex analytical queries that need full historical depth and dbt-modeled semantics.
When evaluating whether to introduce CDC, calculate the cost of Kafka cluster operations, connector management, and schema registry maintenance against your actual latency SLA. If your stakeholders can tolerate 30-minute-old data, a well-tuned ELT pipeline with a short sync interval is significantly simpler to operate and is often the right starting point — with CDC introduced surgically for the subset of feeds that genuinely require real-time freshness.
Key Takeaways
- ETL transforms before loading — proven for legacy warehouses and regulated pipelines, but carries hours of latency and brittle schema coupling.
- ELT loads raw data first and transforms inside the cloud warehouse — the default choice for modern analytics stacks using dbt, Snowflake, or BigQuery.
- CDC reads from the database transaction log, delivering sub-second change events with minimal source impact and native hard-delete capture.
- Latency hierarchy: ETL (hours) > ELT (minutes) > CDC (seconds).
- Source load impact runs in the opposite direction: ETL hits hardest, CDC is nearly invisible to the source.
- Schema evolution is most forgiving in ELT (new columns land in staging safely), most operationally demanding in CDC (requires schema registry and versioning).
- Most production data platforms combine all three: CDC for real-time operational feeds, ELT for analytical modeling, ETL for legacy or compliance-critical pipelines.
- Start with ELT for new analytical use cases. Add CDC when you hit a latency wall that short-interval ELT cannot solve cost-effectively.
How JusDB Helps You Navigate Data Integration
Choosing the right integration pattern is only the first decision. Implementing it correctly — selecting the right tools, configuring your database for log-based replication, designing your warehouse schemas for incremental loads, and building the observability to catch lag and schema drift before they surface as broken dashboards — requires deep operational experience across all three patterns.
JusDB's database guides, architecture comparisons, and tooling deep-dives are written by engineers who have built and operated CDC pipelines on Debezium and Kafka, ELT stacks on dbt and Snowflake, and legacy ETL migrations at scale. Whether you're evaluating Airbyte against Fivetran for your ELT layer, configuring PostgreSQL logical replication slots for production CDC, or designing a hybrid architecture that combines both, JusDB has the reference material to help you move faster and avoid the pitfalls.
Explore the JusDB blog for in-depth guides on Debezium configuration, dbt incremental strategies, Kafka Connect sink tuning, and real-time OLAP engine comparisons — the technical depth you need to build data pipelines that actually hold up in production.