Apache SeaTunnel: Unified CDC and Batch Pipelines at Scale

Use Apache SeaTunnel for CDC and batch ingestion — Zeta engine, 100+ connectors, exactly-once semantics, and Kubernetes deployment

JusDB Team
February 17, 2026
Updated June 20, 2026
9 min read

Change data capture pipelines have a reputation for being brittle — you wire together a Debezium connector, a Kafka topic, a Flink job, and a sink adapter, then spend the next six months babysitting each seam. Apache SeaTunnel takes a different approach: one unified runtime, a declarative config file, and a connector library that covers both CDC streams and bulk batch loads without forcing you to maintain two separate codebases. Whether you are migrating a 10 TB data warehouse overnight or keeping a real-time analytics table in sync with microsecond latency, the same tool handles both workloads. This post walks through the architecture, a complete MySQL-to-ClickHouse CDC job, connector options, and a Kubernetes deployment pattern that production teams are actually running today.

TL;DR
  • Apache SeaTunnel is an open-source data integration platform with its own lightweight Zeta engine that avoids JVM-heavy Spark and Flink overhead.
  • It ships with 100+ connectors covering MySQL, PostgreSQL, Kafka, S3, ClickHouse, StarRocks, and dozens more — all usable from a single YAML job file.
  • CDC pipelines use Debezium under the hood with exactly-once semantics enforced through two-phase commit across source and sink.
  • Batch and streaming jobs share the same API, so you can backfill historical data and then seamlessly hand off to incremental CDC without rewriting anything.
  • Kubernetes deployment is first-class: Zeta runs as a native cluster with JobManager and TaskManagers, and Helm charts are available for quick provisioning.

What Is Apache SeaTunnel

Apache SeaTunnel (formerly known as Waterdrop) graduated to a top-level Apache project in 2022. Its core purpose is deceptively simple: read data from source A, optionally transform it, write it to sink B — but do that reliably at petabyte scale for both batch snapshots and real-time CDC streams. What makes it stand out in a crowded integration space is the refusal to treat batch and streaming as fundamentally different problems. SeaTunnel represents both as pipelines described in the same HOCON/YAML configuration format, executed on the same runtime.

The project is structured around three abstractions:

  • Source connectors — read data from external systems using either a bounded (batch) or unbounded (streaming) reader interface.
  • Transform plugins — pure in-process operations such as field mapping, SQL filtering, type casting, and multi-table routing.
  • Sink connectors — write data to target systems with configurable delivery guarantees ranging from at-least-once to exactly-once.

All three plug into the engine through a stable SPI interface, meaning a connector written for Zeta also runs on Flink or Spark with no code changes — only the engine stanza in the job file changes.

Zeta Engine Architecture (vs Flink/Spark)

Most SeaTunnel users eventually ask: why does this project need its own engine when Flink and Spark already exist? The answer comes down to operational weight. Both Flink and Spark carry significant JVM overhead: Flink's JobManager alone consumes hundreds of megabytes before your first record flows, and Spark's driver adds further per-job startup latency. For organisations running dozens of small-to-medium pipelines — the typical data platform scenario — that overhead multiplies painfully.

Zeta is SeaTunnel's answer. It is a purpose-built, lightweight distributed execution engine designed specifically for data integration workloads. Key architectural decisions:

Dimension Zeta Engine Apache Flink Apache Spark
Process model Single JVM per node, shared thread pool Separate JVM per TaskManager slot group Separate JVM per executor
State backend Built-in IMap (Hazelcast-based) RocksDB / heap Checkpoint to HDFS/S3
Startup latency ~2–5 s for small jobs 10–30 s typical 30–90 s typical
Multi-job isolation Thread-level (low overhead) Process-level Process-level
CDC support Native, first-class Via Flink CDC plugin Limited, batch-centric
Connector reuse Same connector JAR Requires Flink connector API Requires Spark connector API

Zeta uses pipeline-level parallelism rather than operator-level, which simplifies checkpointing. Each pipeline is an independent unit of failure recovery. The state store is backed by Hazelcast IMap distributed across the cluster, giving you in-memory speed with optional persistence to local disk or an external store. This design makes Zeta particularly well-suited for fleets of medium-throughput pipelines that would be wasteful to deploy as dedicated Flink clusters.

Tip

If you already operate a Flink cluster and want to leverage existing operational knowledge, SeaTunnel's Flink engine mode is a valid choice — your connector configs remain identical. Zeta becomes the better default once you want to consolidate many pipelines onto shared infrastructure without the per-cluster overhead.

Setting Up a CDC Pipeline with SeaTunnel

The most common real-world pipeline pattern is MySQL CDC flowing into an OLAP store such as ClickHouse for real-time analytics. SeaTunnel handles this in a single job file. The MySQL CDC source connector wraps Debezium internally, manages binlog offsets, and emits row-level change events (INSERT, UPDATE, DELETE) as a typed data stream. The ClickHouse sink uses its native HTTP interface and batches writes for efficiency, applying two-phase commit to enforce exactly-once delivery.

Exactly-once works as follows: the source records the current binlog position into a checkpoint snapshot. The sink buffers uncommitted rows in a staging table. On checkpoint completion, the sink promotes staged rows atomically and the source advances its committed offset. If the job crashes mid-flight, the next startup replays from the last committed checkpoint, and the sink discards any partially staged data from the failed attempt.

Here is a complete working job configuration:

yaml
# seatunnel-mysql-cdc-to-clickhouse.yaml
env {
  job.name        = "mysql-cdc-to-clickhouse"
  job.mode        = "STREAMING"
  checkpoint.interval = 30000          # ms — triggers two-phase commit cycle
  parallelism     = 4
}

source {
  MySQL-CDC {
    result_table_name = "orders_cdc"

    hostname      = "mysql.internal"
    port          = 3306
    username      = "seatunnel_reader"
    password      = "${MYSQL_PASSWORD}"
    database-names = ["commerce"]
    table-names    = ["commerce.orders", "commerce.order_items"]

    # Start from the current binlog tail on first run;
    # subsequent runs resume from checkpointed offset.
    startup.mode  = "initial"

    # Debezium properties passed through
    debezium {
      snapshot.mode                = "initial"
      decimal.handling.mode        = "double"
      bigint.unsigned.handling.mode = "long"
    }
  }
}

transform {
  FieldMapper {
    source_table_name = "orders_cdc"
    result_table_name = "orders_flat"

    field_mapper = {
      id            = order_id
      customer_id   = customer_id
      total_amount  = amount_usd
      created_at    = event_time
      # Drop internal Debezium metadata columns
    }
  }

  Filter {
    source_table_name = "orders_flat"
    result_table_name = "orders_filtered"
    # Only forward completed or shipped orders
    condition = "status IN ('completed', 'shipped')"
  }
}

sink {
  ClickHouse {
    source_table_name = "orders_filtered"

    host          = "clickhouse.internal:8123"
    database      = "analytics"
    table         = "orders_realtime"
    username      = "seatunnel_writer"
    password      = "${CLICKHOUSE_PASSWORD}"

    # Exactly-once via two-phase commit staging table
    support_upsert         = true
    primary_keys           = ["order_id"]
    allow_experimental_lightweight_delete = true

    # Batch tuning
    bulk_size     = 20000
    flush_interval = 5000               # ms — flush even if bulk_size not reached
  }
}

Submit this job to a running Zeta cluster with:

bash
# Submit to local Zeta cluster
./bin/seatunnel.sh \
  --config ./seatunnel-mysql-cdc-to-clickhouse.yaml \
  --master local[4]

# Submit to a remote Zeta cluster
./bin/seatunnel.sh \
  --config ./seatunnel-mysql-cdc-to-clickhouse.yaml \
  --master seatunnel://seatunnel-master:5801

# Check job status
./bin/seatunnel.sh --list --master seatunnel://seatunnel-master:5801
Warning

The MySQL user referenced in the source config needs SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT privileges. Missing any of these will cause the initial snapshot to fail silently or the binlog stream to drop without a clear error message. Always verify the grant set before deploying to production.

Connector Ecosystem (Sources and Sinks)

SeaTunnel's connector library is one of its strongest practical arguments. The project ships with over 100 connectors maintained in the main repository, covering the majority of systems a data platform team encounters. They are divided into two categories: SeaTunnel V2 connectors (the modern API, supports both Zeta and Flink) and legacy V1 connectors (Spark-only, being phased out).

A representative selection of the V2 connector catalogue:

Category Source Connectors Sink Connectors
OLTP / CDC MySQL CDC, PostgreSQL CDC, Oracle CDC, SQL Server CDC, MongoDB CDC, TiDB CDC MySQL, PostgreSQL, Oracle, SQL Server, TiDB
OLAP ClickHouse, StarRocks, Doris, Hive, Iceberg, Hudi ClickHouse, StarRocks, Doris, Hive, Iceberg, Hudi, BigQuery
Streaming / Queue Kafka, Pulsar, RocketMQ, RabbitMQ Kafka, Pulsar, RocketMQ
Object Storage S3, GCS, Azure Blob, OSS, MinIO (all via unified File connector) S3, GCS, Azure Blob, OSS, MinIO
SaaS / API HTTP, Salesforce, HubSpot, Slack, GitHub HTTP, Feishu, DingTalk, Slack
Search / Cache Elasticsearch, OpenSearch, Redis Elasticsearch, OpenSearch, Redis, Neo4j

Every connector exposes a consistent set of options for parallelism, retry policy, and schema inference. The schema block is optional — SeaTunnel will infer column types from the source when possible — but specifying it explicitly is strongly recommended for production pipelines to catch upstream schema changes at job startup rather than mid-run.

Deploying SeaTunnel on Kubernetes

For production deployments, Kubernetes is the natural home for SeaTunnel's Zeta engine. The cluster topology mirrors Flink: one SeaTunnel Master (equivalent to JobManager) handles job scheduling and state coordination, while multiple SeaTunnel Workers (equivalent to TaskManagers) execute pipeline tasks.

yaml
# seatunnel-master-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: seatunnel-master
  namespace: data-platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: seatunnel-master
  template:
    metadata:
      labels:
        app: seatunnel-master
    spec:
      containers:
        - name: master
          image: apache/seatunnel:2.3.5
          command: ["/opt/seatunnel/bin/seatunnel-cluster.sh"]
          args: ["-r", "master"]
          ports:
            - containerPort: 5801    # Hazelcast cluster port
            - containerPort: 8080    # REST API / UI
          env:
            - name: ST_DOCKER_MEMBER_LIST
              value: "seatunnel-master"
            - name: JAVA_OPTS
              value: "-Xms512m -Xmx2g"
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          volumeMounts:
            - name: config
              mountPath: /opt/seatunnel/config
      volumes:
        - name: config
          configMap:
            name: seatunnel-config
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: seatunnel-worker
  namespace: data-platform
spec:
  replicas: 3
  selector:
    matchLabels:
      app: seatunnel-worker
  template:
    metadata:
      labels:
        app: seatunnel-worker
    spec:
      containers:
        - name: worker
          image: apache/seatunnel:2.3.5
          command: ["/opt/seatunnel/bin/seatunnel-cluster.sh"]
          args: ["-r", "worker"]
          env:
            - name: ST_DOCKER_MEMBER_LIST
              value: "seatunnel-master"
            - name: JAVA_OPTS
              value: "-Xms1g -Xmx4g"
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "4"
              memory: "8Gi"

A few production considerations worth encoding in your Helm values:

  • Checkpoint storage: configure checkpoint.storage.type = hdfs or s3 in seatunnel.yaml so that job state survives pod restarts. In-memory (default) is fine for development but will lose offset position on any pod eviction.
  • Slot count: each worker exposes a configurable number of task slots (worker.check-interval + seatunnel.engine.slot-service.dynamic-slot). Start with two slots per CPU core and tune down if you observe GC pressure.
  • Connector JARs: mount a shared PVC or use an init container to pull connector JARs from an internal artifact store. Baking all 100+ connectors into the image bloats it unnecessarily — only include what each deployment needs.
Tip

The SeaTunnel community maintains an unofficial Helm chart at seatunnel/seatunnel-helm. It handles the Hazelcast service discovery configuration automatically, which is the most common source of worker-to-master connectivity failures when setting up Kubernetes deployments manually.

Key Takeaways

Key Takeaways
  • Apache SeaTunnel unifies batch snapshot ingestion and real-time CDC streaming in one declarative YAML configuration, eliminating the need for separate pipeline codebases.
  • The Zeta engine is a purpose-built lightweight runtime that avoids Flink and Spark JVM overhead — startup latency is measured in seconds, making it practical to run dozens of small-to-medium pipelines on shared infrastructure.
  • Exactly-once semantics are achieved through two-phase commit: the source checkpoints its binlog offset and the sink stages writes atomically, ensuring no data loss or duplication on failure recovery.
  • With 100+ connectors covering MySQL, PostgreSQL, Kafka, S3, ClickHouse, StarRocks, Iceberg, and more, SeaTunnel handles the full breadth of a modern data platform's integration surface from a single tool.
  • Kubernetes deployment is production-grade: configure durable checkpoint storage (S3 or HDFS), tune slot counts per worker, and use an init container pattern to manage connector JAR distribution without bloating your base image.
  • SeaTunnel's connector SPI means the same connector code runs on Zeta, Flink, or Spark — so adopting SeaTunnel does not lock you into a single execution engine.

Working with JusDB on Apache SeaTunnel

Standing up a SeaTunnel cluster is straightforward; keeping it healthy in production is a different exercise. Offset drift, schema evolution in source tables, ClickHouse merge-tree tuning, and checkpoint storage configuration all require domain-specific knowledge that accumulates slowly and painfully through incidents. JusDB's engineering team has deployed and operated SeaTunnel pipelines across a range of production environments — from single-node Zeta clusters handling modest CDC volumes to multi-worker Kubernetes deployments ingesting billions of rows per day into StarRocks and ClickHouse OLAP stores.

Engagements typically cover initial pipeline design and YAML job authoring, connector configuration review (the MySQL CDC privilege issue described above is a real example from a client onboarding), checkpoint storage setup, Kubernetes resource sizing, and ongoing incident response. If you are evaluating SeaTunnel for a specific migration project or need help tuning an existing deployment, the team is available for both short advisory engagements and longer managed service arrangements.

Explore JusDB SeaTunnel Services →  |  Talk to a DBA

Share this article

Keep reading

PostgreSQL 19 Beta: Every New Feature That Matters to DBAs

PostgreSQL 19 Beta 1 (June 4, 2026) brings parallel autovacuum, the native REPACK command for online table rebuilds, 2x faster inserts under foreign-key load, online logical replication without a restart, WAIT FOR LSN for read-your-writes consistency, and default changes (JIT off, lz4 TOAST, RADIUS removed). A DBA-focused walkthrough of what changed and what to test before GA.

PostgreSQL14 minJun 15, 2026
Read

High Performance with MongoDB: A Top-Down Tuning Guide

A top-down playbook for high-performance MongoDB: measure with the profiler and explain(), model for access patterns, index by the ESR rule, keep the working set in the WiredTiger cache, pool connections, and scale reads with secondaries and sharding — with flow diagrams for each layer.

MongoDB14 minJun 6, 2026
Read

Migrate On-Premises SQL Server to Amazon RDS: Native Backup/Restore vs AWS DMS

A step-by-step guide to migrating an on-premises Microsoft SQL Server database to Amazon RDS for SQL Server — covering native backup/restore via S3 with the rds_restore_database stored procedure, AWS DMS full-load + CDC for near-zero downtime, option group and IAM setup, cutover, and post-migration hardening.

AWS15 minJun 2, 2026
Read