Change data capture pipelines have a reputation for being brittle — you wire together a Debezium connector, a Kafka topic, a Flink job, and a sink adapter, then spend the next six months babysitting each seam. Apache SeaTunnel takes a different approach: one unified runtime, a declarative config file, and a connector library that covers both CDC streams and bulk batch loads without forcing you to maintain two separate codebases. Whether you are migrating a 10 TB data warehouse overnight or keeping a real-time analytics table in sync with microsecond latency, the same tool handles both workloads. This post walks through the architecture, a complete MySQL-to-ClickHouse CDC job, connector options, and a Kubernetes deployment pattern that production teams are actually running today.
- Apache SeaTunnel is an open-source data integration platform with its own lightweight Zeta engine that avoids JVM-heavy Spark and Flink overhead.
- It ships with 100+ connectors covering MySQL, PostgreSQL, Kafka, S3, ClickHouse, StarRocks, and dozens more — all usable from a single YAML job file.
- CDC pipelines use Debezium under the hood with exactly-once semantics enforced through two-phase commit across source and sink.
- Batch and streaming jobs share the same API, so you can backfill historical data and then seamlessly hand off to incremental CDC without rewriting anything.
- Kubernetes deployment is first-class: Zeta runs as a native cluster with JobManager and TaskManagers, and Helm charts are available for quick provisioning.
What Is Apache SeaTunnel
Apache SeaTunnel (formerly known as Waterdrop) graduated to a top-level Apache project in 2022. Its core purpose is deceptively simple: read data from source A, optionally transform it, write it to sink B — but do that reliably at petabyte scale for both batch snapshots and real-time CDC streams. What makes it stand out in a crowded integration space is the refusal to treat batch and streaming as fundamentally different problems. SeaTunnel represents both as pipelines described in the same HOCON/YAML configuration format, executed on the same runtime.
The project is structured around three abstractions:
- Source connectors — read data from external systems using either a bounded (batch) or unbounded (streaming) reader interface.
- Transform plugins — pure in-process operations such as field mapping, SQL filtering, type casting, and multi-table routing.
- Sink connectors — write data to target systems with configurable delivery guarantees ranging from at-least-once to exactly-once.
All three plug into the engine through a stable SPI interface, meaning a connector written for Zeta also runs on Flink or Spark with no code changes — only the engine stanza in the job file changes.
Zeta Engine Architecture (vs Flink/Spark)
Most SeaTunnel users eventually ask: why does this project need its own engine when Flink and Spark already exist? The answer comes down to operational weight. Both Flink and Spark carry significant JVM overhead: Flink's JobManager alone consumes hundreds of megabytes before your first record flows, and Spark's driver adds further per-job startup latency. For organisations running dozens of small-to-medium pipelines — the typical data platform scenario — that overhead multiplies painfully.
Zeta is SeaTunnel's answer. It is a purpose-built, lightweight distributed execution engine designed specifically for data integration workloads. Key architectural decisions:
| Dimension | Zeta Engine | Apache Flink | Apache Spark |
|---|---|---|---|
| Process model | Single JVM per node, shared thread pool | Separate JVM per TaskManager slot group | Separate JVM per executor |
| State backend | Built-in IMap (Hazelcast-based) | RocksDB / heap | Checkpoint to HDFS/S3 |
| Startup latency | ~2–5 s for small jobs | 10–30 s typical | 30–90 s typical |
| Multi-job isolation | Thread-level (low overhead) | Process-level | Process-level |
| CDC support | Native, first-class | Via Flink CDC plugin | Limited, batch-centric |
| Connector reuse | Same connector JAR | Requires Flink connector API | Requires Spark connector API |
Zeta uses pipeline-level parallelism rather than operator-level, which simplifies checkpointing. Each pipeline is an independent unit of failure recovery. The state store is backed by Hazelcast IMap distributed across the cluster, giving you in-memory speed with optional persistence to local disk or an external store. This design makes Zeta particularly well-suited for fleets of medium-throughput pipelines that would be wasteful to deploy as dedicated Flink clusters.
If you already operate a Flink cluster and want to leverage existing operational knowledge, SeaTunnel's Flink engine mode is a valid choice — your connector configs remain identical. Zeta becomes the better default once you want to consolidate many pipelines onto shared infrastructure without the per-cluster overhead.
Setting Up a CDC Pipeline with SeaTunnel
The most common real-world pipeline pattern is MySQL CDC flowing into an OLAP store such as ClickHouse for real-time analytics. SeaTunnel handles this in a single job file. The MySQL CDC source connector wraps Debezium internally, manages binlog offsets, and emits row-level change events (INSERT, UPDATE, DELETE) as a typed data stream. The ClickHouse sink uses its native HTTP interface and batches writes for efficiency, applying two-phase commit to enforce exactly-once delivery.
Exactly-once works as follows: the source records the current binlog position into a checkpoint snapshot. The sink buffers uncommitted rows in a staging table. On checkpoint completion, the sink promotes staged rows atomically and the source advances its committed offset. If the job crashes mid-flight, the next startup replays from the last committed checkpoint, and the sink discards any partially staged data from the failed attempt.
Here is a complete working job configuration:
# seatunnel-mysql-cdc-to-clickhouse.yaml
env {
job.name = "mysql-cdc-to-clickhouse"
job.mode = "STREAMING"
checkpoint.interval = 30000 # ms — triggers two-phase commit cycle
parallelism = 4
}
source {
MySQL-CDC {
result_table_name = "orders_cdc"
hostname = "mysql.internal"
port = 3306
username = "seatunnel_reader"
password = "${MYSQL_PASSWORD}"
database-names = ["commerce"]
table-names = ["commerce.orders", "commerce.order_items"]
# Start from the current binlog tail on first run;
# subsequent runs resume from checkpointed offset.
startup.mode = "initial"
# Debezium properties passed through
debezium {
snapshot.mode = "initial"
decimal.handling.mode = "double"
bigint.unsigned.handling.mode = "long"
}
}
}
transform {
FieldMapper {
source_table_name = "orders_cdc"
result_table_name = "orders_flat"
field_mapper = {
id = order_id
customer_id = customer_id
total_amount = amount_usd
created_at = event_time
# Drop internal Debezium metadata columns
}
}
Filter {
source_table_name = "orders_flat"
result_table_name = "orders_filtered"
# Only forward completed or shipped orders
condition = "status IN ('completed', 'shipped')"
}
}
sink {
ClickHouse {
source_table_name = "orders_filtered"
host = "clickhouse.internal:8123"
database = "analytics"
table = "orders_realtime"
username = "seatunnel_writer"
password = "${CLICKHOUSE_PASSWORD}"
# Exactly-once via two-phase commit staging table
support_upsert = true
primary_keys = ["order_id"]
allow_experimental_lightweight_delete = true
# Batch tuning
bulk_size = 20000
flush_interval = 5000 # ms — flush even if bulk_size not reached
}
}Submit this job to a running Zeta cluster with:
# Submit to local Zeta cluster
./bin/seatunnel.sh \
--config ./seatunnel-mysql-cdc-to-clickhouse.yaml \
--master local[4]
# Submit to a remote Zeta cluster
./bin/seatunnel.sh \
--config ./seatunnel-mysql-cdc-to-clickhouse.yaml \
--master seatunnel://seatunnel-master:5801
# Check job status
./bin/seatunnel.sh --list --master seatunnel://seatunnel-master:5801The MySQL user referenced in the source config needs SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, and REPLICATION CLIENT privileges. Missing any of these will cause the initial snapshot to fail silently or the binlog stream to drop without a clear error message. Always verify the grant set before deploying to production.
Connector Ecosystem (Sources and Sinks)
SeaTunnel's connector library is one of its strongest practical arguments. The project ships with over 100 connectors maintained in the main repository, covering the majority of systems a data platform team encounters. They are divided into two categories: SeaTunnel V2 connectors (the modern API, supports both Zeta and Flink) and legacy V1 connectors (Spark-only, being phased out).
A representative selection of the V2 connector catalogue:
| Category | Source Connectors | Sink Connectors |
|---|---|---|
| OLTP / CDC | MySQL CDC, PostgreSQL CDC, Oracle CDC, SQL Server CDC, MongoDB CDC, TiDB CDC | MySQL, PostgreSQL, Oracle, SQL Server, TiDB |
| OLAP | ClickHouse, StarRocks, Doris, Hive, Iceberg, Hudi | ClickHouse, StarRocks, Doris, Hive, Iceberg, Hudi, BigQuery |
| Streaming / Queue | Kafka, Pulsar, RocketMQ, RabbitMQ | Kafka, Pulsar, RocketMQ |
| Object Storage | S3, GCS, Azure Blob, OSS, MinIO (all via unified File connector) | S3, GCS, Azure Blob, OSS, MinIO |
| SaaS / API | HTTP, Salesforce, HubSpot, Slack, GitHub | HTTP, Feishu, DingTalk, Slack |
| Search / Cache | Elasticsearch, OpenSearch, Redis | Elasticsearch, OpenSearch, Redis, Neo4j |
Every connector exposes a consistent set of options for parallelism, retry policy, and schema inference. The schema block is optional — SeaTunnel will infer column types from the source when possible — but specifying it explicitly is strongly recommended for production pipelines to catch upstream schema changes at job startup rather than mid-run.
Deploying SeaTunnel on Kubernetes
For production deployments, Kubernetes is the natural home for SeaTunnel's Zeta engine. The cluster topology mirrors Flink: one SeaTunnel Master (equivalent to JobManager) handles job scheduling and state coordination, while multiple SeaTunnel Workers (equivalent to TaskManagers) execute pipeline tasks.
# seatunnel-master-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: seatunnel-master
namespace: data-platform
spec:
replicas: 1
selector:
matchLabels:
app: seatunnel-master
template:
metadata:
labels:
app: seatunnel-master
spec:
containers:
- name: master
image: apache/seatunnel:2.3.5
command: ["/opt/seatunnel/bin/seatunnel-cluster.sh"]
args: ["-r", "master"]
ports:
- containerPort: 5801 # Hazelcast cluster port
- containerPort: 8080 # REST API / UI
env:
- name: ST_DOCKER_MEMBER_LIST
value: "seatunnel-master"
- name: JAVA_OPTS
value: "-Xms512m -Xmx2g"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
volumeMounts:
- name: config
mountPath: /opt/seatunnel/config
volumes:
- name: config
configMap:
name: seatunnel-config
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: seatunnel-worker
namespace: data-platform
spec:
replicas: 3
selector:
matchLabels:
app: seatunnel-worker
template:
metadata:
labels:
app: seatunnel-worker
spec:
containers:
- name: worker
image: apache/seatunnel:2.3.5
command: ["/opt/seatunnel/bin/seatunnel-cluster.sh"]
args: ["-r", "worker"]
env:
- name: ST_DOCKER_MEMBER_LIST
value: "seatunnel-master"
- name: JAVA_OPTS
value: "-Xms1g -Xmx4g"
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"A few production considerations worth encoding in your Helm values:
- Checkpoint storage: configure
checkpoint.storage.type = hdfsors3inseatunnel.yamlso that job state survives pod restarts. In-memory (default) is fine for development but will lose offset position on any pod eviction. - Slot count: each worker exposes a configurable number of task slots (
worker.check-interval+seatunnel.engine.slot-service.dynamic-slot). Start with two slots per CPU core and tune down if you observe GC pressure. - Connector JARs: mount a shared PVC or use an init container to pull connector JARs from an internal artifact store. Baking all 100+ connectors into the image bloats it unnecessarily — only include what each deployment needs.
The SeaTunnel community maintains an unofficial Helm chart at seatunnel/seatunnel-helm. It handles the Hazelcast service discovery configuration automatically, which is the most common source of worker-to-master connectivity failures when setting up Kubernetes deployments manually.
Key Takeaways
- Apache SeaTunnel unifies batch snapshot ingestion and real-time CDC streaming in one declarative YAML configuration, eliminating the need for separate pipeline codebases.
- The Zeta engine is a purpose-built lightweight runtime that avoids Flink and Spark JVM overhead — startup latency is measured in seconds, making it practical to run dozens of small-to-medium pipelines on shared infrastructure.
- Exactly-once semantics are achieved through two-phase commit: the source checkpoints its binlog offset and the sink stages writes atomically, ensuring no data loss or duplication on failure recovery.
- With 100+ connectors covering MySQL, PostgreSQL, Kafka, S3, ClickHouse, StarRocks, Iceberg, and more, SeaTunnel handles the full breadth of a modern data platform's integration surface from a single tool.
- Kubernetes deployment is production-grade: configure durable checkpoint storage (S3 or HDFS), tune slot counts per worker, and use an init container pattern to manage connector JAR distribution without bloating your base image.
- SeaTunnel's connector SPI means the same connector code runs on Zeta, Flink, or Spark — so adopting SeaTunnel does not lock you into a single execution engine.
Working with JusDB on Apache SeaTunnel
Standing up a SeaTunnel cluster is straightforward; keeping it healthy in production is a different exercise. Offset drift, schema evolution in source tables, ClickHouse merge-tree tuning, and checkpoint storage configuration all require domain-specific knowledge that accumulates slowly and painfully through incidents. JusDB's engineering team has deployed and operated SeaTunnel pipelines across a range of production environments — from single-node Zeta clusters handling modest CDC volumes to multi-worker Kubernetes deployments ingesting billions of rows per day into StarRocks and ClickHouse OLAP stores.
Engagements typically cover initial pipeline design and YAML job authoring, connector configuration review (the MySQL CDC privilege issue described above is a real example from a client onboarding), checkpoint storage setup, Kubernetes resource sizing, and ongoing incident response. If you are evaluating SeaTunnel for a specific migration project or need help tuning an existing deployment, the team is available for both short advisory engagements and longer managed service arrangements.