Cloud Databases

AWS DMS Performance Tuning: Zero-Downtime Database Migration at Scale

AWS DMS can migrate terabytes of data with zero downtime — but default settings leave significant performance on the table. Here's how to tune DMS for production-grade migrations.

JusDB Team
November 29, 2025
10 min read
186 views

A fintech platform running on Oracle 19c needed to migrate 4.2 TB of transactional data to Aurora PostgreSQL without a maintenance window — their SLA permitted zero downtime and the engineering lead had a hard requirement: no data loss, row count reconciled to zero. They had three weeks, one shot, and an angry CTO watching the Slack channel. AWS Database Migration Service (DMS) was the right tool, but only after the team understood exactly how its three-layer architecture behaves under production load, where it silently fails on LOB columns, and how to time a cutover so tightly that the application layer never noticed the switch. This guide covers everything that migration required, in the exact order you need it.

TL;DR
  • AWS DMS has three components: a replication instance (the engine), source and target endpoints (the connections), and replication tasks (the actual migration work). All three must be sized and configured independently.
  • Use Full Load + CDC task type for zero-downtime migrations. Full Load seeds the target; CDC (Change Data Capture) keeps it current until cutover.
  • Enable binary logging (MySQL) or logical replication slots (PostgreSQL) on the source before the DMS task starts — CDC will fail silently without these prerequisites.
  • LOB columns are the #1 cause of DMS failures. Set LobMaxSize explicitly and use Limited LOB Mode unless your LOBs genuinely vary in size.
  • Right-size the replication instance: dms.r6i.xlarge handles most migrations under 1 TB; go to dms.r6i.2xlarge for parallel full-load at scale.
  • Cutover is a four-step process: stop writes on source, confirm CDC lag reaches zero, flip DNS, verify application connectivity on the target.

AWS DMS Architecture: The Three-Layer Model

Before touching the console, understand what DMS actually is. It is not a single service — it is a pipeline with three independently configured components, and performance problems almost always trace back to one specific layer being misconfigured.

Replication Instance

The replication instance is an EC2 instance managed by AWS that runs the DMS engine. It reads from your source endpoint, buffers change data in its local storage, and writes to your target endpoint. The instance class determines CPU, RAM, and network throughput. The attached storage (Multi-AZ or single-AZ) determines how much CDC buffering you can sustain during a target slowdown.

Instance class selection rules of thumb:

  • dms.t3.medium — dev/test only. Burstable CPU will throttle under any sustained load.
  • dms.r6i.xlarge — production migrations up to ~1 TB with moderate CDC rate.
  • dms.r6i.2xlarge — parallel full-load tasks, high LOB volume, or sustained CDC above 10K TPS on the source.
  • dms.r6i.4xlarge — multi-TB migrations with tight SLA, or when target is Aurora and you want maximum write parallelism.
Tip

Always enable Multi-AZ on the replication instance for production migrations. If the instance fails mid-migration, DMS resumes from the last committed checkpoint rather than restarting the full load. A failed single-AZ instance mid-LOB-load is a multi-hour recovery.

Source and Target Endpoints

Endpoints define the connection parameters and engine-specific settings for your source and target databases. DMS supports MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Amazon S3, Amazon Redshift, and several others as sources or targets. Each engine has its own endpoint extra connection attributes (ECA) that control CDC behavior, SSL mode, and character set handling.

Key source endpoint settings for CDC-capable configurations:

ini
-- MySQL source ECA
server_id=12345;
binlogPosition=ROW;
afterConnectScript=SET GLOBAL binlog_format='ROW';

-- PostgreSQL source ECA
pluginName=pglogical;
slotName=dms_replication_slot;
heartbeatEnable=true;
heartbeatFrequency=5;

-- Oracle source ECA
useLogminerReader=Y;
useBfile=N;
accessAlternateDirectly=false;

Replication Tasks

Replication tasks define what to migrate and how. A task ties together one replication instance, one source endpoint, and one target endpoint. It contains table mappings (which schemas and tables to include or exclude) and task settings (LOB mode, error handling, parallel load rules).


Task Types: Choosing the Right Migration Mode

Full Load

Full Load copies existing data from source to target in bulk, then stops. No CDC. Use this only when the source database is offline during the migration or when downtime is acceptable. DMS reads source tables in parallel using SELECT queries (or direct Oracle read APIs) and bulk-inserts into the target. It does not capture changes made to the source while the load is running.

CDC (Ongoing Replication)

CDC only starts replicating changes from a specified point in time or LSN without doing an initial full load. Use this when you have already seeded the target by another mechanism (for example, an RDS snapshot restore or pg_dump restore) and only need DMS to capture the delta going forward.

Full Load + CDC

Full Load + CDC is the zero-downtime path. DMS first completes a full load of the source tables, then automatically transitions to CDC mode to replicate ongoing changes. The target catches up to the source incrementally. You cut over only when CDC lag is near zero. This is the mode you want for production migrations.

Warning

In Full Load + CDC mode, DMS does not apply DDL changes made to the source during the full load phase. Schema changes (ALTER TABLE, DROP COLUMN, CREATE INDEX) made on the source while DMS is in full-load will not be replicated and will break CDC. Freeze DDL on the source for the duration of the migration.


Pre-Migration: SCT Assessment and Source Prerequisites

AWS Schema Conversion Tool (SCT)

Before creating a single DMS task, run the AWS Schema Conversion Tool against your source database. SCT analyzes your schema and generates a conversion report that identifies objects DMS cannot migrate automatically: stored procedures using proprietary syntax, triggers, sequences, views with engine-specific functions, and custom data types.

SCT assigns each object a complexity rating (Simple / Medium / Complex / Action Required). Every object marked Action Required must be manually rewritten for the target engine before your DMS task starts. Discovering these on cutover day is how migrations fail.

bash
# Install SCT (macOS)
brew install --cask aws-schema-conversion-tool

# Or download from:
# https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Installing.html

# SCT generates a PDF/HTML assessment report.
# Focus on: stored procedures, triggers, sequences, custom types, package bodies (Oracle)

MySQL: Enable Binary Logging

DMS CDC from MySQL requires binary logging in ROW format. This is not enabled by default on self-managed MySQL. On RDS MySQL, set the parameter group value and reboot; binlog is already enabled with binlog_format=ROW on RDS MySQL by default.

ini
# my.cnf (self-managed MySQL)
[mysqld]
server-id           = 1
log_bin             = /var/log/mysql/mysql-bin.log
binlog_format       = ROW
binlog_row_image    = FULL
expire_logs_days    = 7
max_binlog_size     = 512M
sql
-- Grant DMS user the required privileges (MySQL)
CREATE USER 'dms_user'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, RELOAD, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'dms_user'@'%';
FLUSH PRIVILEGES;

PostgreSQL: Create a Logical Replication Slot

DMS CDC from PostgreSQL requires logical replication to be enabled and a replication slot created. The slot retains WAL segments until DMS has consumed them — do not forget to drop it after migration or WAL will accumulate indefinitely and fill your disk.

sql
-- postgresql.conf (or RDS parameter group)
wal_level           = logical
max_replication_slots = 5
max_wal_senders     = 5

-- Create the replication slot (DMS can create this automatically, but explicit is safer)
SELECT pg_create_logical_replication_slot('dms_replication_slot', 'pgoutput');

-- Grant privileges to DMS user
CREATE ROLE dms_user WITH LOGIN REPLICATION PASSWORD 'strong_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dms_user;
GRANT rds_replication TO dms_user;  -- RDS only
Warning

If you create a PostgreSQL logical replication slot and then pause or delete the DMS task without dropping the slot, WAL segments will accumulate until disk is exhausted. Monitor pg_replication_slots and alert on pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) exceeding a safe threshold.


Performance Tuning: LOB Settings, Parallel Load, and Commit Rate

LOB Column Configuration

Large Object (LOB) columns — TEXT, BLOB, CLOB, BYTEA, JSONB at high sizes, XML — are the most common cause of DMS task failures and performance problems. DMS handles LOBs differently from standard columns because their size is not known until read time.

DMS offers three LOB modes:

  • Full LOB Mode: fetches the full LOB regardless of size. Accurate but ~10x slower than Limited LOB Mode for tables with large LOBs.
  • Limited LOB Mode: truncates LOBs to LobMaxSize KB. Fast, but data loss if any LOB exceeds the configured limit.
  • Inline LOB Mode (per-table): applies different LOB settings per table. Available in DMS 3.4+.
json
// DMS Task Settings JSON — LOB configuration
{
  "TargetMetadata": {
    "LobMaxSize": 32768,
    "FullLobMode": false,
    "LimitedSizeLobMode": true,
    "LobChunkSize": 0
  }
}

Before setting LobMaxSize, query your source to find the actual maximum LOB size in each table:

sql
-- MySQL: find max BLOB/TEXT column sizes
SELECT table_name, column_name, MAX(LENGTH(column_name))
FROM information_schema.columns
JOIN your_table ON 1=1   -- replace with actual table scan
WHERE data_type IN ('blob','mediumblob','longblob','text','mediumtext','longtext')
GROUP BY table_name, column_name;

-- PostgreSQL: find max bytea / text column sizes
SELECT max(octet_length(your_lob_column))
FROM your_schema.your_table;

Parallel Load for Full Load Phase

By default, DMS loads tables one at a time. For large migrations, enable parallel full-load to process multiple tables simultaneously and use range-based partitioning to split individual large tables across parallel sub-tasks.

json
// DMS Task Settings JSON — Parallel Full Load
{
  "FullLoadSettings": {
    "MaxFullLoadSubTasks": 8,
    "TransactionConsistencyTimeout": 600,
    "CommitRate": 50000,
    "StopTaskCachedChangesApplied": false,
    "StopTaskCachedChangesNotApplied": false
  }
}

MaxFullLoadSubTasks controls how many tables load in parallel. The default is 8, but increasing to 16 or 32 on a large replication instance (dms.r6i.4xlarge) with a high-throughput target (Aurora with writer auto-scaling) reduces full-load time significantly. Do not set this above what the target can sustain — watch for lock contention and write queue depth on Aurora.

CommitRate controls how many rows DMS commits in a single transaction to the target. The default of 10,000 is conservative. For Aurora PostgreSQL targets, raising to 50,000–100,000 reduces transaction overhead and improves throughput by 30–60% on bulk loads.

Tip

For very large tables (above 100 million rows), use table segmentation in DMS table mappings to split a single table across multiple parallel sub-tasks based on a numeric partition key. This is the fastest way to load a partitioned MySQL or PostgreSQL table without changing application code.


Common Failure Modes and How to Prevent Them

LOB Truncation Without Warning

If LobMaxSize is set lower than the actual maximum LOB in a row, DMS silently truncates the data. The task does not error; the row inserts successfully with a shorter value. You discover this in production weeks later when an application tries to read a truncated document.

Prevention: Always run the max LOB size query above before setting LobMaxSize. Add 20% headroom. After full load, spot-check octet_length(column) distributions between source and target.

Unsupported DDL During CDC

DMS CDC does not support all DDL statements. ALTER TABLE ADD COLUMN, TRUNCATE, and RENAME TABLE during CDC will either error the task, corrupt the target, or silently drop the change depending on the source engine and DMS version. TRUNCATE in particular is not captured by MySQL binlog in ROW format — the target table will be out of sync after a truncate on the source.

Prevention: Freeze all DDL on the source for the duration of the migration. If schema changes are unavoidable, apply them to the target first, then the source, and restart CDC from a fresh position.

Transaction Rollback Errors

DMS maintains a cache of in-flight transactions. If a transaction on the source remains open for longer than TransactionConsistencyTimeout (default: 600 seconds), DMS abandons it and moves forward. This is correct behavior for most workloads, but long-running batch jobs on the source that hold open transactions can cause DMS to skip rows or log errors.

Prevention: Identify and pause long-running batch jobs on the source during the full load + early CDC phase. Monitor the CDCLatencySource CloudWatch metric — a sustained spike indicates DMS is waiting on a long transaction.

Warning

DMS error handling defaults to logging and continuing past row-level errors. This means a failed row insert on the target does not stop the task — it logs the error and moves to the next row. Always review the DMS task log in CloudWatch after full load completes and treat any ERROR entries as blocking before proceeding to cutover.


Monitoring: CloudWatch Metrics and Replication Lag

DMS publishes metrics to CloudWatch under the AWS/DMS namespace. These are the metrics that matter:

Metric What It Measures Action Threshold
CDCLatencySource Lag (seconds) between source commit and DMS read Alert above 60s; investigate above 300s
CDCLatencyTarget Lag (seconds) between DMS read and target apply Alert above 60s; target bottleneck if rising independently
CDCIncomingChanges Change events queued in DMS buffer Should trend to zero as CDC stabilizes
FreeableMemory Available RAM on the replication instance Upgrade instance class if below 500 MB sustained
CPUUtilization CPU on the replication instance Investigate above 80%; upgrade if sustained above 90%
NetworkTransmitThroughput Bytes/sec from DMS to target Compare to target write throughput to identify bottleneck

Set up a CloudWatch alarm on CDCLatencyTarget above 300 seconds. If CDC lag on the target is growing, the replication instance or the target write capacity is the bottleneck — not the source. On Aurora, check WriteThroughput and WriteLatency on the cluster to confirm.


Cutover Strategy: Zero-Downtime in Four Steps

The cutover is the highest-risk moment of the migration. Executing it with a clear, pre-rehearsed sequence prevents data loss and minimizes the window where the application is unreachable.

Step 1: Stop Writes on the Source

Put the application into read-only mode or take it offline. This can mean: setting a feature flag, pointing the load balancer to a maintenance page, revoking write privileges on the source schema for the application user, or disabling the write path in your service configuration. The source must stop accepting new transactions.

sql
-- Option: revoke writes on source to freeze the database
-- PostgreSQL
REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM app_user;

-- MySQL
REVOKE INSERT, UPDATE, DELETE ON mydb.* FROM 'app_user'@'%';
FLUSH PRIVILEGES;

Step 2: Wait for CDC Lag to Reach Zero

Monitor CDCLatencyTarget in CloudWatch. When it reaches zero and stays there for 30–60 seconds, the target is fully caught up. Do not rush this step. A premature cutover with non-zero lag means rows written to the source after your last committed CDC event are lost.

bash
# Poll CDCLatencyTarget from CLI until it hits zero
aws cloudwatch get-metric-statistics \
  --namespace AWS/DMS \
  --metric-name CDCLatencyTarget \
  --dimensions Name=ReplicationInstanceIdentifier,Value=my-replication-instance \
  --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average \
  --output table

Step 3: Validate Row Counts and Sample Data

Before switching DNS, run a quick validation. Row counts alone are insufficient — also spot-check a sample of high-value rows using a checksum or hash comparison.

sql
-- Compare row counts (run on both source and target)
SELECT 'orders' AS tbl, COUNT(*) FROM orders
UNION ALL
SELECT 'order_items', COUNT(*) FROM order_items
UNION ALL
SELECT 'users', COUNT(*) FROM users;

-- Spot-check: compare a recent order on source vs target
-- Source (MySQL)
SELECT id, total, status, updated_at FROM orders WHERE id = 99999999;

-- Target (Aurora PostgreSQL)
SELECT id, total, status, updated_at FROM orders WHERE id = 99999999;

Step 4: Switch DNS and Verify Application

Update your database connection string — either via DNS TTL flip, Secrets Manager secret rotation, or environment variable push + rolling deploy — to point to the target. Verify application health checks are green. Restore write privileges on the target.

sql
-- Restore write privileges on target after cutover
-- Aurora PostgreSQL
GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
Tip

Keep the DMS task running in CDC mode for at least 24 hours after cutover. If a critical issue is discovered on the target, you can reverse-replicate back to the source by creating a new DMS task in the opposite direction — but only if the original task is still active and the replication slot on the source has not been dropped.


Cost Optimization: Right-Sizing Replication Instances

DMS replication instances are billed by the hour regardless of whether a task is running. A common mistake is provisioning a large instance for the full-load phase and leaving it running at full cost during idle periods or slow CDC phases.

  • Run the full-load phase on a larger instance (dms.r6i.2xlarge or 4xlarge) to minimize wall-clock time. A full load that takes 4 hours on a 4xlarge versus 14 hours on an xlarge costs more per hour but less in total.
  • Downsize to a smaller instance for the CDC phase. Once full load is complete and CDC lag is stable, CDC typically uses a fraction of the compute. Modify the replication instance to dms.r6i.xlarge during the CDC stabilization window. Modifying a replication instance causes a brief CDC pause (under 30 seconds) — not a task restart.
  • Delete the replication instance within 24 hours of cutover confirmation. There is no reason to keep a DMS instance running after the migration is validated. The replication slot on the source can be dropped and the instance deleted.
  • Use reserved pricing for migrations planned months in advance. A 1-year reserved dms.r6i.xlarge is approximately 40% cheaper than on-demand, and migrations often run for 2–6 weeks including the CDC stabilization period.

Key Takeaways
  • Use Full Load + CDC for zero-downtime migrations — seed the target with full load, let CDC close the gap, then cut over when lag reaches zero.
  • Run AWS SCT before creating any DMS task to surface unsupported objects (stored procedures, triggers, sequences) that must be manually converted.
  • Set binlog ROW format (MySQL) or logical replication with a named slot (PostgreSQL) on the source before the DMS task starts — CDC will not function without these prerequisites.
  • Query actual maximum LOB sizes in your source tables before configuring LobMaxSize; silent truncation is the most common cause of undetected data corruption in DMS migrations.
  • Monitor CDCLatencyTarget as the primary cutover readiness signal — only proceed when it has been at zero for at least 60 consecutive seconds.
  • Right-size the replication instance: use a larger class for the full-load phase, downsize for the CDC phase, and delete within 24 hours of a confirmed cutover.
  • Keep the DMS task active for 24 hours post-cutover to preserve the option of reverse replication if rollback becomes necessary.

Working with JusDB on Database Migrations

AWS DMS gets you 80% of the way there. The remaining 20% — SCT object conversion, LOB audit, CDC validation, cutover timing, and post-migration performance tuning on the target — is where migrations fail in production. JusDB's DBA team has executed zero-downtime migrations across MySQL, PostgreSQL, Oracle, and SQL Server using DMS, and we have the runbooks, validation scripts, and on-call coverage to make sure your cutover lands cleanly. Learn more at jusdb.com.

Explore JusDB Migration Services →  |  Talk to a DBA

Share this article