Running a Spark job that reads from a production database through a single JDBC connection is one of the most reliable ways to bring down your application at 2 AM. A single-threaded read pulls every row through one socket, pins the database, and still takes twenty minutes to load a table you could have partitioned across fifty executors in under two. The good news is that Spark's JDBC connector is remarkably capable once you understand its partition model — but the defaults are quietly dangerous, and the documentation glosses over several production-critical details. This post covers the full lifecycle: reading, parallel partitioning, writing, performance tuning, and the pitfalls that catch engineers off guard in their first week running Spark against a relational database.
- Use
spark.read.jdbc()withpartitionColumn,lowerBound,upperBound, andnumPartitionsto enable parallel reads — without these, Spark issues a single-threaded query. - Push filters down to the database with the
queryparameter instead ofdbtablewhen you need to limit rows before Spark touches them. - Choose your write mode carefully:
overwritedrops and recreates the table,appendinserts,ignoreis a no-op if the table exists, anderror(the default) throws if the table already exists. - Set
batchsizeabove the default of 1000 for bulk writes and tuneisolationLevelfor consistency requirements. - Never point Spark at your primary database — use a read replica.
Spark and JDBC: How They Fit Together
Apache Spark communicates with relational databases through JDBC — the same Java Database Connectivity API that application servers have used for decades. From Spark's perspective, a JDBC source is just another data source: you give it a URL, credentials, and a table name, and it returns a DataFrame. The difference from an application query is scale. A typical Spark job might spin up dozens of executor JVMs, each holding an open JDBC connection and reading a chunk of data in parallel. Your database sees this as a connection storm, not a single query.
The JDBC connector ships with Spark's core distribution and works with any database that provides a JDBC driver. You add the driver JAR to your Spark session and the rest is configuration. Common JDBC URL formats look like this:
# MySQL
jdbc:mysql://hostname:3306/database_name?useSSL=true&requireSSL=true
# PostgreSQL
jdbc:postgresql://hostname:5432/database_name?sslmode=require
# Amazon Aurora (MySQL-compatible)
jdbc:mysql://cluster.cluster-xxxx.us-east-1.rds.amazonaws.com:3306/mydbThe JDBC driver JAR needs to be available to every executor. When running on a cluster, use --jars or package it into your deployment artifact. On AWS Glue, managed JDBC connectors are built in — more on that later.
Reading from Databases with spark.read.jdbc()
The simplest read looks like this:
# Python
jdbc_url = "jdbc:postgresql://db-host:5432/analytics"
properties = {
"user": "spark_reader",
"password": "secret",
"driver": "org.postgresql.Driver"
}
df = spark.read.jdbc(
url=jdbc_url,
table="orders",
properties=properties
)
df.printSchema()
df.show(5)// Scala
val jdbcUrl = "jdbc:postgresql://db-host:5432/analytics"
val properties = new java.util.Properties()
properties.setProperty("user", "spark_reader")
properties.setProperty("password", "secret")
properties.setProperty("driver", "org.postgresql.Driver")
val df = spark.read.jdbc(
url = jdbcUrl,
table = "orders",
properties = properties
)
df.printSchema()
df.show(5)This works — and it uses exactly one JDBC connection. Every row in orders flows through a single executor thread. For tables with millions of rows, this is a bottleneck. For tables with hundreds of millions, it is effectively unusable.
Partitioned Reads: Parallelism That Actually Scales
Spark's JDBC connector supports parallel reads through four parameters that divide the table into numeric ranges, each read by a separate executor task:
- partitionColumn: a numeric or date column to split on (must be indexable for good performance)
- lowerBound: the minimum value of that column in the range to read
- upperBound: the maximum value
- numPartitions: how many parallel tasks (and therefore JDBC connections) to create
# Python — parallel read across order_id ranges
df = spark.read.jdbc(
url=jdbc_url,
table="orders",
column="order_id",
lowerBound=1,
upperBound=50_000_000,
numPartitions=50,
properties=properties
)// Scala — same parallel read
val df = spark.read.jdbc(
url = jdbcUrl,
table = "orders",
columnName = "order_id",
lowerBound = 1L,
upperBound = 50_000_000L,
numPartitions = 50,
properties = properties
)Spark generates 50 queries like WHERE order_id >= 1 AND order_id < 1000001, WHERE order_id >= 1000001 AND order_id < 2000001, and so on — one per executor task, all running simultaneously. The database sees 50 concurrent connections reading different row ranges.
numPartitions directly controls how many simultaneous JDBC connections Spark opens. Setting it to 200 on a PostgreSQL instance with max_connections = 100 will cause connection exhaustion and kill other application traffic. Size this parameter relative to your database's connection limit, not your Spark cluster's parallelism.
For non-uniform data (e.g., order IDs that are sparse or clustered), the range-based partitioning will produce skewed partitions. In these cases, compute buckets manually or use a hash-based surrogate:
# Use a subquery to create an artificial uniform partition key
query = "(SELECT *, MOD(order_id, 50) AS partition_key FROM orders) AS subq"
df = spark.read.jdbc(
url=jdbc_url,
table=query,
column="partition_key",
lowerBound=0,
upperBound=50,
numPartitions=50,
properties=properties
)Predicate Pushdown: Filter at the Database, Not in Spark
Spark can push WHERE clause predicates down to the database, so the database filters rows before sending them over the network. This happens automatically for simple filters when you use dbtable. However, you can be explicit by passing a subquery as the table parameter or using the query option:
# Using query parameter for explicit pushdown (Spark 3.0+)
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("query", "SELECT * FROM orders WHERE status = 'COMPLETED' AND created_at >= '2024-01-01'") \
.option("user", "spark_reader") \
.option("password", "secret") \
.option("driver", "org.postgresql.Driver") \
.load()You cannot use both query and partitionColumn together. If you need predicate pushdown AND parallel reads, wrap the filtered query as a subquery in the dbtable parameter: dbtable="(SELECT * FROM orders WHERE status='COMPLETED') AS filtered_orders". Then apply partitionColumn on top of that subquery.
Writing to Databases with spark.write.jdbc()
Writing from Spark to a database reverses the pattern. Each executor holds an open JDBC connection and inserts rows in batches. The write API is straightforward:
# Python — write a DataFrame to a table
result_df.write.jdbc(
url=jdbc_url,
table="order_aggregates",
mode="append",
properties={
"user": "spark_writer",
"password": "secret",
"driver": "org.postgresql.Driver",
"batchsize": "10000",
"isolationLevel": "READ_COMMITTED"
}
)// Scala — write with options
resultDf.write
.mode("append")
.option("batchsize", "10000")
.option("isolationLevel", "READ_COMMITTED")
.jdbc(
url = jdbcUrl,
table = "order_aggregates",
connectionProperties = properties
)Write Modes
Spark's four write modes control what happens when the target table already exists:
- overwrite: drops the existing table, recreates it from the DataFrame schema, and inserts. This drops indexes, constraints, and foreign keys — almost never what you want in production.
- append: inserts rows into the existing table without modifying the schema or existing data. The most common mode for ETL pipelines.
- ignore: if the table already exists, do nothing (no error, no write). Useful for idempotent pipelines where the table is pre-populated.
- error (default): throws an
AnalysisExceptionif the table exists. Safe for one-time migrations.
mode("overwrite") with JDBC drops and recreates the entire table — including all indexes, primary keys, and foreign key constraints. If you need to replace data without losing schema metadata, truncate the table externally first and then use mode("append").
batchsize and isolationLevel
Two properties have an outsized impact on write performance and consistency:
- batchsize: controls how many rows Spark buffers before issuing a single JDBC
executeBatch()call. The default is 1000 rows per batch, which is far too small for bulk loads. For most databases, values between 5000 and 50000 perform best — test against your specific database and row width. - isolationLevel: sets the JDBC transaction isolation level. Options are
NONE,READ_UNCOMMITTED,READ_COMMITTED,REPEATABLE_READ, andSERIALIZABLE. The default isREAD_UNCOMMITTED. For production writes where partial visibility matters, set this toREAD_COMMITTED.
If you are writing large volumes and the database supports it, disable auto-commit and use explicit transactions by setting isolationLevel to READ_COMMITTED and keeping batchsize high. This reduces round-trips and gives you cleaner failure semantics — if an executor task fails and retries, the entire batch for that partition is retried rather than leaving a partial insert.
Performance Tuning
Beyond partitioning and batch sizes, several other levers matter for production throughput:
fetchsize: controls how many rows the JDBC driver fetches per round-trip during reads. Many drivers default to fetching one row at a time (or a very small number), which creates excessive network round-trips for large tables. Set this to 10000 or higher:
properties["fetchsize"] = "10000"numPartitions for writes: the number of partitions in your DataFrame before writing controls write parallelism. If you have 200 DataFrame partitions, Spark opens 200 simultaneous JDBC connections for writes. Repartition down before writing to match your database's capacity:
result_df.repartition(10).write.jdbc(
url=jdbc_url,
table="order_aggregates",
mode="append",
properties=properties
)Use a read replica: pointing Spark at your primary database is a risk every time. A Spark job doing a full table scan holds read locks (or at minimum saturates I/O) for its entire duration. Use a read replica or a standby — it is the single most important operational change you can make for JDBC-based Spark pipelines.
AWS Glue for Managed Spark and JDBC
AWS Glue provides a managed Spark environment with built-in JDBC connectivity. Glue Connections abstract away the JDBC URL, credentials (via Secrets Manager), and VPC networking so your job code never contains a plaintext password:
# AWS Glue — using a Glue Connection
from awsglue.context import GlueContext
from pyspark.context import SparkContext
sc = SparkContext()
glue_ctx = GlueContext(sc)
dynamic_frame = glue_ctx.create_dynamic_frame.from_catalog(
database="my_glue_database",
table_name="orders",
additional_options={
"hashfield": "order_id",
"hashpartitions": "50",
"jobbookmarkenabled": "true"
}
)
df = dynamic_frame.toDF()Glue's hashfield and hashpartitions options are the Glue equivalents of partitionColumn and numPartitions. The job bookmark feature tracks which rows have already been processed, enabling incremental loads without manual watermark management.
Pitfalls to Avoid
- Connection pool exhaustion: Spark does not use a connection pool — each task opens its own JDBC connection for the duration of the task. With 200 partitions, that is 200 simultaneous connections. Database connection limits are a hard ceiling, not a suggestion.
- Schema type mismatch on write: Spark infers column types from the DataFrame schema and maps them to SQL types. The mapping is not always what you expect — a Spark
LongTypebecomesBIGINT, aDecimalType(38, 18)might overflow aNUMERIC(10, 2)column. Define thecreateTableColumnTypesproperty or write to a pre-existing table. - Null partition column values are lost: rows where
partitionColumnis NULL are silently dropped during partitioned reads. If your partition column is nullable, add a filter or use a derived non-null column. - overwrite mode deletes your indexes: covered above, but worth repeating —
mode("overwrite")callsDROP TABLEandCREATE TABLE. Your carefully crafted composite indexes are gone. - Network latency kills small-batch writes: if your Spark cluster and database are in different availability zones (or worse, different regions), each JDBC round-trip pays network latency. Keep
batchsizelarge enough that you are amortizing that latency across thousands of rows per round-trip.
Rows where the partitionColumn is NULL are silently excluded from partitioned reads — Spark does not read them and does not warn you. Always verify row counts after a partitioned JDBC read if your partition column has any nullable values.
Key Takeaways
- Always specify
partitionColumn,lowerBound,upperBound, andnumPartitionsfor reads on large tables — the default single-connection read does not scale. - Size
numPartitionsrelative to your database'smax_connections, not your Spark cluster size — each partition opens one JDBC connection. - Use a subquery in
dbtableto combine predicate pushdown with parallel partitioned reads. - Avoid
mode("overwrite")in production; prefer external truncation followed bymode("append")to preserve indexes and constraints. - Increase
batchsizesignificantly above the default 1000 for bulk writes and setisolationLeveltoREAD_COMMITTEDfor consistency. - Set
fetchsizeto at least 10000 to reduce round-trips during reads. - Always read from a read replica, never the primary, to protect application traffic.
- On AWS, Glue Connections handle credential management and VPC networking, and job bookmarks automate incremental load tracking.
- NULL values in the partition column are silently dropped — always validate row counts after partitioned reads.
Run Your Spark JDBC Pipelines Against a Managed Database
The patterns in this post assume you have a reliable, scalable relational database on the other end of the JDBC connection. JusDB provides fully managed relational databases built for the throughput and connection patterns that Spark pipelines produce — read replicas provisioned automatically, connection limits configurable to your cluster size, and monitoring that surfaces per-query impact before it reaches production. Whether you are running nightly aggregations into PostgreSQL or streaming Spark Structured Streaming micro-batches into MySQL, JusDB handles the infrastructure so your data engineering work stays focused on the pipeline logic.
Get started with JusDB and connect your first Spark job in minutes.