Apache Spark JDBC Databases

Running a Spark job that reads from a production database through a single JDBC connection is one of the most reliable ways to bring down your application at 2 AM. A single-threaded read pulls every row through one socket, pins the database, and still takes twenty minutes to load a table you could have partitioned across fifty executors in under two. The good news is that Spark's JDBC connector is remarkably capable once you understand its partition model — but the defaults are quietly dangerous, and the documentation glosses over several production-critical details. This post covers the full lifecycle: reading, parallel partitioning, writing, performance tuning, and the pitfalls that catch engineers off guard in their first week running Spark against a relational database.

TL;DR

Use spark.read.jdbc() with partitionColumn, lowerBound, upperBound, and numPartitions to enable parallel reads — without these, Spark issues a single-threaded query.
Push filters down to the database with the query parameter instead of dbtable when you need to limit rows before Spark touches them.
Choose your write mode carefully: overwrite drops and recreates the table, append inserts, ignore is a no-op if the table exists, and error (the default) throws if the table already exists.
Set batchsize above the default of 1000 for bulk writes and tune isolationLevel for consistency requirements.
Never point Spark at your primary database — use a read replica.

Spark and JDBC: How They Fit Together

Apache Spark communicates with relational databases through JDBC — the same Java Database Connectivity API that application servers have used for decades. From Spark's perspective, a JDBC source is just another data source: you give it a URL, credentials, and a table name, and it returns a DataFrame. The difference from an application query is scale. A typical Spark job might spin up dozens of executor JVMs, each holding an open JDBC connection and reading a chunk of data in parallel. Your database sees this as a connection storm, not a single query.

The JDBC connector ships with Spark's core distribution and works with any database that provides a JDBC driver. You add the driver JAR to your Spark session and the rest is configuration. Common JDBC URL formats look like this:

text

# MySQL
jdbc:mysql://hostname:3306/database_name?useSSL=true&requireSSL=true

# PostgreSQL
jdbc:postgresql://hostname:5432/database_name?sslmode=require

# Amazon Aurora (MySQL-compatible)
jdbc:mysql://cluster.cluster-xxxx.us-east-1.rds.amazonaws.com:3306/mydb

The JDBC driver JAR needs to be available to every executor. When running on a cluster, use --jars or package it into your deployment artifact. On AWS Glue, managed JDBC connectors are built in — more on that later.

Reading from Databases with spark.read.jdbc()

The simplest read looks like this:

python

# Python
jdbc_url = "jdbc:postgresql://db-host:5432/analytics"

properties = {
    "user": "spark_reader",
    "password": "secret",
    "driver": "org.postgresql.Driver"
}

df = spark.read.jdbc(
    url=jdbc_url,
    table="orders",
    properties=properties
)

df.printSchema()
df.show(5)

scala

// Scala
val jdbcUrl = "jdbc:postgresql://db-host:5432/analytics"

val properties = new java.util.Properties()
properties.setProperty("user", "spark_reader")
properties.setProperty("password", "secret")
properties.setProperty("driver", "org.postgresql.Driver")

val df = spark.read.jdbc(
  url = jdbcUrl,
  table = "orders",
  properties = properties
)

df.printSchema()
df.show(5)

This works — and it uses exactly one JDBC connection. Every row in orders flows through a single executor thread. For tables with millions of rows, this is a bottleneck. For tables with hundreds of millions, it is effectively unusable.

Partitioned Reads: Parallelism That Actually Scales

Spark's JDBC connector supports parallel reads through four parameters that divide the table into numeric ranges, each read by a separate executor task:

partitionColumn: a numeric or date column to split on (must be indexable for good performance)
lowerBound: the minimum value of that column in the range to read
upperBound: the maximum value
numPartitions: how many parallel tasks (and therefore JDBC connections) to create

python

# Python — parallel read across order_id ranges
df = spark.read.jdbc(
    url=jdbc_url,
    table="orders",
    column="order_id",
    lowerBound=1,
    upperBound=50_000_000,
    numPartitions=50,
    properties=properties
)

scala

// Scala — same parallel read
val df = spark.read.jdbc(
  url = jdbcUrl,
  table = "orders",
  columnName = "order_id",
  lowerBound = 1L,
  upperBound = 50_000_000L,
  numPartitions = 50,
  properties = properties
)

Spark generates 50 queries like WHERE order_id >= 1 AND order_id < 1000001, WHERE order_id >= 1000001 AND order_id < 2000001, and so on — one per executor task, all running simultaneously. The database sees 50 concurrent connections reading different row ranges.

Warning

numPartitions directly controls how many simultaneous JDBC connections Spark opens. Setting it to 200 on a PostgreSQL instance with max_connections = 100 will cause connection exhaustion and kill other application traffic. Size this parameter relative to your database's connection limit, not your Spark cluster's parallelism.

For non-uniform data (e.g., order IDs that are sparse or clustered), the range-based partitioning will produce skewed partitions. In these cases, compute buckets manually or use a hash-based surrogate:

python

# Use a subquery to create an artificial uniform partition key
query = "(SELECT *, MOD(order_id, 50) AS partition_key FROM orders) AS subq"

df = spark.read.jdbc(
    url=jdbc_url,
    table=query,
    column="partition_key",
    lowerBound=0,
    upperBound=50,
    numPartitions=50,
    properties=properties
)

Predicate Pushdown: Filter at the Database, Not in Spark

Spark can push WHERE clause predicates down to the database, so the database filters rows before sending them over the network. This happens automatically for simple filters when you use dbtable. However, you can be explicit by passing a subquery as the table parameter or using the query option:

python

# Using query parameter for explicit pushdown (Spark 3.0+)
df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("query", "SELECT * FROM orders WHERE status = 'COMPLETED' AND created_at >= '2024-01-01'") \
    .option("user", "spark_reader") \
    .option("password", "secret") \
    .option("driver", "org.postgresql.Driver") \
    .load()

Warning

You cannot use both query and partitionColumn together. If you need predicate pushdown AND parallel reads, wrap the filtered query as a subquery in the dbtable parameter: dbtable="(SELECT * FROM orders WHERE status='COMPLETED') AS filtered_orders". Then apply partitionColumn on top of that subquery.

Writing to Databases with spark.write.jdbc()

Writing from Spark to a database reverses the pattern. Each executor holds an open JDBC connection and inserts rows in batches. The write API is straightforward:

python

# Python — write a DataFrame to a table
result_df.write.jdbc(
    url=jdbc_url,
    table="order_aggregates",
    mode="append",
    properties={
        "user": "spark_writer",
        "password": "secret",
        "driver": "org.postgresql.Driver",
        "batchsize": "10000",
        "isolationLevel": "READ_COMMITTED"
    }
)

scala

// Scala — write with options
resultDf.write
  .mode("append")
  .option("batchsize", "10000")
  .option("isolationLevel", "READ_COMMITTED")
  .jdbc(
    url = jdbcUrl,
    table = "order_aggregates",
    connectionProperties = properties
  )

Write Modes

Spark's four write modes control what happens when the target table already exists:

overwrite: drops the existing table, recreates it from the DataFrame schema, and inserts. This drops indexes, constraints, and foreign keys — almost never what you want in production.
append: inserts rows into the existing table without modifying the schema or existing data. The most common mode for ETL pipelines.
ignore: if the table already exists, do nothing (no error, no write). Useful for idempotent pipelines where the table is pre-populated.
error (default): throws an AnalysisException if the table exists. Safe for one-time migrations.

Warning

mode("overwrite") with JDBC drops and recreates the entire table — including all indexes, primary keys, and foreign key constraints. If you need to replace data without losing schema metadata, truncate the table externally first and then use mode("append").

batchsize and isolationLevel

Two properties have an outsized impact on write performance and consistency:

batchsize: controls how many rows Spark buffers before issuing a single JDBC executeBatch() call. The default is 1000 rows per batch, which is far too small for bulk loads. For most databases, values between 5000 and 50000 perform best — test against your specific database and row width.
isolationLevel: sets the JDBC transaction isolation level. Options are NONE, READ_UNCOMMITTED, READ_COMMITTED, REPEATABLE_READ, and SERIALIZABLE. The default is READ_UNCOMMITTED. For production writes where partial visibility matters, set this to READ_COMMITTED.

Tip

If you are writing large volumes and the database supports it, disable auto-commit and use explicit transactions by setting isolationLevel to READ_COMMITTED and keeping batchsize high. This reduces round-trips and gives you cleaner failure semantics — if an executor task fails and retries, the entire batch for that partition is retried rather than leaving a partial insert.

Performance Tuning

Beyond partitioning and batch sizes, several other levers matter for production throughput:

fetchsize: controls how many rows the JDBC driver fetches per round-trip during reads. Many drivers default to fetching one row at a time (or a very small number), which creates excessive network round-trips for large tables. Set this to 10000 or higher:

python

properties["fetchsize"] = "10000"

numPartitions for writes: the number of partitions in your DataFrame before writing controls write parallelism. If you have 200 DataFrame partitions, Spark opens 200 simultaneous JDBC connections for writes. Repartition down before writing to match your database's capacity:

python

result_df.repartition(10).write.jdbc(
    url=jdbc_url,
    table="order_aggregates",
    mode="append",
    properties=properties
)

Use a read replica: pointing Spark at your primary database is a risk every time. A Spark job doing a full table scan holds read locks (or at minimum saturates I/O) for its entire duration. Use a read replica or a standby — it is the single most important operational change you can make for JDBC-based Spark pipelines.

AWS Glue for Managed Spark and JDBC

AWS Glue provides a managed Spark environment with built-in JDBC connectivity. Glue Connections abstract away the JDBC URL, credentials (via Secrets Manager), and VPC networking so your job code never contains a plaintext password:

python

# AWS Glue — using a Glue Connection
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glue_ctx = GlueContext(sc)

dynamic_frame = glue_ctx.create_dynamic_frame.from_catalog(
    database="my_glue_database",
    table_name="orders",
    additional_options={
        "hashfield": "order_id",
        "hashpartitions": "50",
        "jobbookmarkenabled": "true"
    }
)

df = dynamic_frame.toDF()

Glue's hashfield and hashpartitions options are the Glue equivalents of partitionColumn and numPartitions. The job bookmark feature tracks which rows have already been processed, enabling incremental loads without manual watermark management.

Pitfalls to Avoid

Connection pool exhaustion: Spark does not use a connection pool — each task opens its own JDBC connection for the duration of the task. With 200 partitions, that is 200 simultaneous connections. Database connection limits are a hard ceiling, not a suggestion.
Schema type mismatch on write: Spark infers column types from the DataFrame schema and maps them to SQL types. The mapping is not always what you expect — a Spark LongType becomes BIGINT, a DecimalType(38, 18) might overflow a NUMERIC(10, 2) column. Define the createTableColumnTypes property or write to a pre-existing table.
Null partition column values are lost: rows where partitionColumn is NULL are silently dropped during partitioned reads. If your partition column is nullable, add a filter or use a derived non-null column.
overwrite mode deletes your indexes: covered above, but worth repeating — mode("overwrite") calls DROP TABLE and CREATE TABLE. Your carefully crafted composite indexes are gone.
Network latency kills small-batch writes: if your Spark cluster and database are in different availability zones (or worse, different regions), each JDBC round-trip pays network latency. Keep batchsize large enough that you are amortizing that latency across thousands of rows per round-trip.

Warning

Rows where the partitionColumn is NULL are silently excluded from partitioned reads — Spark does not read them and does not warn you. Always verify row counts after a partitioned JDBC read if your partition column has any nullable values.

Key Takeaways

Key Takeaways

Always specify partitionColumn, lowerBound, upperBound, and numPartitions for reads on large tables — the default single-connection read does not scale.
Size numPartitions relative to your database's max_connections, not your Spark cluster size — each partition opens one JDBC connection.
Use a subquery in dbtable to combine predicate pushdown with parallel partitioned reads.
Avoid mode("overwrite") in production; prefer external truncation followed by mode("append") to preserve indexes and constraints.
Increase batchsize significantly above the default 1000 for bulk writes and set isolationLevel to READ_COMMITTED for consistency.
Set fetchsize to at least 10000 to reduce round-trips during reads.
Always read from a read replica, never the primary, to protect application traffic.
On AWS, Glue Connections handle credential management and VPC networking, and job bookmarks automate incremental load tracking.
NULL values in the partition column are silently dropped — always validate row counts after partitioned reads.

Run Your Spark JDBC Pipelines Against a Managed Database

The patterns in this post assume you have a reliable, scalable relational database on the other end of the JDBC connection. JusDB provides fully managed relational databases built for the throughput and connection patterns that Spark pipelines produce — read replicas provisioned automatically, connection limits configurable to your cluster size, and monitoring that surfaces per-query impact before it reaches production. Whether you are running nightly aggregations into PostgreSQL or streaming Spark Structured Streaming micro-batches into MySQL, JusDB handles the infrastructure so your data engineering work stays focused on the pipeline logic.

Get started with JusDB and connect your first Spark job in minutes.

Apache Spark and JDBC: Reading and Writing Databases at Scale

Spark and JDBC: How They Fit Together

Reading from Databases with spark.read.jdbc()

Partitioned Reads: Parallelism That Actually Scales

Predicate Pushdown: Filter at the Database, Not in Spark

Writing to Databases with spark.write.jdbc()

Write Modes

batchsize and isolationLevel

Performance Tuning

AWS Glue for Managed Spark and JDBC

Pitfalls to Avoid

Key Takeaways

Run Your Spark JDBC Pipelines Against a Managed Database

Share this article

Need Expert Help?

Database Performance Optimization

Related Articles

Open Source Databases (2026): PostgreSQL, MySQL, ClickHouse, Cassandra & Beyond

Liquibase vs Flyway: Which Database Migration Tool to Choose?

Apache Airflow for Database Workflows: Scheduling and Orchestration