Database Architecture

Partitioning and Sharding Techniques for Database Scalability

Master horizontal and vertical partitioning, plus database sharding strategies for handling billions of records. Learn partition pruning, shard key selection, and cross-shard query optimization.

JusDB Team
April 12, 2022
10 min read
1210 views

A fintech company came to us after their largest table hit 800 million rows and queries that used to take 200ms were averaging 4 seconds — even with the right indexes. The table had no partitioning. Adding range partitioning on the created_at column (monthly partitions) brought query time back to 180ms by allowing partition pruning: instead of scanning 800M rows, queries scanned the relevant month's partition of ~5M rows. The fix took a weekend; the decision to partition should have been made at table creation.

This guide covers database partitioning and sharding — when each is appropriate, how to implement them, and the operational trade-offs that matter at scale.

Core Concept: Partitioning and sharding are fundamental to building truly scalable database systems that can handle massive data volumes and high transaction rates by distributing the load across multiple storage units or servers.

Table Partitioning: Dividing Within a Database

Table partitioning involves dividing a single logical table into multiple physical storage units according to defined rules. The database engine handles this division transparently, so applications can interact with a partitioned table as if it were a single entity.

Horizontal Partitioning (Row-Based)

Horizontal partitioning divides a table by rows, with each partition containing a subset of rows based on a partition key.

1. Range Partitioning

Rows are distributed based on a range of values in the partition key, such as date ranges:

text
-- Example: Partitioning orders by date ranges
CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    order_date DATE NOT NULL,
    total_amount DECIMAL(10,2),
    status VARCHAR(20)
) PARTITION BY RANGE (order_date);

-- Create quarterly partitions
CREATE TABLE orders_2024_q1 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders
    FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

CREATE TABLE orders_2024_q3 PARTITION OF orders
    FOR VALUES FROM ('2024-07-01') TO ('2024-10-01');

CREATE TABLE orders_2024_q4 PARTITION OF orders
    FOR VALUES FROM ('2024-10-01') TO ('2025-01-01');
Range Partitioning Benefits ✅ Range Partitioning Challenges ❌
Excellent for time-series data Potential uneven distribution
Efficient date-based queries Hot partitions with recent data
Easy archiving of old data Requires predictable range patterns
Partition pruning optimization Manual partition management

2. List Partitioning

Rows are distributed based on discrete values in the partition key:

text
-- Example: Partitioning customers by geographic region
CREATE TABLE customers (
    customer_id BIGINT PRIMARY KEY,
    name VARCHAR(100),
    country_code CHAR(2),
    email VARCHAR(255)
) PARTITION BY LIST (country_code);

-- Create regional partitions
CREATE TABLE customers_americas PARTITION OF customers
    FOR VALUES IN ('US', 'CA', 'MX', 'BR', 'AR');

CREATE TABLE customers_europe PARTITION OF customers
    FOR VALUES IN ('UK', 'FR', 'DE', 'IT', 'ES', 'NL');

CREATE TABLE customers_asia PARTITION OF customers
    FOR VALUES IN ('CN', 'JP', 'IN', 'SG', 'KR');

CREATE TABLE customers_other PARTITION OF customers
    DEFAULT;

3. Hash Partitioning

Rows are distributed evenly across partitions using a hash function:

text
-- Example: Even distribution of user activities
CREATE TABLE user_activities (
    activity_id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL,
    activity_type VARCHAR(50),
    activity_date TIMESTAMP,
    data JSONB
) PARTITION BY HASH (user_id);

-- Create 16 hash partitions for even distribution
CREATE TABLE user_activities_p0 PARTITION OF user_activities
    FOR VALUES WITH (modulus 16, remainder 0);

CREATE TABLE user_activities_p1 PARTITION OF user_activities
    FOR VALUES WITH (modulus 16, remainder 1);

-- ... continue for all 16 partitions

Hash Partitioning Best Practice: Use a power of 2 for the number of partitions (4, 8, 16, 32) to ensure even distribution and simplify future partition splitting.

Vertical Partitioning (Column-Based)

Vertical partitioning divides a table by columns rather than rows, splitting logical entities across multiple related tables:

text
-- Original wide table
CREATE TABLE products_monolithic (
    product_id BIGINT PRIMARY KEY,
    name VARCHAR(100),
    description TEXT,
    price DECIMAL(10,2),
    inventory_count INT,
    specifications JSON,
    images JSON,
    reviews_summary JSON
);

-- Vertically partitioned structure
CREATE TABLE products_core (
    product_id BIGINT PRIMARY KEY,
    name VARCHAR(100),
    price DECIMAL(10,2),
    inventory_count INT,
    last_updated TIMESTAMP
);

CREATE TABLE products_content (
    product_id BIGINT PRIMARY KEY,
    description TEXT,
    specifications JSON,
    FOREIGN KEY (product_id) REFERENCES products_core(product_id)
);

CREATE TABLE products_media (
    product_id BIGINT PRIMARY KEY,
    images JSON,
    videos JSON,
    FOREIGN KEY (product_id) REFERENCES products_core(product_id)
);

When to Use Vertical Partitioning:

  • Tables with very wide rows (many columns)
  • Columns accessed with different frequencies
  • Large BLOB/TEXT columns that slow down common queries
  • Different security requirements for different column groups

Database Sharding: Scaling Across Servers

While partitioning divides tables within a single database instance, sharding distributes data across multiple database instances or servers. Sharding is fundamental for horizontal scaling beyond single-machine capacity.

Sharding Architecture

Horizontal sharding by user-ID range Application traffic flows through a shard router that maps each request to one of three physical shards based on the user-ID range. Each shard holds 1 million users plus their related rows. Application Layer stateless · horizontal pool Shard Router / Proxy hash(user_id) → shard ProxySQL · pgcat · Vitess vtgate Shard 1 Users 1 – 1M + related rows orders · sessions · prefs Shard 2 Users 1M – 2M + related rows orders · sessions · prefs Shard 3 Users 2M – 3M + related rows orders · sessions · prefs
Range-based horizontal sharding. Cross-shard joins require fan-out or materialized views; pick the shard key so the dominant access pattern stays single-shard.

Sharding Strategies

1. Key-Based Sharding

Range-based sharding: Similar to range partitioning, but across separate database instances.

text
-- Example: Customer ID ranges
Shard 1: Customer IDs 1-1,000,000
Shard 2: Customer IDs 1,000,001-2,000,000
Shard 3: Customer IDs 2,000,001-3,000,000
  • Pros: Simple to implement, efficient range queries within a shard
  • Cons: Potential for uneven distribution, "hot spots" if certain ranges are accessed more frequently

Hash-based sharding: A hash function determines which shard holds each record.

text
-- Example: Hash-based distribution
function determine_shard(customer_id) {
    return "shard_" + (hash(customer_id) % num_shards);
}
  • Pros: Even distribution of data, minimizes hot spots
  • Cons: Range queries become inefficient as they must query all shards

Directory-based sharding: A lookup service maps keys to shards.

  • Pros: Flexible, allows for resharding without changing application logic
  • Cons: Additional complexity, lookup service becomes a potential bottleneck

2. Entity-Based Sharding

Functional sharding: Dividing by functional area (e.g., user data on one shard, product data on another).

  • Pros: Isolates workloads, allows for specialized optimization of each shard
  • Cons: Complicates queries that span functional areas

Tenant-based sharding: In multi-tenant applications, each tenant's data is placed on a specific shard.

  • Pros: Natural isolation, tenant-specific scaling, simplified tenant onboarding/offboarding
  • Cons: Potential for uneven distribution if tenants vary significantly in size

Shard Key Selection: Critical Design Decision

The shard key determines how data is distributed across shards and has profound implications for performance and scalability:

Shard Key Criteria

  • High cardinality: The shard key should have enough unique values to allow even distribution
  • Low frequency of updates: Changing a shard key value typically requires moving data between shards
  • Query patterns: Most queries should be satisfiable within a single shard by including the shard key

Common Shard Key Choices

Shard Key Type Best Use Case Example
Customer/Tenant ID Multi-tenant systems SaaS applications
Geographic location Location-based services Ride-sharing apps
Time-based identifiers Time-series data IoT sensor data
User ID Social platforms Social media feeds

Managing Cross-Shard Operations

Queries that span multiple shards (cross-shard queries) are typically more expensive and complex:

Strategies to Minimize Cross-Shard Operations

  • Minimize cross-shard joins: Design your schema to keep related data on the same shard
  • Denormalize across shards: Strategic denormalization can reduce the need for cross-shard operations
  • Global tables: Consider replicating small, relatively static lookup tables across all shards
  • Aggregation services: For operations that must span all shards, implement application-level aggregation services
text
-- Example: Keeping related data together
-- Good: All customer data on same shard
Shard 1: customers (1-1000), orders (customer_id 1-1000), payments (customer_id 1-1000)

-- Poor: Related data scattered across shards
Shard 1: customers (1-1000)
Shard 2: orders (all orders)
Shard 3: payments (all payments)

Implementation Approaches

Application-Level Sharding

The application contains the logic to route queries to the appropriate shard.

  • Pros: More control, can work with databases that don't natively support sharding
  • Cons: Increases application complexity, potential for logic inconsistencies
text
// Example application-level sharding logic
function getShardConnection(customerId) {
    const shardId = customerId % NUMBER_OF_SHARDS;
    return connectionPool.getConnection(`shard_${shardId}`);
}

function getCustomerOrders(customerId) {
    const connection = getShardConnection(customerId);
    return connection.query(
        'SELECT * FROM orders WHERE customer_id = ?', 
        [customerId]
    );
}

Middleware Sharding

A separate middleware layer handles shard routing.

  • Pros: Separates sharding logic from application code, consistent implementation
  • Cons: Additional infrastructure component, potential performance overhead

Database-Level Sharding

The database system itself handles sharding.

  • Pros: Simplifies application development, leverages database-specific optimizations
  • Cons: Limited to databases with native sharding support, less flexibility

Sharding in Different Database Systems

Relational Databases

  • PostgreSQL: Built-in declarative partitioning, external tools like Citus for sharding
  • MySQL: Native partitioning, external solutions like Vitess for sharding
  • SQL Server: Partitioning through partition functions, Azure SQL Database elastic pools

NoSQL Databases

  • MongoDB: Built-in sharding with automatic balancing
  • Cassandra: Distributed by design using consistent hashing
  • DynamoDB: Automatic partitioning based on partition key

Resharding Considerations

As data grows, the initial sharding scheme may become suboptimal, necessitating resharding:

Planning for Resharding

  • Design for resharding from the start: Assume you'll need to reshard eventually
  • Implement shard splitting: The ability to split a shard into multiple shards as it grows
  • Consider shard rebalancing: Mechanisms to move data between shards to maintain even distribution
  • Zero-downtime resharding: Techniques like double-writing and background data migration
text
-- Example: Gradual resharding process
-- Step 1: Create new shards
-- Step 2: Start double-writing to old and new shards
-- Step 3: Migrate historical data in background
-- Step 4: Switch reads to new shards
-- Step 5: Stop writing to old shards
-- Step 6: Decommission old shards

Monitoring and Maintenance

Key Metrics to Monitor

  • Shard distribution: Ensure even data distribution across shards
  • Query performance: Monitor performance on each shard
  • Cross-shard operations: Track frequency and performance of cross-shard queries
  • Hot spots: Identify shards receiving disproportionate load
  • Resource utilization: Monitor CPU, memory, and I/O on each shard

Operational Considerations

  • Backup and recovery: Coordinate backups across all shards
  • Schema changes: Apply schema changes consistently across all shards
  • Monitoring and alerting: Centralized monitoring for distributed infrastructure
  • Disaster recovery: Plan for shard-level failures and recovery procedures

Real-World Examples

E-commerce Platform

text
-- Shard by customer ID
Shard 1: Customers 1-1M + their orders, payments, reviews
Shard 2: Customers 1M-2M + their orders, payments, reviews
Shard 3: Customers 2M-3M + their orders, payments, reviews

-- Global tables (replicated across all shards)
- Product catalog
- Category hierarchy
- Shipping zones

Social Media Platform

text
-- Shard by user ID
Shard 1: Users 1-1M + their posts, followers, feeds
Shard 2: Users 1M-2M + their posts, followers, feeds

-- Separate functional shards
- Media shard: All images, videos
- Search shard: Search indexes
- Analytics shard: Usage metrics

Best Practices Summary

  1. Start simple: Begin with partitioning before moving to sharding
  2. Choose the right shard key: Consider cardinality, query patterns, and update frequency
  3. Keep related data together: Minimize cross-shard operations
  4. Plan for growth: Design resharding strategies from the beginning
  5. Monitor continuously: Track distribution, performance, and hot spots
  6. Automate operations: Implement automated backup, monitoring, and maintenance
  7. Test thoroughly: Validate sharding strategies with realistic data and workloads

Conclusion

Partitioning and sharding are powerful techniques for scaling databases beyond the capacity of a single server. However, they introduce complexity and require careful planning:

Key Takeaway: The best partitioning and sharding strategy depends on your specific workload characteristics, query patterns, and scaling requirements. Start with the simplest approach that meets your needs, and evolve your strategy as your application grows and your understanding of its data access patterns matures.

By understanding the trade-offs and designing your schema with partitioning and sharding in mind from the beginning, you can create database systems that scale gracefully as your data and user base grow.


Next in our series: Data Types and Constraints Selection for Scalable Database Schemas

Further Reading

For more in-depth information, check out these authoritative resources:

Explore more database insights from JusDB:

Share this article

JusDB Team

Official JusDB content team

Deeper Reading

Curated companion guides for readers who want to go deeper on this topic.