NoSQL Databases

Cassandra Data Modeling for Time-Series: Partitioning, Clustering, and Tombstone Management

Design Cassandra tables for time-series data that scale to billions of rows. Learn partition key design, clustering columns, TTL tombstone management, and bucket strategies.

JusDB Team
December 9, 2025
10 min read
200 views

Why Cassandra for Time-Series?

Cassandra excels at time-series workloads because its storage model — partitioned, append-oriented SSTables — maps naturally to time-ordered data. But a poorly designed schema can create partition hotspots, huge partitions, or tombstone storms that crush performance.

The Core Data Model

text
-- Anti-pattern: single partition for all data
CREATE TABLE sensor_data_bad (
  sensor_id TEXT,
  ts        TIMESTAMP,
  value     DOUBLE,
  PRIMARY KEY (sensor_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
-- Problem: sensor that writes 1000/sec will have a 31M-row partition
-- per month (1000 * 86400 * 31). Cassandra can handle up to ~100MB
-- per partition before performance degrades.
text
-- Better: bucket by time period
CREATE TABLE sensor_data (
  sensor_id TEXT,
  bucket    TEXT,          -- 'sensor_42:2025-11'
  ts        TIMESTAMP,
  value     DOUBLE,
  PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND default_time_to_live = 7776000;  -- 90 days

Choosing Bucket Size

text
-- Target: 10-100 MB per partition
-- Each row: ~100 bytes (approximate)
-- Writes per hour: 3600 writes/sensor at 1/sec rate

-- Daily bucket: 3600 * 24 = 86,400 rows * 100B = 8.4 MB  ✓
-- Weekly bucket: 8.4 MB * 7 = 58.8 MB  ✓
-- Monthly bucket: 8.4 MB * 30 = 252 MB  ✗ (too large)

-- For this sensor: daily or weekly buckets are appropriate
-- Bucket key: sensor_id + date_trunc('day', timestamp)

# Application-side bucket calculation (Python)
def get_bucket(sensor_id: str, ts: datetime) -> str:
    day = ts.strftime('%Y-%m-%d')
    return f'{sensor_id}:{day}'

Querying Across Buckets

text
-- Application must query each bucket in the date range
-- For a 7-day range, query 7 buckets:

SELECT ts, value
FROM sensor_data
WHERE sensor_id = 'sensor_42'
  AND bucket = 'sensor_42:2025-11-20'
  AND ts >= '2025-11-20 00:00:00'
  AND ts < '2025-11-21 00:00:00'
ORDER BY ts DESC
LIMIT 1000;

-- Do this for each day in the range and merge in application

Tombstone Storms: The Hidden Killer

Every DELETE in Cassandra creates a tombstone — a special marker saying 'this data was deleted'. Cassandra must scan past tombstones when reading. Tombstones accumulate until compaction removes them.

Warning signs:

  • Read latency increasing over time despite no change in load
  • Cassandra logs: Read X live rows and Y tombstone cells
  • GC pressure from tombstone scanning
text
-- Bad pattern: deleting individual rows (creates many tombstones)
DELETE FROM sensor_data WHERE sensor_id='x' AND bucket='x:2025-10' AND ts='...';

-- Good pattern: use TTL (creates efficient TTL tombstones)
INSERT INTO sensor_data (sensor_id, bucket, ts, value)
VALUES ('sensor_42', 'sensor_42:2025-11-27', toTimestamp(now()), 98.6)
USING TTL 7776000;  -- 90 days in seconds

-- Or set TTL at table level (as shown in CREATE TABLE above)

Compaction Strategy for Time-Series

text
-- TWCS (TimeWindowCompactionStrategy) is ideal for time-series with TTL
CREATE TABLE sensor_data (
  -- columns as above
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
};

-- TWCS groups SSTables by time window (1 day here)
-- When a window expires (all data TTL'd), it drops the entire SSTable
-- This avoids reading past tombstones

Monitoring Partition Health

text
-- Check for large partitions (Cassandra 3.x+ with nodetool)
nodetool tablestats keyspace.sensor_data

-- Look for:
-- Maximum live cells per slice: should be < 100,000
-- Maximum tombstones per slice: should be < 1,000

-- In Cassandra 4.x, use virtual tables
SELECT * FROM system_views.sstable_tasks;

Key Takeaways

  • Design partition keys with time buckets to keep partition sizes under 100 MB
  • Use TTL instead of DELETEs to avoid tombstone accumulation
  • Use TWCS compaction strategy for time-series data — it aligns with TTL expiry
  • Application must iterate over buckets for multi-day range queries
  • Monitor max tombstones per slice — over 1,000 signals a problem

Share this article

JusDB Team

Official JusDB content team