Cassandra Time-Series Data Modeling: Partitioning, Clustering & Tombstones

Why Cassandra for Time-Series?

Cassandra excels at time-series workloads because its storage model — partitioned, append-oriented SSTables — maps naturally to time-ordered data. But a poorly designed schema can create partition hotspots, huge partitions, or tombstone storms that crush performance.

The Core Data Model

text

-- Anti-pattern: single partition for all data
CREATE TABLE sensor_data_bad (
  sensor_id TEXT,
  ts        TIMESTAMP,
  value     DOUBLE,
  PRIMARY KEY (sensor_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
-- Problem: sensor that writes 1000/sec will have a 31M-row partition
-- per month (1000 * 86400 * 31). Cassandra can handle up to ~100MB
-- per partition before performance degrades.

text

-- Better: bucket by time period
CREATE TABLE sensor_data (
  sensor_id TEXT,
  bucket    TEXT,          -- 'sensor_42:2025-11'
  ts        TIMESTAMP,
  value     DOUBLE,
  PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND default_time_to_live = 7776000;  -- 90 days

Choosing Bucket Size

text

-- Target: 10-100 MB per partition
-- Each row: ~100 bytes (approximate)
-- Writes per hour: 3600 writes/sensor at 1/sec rate

-- Daily bucket: 3600 * 24 = 86,400 rows * 100B = 8.4 MB  ✓
-- Weekly bucket: 8.4 MB * 7 = 58.8 MB  ✓
-- Monthly bucket: 8.4 MB * 30 = 252 MB  ✗ (too large)

-- For this sensor: daily or weekly buckets are appropriate
-- Bucket key: sensor_id + date_trunc('day', timestamp)

# Application-side bucket calculation (Python)
def get_bucket(sensor_id: str, ts: datetime) -> str:
    day = ts.strftime('%Y-%m-%d')
    return f'{sensor_id}:{day}'

Querying Across Buckets

text

-- Application must query each bucket in the date range
-- For a 7-day range, query 7 buckets:

SELECT ts, value
FROM sensor_data
WHERE sensor_id = 'sensor_42'
  AND bucket = 'sensor_42:2025-11-20'
  AND ts >= '2025-11-20 00:00:00'
  AND ts < '2025-11-21 00:00:00'
ORDER BY ts DESC
LIMIT 1000;

-- Do this for each day in the range and merge in application

Tombstone Storms: The Hidden Killer

Every DELETE in Cassandra creates a tombstone — a special marker saying 'this data was deleted'. Cassandra must scan past tombstones when reading. Tombstones accumulate until compaction removes them.

Warning signs:

Read latency increasing over time despite no change in load
Cassandra logs: Read X live rows and Y tombstone cells
GC pressure from tombstone scanning

text

-- Bad pattern: deleting individual rows (creates many tombstones)
DELETE FROM sensor_data WHERE sensor_id='x' AND bucket='x:2025-10' AND ts='...';

-- Good pattern: use TTL (creates efficient TTL tombstones)
INSERT INTO sensor_data (sensor_id, bucket, ts, value)
VALUES ('sensor_42', 'sensor_42:2025-11-27', toTimestamp(now()), 98.6)
USING TTL 7776000;  -- 90 days in seconds

-- Or set TTL at table level (as shown in CREATE TABLE above)

Compaction Strategy for Time-Series

text

-- TWCS (TimeWindowCompactionStrategy) is ideal for time-series with TTL
CREATE TABLE sensor_data (
  -- columns as above
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
};

-- TWCS groups SSTables by time window (1 day here)
-- When a window expires (all data TTL'd), it drops the entire SSTable
-- This avoids reading past tombstones

Monitoring Partition Health

text

-- Check for large partitions (Cassandra 3.x+ with nodetool)
nodetool tablestats keyspace.sensor_data

-- Look for:
-- Maximum live cells per slice: should be < 100,000
-- Maximum tombstones per slice: should be < 1,000

-- In Cassandra 4.x, use virtual tables
SELECT * FROM system_views.sstable_tasks;

Key Takeaways

Design partition keys with time buckets to keep partition sizes under 100 MB
Use TTL instead of DELETEs to avoid tombstone accumulation
Use TWCS compaction strategy for time-series data — it aligns with TTL expiry
Application must iterate over buckets for multi-day range queries
Monitor max tombstones per slice — over 1,000 signals a problem

Cassandra Data Modeling for Time-Series: Partitioning, Clustering, and Tombstone Management

Why Cassandra for Time-Series?

The Core Data Model

Choosing Bucket Size

Querying Across Buckets

Tombstone Storms: The Hidden Killer

Compaction Strategy for Time-Series

Monitoring Partition Health

Key Takeaways

Share this article

JusDB Team

Need Expert Help?

Cassandra High Availability

Cassandra on Kubernetes

Cassandra Consulting

Related Articles

High Performance with MongoDB: A Top-Down Tuning Guide

MongoDB Explained (2026): Replica Sets, Sharding, Atlas & Production Patterns

ScyllaDB vs Apache Cassandra: Performance and Operational Differences