Why Cassandra for Time-Series?
Cassandra excels at time-series workloads because its storage model — partitioned, append-oriented SSTables — maps naturally to time-ordered data. But a poorly designed schema can create partition hotspots, huge partitions, or tombstone storms that crush performance.
The Core Data Model
-- Anti-pattern: single partition for all data
CREATE TABLE sensor_data_bad (
sensor_id TEXT,
ts TIMESTAMP,
value DOUBLE,
PRIMARY KEY (sensor_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
-- Problem: sensor that writes 1000/sec will have a 31M-row partition
-- per month (1000 * 86400 * 31). Cassandra can handle up to ~100MB
-- per partition before performance degrades.-- Better: bucket by time period
CREATE TABLE sensor_data (
sensor_id TEXT,
bucket TEXT, -- 'sensor_42:2025-11'
ts TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND default_time_to_live = 7776000; -- 90 daysChoosing Bucket Size
-- Target: 10-100 MB per partition
-- Each row: ~100 bytes (approximate)
-- Writes per hour: 3600 writes/sensor at 1/sec rate
-- Daily bucket: 3600 * 24 = 86,400 rows * 100B = 8.4 MB ✓
-- Weekly bucket: 8.4 MB * 7 = 58.8 MB ✓
-- Monthly bucket: 8.4 MB * 30 = 252 MB ✗ (too large)
-- For this sensor: daily or weekly buckets are appropriate
-- Bucket key: sensor_id + date_trunc('day', timestamp)
# Application-side bucket calculation (Python)
def get_bucket(sensor_id: str, ts: datetime) -> str:
day = ts.strftime('%Y-%m-%d')
return f'{sensor_id}:{day}'Querying Across Buckets
-- Application must query each bucket in the date range
-- For a 7-day range, query 7 buckets:
SELECT ts, value
FROM sensor_data
WHERE sensor_id = 'sensor_42'
AND bucket = 'sensor_42:2025-11-20'
AND ts >= '2025-11-20 00:00:00'
AND ts < '2025-11-21 00:00:00'
ORDER BY ts DESC
LIMIT 1000;
-- Do this for each day in the range and merge in applicationTombstone Storms: The Hidden Killer
Every DELETE in Cassandra creates a tombstone — a special marker saying 'this data was deleted'. Cassandra must scan past tombstones when reading. Tombstones accumulate until compaction removes them.
Warning signs:
- Read latency increasing over time despite no change in load
- Cassandra logs:
Read X live rows and Y tombstone cells - GC pressure from tombstone scanning
-- Bad pattern: deleting individual rows (creates many tombstones)
DELETE FROM sensor_data WHERE sensor_id='x' AND bucket='x:2025-10' AND ts='...';
-- Good pattern: use TTL (creates efficient TTL tombstones)
INSERT INTO sensor_data (sensor_id, bucket, ts, value)
VALUES ('sensor_42', 'sensor_42:2025-11-27', toTimestamp(now()), 98.6)
USING TTL 7776000; -- 90 days in seconds
-- Or set TTL at table level (as shown in CREATE TABLE above)Compaction Strategy for Time-Series
-- TWCS (TimeWindowCompactionStrategy) is ideal for time-series with TTL
CREATE TABLE sensor_data (
-- columns as above
) WITH compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1
};
-- TWCS groups SSTables by time window (1 day here)
-- When a window expires (all data TTL'd), it drops the entire SSTable
-- This avoids reading past tombstonesMonitoring Partition Health
-- Check for large partitions (Cassandra 3.x+ with nodetool)
nodetool tablestats keyspace.sensor_data
-- Look for:
-- Maximum live cells per slice: should be < 100,000
-- Maximum tombstones per slice: should be < 1,000
-- In Cassandra 4.x, use virtual tables
SELECT * FROM system_views.sstable_tasks;Key Takeaways
- Design partition keys with time buckets to keep partition sizes under 100 MB
- Use TTL instead of DELETEs to avoid tombstone accumulation
- Use TWCS compaction strategy for time-series data — it aligns with TTL expiry
- Application must iterate over buckets for multi-day range queries
- Monitor max tombstones per slice — over 1,000 signals a problem