A team migrates from PostgreSQL to Apache Cassandra, keeps their relational schema intact, and then watches their cluster grind to a halt under production load. Queries timeout, hotspots form on individual nodes, and the operations team scrambles to understand why a database famous for linear scalability is performing worse than the single Postgres instance it replaced. The root cause is almost always the same: they modeled their data for storage rather than for queries. Cassandra's data model is not a relational model with a distributed twist — it is a fundamentally different paradigm where the access pattern drives every schema decision you make.
- Cassandra requires a query-first design approach — model your tables around how data will be read, not how it is logically structured.
- The partition key determines which node stores the data and must be chosen to distribute load evenly across the cluster.
- Clustering columns define the sort order within a partition and are the only columns that support range queries natively.
- Avoid
ALLOW FILTERING, secondary indexes on high-cardinality columns, and unbounded partition growth to prevent performance cliffs. - Denormalization and duplicate tables are not code smells in Cassandra — they are the correct solution to multi-access-pattern problems.
The Cassandra Data Model Philosophy
Cassandra is a wide-column store built for write-heavy, low-latency workloads at scale. Its architecture is masterless, meaning every node in the cluster can accept reads and writes without a single point of coordination. Data is distributed across nodes using consistent hashing on the partition key, which means the database can route any request to exactly the right node without broadcasting the query to every machine in the cluster.
This architecture comes with a non-negotiable constraint: the database can only efficiently locate data it can route deterministically. If you ask Cassandra for rows where email = 'user@example.com' and email is not part of the partition key, Cassandra has no choice but to scan every node in the cluster — or refuse the query outright. This is why the data model philosophy in Cassandra starts with one question: what queries will my application execute?
In a relational database, you normalize data into entities and then join them at query time. In Cassandra, joins do not exist. You pre-compute the join at write time by designing one table per query pattern. This seems wasteful to engineers coming from a relational background, but it is the mechanism that allows Cassandra to serve reads from a single partition on a single node with predictable, sub-millisecond latency regardless of dataset size.
Partition Keys: The Foundation of Data Distribution
The partition key is the first and most consequential decision in any Cassandra schema. It determines two things simultaneously: which node in the cluster holds the data, and what the primary lookup key for that data will be. Cassandra passes the partition key through a hash function (Murmur3 by default) to generate a token, and that token maps to a specific node range on the ring.
A good partition key has three properties: high cardinality (many distinct values), even distribution (no single value is accessed far more than others), and direct correlation to the query's WHERE clause. A poor partition key creates hot partitions — nodes that receive a disproportionate share of traffic while other nodes sit idle, wasting cluster capacity.
Consider a naive schema for storing IoT sensor readings:
-- BAD: using device_type as the partition key
CREATE TABLE sensor_readings_bad (
device_type TEXT,
device_id UUID,
recorded_at TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
PRIMARY KEY (device_type, device_id, recorded_at)
);If 80% of your devices are of type thermometer, then one token on the ring holds 80% of your data and receives 80% of your reads. The remaining nodes are underutilized. The fix is to choose a more selective partition key:
-- GOOD: device_id has high cardinality and even distribution
CREATE TABLE sensor_readings (
device_id UUID,
recorded_at TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
PRIMARY KEY (device_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);Compound Partition Keys
Sometimes a single column does not provide sufficient distribution or does not match the query pattern. Cassandra supports compound partition keys — multiple columns that together form the partition key. The combined hash of all columns in the compound key determines the node placement.
-- Compound partition key: (tenant_id, sensor_type)
CREATE TABLE tenant_sensor_readings (
tenant_id UUID,
sensor_type TEXT,
device_id UUID,
recorded_at TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((tenant_id, sensor_type), device_id, recorded_at)
) WITH CLUSTERING ORDER BY (device_id ASC, recorded_at DESC);Notice the double parentheses around tenant_id, sensor_type — this is how CQL syntax distinguishes a compound partition key from clustering columns. A query against this table must supply both tenant_id and sensor_type in the WHERE clause to resolve to a single partition. If your application sometimes queries by tenant_id alone, you need a separate table for that access pattern.
Cassandra partitions have a practical size limit of approximately 100MB, and the hard architectural guidance is to keep partitions well below that threshold — ideally under 10MB. A partition that grows unboundedly (such as storing all events for a user in a single partition with no time-bucketing) will eventually degrade read performance, cause compaction pressure, and produce coordinator timeouts. Design your partition key to bound partition size from the start.
Clustering Columns: Ordering Data Within a Partition
Within a partition, Cassandra stores rows sorted by the clustering columns. This is what makes wide-row designs powerful: a single partition can hold millions of rows, all sorted in a defined order, and Cassandra can read a contiguous slice of that sorted data in a single disk operation.
Clustering columns support range queries using operators like >, >=, <, <=, and IN — but only within a partition. You cannot execute a range query on a clustering column across different partitions without reading multiple partitions explicitly. The ORDER BY clause in a SELECT is also constrained: you can only order by the clustering columns, and only in the direction declared in the table's CLUSTERING ORDER BY specification.
-- Reading the last 100 readings for a device (efficient: single partition, sorted)
SELECT device_id, recorded_at, temperature, humidity
FROM sensor_readings
WHERE device_id = 550e8400-e29b-41d4-a716-446655440000
ORDER BY recorded_at DESC
LIMIT 100;
-- Reading readings within a time range (efficient: single partition range scan)
SELECT device_id, recorded_at, temperature
FROM sensor_readings
WHERE device_id = 550e8400-e29b-41d4-a716-446655440000
AND recorded_at >= '2025-01-01 00:00:00'
AND recorded_at < '2025-02-01 00:00:00';Query-First Design in Practice: Time-Series Tables
Time-series data is one of Cassandra's strongest use cases because the write pattern is append-only and the read pattern is almost always "give me data for entity X within time range Y." A well-designed time-series schema puts the entity identifier in the partition key and the timestamp as a clustering column.
For high-volume sensors that could generate millions of readings per day, bucketing by time prevents unbounded partition growth:
-- Time-bucketed time-series table: one partition per device per day
CREATE TABLE sensor_readings_by_day (
device_id UUID,
bucket_date DATE, -- e.g., '2025-01-15'
recorded_at TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
PRIMARY KEY ((device_id, bucket_date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1
};Queries for data within a single day hit a single partition. Queries spanning multiple days require the application to fan out across multiple partitions — a known, bounded cost that scales linearly with the number of days requested. The TimeWindowCompactionStrategy (TWCS) is specified here because it is optimized for time-series workloads: it compacts only SSTables within the same time window, avoiding read amplification and write amplification that LeveledCompactionStrategy would introduce on append-heavy tables.
When querying across multiple day buckets, generate the list of bucket dates in your application layer and issue parallel queries, then merge the results client-side. Cassandra's drivers support asynchronous execution, so a 7-day range query can be issued as seven parallel single-partition reads rather than one sequential scan.
Anti-Patterns to Avoid
ALLOW FILTERING
ALLOW FILTERING is Cassandra's escape hatch for queries that cannot be resolved to a single partition. When you append it to a SELECT, you are telling Cassandra to scan every partition in the table and filter results in memory. On a table with millions of rows, this produces full cluster scans that degrade over time as data grows.
-- DANGEROUS: forces a full table scan across all nodes
SELECT * FROM sensor_readings
WHERE temperature > 37.5
ALLOW FILTERING;The correct solution is to design a table where temperature ranges can be queried within a known partition, or to pre-aggregate high-temperature events into a separate table during write time.
ALLOW FILTERING is acceptable in development against small datasets and in rare operational queries on tables that are known to be small. It should never appear in application code serving production traffic. If you find yourself reaching for ALLOW FILTERING, treat it as a signal that the table needs to be redesigned or that a new table for the missing access pattern needs to be created.
Secondary Indexes on High-Cardinality Columns
Cassandra's built-in secondary indexes (created with CREATE INDEX) are implemented as local indexes — each node indexes only its own data. A query against a secondary index must broadcast to every node in the cluster to guarantee a complete result, making it effectively a full cluster scan with extra overhead. Secondary indexes are only appropriate for low-cardinality columns where the result set is large relative to the number of partitions scanned.
-- This index looks harmless but will produce scatter-gather queries
-- across the entire cluster for a UUID column with millions of distinct values
CREATE INDEX ON sensor_readings (device_id); -- device_id is already the partition key here,
-- but imagine if it were notFor secondary access patterns on high-cardinality data, the correct approach is a denormalized duplicate table with the access column as the partition key.
Tombstone Accumulation
When you delete data in Cassandra, the system writes a tombstone — a marker that tells replicas the data is gone. Tombstones accumulate until compaction runs and the gc_grace_seconds period (default 10 days) has elapsed. In delete-heavy workloads, tombstone accumulation causes read performance to degrade because Cassandra must read and discard tombstones to return live data. Queries that generate warnings about tombstone thresholds (default 1,000 tombstones per query) are a sign that the data model or write pattern needs revision.
-- Prefer TTL-based expiry over explicit deletes for time-series data
INSERT INTO sensor_readings (device_id, recorded_at, temperature)
VALUES (550e8400-e29b-41d4-a716-446655440000, toTimestamp(now()), 22.4)
USING TTL 2592000; -- expires after 30 days, compacted cleanly by TWCSMaterialized Views
Cassandra's materialized views (MVs) automatically maintain a denormalized copy of a base table, updated synchronously on every write. While convenient, MVs add write latency, increase the risk of inconsistency between base and view during node failures, and have historically been unstable in production deployments. The Cassandra community's current consensus is to avoid MVs in favor of application-managed dual writes using batch statements or custom write paths. If you use MVs, test them thoroughly under failure conditions before relying on them in production.
Denormalization as a First-Class Strategy
If an application needs to look up sensor readings by device ID and also by geographic region, the correct Cassandra solution is two tables:
-- Table 1: lookup by device
CREATE TABLE sensor_readings_by_device (
device_id UUID,
recorded_at TIMESTAMP,
region TEXT,
temperature DOUBLE,
PRIMARY KEY (device_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
-- Table 2: lookup by region (denormalized duplicate)
CREATE TABLE sensor_readings_by_region (
region TEXT,
recorded_at TIMESTAMP,
device_id UUID,
temperature DOUBLE,
PRIMARY KEY (region, recorded_at, device_id)
) WITH CLUSTERING ORDER BY (recorded_at DESC, device_id ASC);Both tables are written to atomically using a logged batch (for tables in the same keyspace) or coordinated at the application layer. The write cost is doubled, but the read cost is a single partition lookup for either access pattern — exactly the guarantee Cassandra is designed to provide.
- Design tables around query patterns, not around data entities. One access pattern requires one table.
- The partition key determines node placement and must be chosen for high cardinality and even distribution. Avoid hot partitions.
- Compound partition keys (double parentheses in CQL) combine multiple columns into a single partition key — all columns must be provided in the WHERE clause.
- Clustering columns control sort order within a partition and enable efficient range queries — but only within a single partition.
- Keep partition sizes below 100MB. For time-series data, use time bucketing (e.g., one partition per device per day) to bound growth.
- Use
TimeWindowCompactionStrategyfor append-only, time-ordered workloads to minimize compaction overhead. - Never use
ALLOW FILTERINGin application code on production tables. Treat it as a design smell that points to a missing table. - Prefer TTL-based expiry over explicit deletes to control tombstone accumulation in write-heavy workloads.
- Denormalization is correct in Cassandra. Duplicate data across multiple tables to serve multiple access patterns efficiently.
- Avoid materialized views in production unless you have tested them exhaustively under failure conditions.
Running Cassandra at Scale with JusDB
Getting Cassandra's data model right is the difference between a cluster that scales gracefully and one that becomes increasingly expensive to operate as data grows. Even with a well-designed schema, production Cassandra requires ongoing attention: compaction monitoring, partition size alerting, tombstone tracking, and capacity planning across nodes and datacenters.
JusDB provides fully managed Apache Cassandra with built-in observability dashboards that surface partition hotspots, tombstone accumulation rates, and compaction lag before they become incidents. Our team of Cassandra practitioners is available to review your data model and access patterns, helping you catch anti-patterns before they reach production. Automated backups, point-in-time recovery, and multi-datacenter replication are configured and maintained without manual intervention.
If you are designing a new Cassandra schema or migrating an existing workload, talk to a JusDB engineer about your access patterns. We will help you design a data model that stays performant as your data grows from gigabytes to petabytes.