Analytics & OLAP

StarRocks Database (2026): Architecture, Use Cases & Real-Time Analytics Guide

StarRocks is an open-source MPP analytics database. This 2026 guide covers FE/BE/CN architecture, the four table models, vectorized execution, and when to choose StarRocks over ClickHouse.

JusDB Team
May 9, 2026
5 min read
10819 views

TL;DR — StarRocks in 60 seconds: StarRocks is a massively parallel processing (MPP) analytics database forked from Apache Doris and built for sub-second SQL on lakehouse data. Its three-process architecture (FE for metadata + planning, BE for storage + compute, optional CN for compute-only) supports four table models — Duplicate, Aggregate, Unique, and Primary Key — each tuned for different write patterns. Vectorized execution with a cost-based optimizer plus materialized views makes it 5–10× faster than ClickHouse on JOIN-heavy queries. Use cases: real-time dashboards, customer-facing analytics, ad-hoc OLAP on Iceberg/Hudi, and replacing Druid + Trino + ClickHouse with one engine. Deploy on Kubernetes via the official operator.

Two years ago, a product analytics team at a mid-size SaaS company came to us with a familiar problem: their Redshift dashboard queries were taking 8–12 seconds. Business users had stopped trusting the dashboards because the numbers felt stale. The engineering team had already tried caching layers and query rewrites. Nothing moved the needle enough.

We moved their workload to StarRocks. Same data, same queries — P99 query time: 340ms. The dashboards are now live, and users actually use them.

StarRocks has earned a dedicated following in the analytics space because it solves a specific problem extremely well: real-time, high-concurrency OLAP on data you're actively writing to. This guide explains the architecture, when to use it, when not to, and how to operate it in production.

TL;DR
  • StarRocks is a columnar OLAP database optimized for real-time analytics with high query concurrency
  • Best fit: user-facing dashboards, real-time BI, API-served analytics, ad-tech, IoT — anywhere you need sub-second queries at scale
  • Key architectural differentiator: Compute Nodes (CN) let you scale query concurrency independently from storage
  • Natively queries Parquet/ORC/Iceberg without ingestion — this alone replaces complex ETL pipelines for many teams
  • Uses MySQL protocol — existing MySQL tooling, ORMs, and BI connectors work out of the box

Why StarRocks Exists: The Problem It Solves

Traditional data warehouses (Redshift, BigQuery, Snowflake) are optimized for throughput at the cost of latency. A Redshift query scanning 10 billion rows might take 30 seconds — acceptable for a nightly report, unacceptable for a dashboard a user is staring at.

ClickHouse solved the latency problem but had a concurrency problem: it's optimized for a small number of heavy queries, not thousands of lightweight ones hitting it simultaneously from user-facing applications.

StarRocks was designed to handle both: sub-second latency on heavy aggregations AND thousands of concurrent queries from many users simultaneously. The key enabling technology is the Compute Node (CN) layer, which lets you add query-serving capacity without adding storage nodes.


Architecture: FE, BE, and CN Explained

StarRocks uses a shared-nothing distributed architecture with three node types:

Frontend (FE) — Query Planning

FE nodes handle SQL parsing, query planning, metadata management, and client connections. They don't touch data. For high availability, deploy 3 FE nodes in a Raft-based leader election. The leader handles writes; followers handle metadata reads and failover.

-- Check FE status
SHOW FRONTENDS\G

-- FE nodes serve the MySQL-compatible connection endpoint
mysql -h fe-host -P 9030 -u root

Backend (BE) — Storage and Execution

BE nodes store data in columnar format and execute queries. In a classic StarRocks setup, BE nodes handle both storage and compute. Adding BE nodes scales both capacity and query execution in tandem.

-- Check BE health
SHOW BACKENDS\G

-- Key metrics per BE node:
-- TabletNum: number of tablet replicas
-- DataUsedCapacity: storage used
-- AvailCapacity: storage available
-- CpuCores: available for query execution

Compute Node (CN) — Stateless Query Scaling

CN nodes are stateless — they execute queries but hold no data. This is StarRocks' killer feature for cloud deployments: you can scale query concurrency by adding CN nodes without rebalancing data. Scale up before a traffic spike, scale down after. On Kubernetes, CN nodes can be horizontal-pod-autoscaled.

-- Check CN nodes
SHOW COMPUTE NODES\G

-- Add a CN node dynamically
ALTER SYSTEM ADD COMPUTE NODE "cn-host:9050";
Node Type Role Scales Stateful?
FESQL parsing, planning, metadataHA (usually 3 fixed)Yes (metadata)
BEStorage + query executionStorage capacity + throughputYes (data)
CNStateless query execution onlyQuery concurrency onlyNo

Table Models: Getting This Right Is Critical

StarRocks has four table models. Choosing the wrong one is the most common mistake in production deployments — it affects performance, storage efficiency, and whether you can update data at all.

Duplicate Key Table (default, append-only)

-- For log/event data where every row is unique
CREATE TABLE access_logs (
  event_time  DATETIME,
  user_id     BIGINT,
  url         VARCHAR(512),
  status_code SMALLINT,
  duration_ms INT
) DUPLICATE KEY(event_time, user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES ("replication_num" = "3");

Use for: log analytics, clickstream, telemetry, event sourcing. All rows are kept. Best for immutable data.

Aggregate Key Table (pre-aggregation at write time)

-- For pre-aggregated metrics
CREATE TABLE daily_metrics (
  report_date DATE,
  user_id     BIGINT,
  page_views  BIGINT SUM,
  revenue     DECIMAL(12,2) SUM,
  sessions    BIGINT SUM
) AGGREGATE KEY(report_date, user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16;

Use for: counters, summaries, KPIs where you always want aggregated values. Merges rows sharing the same key on insert. Massively reduces storage and query cost for metric tables.

Unique Key Table (upsert/latest value)

-- For dimension tables or mutable state (user profiles, inventory)
CREATE TABLE user_profiles (
  user_id      BIGINT,
  email        VARCHAR(255),
  plan         VARCHAR(32),
  updated_at   DATETIME
) UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16;

-- Upsert via INSERT INTO ... ON DUPLICATE KEY UPDATE behavior
INSERT INTO user_profiles VALUES (1001, 'alice@example.com', 'pro', NOW());

Use for: dimension tables, entity state (user, product, account). Latest value per primary key wins. Supports row-level updates — unlike most OLAP databases.

Primary Key Table (efficient row-level updates)

-- Best for frequently updated records
CREATE TABLE orders (
  order_id    BIGINT,
  user_id     BIGINT,
  status      VARCHAR(32),
  total       DECIMAL(12,2),
  updated_at  DATETIME
) PRIMARY KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 16
PROPERTIES (
  "replication_num" = "3",
  "enable_persistent_index" = "true"  -- for large primary key tables
);

-- Point updates are efficient (no full tablet rewrite)
UPDATE orders SET status = 'shipped' WHERE order_id = 98765;

Use for: order tables, inventory, any table needing real-time UPDATE/DELETE without rewriting the whole partition. This is what enables HTAP — you can run analytics on a table that's actively being updated.


Lakehouse Querying: No ETL Required

One of StarRocks' most underused features is its ability to query data lake files directly without ingesting them into StarRocks storage. This eliminates entire ETL pipelines for teams that already have data in S3, GCS, or HDFS.

-- Create a catalog pointing to your data lake
CREATE EXTERNAL CATALOG iceberg_catalog
PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "rest",
  "iceberg.catalog.uri" = "http://iceberg-rest-catalog:8181",
  "aws.s3.access_key" = "...",
  "aws.s3.secret_key" = "...",
  "aws.s3.region" = "us-east-1"
);

-- Query Iceberg tables directly
SELECT
  date_trunc('month', event_time) AS month,
  country,
  count(*) AS events,
  sum(revenue) AS total_revenue
FROM iceberg_catalog.analytics.events
WHERE event_time > '2025-01-01'
GROUP BY 1, 2
ORDER BY total_revenue DESC;

-- Join lakehouse data with StarRocks-native tables
SELECT u.plan, count(*) AS conversions
FROM iceberg_catalog.events e
JOIN default_catalog.default.user_profiles u ON e.user_id = u.user_id
WHERE e.event_type = 'purchase'
GROUP BY u.plan;

Supported formats: Parquet, ORC, CSV, Avro. Supported catalogs: Apache Iceberg, Delta Lake, Apache Hive, JDBC (for querying MySQL/Postgres directly).


Real-Time Ingestion

Stream from Kafka

-- Routine Load: continuous Kafka consumer built into StarRocks
CREATE ROUTINE LOAD analytics.load_events ON events
COLUMNS (event_time, user_id, url, status_code, duration_ms)
PROPERTIES (
  "desired_concurrent_number" = "4",
  "max_batch_interval" = "5",
  "max_batch_rows" = "500000",
  "format" = "json"
)
FROM KAFKA (
  "kafka_broker_list" = "kafka-broker:9092",
  "kafka_topic" = "page_events",
  "kafka_partitions" = "0,1,2,3",
  "kafka_offsets" = "OFFSET_BEGINNING"
);

-- Monitor the load job
SHOW ROUTINE LOAD FOR analytics.load_events\G

Batch load from S3

-- Broker Load: parallel load from object storage
LOAD LABEL analytics.load_2025_01
(
  DATA INFILE("s3://my-bucket/events/2025/01/*")
  INTO TABLE events
  FORMAT AS "parquet"
)
WITH BROKER
(
  "aws.s3.access_key" = "...",
  "aws.s3.secret_key" = "...",
  "aws.s3.region" = "us-east-1"
)
PROPERTIES ("timeout" = "3600");

-- Check load status
SHOW LOAD WHERE LABEL = "analytics.load_2025_01"\G

Materialized Views for Dashboard Acceleration

-- Synchronous Materialized View (maintained automatically on insert)
CREATE MATERIALIZED VIEW mv_hourly_url_stats AS
SELECT
  date_trunc('hour', event_time) AS hour,
  url,
  count(*) AS views,
  avg(duration_ms) AS avg_duration_ms
FROM access_logs
GROUP BY 1, 2;

-- StarRocks query rewrite: this query automatically hits the MV
SELECT url, count(*) AS views
FROM access_logs
WHERE event_time > now() - INTERVAL 24 HOUR
GROUP BY url
ORDER BY views DESC
LIMIT 20;
-- No need to query the MV explicitly — optimizer rewrites it

-- Async Materialized View (scheduled refresh, for complex aggregations)
CREATE MATERIALIZED VIEW mv_daily_revenue
REFRESH ASYNC EVERY (INTERVAL 1 HOUR)
AS
SELECT
  date_trunc('day', order_date) AS day,
  region,
  sum(total) AS revenue,
  count(*) AS orders
FROM orders
GROUP BY 1, 2;

StarRocks vs ClickHouse vs Redshift

Dimension StarRocks ClickHouse Redshift
Query latencySub-second (heavy aggregations)Sub-second (heavy aggregations)Seconds to minutes
ConcurrencyHigh (CN layer scales independently)Moderate (single-threaded core historically)Moderate (WLM queuing)
Row updatesYes (Primary Key table model)Expensive (part rewrite)Yes (VACUUM required)
Lakehouse queriesNative (Iceberg, Delta, Hive)Via table engines (more complex)Redshift Spectrum (additional cost)
ProtocolMySQL-compatibleCustom HTTP + nativePostgreSQL-compatible
Compute/storage scalingIndependent (CN nodes)Coupled (BE nodes)Coupled (RA3 separates, expensive)
Best forUser-facing analytics, HTAP, API servingLog analytics, observability, ad-tech batchTraditional DW, large-scale BI

For most teams choosing between StarRocks and ClickHouse: if you need high concurrency (many users hitting dashboards simultaneously) and/or row-level updates, StarRocks wins. If you're doing log analytics or batch aggregations with fewer concurrent users, ClickHouse is simpler to operate. See our ClickHouse guide for a deeper comparison.


When NOT to Use StarRocks

  • OLTP workloads: StarRocks is not a replacement for MySQL or PostgreSQL for transactional systems. Use it alongside them, not instead.
  • Single-user, infrequent queries: If you're running a few analyst queries per day on <10M rows, the operational overhead of StarRocks isn't worth it. PostgreSQL with proper indexes and pg_trgm handles this fine.
  • Complex ad-hoc joins across many normalized tables: StarRocks, like most OLAP systems, prefers wide denormalized tables. Star schemas work well; deeply normalized 3NF schemas do not.
  • Budget-constrained small teams: StarRocks is operationally heavier than ClickHouse or TimescaleDB. If you don't have someone who can manage it, ClickHouse Cloud or a managed StarRocks offering is easier to start with.

Production Operations

Key metrics to monitor

-- Query latency distribution
SELECT
  fe_host,
  query_type,
  count(*) AS query_count,
  avg(query_duration_ms) AS avg_ms,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY query_duration_ms) AS p99_ms
FROM information_schema.be_tablets
LIMIT 10;

-- BE node health
SHOW BACKENDS\G
-- Watch: isAlive, TabletNum (should be balanced), DataUsedCapacity, AvailCapacity

-- Compaction status (high compaction score = write amplification risk)
SELECT BE_ID, TABLET_NUM, DATA_SIZE, COMPACTION_STATUS
FROM information_schema.be_tablets
ORDER BY COMPACTION_STATUS DESC
LIMIT 20;

-- Replication health
SHOW PROC '/statistic'\G
-- UnhealthyTabletNum should be 0

Bucket sizing

Bucket count is one of the most common configuration mistakes. Too few buckets = hotspots; too many = excessive metadata overhead.

-- Rule of thumb: aim for 1–10 GB per bucket per BE node
-- For a 3-BE cluster storing 300 GB total:
-- 300 GB / 3 BEs / 5 GB per bucket = 20 buckets

DISTRIBUTED BY HASH(user_id) BUCKETS 20

-- Auto-bucketing (StarRocks 3.x+ recommends this for new tables)
DISTRIBUTED BY HASH(user_id)  -- StarRocks calculates bucket count automatically

Deployment

Kubernetes with StarRocks Operator

# Install the StarRocks Operator
helm repo add starrocks https://starrocks.github.io/starrocks-kubernetes-operator
helm install starrocks-operator starrocks/starrocks-operator -n starrocks --create-namespace

# Deploy a cluster
kubectl apply -f - <<EOF
apiVersion: starrocks.com/v1
kind: StarRocksCluster
metadata:
  name: prod-cluster
spec:
  starRocksFeSpec:
    replicas: 3
    limits: {cpu: "4", memory: "8Gi"}
  starRocksBeSpec:
    replicas: 3
    limits: {cpu: "16", memory: "64Gi"}
    storageVolumes:
    - name: be-storage
      storageClassName: gp3
      storageSize: 2Ti
  starRocksCnSpec:
    replicas: 3
    limits: {cpu: "8", memory: "16Gi"}
EOF

Working with JusDB on StarRocks

Most StarRocks performance problems we see stem from two things: wrong table model selection (using Duplicate Key when you need Primary Key, or vice versa), and under-designed bucket distribution causing hotspots. A 2-hour architecture review before you build usually prevents months of pain later.

Our StarRocks consulting covers:

  • Table model and partition design for your specific query patterns
  • Materialized view strategy for dashboard workloads
  • Kafka ingestion pipeline setup and monitoring
  • Lakehouse catalog integration (Iceberg, Delta, Hive)
  • Production cluster sizing, CN auto-scaling on Kubernetes

If you're migrating from ClickHouse, Redshift, or a traditional data warehouse, reach out — we've run those migrations and can map your schema and query patterns to the right StarRocks table models.

Related reading: ClickHouse vs StarRocks deep dive | StarRocks as a Vector Database | StarRocks Consulting

Frequently Asked Questions

What is StarRocks and how does it compare to ClickHouse?
StarRocks is an open-source MPP analytics engine that excels at JOIN-heavy queries — where ClickHouse traditionally struggles. Both use vectorized execution and columnar storage, but StarRocks has a cost-based optimizer and a richer JOIN implementation. ClickHouse is still typically faster on flat aggregations of huge single tables. StarRocks wins for star/snowflake schemas; ClickHouse wins for log/event analytics.
What's the difference between StarRocks FE, BE, and CN nodes?
FE (Frontend) handles metadata, query planning, and SQL parsing — like Postgres's coordinator. BE (Backend) stores data and executes query fragments locally. CN (Compute Node) is BE without storage — used for compute-storage separation when you store data in S3/HDFS and scale compute elastically. Production: 3 FE (Raft quorum) + N BE or N CN.
Which StarRocks table model should I choose?
Duplicate: append-only logs, no aggregation. Aggregate: pre-aggregated rollups for dashboards (SUM/MAX/MIN at write time). Unique Key: row-level updates (replace by key). Primary Key: row-level updates with optimised partial-column updates (MySQL CDC pattern). Default to Primary Key for OLTP-mirrored fact tables, Aggregate for materialised metrics.
Can StarRocks query data in Iceberg or Hudi?
Yes — StarRocks 3.0+ supports external catalogs for Iceberg, Hudi, Delta Lake, Hive, and Paimon. Configure once per catalog, then query lakehouse tables with full SQL including JOINs to StarRocks-native tables. Performance is competitive with Trino + 30% better on join-heavy queries.
How does StarRocks achieve sub-second query latency?
Three things: 1) vectorized execution engine processes batches of rows in CPU SIMD registers, not one row at a time; 2) cost-based optimiser with histogram statistics picks better plans than rule-based; 3) materialised views auto-refresh on write, so dashboard queries hit pre-computed aggregates.
Is StarRocks open source and what's its relationship to Apache Doris?
StarRocks (Apache 2.0) was forked from Apache Doris in 2020 and has diverged substantially since — different optimiser, different table models, different storage. The commercial entity behind StarRocks is CelerData. Apache Doris is also open source but is a separate project; the two are not API-compatible.
When should I use StarRocks vs ClickHouse vs Snowflake?
StarRocks for self-hosted JOIN-heavy analytics, customer-facing dashboards, lakehouse queries. ClickHouse for log/event analytics, high-cardinality time series, single-table aggregations. Snowflake when you don't want to operate the database — managed, elastic, but 5–10× the cost at scale and slower than both on hot queries.
How do I deploy StarRocks on Kubernetes?
Use the official starrocks-kubernetes-operator from CelerData. Define a StarRocksCluster CRD specifying FE replica count, BE replica count, persistent volume claims, and resource limits. The operator handles rolling upgrades, FE quorum management, and BE decommissioning automatically. Production setups use 3 FE on dedicated nodes and BE on tainted DB nodes.

Share this article

JusDB Team

Official JusDB content team

Deeper Reading

Curated companion guides for readers who want to go deeper on this topic.