StarRocks Monitoring & Alerting: Complete Production Guide

Two months ago, a Series C e-commerce company called us at 2 AM. Their real-time analytics dashboard—powered by StarRocks—had gone dark during a flash sale. The culprit: a single BE node ran out of disk at 98% capacity, triggering a cascade that brought down compaction, stalled ingestion from Kafka, and left the executive team staring at a blank Grafana screen while $400K/hour in GMV rolled through untracked.

The cluster had been running without a single alert rule. Prometheus was scraping metrics, Grafana dashboards existed, but nobody had configured thresholds or notification channels. The disk had been climbing at 2% per day for six weeks. A 90% disk usage alert would have fired four days earlier—enough time to expand storage during business hours instead of scrambling at 2 AM.

This guide is the monitoring and alerting configuration we now deploy for every StarRocks cluster we manage. It covers the three layers that matter: resource saturation, cluster service health, and application availability.

TL;DR

StarRocks monitoring splits into three domains: resource saturation (CPU, memory, disk I/O, disk space), cluster service health (FE/BE node status, JVM, compaction, metadata), and application availability (query errors, latency, ingestion lag).
Resource saturation alerts catch slow-building problems days before they become outages. Set CPU and memory alerts at 90%, disk at 90%, and filesystem free space at 5–10 GB.
Service health alerts fire on immediate dangers: dead nodes, JVM exhaustion, checkpoint failures, and compaction backlogs. These need PagerDuty-level routing.
Every alert in this guide includes the PromQL expression, the threshold, and the specific investigation steps—not just "check the logs."
A well-monitored StarRocks cluster catches problems early and fixes them before they turn into real incidents.

Why StarRocks Monitoring Is Different

StarRocks is an MPP (Massively Parallel Processing) analytical database with a split architecture: FE nodes handle query parsing, planning, metadata management, and catalog operations, while BE nodes handle storage, data ingestion, compaction, and query execution. Each layer has its own failure modes, its own resource bottlenecks, and its own metrics.

This means you cannot monitor StarRocks the way you monitor PostgreSQL or MySQL. A single-node database has one process to watch. StarRocks has a distributed system where an FE metadata checkpoint failure or a BE compaction backlog can silently degrade query performance for hours before a user notices. The monitoring must be layered.

The three layers we use:

Resource Saturation — Host-level metrics that develop gradually. These are your early warning system.
Cluster Service Health — Node availability, JVM pressure, compaction state, metadata integrity. These signal immediate service disruption.
Application Availability — Query errors, latency, throughput, ingestion lag. These are what your users actually feel.

Layer 1: Resource Saturation

Resource saturation alerts are the foundation. They are slow-building, predictable, and—critically—actionable before they cause outages. If you only configure one layer of monitoring, make it this one.

CPU Utilization (BE Nodes)

Alert threshold: >90% sustained for 5 minutes

text

# PromQL
avg(rate(starrocks_be_cpu{mode!="idle", job="$job_name"}[5m])) by (instance) * 100 > 90

Investigation steps:

Check running queries: SHOW PROC '/current_queries' on the FE leader.
Review the audit log for resource-intensive statements—look for queries scanning large amounts of data without partition pruning.
On the affected BE node, use perf top to identify hot functions. Common culprits: hash join build, expression evaluation, or JSON parsing.
If a single query is responsible, capture its query profile with EXPLAIN ANALYZE before killing it.

Emergency mitigation: If the node is unresponsive and cannot execute SHOW PROC, restart the BE process. StarRocks will automatically re-route in-flight queries to surviving nodes (assuming replication factor ≥2). Collect a stack trace with pstack before restarting for post-incident analysis.

Memory Usage (BE Process)

Alert threshold: >90% of configured mem_limit

text

# PromQL
starrocks_be_process_mem_bytes{job="$job_name"} / starrocks_be_mem_limit{job="$job_name"} * 100 > 90

Investigation steps:

Check the memory tracker hierarchy at http://<be_ip>:8040/mem_tracker. This shows memory consumption broken down by component: query pool, load, compaction, schema change, storage page cache.
In Grafana, review starrocks_be_*_mem_bytes panels to identify which memory consumer is growing.
Common causes: large hash joins spilling to memory, excessive concurrent loads, or page cache growing unbounded. If page cache is the culprit, check storage_page_cache_limit in be.conf.

Tip: StarRocks BE will attempt to spill to disk when memory pressure is high, but this dramatically slows queries. The goal is to catch memory pressure before spill happens, not after.

Machine-Level Memory (>90%)

This is separate from BE process memory. Other processes on the same host—node exporter, log agents, monitoring sidecars—can consume memory that the BE process then cannot use. Use standard node_exporter metrics:

text

# PromQL
(1 - node_memory_MemAvailable_bytes{job="$node_job"} / node_memory_MemTotal_bytes{job="$node_job"}) * 100 > 90

If machine memory is high but BE process memory is normal, investigate co-located services with ps aux --sort=-%mem | head -20.

Disk I/O Saturation

Alert threshold: >90% I/O utilization sustained for 5 minutes

text

# PromQL
rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*", job="$node_job"}[5m]) * 100 > 90

Investigation steps:

Run iostat -xz 1 5 on the affected node to determine if the bottleneck is reads (query scans) or writes (ingestion/compaction).
Check for concurrent operations: heavy ingestion + compaction + large analytical queries competing for the same disks.
Review partition pruning—queries scanning all partitions when they should scan one will saturate disk I/O. Check query plans with EXPLAIN.
If compaction is the primary I/O consumer, consider scheduling large compactions during off-peak hours or tuning cumulative_compaction_num_threads_per_disk.

Disk Capacity

Alert threshold: >90% usage on data directories

text

# PromQL
(1 - node_filesystem_avail_bytes{mountpoint=~"/data.*", job="$node_job"}
  / node_filesystem_size_bytes{mountpoint=~"/data.*", job="$node_job"}) * 100 > 90

Key detail most teams miss: When you DROP a table in StarRocks, the data files are moved to a trash directory and retained for a configurable period (default: 1 day, controlled by trash_file_expire_time_sec in be.conf). This means disk space is not reclaimed immediately after a DROP. If you need immediate space recovery, use TRUNCATE TABLE instead, or manually clear the trash directory.

FE Metadata Directory (<10 GB free)

text

# PromQL
node_filesystem_avail_bytes{mountpoint="/path/to/fe/meta"} < 10737418240

When the FE metadata directory runs low on space, the BDB-JE (Berkeley DB Java Edition) replication layer cannot write journals, and the FE node will crash. Excessive metadata directory growth almost always indicates checkpoint failures—the FE is accumulating journal entries that should have been compacted into an image file. Check ${meta_dir}/bdb for abnormally large or numerous files.

Root Filesystem (<5 GB free)

text

# PromQL
node_filesystem_avail_bytes{mountpoint="/"} < 5368709120

Standard Linux housekeeping: check /var/log, /tmp, and core dump directories. StarRocks BE can generate large core files on crash—ensure core_pattern is configured to write to a dedicated volume, not the root filesystem.

Layer 2: Cluster Service Health

Unlike resource saturation alerts that develop over hours or days, service health alerts signal immediate problems. These should route to PagerDuty or your on-call channel with high urgency.

FE Service Down

text

# PromQL — alert when active FE nodes drop below expected count
count(up{group="fe", job="$job_name"} == 1) < 3

An FE node going down affects query routing, metadata operations, and—if it is the leader—all DDL and catalog operations. Common causes and their fixes:

Cause	Symptom	Fix
Clock skew >5 seconds	"Environment invalid" error in FE log	Synchronize NTP across all nodes: `ntpdate -b pool.ntp.org`
Metadata disk full (<5 GB)	"DiskLimitException" in FE log	Expand metadata volume or clean old BDB journal files
Corrupted journal entry	Journal replay exception on startup	Add `metadata_journal_skip_bad_journal_ids=<id>` to `fe.conf`
Unknown operation type	FE cannot replay after downgrade	Back up meta, set `metadata_ignore_unknown_operation_type=true`, restart, create new image, then disable and restart

BE Service Down

text

# PromQL — alert when any BE node is reported dead
node_info{type="be_node_num", job="$job_name", state="dead"} > 0

A dead BE node means lost replicas, potential query failures on tablets that only existed on that node, and increased load on surviving nodes. Remediation depends on the crash cause:

Tablet metadata corruption: Add ignore_load_tablet_failure = true to be.conf and restart. The BE will skip corrupt tablets, which can then be repaired via replica rebuild.
Disk failure: Remove the failed disk from storage_root_path in be.conf and restart. Replicas on the failed disk will be rebuilt automatically from surviving copies.
Query-induced crash: If a specific SQL pattern crashes the BE, add it to the SQL blacklist: ADD SQLBLACKLIST "pattern". Then restart the BE and file a bug with the StarRocks community.

JVM Memory Exhaustion (FE)

text

# PromQL
sum(jvm_heap_size_bytes{type="used", job="$job_name"})
  / sum(jvm_heap_size_bytes{type="max", job="$job_name"}) * 100 > 90

FE nodes run on the JVM. When heap usage exceeds 90%, you are one large metadata operation away from an OutOfMemoryError and an FE crash. Investigation:

On StarRocks 3.2+, async-profiler memory profiles are stored in fe/log/proc_profile. These show exactly which objects are consuming heap.
Common causes: excessive UNION ALL queries generating large plan trees, or a metadata catalog with millions of tablets.
Emergency fix: restart the FE. Long-term fix: increase -Xmx in fe.conf or reduce the tablet count by using larger partitions.

Checkpoint Failure (FE Metadata)

text

# PromQL — normal operation triggers checkpoint at 50,000 logs
starrocks_fe_meta_log_count{job="$job_name", instance="$fe_leader"} > 100000

If the meta log count exceeds 100,000, the FE leader has failed to create a checkpoint image. This means recovery time after a restart will be extremely long (replaying all those journal entries). Search FE logs for "begin to generate new image" to check if checkpoints are running. If not, investigate disk I/O on the metadata volume—checkpoint writes compete with journal writes.

Compaction Health

Compaction is StarRocks' background process for merging small data files (rowsets) into larger ones. When compaction falls behind, query performance degrades because each scan must open more files, and eventually tablets hit the version limit and reject new writes.

Compaction Failures (≥3/min):

text

# PromQL
increase(starrocks_be_engine_requests_total{status="failed", type="cumulative_compaction", job="$job_name"}[1m]) > 3

Search BE logs: grep -E 'compaction' be.INFO | grep failed. Common causes: corrupt data files, disk I/O errors, or OOM during compaction.

High Compaction Pressure Score (>100):

text

# PromQL
starrocks_fe_max_tablet_compaction_score{job="$job_name"} > 100

This means ingestion is producing rowsets faster than compaction can merge them. Mitigations:

Increase the ingestion interval to ≥5 seconds between batches.
Avoid high-concurrency DELETE operations—each DELETE creates a new version that compaction must merge.
Increase cumulative_compaction_num_threads_per_disk if CPU and I/O headroom exist.

Excessive Tablet Versions (>700 rowsets):

text

# PromQL
starrocks_be_max_tablet_rowset_num{job="$job_name"} > 700

When a tablet exceeds 1000 versions (configurable via tablet_max_versions), new writes to that tablet are rejected. At 700, you have a window to act. Identify the problem tablets:

text

SELECT BE_ID, TABLET_ID, NUM_ROWSET
FROM information_schema.be_tablets
WHERE NUM_ROWSET > 700
ORDER BY NUM_ROWSET DESC;

If compaction cannot catch up, mark the problematic replica as BAD to trigger an automatic rebuild from a healthy replica:

text

ADMIN SET REPLICA STATUS PROPERTIES(
  "tablet_id" = "12345",
  "backend_id" = "10001",
  "status" = "bad"
);

Excessive FE Thread Count

text

# PromQL
starrocks_fe_thread_pool{job="$job_name"} > 3000

Often caused by UNION ALL queries that create parallel execution fragments per union branch. Mitigation: SET GLOBAL pipeline_dop = 8; to cap per-query parallelism, or increase thrift_server_max_worker_threads to 8192 if the hardware supports it.

Layer 3: Application Availability

Resource and service health alerts catch infrastructure problems. Application availability alerts catch what your users actually experience: failed queries, slow dashboards, and stalled data pipelines.

Query Error Rate

text

# PromQL — alert if more than 5% of queries are failing
rate(starrocks_fe_query_err{job="$job_name"}[5m])
  / rate(starrocks_fe_query_total{job="$job_name"}[5m]) * 100 > 5

Query errors fall into two buckets: user errors (syntax, permission, timeout) and internal errors (BE crash, memory limit exceeded, tablet unavailable). Filter on starrocks_fe_query_internal_err to isolate system-side failures that indicate a real problem versus application bugs.

Query Latency (p99)

text

# PromQL
histogram_quantile(0.99, rate(starrocks_fe_query_latency_ms_bucket{job="$job_name"}[5m])) > 5000

A p99 above 5 seconds (adjust for your SLA) means 1 in 100 queries is unacceptably slow. Correlate with CPU, memory, and disk I/O alerts—latency spikes rarely exist in isolation. Check if the slow queries are hitting unpartitioned tables, missing materialized views, or joining large fact tables without bucket pruning.

Kafka Ingestion Lag

text

# PromQL
starrocks_be_routine_load_lag{job="$job_name"} > 100000

For Routine Load jobs consuming from Kafka, a lag exceeding 100K messages means your real-time analytics are no longer real-time. Check:

Is the Routine Load job in PAUSED or CANCELLED state? Run SHOW ROUTINE LOAD to check.
Are BE nodes under memory or CPU pressure, throttling ingestion?
Has the Kafka topic partition count changed without updating the Routine Load job?

Failed Transactions

text

# PromQL
increase(starrocks_fe_txn_failed{job="$job_name"}[5m]) > 10

Failed transactions often indicate publish timeouts (data is loaded but not yet visible) or tablet version overflow. Cross-reference with compaction pressure alerts—the two are frequently related.

Materialized View Refresh Failures

text

# PromQL
increase(starrocks_fe_mv_refresh_failed{job="$job_name"}[30m]) > 0

A failed MV refresh means downstream dashboards are serving stale data. Check if the base tables have changed schema, if the refresh query exceeds memory limits, or if there is a lock conflict with concurrent DDL operations.

Putting It Together: Grafana Dashboard Layout

We organize our StarRocks Grafana dashboards into four rows:

Row	Panels	Purpose
Cluster Overview	FE/BE node count, cluster version, uptime	Instant health check—green means all nodes are alive
Resource Saturation	CPU %, memory %, disk I/O %, disk capacity per BE	Trend lines showing pressure building over time
Service Health	JVM heap %, meta log count, compaction score, tablet versions	Internal system state that users never see directly
Application Metrics	QPS, p50/p99 latency, error rate, Kafka lag, active connections	What your end-users actually experience

Each panel should have an alert rule linked to the corresponding PromQL expression from this guide. Use Grafana's alert grouping to prevent notification storms—a disk full event will trigger disk capacity, compaction failure, and transaction failure alerts simultaneously. Group them so your on-call gets one page, not six.

Alert Routing Best Practices

Not all alerts deserve the same response. We route StarRocks alerts into three tiers:

Tier	Alerts	Channel	Response Time
P1 — Page	FE/BE down, query error rate >10%, disk full	PagerDuty / phone call	<15 minutes
P2 — Urgent	CPU >90%, memory >90%, JVM >90%, compaction backlog, tablet versions >700	Slack #starrocks-alerts	<1 hour
P3 — Warning	Disk >80%, Kafka lag >50K, MV refresh failure, checkpoint lag	Slack #starrocks-monitoring	Next business day

Important: Resist the urge to make everything P1. Alert fatigue is the number one reason monitoring systems get ignored. If your on-call is getting paged for P3 warnings, they will start ignoring P1 pages too.

Key Takeaways

Monitor StarRocks across three layers: resource saturation (early warning), service health (immediate danger), and application availability (user impact).
Set resource alerts at 90% thresholds—this gives you a window to act before cascading failures begin.
Watch compaction health obsessively. Tablet version overflow is the most common cause of StarRocks write failures in production, and it is entirely preventable with monitoring.
Route alerts into severity tiers. Not every metric deserves a 2 AM phone call.
Use the specific PromQL expressions in this guide as your starting point, then tune thresholds based on your cluster's baseline behavior over 2–4 weeks.
Pair every alert with documented investigation steps. An alert without a runbook is just noise.

Working with JusDB on StarRocks Monitoring

Setting up StarRocks monitoring correctly—PromQL expressions, Grafana dashboards, alert routing, runbooks, and capacity planning—requires hands-on experience with StarRocks internals and production failure modes. Getting the compaction thresholds wrong means either false alarms every day or missed warnings that lead to write rejections at peak traffic.

JusDB's database engineering team has deployed and monitored StarRocks clusters for e-commerce analytics, real-time reporting, and data lakehouse architectures. We cover the full observability lifecycle: Prometheus configuration, Grafana dashboard design, alert rule tuning, PagerDuty integration, runbook creation, and capacity forecasting.

Database Monitoring Services Talk to a Database Expert

We also offer a StarRocks health check for teams running production clusters without monitoring—we audit your current Prometheus scrape config, identify missing alert rules, and deploy a complete Grafana dashboard in a single engagement.

Database Monitoring Best Practices Guide — general principles that apply across all database engines
Prometheus + Grafana for Database Observability — setting up the monitoring stack itself
Real-Time Analytics Architecture Guide — where StarRocks fits in the modern data stack

StarRocks Monitoring & Alerting: The Complete Production Guide

Why StarRocks Monitoring Is Different

Layer 1: Resource Saturation

CPU Utilization (BE Nodes)

Memory Usage (BE Process)

Machine-Level Memory (>90%)

Disk I/O Saturation

Disk Capacity

FE Metadata Directory (<10 GB free)

Root Filesystem (<5 GB free)

Layer 2: Cluster Service Health

FE Service Down

BE Service Down

JVM Memory Exhaustion (FE)

Checkpoint Failure (FE Metadata)

Compaction Health

Excessive FE Thread Count

Layer 3: Application Availability

Query Error Rate

Query Latency (p99)

Kafka Ingestion Lag

Failed Transactions

Materialized View Refresh Failures

Putting It Together: Grafana Dashboard Layout

Alert Routing Best Practices

Working with JusDB on StarRocks Monitoring

Share this article

JusDB Team

Keep reading

TimescaleDB Hypertables, Continuous Aggregates & Compression (2026 Production Guide)

ClickHouse Explained (2026): MergeTree, Distributed Engine & Real-Time OLAP

StarRocks vs ClickHouse: Architecture, Table Models & When to Choose Each (2026)

Need Expert Help?

Database Performance Optimization

StarRocks Consulting

StarRocks Performance Tuning

StarRocks Migration

StarRocks on Kubernetes

StarRocks Use Cases

StarRocks Monitoring & Alerting: The Complete Production Guide

Why StarRocks Monitoring Is Different

Layer 1: Resource Saturation

CPU Utilization (BE Nodes)

Memory Usage (BE Process)

Machine-Level Memory (>90%)

Disk I/O Saturation

Disk Capacity

FE Metadata Directory (<10 GB free)

Root Filesystem (<5 GB free)

Layer 2: Cluster Service Health

FE Service Down

BE Service Down

JVM Memory Exhaustion (FE)

Checkpoint Failure (FE Metadata)

Compaction Health

Excessive FE Thread Count

Layer 3: Application Availability

Query Error Rate

Query Latency (p99)

Kafka Ingestion Lag

Failed Transactions

Materialized View Refresh Failures

Putting It Together: Grafana Dashboard Layout

Alert Routing Best Practices

Working with JusDB on StarRocks Monitoring

Related Reading

Share this article

JusDB Team

Keep reading

TimescaleDB Hypertables, Continuous Aggregates & Compression (2026 Production Guide)

ClickHouse Explained (2026): MergeTree, Distributed Engine & Real-Time OLAP

StarRocks vs ClickHouse: Architecture, Table Models & When to Choose Each (2026)

Need Expert Help?

Database Performance Optimization

StarRocks Consulting

StarRocks Performance Tuning

StarRocks Migration

StarRocks on Kubernetes

StarRocks Use Cases