Two months ago, a Series C e-commerce company called us at 2 AM. Their real-time analytics dashboard—powered by StarRocks—had gone dark during a flash sale. The culprit: a single BE node ran out of disk at 98% capacity, triggering a cascade that brought down compaction, stalled ingestion from Kafka, and left the executive team staring at a blank Grafana screen while $400K/hour in GMV rolled through untracked.
The cluster had been running without a single alert rule. Prometheus was scraping metrics, Grafana dashboards existed, but nobody had configured thresholds or notification channels. The disk had been climbing at 2% per day for six weeks. A 90% disk usage alert would have fired four days earlier—enough time to expand storage during business hours instead of scrambling at 2 AM.
This guide is the monitoring and alerting configuration we now deploy for every StarRocks cluster we manage. It covers the three layers that matter: resource saturation, cluster service health, and application availability.
- StarRocks monitoring splits into three domains: resource saturation (CPU, memory, disk I/O, disk space), cluster service health (FE/BE node status, JVM, compaction, metadata), and application availability (query errors, latency, ingestion lag).
- Resource saturation alerts catch slow-building problems days before they become outages. Set CPU and memory alerts at 90%, disk at 90%, and filesystem free space at 5–10 GB.
- Service health alerts fire on immediate dangers: dead nodes, JVM exhaustion, checkpoint failures, and compaction backlogs. These need PagerDuty-level routing.
- Every alert in this guide includes the PromQL expression, the threshold, and the specific investigation steps—not just "check the logs."
- A well-monitored StarRocks cluster catches problems early and fixes them before they turn into real incidents.
Why StarRocks Monitoring Is Different
StarRocks is an MPP (Massively Parallel Processing) analytical database with a split architecture: FE nodes handle query parsing, planning, metadata management, and catalog operations, while BE nodes handle storage, data ingestion, compaction, and query execution. Each layer has its own failure modes, its own resource bottlenecks, and its own metrics.
This means you cannot monitor StarRocks the way you monitor PostgreSQL or MySQL. A single-node database has one process to watch. StarRocks has a distributed system where an FE metadata checkpoint failure or a BE compaction backlog can silently degrade query performance for hours before a user notices. The monitoring must be layered.
The three layers we use:
- Resource Saturation — Host-level metrics that develop gradually. These are your early warning system.
- Cluster Service Health — Node availability, JVM pressure, compaction state, metadata integrity. These signal immediate service disruption.
- Application Availability — Query errors, latency, throughput, ingestion lag. These are what your users actually feel.
Layer 1: Resource Saturation
Resource saturation alerts are the foundation. They are slow-building, predictable, and—critically—actionable before they cause outages. If you only configure one layer of monitoring, make it this one.
CPU Utilization (BE Nodes)
Alert threshold: >90% sustained for 5 minutes
# PromQL
avg(rate(starrocks_be_cpu{mode!="idle", job="$job_name"}[5m])) by (instance) * 100 > 90
Investigation steps:
- Check running queries:
SHOW PROC '/current_queries'on the FE leader. - Review the audit log for resource-intensive statements—look for queries scanning large amounts of data without partition pruning.
- On the affected BE node, use
perf topto identify hot functions. Common culprits: hash join build, expression evaluation, or JSON parsing. - If a single query is responsible, capture its query profile with
EXPLAIN ANALYZEbefore killing it.
Emergency mitigation: If the node is unresponsive and cannot execute SHOW PROC, restart the BE process. StarRocks will automatically re-route in-flight queries to surviving nodes (assuming replication factor ≥2). Collect a stack trace with pstack before restarting for post-incident analysis.
Memory Usage (BE Process)
Alert threshold: >90% of configured mem_limit
# PromQL
starrocks_be_process_mem_bytes{job="$job_name"} / starrocks_be_mem_limit{job="$job_name"} * 100 > 90
Investigation steps:
- Check the memory tracker hierarchy at
http://<be_ip>:8040/mem_tracker. This shows memory consumption broken down by component: query pool, load, compaction, schema change, storage page cache. - In Grafana, review
starrocks_be_*_mem_bytespanels to identify which memory consumer is growing. - Common causes: large hash joins spilling to memory, excessive concurrent loads, or page cache growing unbounded. If page cache is the culprit, check
storage_page_cache_limitinbe.conf.
Machine-Level Memory (>90%)
This is separate from BE process memory. Other processes on the same host—node exporter, log agents, monitoring sidecars—can consume memory that the BE process then cannot use. Use standard node_exporter metrics:
# PromQL
(1 - node_memory_MemAvailable_bytes{job="$node_job"} / node_memory_MemTotal_bytes{job="$node_job"}) * 100 > 90
If machine memory is high but BE process memory is normal, investigate co-located services with ps aux --sort=-%mem | head -20.
Disk I/O Saturation
Alert threshold: >90% I/O utilization sustained for 5 minutes
# PromQL
rate(node_disk_io_time_seconds_total{device=~"sd.*|nvme.*", job="$node_job"}[5m]) * 100 > 90
Investigation steps:
- Run
iostat -xz 1 5on the affected node to determine if the bottleneck is reads (query scans) or writes (ingestion/compaction). - Check for concurrent operations: heavy ingestion + compaction + large analytical queries competing for the same disks.
- Review partition pruning—queries scanning all partitions when they should scan one will saturate disk I/O. Check query plans with
EXPLAIN. - If compaction is the primary I/O consumer, consider scheduling large compactions during off-peak hours or tuning
cumulative_compaction_num_threads_per_disk.
Disk Capacity
Alert threshold: >90% usage on data directories
# PromQL
(1 - node_filesystem_avail_bytes{mountpoint=~"/data.*", job="$node_job"}
/ node_filesystem_size_bytes{mountpoint=~"/data.*", job="$node_job"}) * 100 > 90
Key detail most teams miss: When you DROP a table in StarRocks, the data files are moved to a trash directory and retained for a configurable period (default: 1 day, controlled by trash_file_expire_time_sec in be.conf). This means disk space is not reclaimed immediately after a DROP. If you need immediate space recovery, use TRUNCATE TABLE instead, or manually clear the trash directory.
FE Metadata Directory (<10 GB free)
# PromQL
node_filesystem_avail_bytes{mountpoint="/path/to/fe/meta"} < 10737418240
When the FE metadata directory runs low on space, the BDB-JE (Berkeley DB Java Edition) replication layer cannot write journals, and the FE node will crash. Excessive metadata directory growth almost always indicates checkpoint failures—the FE is accumulating journal entries that should have been compacted into an image file. Check ${meta_dir}/bdb for abnormally large or numerous files.
Root Filesystem (<5 GB free)
# PromQL
node_filesystem_avail_bytes{mountpoint="/"} < 5368709120
Standard Linux housekeeping: check /var/log, /tmp, and core dump directories. StarRocks BE can generate large core files on crash—ensure core_pattern is configured to write to a dedicated volume, not the root filesystem.
Layer 2: Cluster Service Health
Unlike resource saturation alerts that develop over hours or days, service health alerts signal immediate problems. These should route to PagerDuty or your on-call channel with high urgency.
FE Service Down
# PromQL — alert when active FE nodes drop below expected count
count(up{group="fe", job="$job_name"} == 1) < 3
An FE node going down affects query routing, metadata operations, and—if it is the leader—all DDL and catalog operations. Common causes and their fixes:
| Cause | Symptom | Fix |
|---|---|---|
| Clock skew >5 seconds | "Environment invalid" error in FE log | Synchronize NTP across all nodes: ntpdate -b pool.ntp.org |
| Metadata disk full (<5 GB) | "DiskLimitException" in FE log | Expand metadata volume or clean old BDB journal files |
| Corrupted journal entry | Journal replay exception on startup | Add metadata_journal_skip_bad_journal_ids=<id> to fe.conf |
| Unknown operation type | FE cannot replay after downgrade | Back up meta, set metadata_ignore_unknown_operation_type=true, restart, create new image, then disable and restart |
BE Service Down
# PromQL — alert when any BE node is reported dead
node_info{type="be_node_num", job="$job_name", state="dead"} > 0
A dead BE node means lost replicas, potential query failures on tablets that only existed on that node, and increased load on surviving nodes. Remediation depends on the crash cause:
- Tablet metadata corruption: Add
ignore_load_tablet_failure = truetobe.confand restart. The BE will skip corrupt tablets, which can then be repaired via replica rebuild. - Disk failure: Remove the failed disk from
storage_root_pathinbe.confand restart. Replicas on the failed disk will be rebuilt automatically from surviving copies. - Query-induced crash: If a specific SQL pattern crashes the BE, add it to the SQL blacklist:
ADD SQLBLACKLIST "pattern". Then restart the BE and file a bug with the StarRocks community.
JVM Memory Exhaustion (FE)
# PromQL
sum(jvm_heap_size_bytes{type="used", job="$job_name"})
/ sum(jvm_heap_size_bytes{type="max", job="$job_name"}) * 100 > 90
FE nodes run on the JVM. When heap usage exceeds 90%, you are one large metadata operation away from an OutOfMemoryError and an FE crash. Investigation:
- On StarRocks 3.2+, async-profiler memory profiles are stored in
fe/log/proc_profile. These show exactly which objects are consuming heap. - Common causes: excessive
UNION ALLqueries generating large plan trees, or a metadata catalog with millions of tablets. - Emergency fix: restart the FE. Long-term fix: increase
-Xmxinfe.confor reduce the tablet count by using larger partitions.
Checkpoint Failure (FE Metadata)
# PromQL — normal operation triggers checkpoint at 50,000 logs
starrocks_fe_meta_log_count{job="$job_name", instance="$fe_leader"} > 100000
If the meta log count exceeds 100,000, the FE leader has failed to create a checkpoint image. This means recovery time after a restart will be extremely long (replaying all those journal entries). Search FE logs for "begin to generate new image" to check if checkpoints are running. If not, investigate disk I/O on the metadata volume—checkpoint writes compete with journal writes.
Compaction Health
Compaction is StarRocks' background process for merging small data files (rowsets) into larger ones. When compaction falls behind, query performance degrades because each scan must open more files, and eventually tablets hit the version limit and reject new writes.
Compaction Failures (≥3/min):
# PromQL
increase(starrocks_be_engine_requests_total{status="failed", type="cumulative_compaction", job="$job_name"}[1m]) > 3
Search BE logs: grep -E 'compaction' be.INFO | grep failed. Common causes: corrupt data files, disk I/O errors, or OOM during compaction.
High Compaction Pressure Score (>100):
# PromQL
starrocks_fe_max_tablet_compaction_score{job="$job_name"} > 100
This means ingestion is producing rowsets faster than compaction can merge them. Mitigations:
- Increase the ingestion interval to ≥5 seconds between batches.
- Avoid high-concurrency
DELETEoperations—each DELETE creates a new version that compaction must merge. - Increase
cumulative_compaction_num_threads_per_diskif CPU and I/O headroom exist.
Excessive Tablet Versions (>700 rowsets):
# PromQL
starrocks_be_max_tablet_rowset_num{job="$job_name"} > 700
When a tablet exceeds 1000 versions (configurable via tablet_max_versions), new writes to that tablet are rejected. At 700, you have a window to act. Identify the problem tablets:
SELECT BE_ID, TABLET_ID, NUM_ROWSET FROM information_schema.be_tablets WHERE NUM_ROWSET > 700 ORDER BY NUM_ROWSET DESC;
If compaction cannot catch up, mark the problematic replica as BAD to trigger an automatic rebuild from a healthy replica:
ADMIN SET REPLICA STATUS PROPERTIES( "tablet_id" = "12345", "backend_id" = "10001", "status" = "bad" );
Excessive FE Thread Count
# PromQL
starrocks_fe_thread_pool{job="$job_name"} > 3000
Often caused by UNION ALL queries that create parallel execution fragments per union branch. Mitigation: SET GLOBAL pipeline_dop = 8; to cap per-query parallelism, or increase thrift_server_max_worker_threads to 8192 if the hardware supports it.
Layer 3: Application Availability
Resource and service health alerts catch infrastructure problems. Application availability alerts catch what your users actually experience: failed queries, slow dashboards, and stalled data pipelines.
Query Error Rate
# PromQL — alert if more than 5% of queries are failing
rate(starrocks_fe_query_err{job="$job_name"}[5m])
/ rate(starrocks_fe_query_total{job="$job_name"}[5m]) * 100 > 5
Query errors fall into two buckets: user errors (syntax, permission, timeout) and internal errors (BE crash, memory limit exceeded, tablet unavailable). Filter on starrocks_fe_query_internal_err to isolate system-side failures that indicate a real problem versus application bugs.
Query Latency (p99)
# PromQL
histogram_quantile(0.99, rate(starrocks_fe_query_latency_ms_bucket{job="$job_name"}[5m])) > 5000
A p99 above 5 seconds (adjust for your SLA) means 1 in 100 queries is unacceptably slow. Correlate with CPU, memory, and disk I/O alerts—latency spikes rarely exist in isolation. Check if the slow queries are hitting unpartitioned tables, missing materialized views, or joining large fact tables without bucket pruning.
Kafka Ingestion Lag
# PromQL
starrocks_be_routine_load_lag{job="$job_name"} > 100000
For Routine Load jobs consuming from Kafka, a lag exceeding 100K messages means your real-time analytics are no longer real-time. Check:
- Is the Routine Load job in
PAUSEDorCANCELLEDstate? RunSHOW ROUTINE LOADto check. - Are BE nodes under memory or CPU pressure, throttling ingestion?
- Has the Kafka topic partition count changed without updating the Routine Load job?
Failed Transactions
# PromQL
increase(starrocks_fe_txn_failed{job="$job_name"}[5m]) > 10Failed transactions often indicate publish timeouts (data is loaded but not yet visible) or tablet version overflow. Cross-reference with compaction pressure alerts—the two are frequently related.
Materialized View Refresh Failures
# PromQL
increase(starrocks_fe_mv_refresh_failed{job="$job_name"}[30m]) > 0A failed MV refresh means downstream dashboards are serving stale data. Check if the base tables have changed schema, if the refresh query exceeds memory limits, or if there is a lock conflict with concurrent DDL operations.
Putting It Together: Grafana Dashboard Layout
We organize our StarRocks Grafana dashboards into four rows:
| Row | Panels | Purpose |
|---|---|---|
| Cluster Overview | FE/BE node count, cluster version, uptime | Instant health check—green means all nodes are alive |
| Resource Saturation | CPU %, memory %, disk I/O %, disk capacity per BE | Trend lines showing pressure building over time |
| Service Health | JVM heap %, meta log count, compaction score, tablet versions | Internal system state that users never see directly |
| Application Metrics | QPS, p50/p99 latency, error rate, Kafka lag, active connections | What your end-users actually experience |
Each panel should have an alert rule linked to the corresponding PromQL expression from this guide. Use Grafana's alert grouping to prevent notification storms—a disk full event will trigger disk capacity, compaction failure, and transaction failure alerts simultaneously. Group them so your on-call gets one page, not six.
Alert Routing Best Practices
Not all alerts deserve the same response. We route StarRocks alerts into three tiers:
| Tier | Alerts | Channel | Response Time |
|---|---|---|---|
| P1 — Page | FE/BE down, query error rate >10%, disk full | PagerDuty / phone call | <15 minutes |
| P2 — Urgent | CPU >90%, memory >90%, JVM >90%, compaction backlog, tablet versions >700 | Slack #starrocks-alerts | <1 hour |
| P3 — Warning | Disk >80%, Kafka lag >50K, MV refresh failure, checkpoint lag | Slack #starrocks-monitoring | Next business day |
- Monitor StarRocks across three layers: resource saturation (early warning), service health (immediate danger), and application availability (user impact).
- Set resource alerts at 90% thresholds—this gives you a window to act before cascading failures begin.
- Watch compaction health obsessively. Tablet version overflow is the most common cause of StarRocks write failures in production, and it is entirely preventable with monitoring.
- Route alerts into severity tiers. Not every metric deserves a 2 AM phone call.
- Use the specific PromQL expressions in this guide as your starting point, then tune thresholds based on your cluster's baseline behavior over 2–4 weeks.
- Pair every alert with documented investigation steps. An alert without a runbook is just noise.
Working with JusDB on StarRocks Monitoring
Setting up StarRocks monitoring correctly—PromQL expressions, Grafana dashboards, alert routing, runbooks, and capacity planning—requires hands-on experience with StarRocks internals and production failure modes. Getting the compaction thresholds wrong means either false alarms every day or missed warnings that lead to write rejections at peak traffic.
JusDB's database engineering team has deployed and monitored StarRocks clusters for e-commerce analytics, real-time reporting, and data lakehouse architectures. We cover the full observability lifecycle: Prometheus configuration, Grafana dashboard design, alert rule tuning, PagerDuty integration, runbook creation, and capacity forecasting.
Database Monitoring Services Talk to a Database Expert
We also offer a StarRocks health check for teams running production clusters without monitoring—we audit your current Prometheus scrape config, identify missing alert rules, and deploy a complete Grafana dashboard in a single engagement.
Related Reading
- Database Monitoring Best Practices Guide — general principles that apply across all database engines
- Prometheus + Grafana for Database Observability — setting up the monitoring stack itself
- Real-Time Analytics Architecture Guide — where StarRocks fits in the modern data stack