At 2:47 AM on a Tuesday, a MongoDB 7.0 cluster at a mid-sized fintech started throwing OOM kill signals. The mongod process had consumed 94% of available RAM on a 64 GB instance — memory that should have been headroom for working set and index caching. The on-call engineer restarted the process, latency recovered, and the incident was logged as "unexplained memory spike." Three nights later it happened again, same trajectory, same time window, same result. The root cause was not a runaway aggregation pipeline or a missing index: it was the Slot-Based Execution (SBE) plan cache growing unbounded due to a confirmed bug in MongoDB 7.0, silently accumulating plan cache entries until the process ran out of memory.
- MongoDB 7.0 introduced SBE (Slot-Based Execution engine) as the default query execution backend, and a confirmed bug causes its plan cache to grow unbounded under certain workloads.
- Identify the issue via
db.serverStatus().metrics.query.planCacheand process-level memory metrics — SBE plan cache entries may number in the tens of thousands with no eviction. - Immediate relief: clear the plan cache with
db.collection.getPlanCache().clear()or theplanCacheClearcommand across all collections. - Workaround: disable SBE by setting
internalQueryFrameworkControlto"forceClassicEngine"to revert to the classic query engine whose plan cache respects size limits. - Long-term fix: upgrade to a patched MongoDB release (7.0.4+ or 7.1+) where the SBE plan cache eviction is corrected.
- Monitor plan cache health continuously in production — this class of memory leak produces no obvious error in the logs until the OOM kill.
How MongoDB's Plan Cache Works
MongoDB maintains an in-memory plan cache per collection to avoid re-running the query planner on every execution of a query with the same shape. When a query arrives, the planner identifies its query shape — a normalized representation of the filter, sort, projection, and collection namespace — and checks whether a cached plan exists. If one does, the cached plan is used directly; if not, the planner generates candidate plans, trials them under the multi-plan selection mechanism, and writes the winner to the cache.
Classic Engine Plan Cache
Prior to MongoDB 7.0, the query engine was the classic execution engine. Its plan cache is keyed by query shape and is bounded: MongoDB enforces a default limit of 5,000 plan cache entries per collection (planCacheSize parameter, introduced in MongoDB 6.0 as a server-level cap across all collections). When the cache is full, the least-recently-used (LRU) entry is evicted. The classic cache stores PlanCacheEntry objects that contain the winning plan's stage tree, indexed field references, and a feedback counter used for plan invalidation.
SBE (Slot-Based Execution) Plan Cache
MongoDB 7.0 made SBE the default execution engine for a broader set of query types, including most find and aggregate operations that previously ran on the classic engine. SBE compiles queries into a lower-level slot-machine IR (intermediate representation) that is closer to a physical execution plan, enabling vectorized execution and tighter integration with the storage layer. The SBE plan cache stores these compiled plan representations rather than logical stage trees, which makes them significantly larger per entry in memory footprint.
SBE plan cache entries are materially larger in memory than classic engine entries. A workload generating 10,000 distinct query shapes — common in applications that embed user IDs, timestamps, or tenant identifiers directly in query predicates rather than using parameterized shapes — can accumulate several gigabytes of plan cache data under the bug conditions.
The MongoDB 7.0 SBE Plan Cache Memory Leak Bug
The bug (tracked internally by MongoDB and referenced in SERVER-79206 and related tickets) manifests when the SBE plan cache does not correctly enforce eviction against the planCacheSize limit. Under certain conditions — particularly when SBE-compiled plans are invalidated due to index changes, collection drops, or chunk migrations in sharded environments — stale entries remain in the cache without being evicted. New shapes continue to be added while old invalidated entries accumulate, causing the cache to grow monotonically.
The growth is gradual enough to be invisible on daily memory graphs but compounds over hours and days. Workloads with high query shape diversity are worst affected: multi-tenant SaaS applications routing per-tenant queries, time-series applications with timestamp ranges embedded in predicates, and analytics workloads with dynamic filter generation all produce high shape cardinality. The cache can reach tens of gigabytes before the mongod process exhausts available memory.
The mongod log does not emit a warning or error as SBE plan cache memory grows. There is no planCache eviction failure message. The first observable signal at the application layer is typically elevated query latency as the system begins paging, followed by the OOM kill. By that point the cache may have been growing for 12–72 hours.
Identifying the Issue: Diagnostics
Step 1: Check Plan Cache Metrics via serverStatus
The fastest first check is db.serverStatus(). The relevant section is nested under metrics.query.planCache:
db.adminCommand({ serverStatus: 1, repl: 0, connections: 0 }).metrics.query.planCacheThe output will resemble:
{
"classicEngineSkippedDueToNotMatchingSbeCompatible": NumberLong(0),
"sbeNotUsedDueToNotMatchingSbeCompatible": NumberLong(0),
"totalQueryShapes": NumberLong(14872),
"totalSizeOfPlanCacheEntriesInBytes": NumberLong(3724189952),
"totalPlanCacheEntriesInvalidated": NumberLong(12091)
}A totalQueryShapes value in the thousands combined with totalSizeOfPlanCacheEntriesInBytes in the gigabytes range is the fingerprint of the bug. In a healthy cluster, both values should be stable and bounded — typically below 5,000 shapes and a few hundred megabytes at most.
Step 2: Inspect Per-Collection Plan Cache Entries
Drill into the collections contributing the most entries:
// List all plan cache entries for a collection
db.orders.getPlanCache().list()
// Count entries per collection across all collections in the DB
db.getCollectionNames().forEach(function(col) {
var count = db[col].getPlanCache().list().length;
if (count > 100) {
print(col + ": " + count + " plan cache entries");
}
});If a single collection reports thousands of entries, that collection's query shapes are the source. Examine a sample of entries to understand whether the shapes are genuinely distinct or whether application code is embedding variable literal values in predicates that should be parameterized:
// Inspect shape details of cached plans
db.orders.getPlanCache().list().slice(0, 5).forEach(function(entry) {
printjson({
queryHash: entry.queryHash,
planCacheKey: entry.planCacheKey,
isActive: entry.isActive,
createdFromQuery: entry.createdFromQuery
});
});Step 3: Correlate with Process Memory
Confirm that plan cache growth tracks overall process memory growth by comparing tcmalloc stats:
var ss = db.adminCommand({ serverStatus: 1 });
printjson({
resident_MB: ss.mem.resident,
virtual_MB: ss.mem.virtual,
planCacheBytes: ss.metrics.query.planCache.totalSizeOfPlanCacheEntriesInBytes,
connections: ss.connections.current
});If planCacheBytes is a significant fraction of resident_MB and both are climbing together without a corresponding increase in connections or active query count, the plan cache is the culprit.
Immediate Remediation: Clearing the Plan Cache
Once confirmed, clear the plan cache to reclaim memory immediately. This is safe to run on a live production instance — it forces re-planning on next execution but does not affect correctness or data integrity:
// Clear plan cache for a specific collection
db.orders.getPlanCache().clear();
// Clear plan cache for all collections in all databases (run from admin or iterate DBs)
db.adminCommand({ planCacheClear: "orders" });
// Iterate all databases and collections to clear everything
db.adminCommand({ listDatabases: 1 }).databases.forEach(function(dbInfo) {
var targetDb = db.getSiblingDB(dbInfo.name);
targetDb.getCollectionNames().forEach(function(col) {
targetDb[col].getPlanCache().clear();
print("Cleared: " + dbInfo.name + "." + col);
});
});After clearing the plan cache, monitor db.serverStatus().metrics.query.planCache.totalSizeOfPlanCacheEntriesInBytes on a 5-minute interval. If the value begins climbing again immediately, the workload is actively regenerating a high volume of unique shapes and the workaround below (disabling SBE) should be applied before the next OOM event.
Workaround: Disabling SBE
Disabling SBE reverts query execution to the classic engine, whose plan cache correctly enforces the planCacheSize limit and does not exhibit the unbounded growth behavior. This can be applied at runtime without a restart:
// Disable SBE at runtime (takes effect immediately for new queries)
db.adminCommand({
setParameter: 1,
internalQueryFrameworkControl: "forceClassicEngine"
});
// Verify the setting
db.adminCommand({ getParameter: 1, internalQueryFrameworkControl: 1 });
// { "internalQueryFrameworkControl": "forceClassicEngine", "ok": 1 }To make the setting persistent across restarts, add it to mongod.conf:
setParameter:
internalQueryFrameworkControl: forceClassicEngineDisabling SBE will reduce query performance for workloads that benefited from SBE's vectorized execution — particularly aggregation pipelines with $lookup, $group, and window functions. Benchmark critical query paths after applying this workaround and plan to re-enable SBE once you have upgraded to a patched release.
Understanding planCacheSize and Its Limits
MongoDB 6.0 introduced the planCacheSize parameter, which sets a server-wide cap on total plan cache memory across all collections. The default is 0.5% of total RAM, with a minimum of 50 MB and a maximum of 20% of total RAM. Under the 7.0 SBE bug, this limit is not correctly enforced for SBE entries, which is why memory grows beyond the configured ceiling.
// Check current planCacheSize setting
db.adminCommand({ getParameter: 1, planCacheSize: 1 });
// Reduce planCacheSize as a defensive measure (enforced for classic engine)
db.adminCommand({
setParameter: 1,
planCacheSize: 104857600 // 100 MB in bytes
});Even with a reduced planCacheSize, the SBE eviction bug means the classic engine limit will be enforced but the SBE cache may still grow. Treat planCacheSize tuning as a complement to, not a replacement for, either the SBE disable workaround or the version upgrade.
Monitoring Plan Cache Health in Production
A plan cache monitoring routine should be part of your standard MongoDB operational runbook. Add the following to your monitoring infrastructure (Prometheus with mongodb_exporter, Datadog MongoDB integration, or a custom cron-based script):
// Monitoring script — run every 5 minutes via cron or monitoring agent
var ss = db.adminCommand({ serverStatus: 1, repl: 0, connections: 0 });
var pc = ss.metrics.query.planCache;
var report = {
ts: new Date().toISOString(),
totalShapes: pc.totalQueryShapes,
cacheSizeBytes: pc.totalSizeOfPlanCacheEntriesInBytes,
cacheSizeMB: Math.round(pc.totalSizeOfPlanCacheEntriesInBytes / 1048576),
invalidated: pc.totalPlanCacheEntriesInvalidated,
residentMB: ss.mem.resident
};
printjson(report);
// Alert threshold: cache exceeds 512 MB or more than 5000 shapes
if (report.cacheSizeMB > 512 || report.totalShapes > 5000) {
print("ALERT: Plan cache threshold exceeded — investigate SBE growth");
}Define alert thresholds appropriate to your workload. As a starting point: alert if totalSizeOfPlanCacheEntriesInBytes exceeds 500 MB, or if totalQueryShapes grows by more than 500 new shapes within a 15-minute window (indicating a query shape explosion rather than normal steady-state caching).
Addressing Query Shape Explosion
Even after patching, high plan cache cardinality is a latent performance problem. The root architectural issue is application code constructing query predicates with embedded literal values rather than using driver-level variable bindings. MongoDB's query planner treats each unique set of literal constants as a distinct query shape if the constants appear in positions that affect plan selection — field value ranges used for index range scans are the most common offender.
// Problematic: each unique userId generates a new plan cache entry
// if userId appears in a compound index's leading field with high cardinality
db.events.find({ userId: "usr_a3f9c1", eventType: "purchase", ts: { $gte: ISODate("2024-01-01") } });
// Better: use $expr with $$NOW or pass values as variables where the driver supports it
// In practice, ensure compound indexes cover (eventType, ts) so userId is filtered
// post-index scan, reducing shape sensitivity to the userId literal
db.events.find({ eventType: "purchase", ts: { $gte: ISODate("2024-01-01") }, userId: "usr_a3f9c1" });Audit collections with the highest plan cache entry counts and review whether indexes can be restructured so that high-cardinality per-user or per-tenant fields are not leading index components. This reduces both plan cache pressure and overall planner CPU cost.
Upgrading to a Fixed Release
The SBE plan cache eviction bug is addressed in MongoDB 7.0.4 and later in the 7.0.x series, and is not present in MongoDB 7.1+. The upgrade path from 7.0.x to 7.0.4+ is an in-place binary upgrade — no data format changes, no replica set reconfiguration required:
- Upgrade secondaries one at a time: replace the mongod binary, restart the secondary, confirm it reaches SECONDARY state and catches up in oplog.
- Step down the primary:
rs.stepDown() - Upgrade the former primary (now secondary) and restart.
- Confirm all members are on the patched version:
rs.status()and checkversionfields. - Re-enable SBE if you disabled it as a workaround:
setParameter internalQueryFrameworkControl: "trySbeEngine"
After upgrading and re-enabling SBE, run your plan cache monitoring script for 24 hours before confirming the fix. The totalSizeOfPlanCacheEntriesInBytes metric should stabilize at a constant level rather than climbing monotonically. If it continues to grow after the upgrade, verify that internalQueryFrameworkControl is set to "trySbeEngine" and not still locked to "forceClassicEngine".
- MongoDB 7.0's SBE plan cache has a confirmed eviction bug that causes unbounded memory growth — it produces no log warnings and is only detectable via
serverStatus().metrics.query.planCachemetrics. - Immediately clear the plan cache with
db.collection.getPlanCache().clear()on affected instances to stop the OOM bleed. - Disable SBE via
setParameter internalQueryFrameworkControl: "forceClassicEngine"as a runtime workaround if the cache regrows after clearing. - Instrument
totalQueryShapesandtotalSizeOfPlanCacheEntriesInBytesas first-class production alerts — treat any unbounded growth as a P1 signal. - Address query shape explosion at the application layer: avoid embedding high-cardinality literal values in leading index fields, and structure compound indexes so that selective low-cardinality fields lead.
- Upgrade to MongoDB 7.0.4+ or 7.1+ to permanently resolve the SBE eviction bug, then re-enable SBE to recover the query performance benefits.
Working with JusDB on MongoDB Performance and Reliability
JusDB manages MongoDB for engineering teams who need production-grade reliability without the operational overhead. Our DBAs handle version upgrades, plan cache monitoring, query shape audits, index strategy reviews, and 24/7 incident response — including OOM investigations like the one described in this post. When your on-call engineer should be sleeping instead of digging through serverStatus output at 3 AM, that is where we come in.