In production MongoDB deployments, the aggregation pipeline is where raw data transforms into business intelligence — and where poorly written queries quietly consume CPU, RAM, and patience. Engineers who treat the pipeline as a simple map-reduce shortcut routinely ship queries that run fine on 10,000 documents and collapse at 10 million. This post is a practitioner's guide to the stages, patterns, and index strategies that separate pipeline code you're proud of from pipeline code you inherit and fear. Whether you're building analytics dashboards, generating reports, or materializing views, the patterns here apply directly to production workloads. By the end you'll have the mental model to write pipelines that stay fast as your collections grow.
- Place
$matchas early as possible to reduce document flow through later stages. - Use
$lookupwith a sub-pipeline for filtered, index-friendly joins. - Use
$facetto run multiple aggregations in a single round-trip. - Use
$bucketfor histogram-style analytics on numeric fields. - Always run
explain("executionStats")on aggregation pipelines before deploying to production. - Set
allowDiskUse: truewhen pipelines process more than 100 MB in memory. $mergelets you write pipeline output to a collection, enabling materialized views.
Pipeline Fundamentals
The aggregation pipeline is a sequence of stages. Each stage receives a stream of documents, transforms them, and passes the result downstream. MongoDB executes stages in order, and the order matters enormously for performance. The optimizer can reorder some stages automatically — pushing $match and $limit earlier — but it cannot fix a fundamentally backwards pipeline design.
The single most important rule is: reduce document count and field size as early as possible. Every document that passes a stage costs CPU and memory. A $match that filters 90% of a collection at the start of a pipeline is doing more work for your query than any index trick later in the chain.
// Anti-pattern: $group first, then $match
db.orders.aggregate([
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } },
{ $match: { total: { $gt: 1000 } } }
]);
// Correct: $match first to reduce input to $group
db.orders.aggregate([
{ $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } },
{ $match: { total: { $gt: 1000 } } }
]);The second pipeline filters on indexed fields before the expensive $group, dramatically reducing the number of documents MongoDB has to accumulate in memory.
Key Stages in Depth
$group with $sum, $avg, and $push
$group is the workhorse of analytical pipelines. It collapses many documents into fewer grouped documents, accumulating values along the way.
db.orders.aggregate([
{ $match: { status: "completed" } },
{
$group: {
_id: { customerId: "$customerId", month: { $month: "$createdAt" } },
orderCount: { $sum: 1 },
totalRevenue: { $sum: "$amount" },
avgOrderValue: { $avg: "$amount" },
productIds: { $push: "$productId" }
}
},
{ $sort: { totalRevenue: -1 } }
]);$push collects values into an array. Use it carefully — on high-cardinality groups, the accumulated arrays can grow very large and hit the 16 MB document size limit. When you only need unique values, prefer $addToSet.
$lookup with Sub-Pipeline
The basic $lookup joins on equality. The sub-pipeline form is more powerful: it lets you filter the joined collection before the join occurs, apply expressions, and reference pipeline variables.
db.orders.aggregate([
{ $match: { status: "completed" } },
{
$lookup: {
from: "customers",
let: { custId: "$customerId", orderDate: "$createdAt" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$_id", "$$custId"] },
{ $eq: ["$active", true] }
]
}
}
},
{ $project: { name: 1, email: 1, tier: 1 } }
],
as: "customer"
}
},
{ $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } }
]);$lookup on an unindexed field in the foreign collection will trigger a full collection scan for every input document. If you're joining 50,000 orders to a customers collection with no index on _id, MongoDB will scan the entire customers collection 50,000 times. Always ensure the field referenced in the $lookup localField/foreignField (or $expr equality) has a supporting index on the foreign collection.
$unwind with preserveNullAndEmptyArrays
$unwind deconstructs an array field, emitting one document per array element. By default, documents where the array is null, missing, or empty are dropped entirely — a subtle bug source when you're doing outer-join style lookups.
// preserveNullAndEmptyArrays keeps documents even when the array is empty
{ $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } }This is especially important after a $lookup that might return no matches — without the flag you'll silently lose input documents that had no join match.
$addFields for Computed Columns
$addFields adds new fields to documents without removing existing ones. It's the pipeline equivalent of a computed column.
db.orders.aggregate([
{ $match: { status: "completed" } },
{
$addFields: {
marginAmount: { $subtract: ["$amount", "$cost"] },
marginPct: {
$multiply: [
{ $divide: [{ $subtract: ["$amount", "$cost"] }, "$amount"] },
100
]
},
isHighValue: { $gt: ["$amount", 500] }
}
}
]);Advanced Patterns
$facet for Multi-Dimensional Aggregation
$facet runs multiple independent sub-pipelines on the same input documents in a single aggregation call. This is the pattern behind faceted search (think e-commerce filter panels) and multi-metric dashboard queries.
db.products.aggregate([
{ $match: { inStock: true } },
{
$facet: {
byCategory: [
{ $group: { _id: "$category", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
],
byPriceRange: [
{
$bucket: {
groupBy: "$price",
boundaries: [0, 25, 50, 100, 250, 500],
default: "500+",
output: { count: { $sum: 1 }, avgPrice: { $avg: "$price" } }
}
}
],
summary: [
{
$group: {
_id: null,
totalProducts: { $sum: 1 },
avgPrice: { $avg: "$price" },
maxPrice: { $max: "$price" }
}
}
]
}
}
]);The output is a single document with three fields — byCategory, byPriceRange, and summary — each containing the results of their respective sub-pipelines. One round-trip, three analytical results.
Each $facet sub-pipeline starts from the same set of documents that entered the $facet stage. Place your heaviest filtering $match stages before $facet, not inside each sub-pipeline, to avoid duplicating filter work.
$bucket for Histogram Analytics
$bucket distributes documents into user-defined ranges. It's ideal for age distributions, price histograms, response time analysis, and any other numeric grouping.
db.users.aggregate([
{ $match: { active: true } },
{
$bucket: {
groupBy: "$age",
boundaries: [18, 25, 35, 45, 55, 65],
default: "65+",
output: {
count: { $sum: 1 },
avgLifetimeValue: { $avg: "$lifetimeValue" }
}
}
}
]);$bucketAuto is the sibling operator that automatically determines bucket boundaries to distribute documents evenly — useful for exploratory analysis when you don't know your data distribution.
$merge for Materialized Views
$merge writes pipeline output into a target collection, supporting upsert, replace, merge, and fail strategies. This is the primitive for building materialized views that refresh on a schedule.
db.orders.aggregate([
{ $match: { createdAt: { $gte: ISODate("2025-01-01") } } },
{
$group: {
_id: { customerId: "$customerId", month: { $dateToString: { format: "%Y-%m", date: "$createdAt" } } },
totalRevenue: { $sum: "$amount" },
orderCount: { $sum: 1 }
}
},
{
$merge: {
into: "monthly_customer_revenue",
on: "_id",
whenMatched: "replace",
whenNotMatched: "insert"
}
}
]);Run this in a scheduled job (a cron task or a MongoDB Atlas Scheduled Trigger) and your monthly_customer_revenue collection becomes a fast, pre-aggregated table your application queries directly — without hitting the raw orders collection at all.
Performance Optimization
explain("executionStats") for Pipelines
The aggregation pipeline has its own explain output. Always inspect it before deploying a new pipeline to production.
db.orders.explain("executionStats").aggregate([
{ $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } }
]);Look for these signals in the explain output:
- COLLSCAN in the first stage: your leading
$matchis not using an index. Add a compound index on the fields used in the match. - High
nReturnedvs lowkeysExamined: index is being used but may not be selective enough. usedDisk: true: the pipeline spilled to disk. This is a warning sign, not necessarily a failure — addallowDiskUse: trueto allow it, then investigate why memory limits are being hit.
allowDiskUse for Large Aggregations
By default, each pipeline stage is limited to 100 MB of RAM. When you're grouping or sorting millions of documents, this limit is easily exceeded.
db.orders.aggregate(
[
{ $match: { status: "completed" } },
{ $group: { _id: "$productId", totalSold: { $sum: "$quantity" } } },
{ $sort: { totalSold: -1 } }
],
{ allowDiskUse: true }
);allowDiskUse is not a performance fix — it's a safety valve. Pipelines that spill to disk are significantly slower than those that run in memory. Use it to prevent failures on large datasets, but treat it as a signal to investigate your pipeline design, index coverage, and whether $project or early filtering can reduce the working set size.
Index Usage in Aggregation Pipelines
Indexes apply only to the first stage of an aggregation pipeline (or to a $match immediately after a $geoNear). After the first stage, documents are in memory as pipeline buffers — indexes on downstream fields have no effect.
This means your compound index strategy must be designed around the leading $match stage. For a pipeline that filters on status and createdAt and then sorts by amount, an index on { status: 1, createdAt: -1, amount: -1 } can cover the match and provide a pre-sorted stream to the sort stage, eliminating an in-memory sort entirely.
// Create a compound index to support this common pipeline pattern
db.orders.createIndex(
{ status: 1, createdAt: -1 },
{ name: "idx_orders_status_created" }
);
// Pipeline benefits from the index on the leading $match
db.orders.aggregate([
{ $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
{ $group: { _id: "$customerId", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } },
{ $limit: 100 }
]);For $lookup joins, indexes on the foreign collection's join field are critical. MongoDB uses the index to look up matching documents in the foreign collection for each input document. Without the index, each lookup triggers a full collection scan.
// Ensure the foreign collection has an index on the join field
db.customers.createIndex({ _id: 1 }); // Usually exists by default
db.customers.createIndex({ externalId: 1 }); // Custom join fields need explicit indexes
// Now $lookup on externalId is index-backed
db.orders.aggregate([
{
$lookup: {
from: "customers",
localField: "customerId",
foreignField: "externalId",
as: "customer"
}
}
]);Use db.collection.aggregate([...]).explain("executionStats") on the foreign collection's lookup query independently to verify index usage. The overall pipeline explain output can sometimes obscure whether the lookup sub-scan is index-backed.
Key Takeaways
- Place
$matchwith indexed fields at the very start of every pipeline. Document reduction early is the highest-leverage performance action available. - Use the sub-pipeline form of
$lookupto filter joined documents before the join — and always index the foreign join field. $faceteliminates round-trips by running multiple aggregations over the same input in a single call. Filter before$facet, not inside each sub-pipeline.$bucketand$bucketAutoprovide histogram-style grouping for numeric distributions without application-side post-processing.$unwindwithpreserveNullAndEmptyArrays: trueprevents silent document loss when arrays are missing or empty after a$lookup.$mergeenables scheduled materialized views — write aggregation output to a collection and query that instead of the raw data at runtime.- Always run
explain("executionStats")before deploying pipelines. Look for COLLSCAN, high document counts mid-pipeline, andusedDisk: true. allowDiskUse: trueprevents OOM failures on large aggregations but is a diagnostic signal, not a solution.
Run Aggregation Pipelines on JusDB
JusDB gives you a managed MongoDB environment where aggregation pipelines get the full power of dedicated compute without shared-tenant resource contention. Built-in performance monitoring surfaces slow pipelines automatically, and the query advisor highlights missing indexes and suggests compound index candidates for your most common aggregation patterns.
Whether you're running real-time dashboards with $facet, nightly ETL with $merge, or ad-hoc analytics on hundreds of millions of documents, JusDB handles the infrastructure so your team can focus on the pipeline logic that drives your product. Explore MongoDB on JusDB and see how managed database infrastructure changes what your team can ship.