MongoDB Aggregation Pipeline

In production MongoDB deployments, the aggregation pipeline is where raw data transforms into business intelligence — and where poorly written queries quietly consume CPU, RAM, and patience. Engineers who treat the pipeline as a simple map-reduce shortcut routinely ship queries that run fine on 10,000 documents and collapse at 10 million. This post is a practitioner's guide to the stages, patterns, and index strategies that separate pipeline code you're proud of from pipeline code you inherit and fear. Whether you're building analytics dashboards, generating reports, or materializing views, the patterns here apply directly to production workloads. By the end you'll have the mental model to write pipelines that stay fast as your collections grow.

TL;DR

Place $match as early as possible to reduce document flow through later stages.
Use $lookup with a sub-pipeline for filtered, index-friendly joins.
Use $facet to run multiple aggregations in a single round-trip.
Use $bucket for histogram-style analytics on numeric fields.
Always run explain("executionStats") on aggregation pipelines before deploying to production.
Set allowDiskUse: true when pipelines process more than 100 MB in memory.
$merge lets you write pipeline output to a collection, enabling materialized views.

Pipeline Fundamentals

The aggregation pipeline is a sequence of stages. Each stage receives a stream of documents, transforms them, and passes the result downstream. MongoDB executes stages in order, and the order matters enormously for performance. The optimizer can reorder some stages automatically — pushing $match and $limit earlier — but it cannot fix a fundamentally backwards pipeline design.

The single most important rule is: reduce document count and field size as early as possible. Every document that passes a stage costs CPU and memory. A $match that filters 90% of a collection at the start of a pipeline is doing more work for your query than any index trick later in the chain.

javascript

// Anti-pattern: $group first, then $match
db.orders.aggregate([
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $match: { total: { $gt: 1000 } } }
]);

// Correct: $match first to reduce input to $group
db.orders.aggregate([
  { $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $match: { total: { $gt: 1000 } } }
]);

The second pipeline filters on indexed fields before the expensive $group, dramatically reducing the number of documents MongoDB has to accumulate in memory.

Key Stages in Depth

$group with $sum, $avg, and $push

$group is the workhorse of analytical pipelines. It collapses many documents into fewer grouped documents, accumulating values along the way.

javascript

db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $group: {
      _id: { customerId: "$customerId", month: { $month: "$createdAt" } },
      orderCount: { $sum: 1 },
      totalRevenue: { $sum: "$amount" },
      avgOrderValue: { $avg: "$amount" },
      productIds: { $push: "$productId" }
    }
  },
  { $sort: { totalRevenue: -1 } }
]);

$push collects values into an array. Use it carefully — on high-cardinality groups, the accumulated arrays can grow very large and hit the 16 MB document size limit. When you only need unique values, prefer $addToSet.

$lookup with Sub-Pipeline

The basic $lookup joins on equality. The sub-pipeline form is more powerful: it lets you filter the joined collection before the join occurs, apply expressions, and reference pipeline variables.

javascript

db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $lookup: {
      from: "customers",
      let: { custId: "$customerId", orderDate: "$createdAt" },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                { $eq: ["$_id", "$$custId"] },
                { $eq: ["$active", true] }
              ]
            }
          }
        },
        { $project: { name: 1, email: 1, tier: 1 } }
      ],
      as: "customer"
    }
  },
  { $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } }
]);

Warning

$lookup on an unindexed field in the foreign collection will trigger a full collection scan for every input document. If you're joining 50,000 orders to a customers collection with no index on _id, MongoDB will scan the entire customers collection 50,000 times. Always ensure the field referenced in the $lookup localField/foreignField (or $expr equality) has a supporting index on the foreign collection.

$unwind with preserveNullAndEmptyArrays

$unwind deconstructs an array field, emitting one document per array element. By default, documents where the array is null, missing, or empty are dropped entirely — a subtle bug source when you're doing outer-join style lookups.

javascript

// preserveNullAndEmptyArrays keeps documents even when the array is empty
{ $unwind: { path: "$customer", preserveNullAndEmptyArrays: true } }

This is especially important after a $lookup that might return no matches — without the flag you'll silently lose input documents that had no join match.

$addFields for Computed Columns

$addFields adds new fields to documents without removing existing ones. It's the pipeline equivalent of a computed column.

javascript

db.orders.aggregate([
  { $match: { status: "completed" } },
  {
    $addFields: {
      marginAmount: { $subtract: ["$amount", "$cost"] },
      marginPct: {
        $multiply: [
          { $divide: [{ $subtract: ["$amount", "$cost"] }, "$amount"] },
          100
        ]
      },
      isHighValue: { $gt: ["$amount", 500] }
    }
  }
]);

Advanced Patterns

$facet runs multiple independent sub-pipelines on the same input documents in a single aggregation call. This is the pattern behind faceted search (think e-commerce filter panels) and multi-metric dashboard queries.

javascript

db.products.aggregate([
  { $match: { inStock: true } },
  {
    $facet: {
      byCategory: [
        { $group: { _id: "$category", count: { $sum: 1 } } },
        { $sort: { count: -1 } }
      ],
      byPriceRange: [
        {
          $bucket: {
            groupBy: "$price",
            boundaries: [0, 25, 50, 100, 250, 500],
            default: "500+",
            output: { count: { $sum: 1 }, avgPrice: { $avg: "$price" } }
          }
        }
      ],
      summary: [
        {
          $group: {
            _id: null,
            totalProducts: { $sum: 1 },
            avgPrice: { $avg: "$price" },
            maxPrice: { $max: "$price" }
          }
        }
      ]
    }
  }
]);

The output is a single document with three fields — byCategory, byPriceRange, and summary — each containing the results of their respective sub-pipelines. One round-trip, three analytical results.

Tip

Each $facet sub-pipeline starts from the same set of documents that entered the $facet stage. Place your heaviest filtering $match stages before $facet, not inside each sub-pipeline, to avoid duplicating filter work.

$bucket for Histogram Analytics

$bucket distributes documents into user-defined ranges. It's ideal for age distributions, price histograms, response time analysis, and any other numeric grouping.

javascript

db.users.aggregate([
  { $match: { active: true } },
  {
    $bucket: {
      groupBy: "$age",
      boundaries: [18, 25, 35, 45, 55, 65],
      default: "65+",
      output: {
        count: { $sum: 1 },
        avgLifetimeValue: { $avg: "$lifetimeValue" }
      }
    }
  }
]);

$bucketAuto is the sibling operator that automatically determines bucket boundaries to distribute documents evenly — useful for exploratory analysis when you don't know your data distribution.

$merge for Materialized Views

$merge writes pipeline output into a target collection, supporting upsert, replace, merge, and fail strategies. This is the primitive for building materialized views that refresh on a schedule.

javascript

db.orders.aggregate([
  { $match: { createdAt: { $gte: ISODate("2025-01-01") } } },
  {
    $group: {
      _id: { customerId: "$customerId", month: { $dateToString: { format: "%Y-%m", date: "$createdAt" } } },
      totalRevenue: { $sum: "$amount" },
      orderCount: { $sum: 1 }
    }
  },
  {
    $merge: {
      into: "monthly_customer_revenue",
      on: "_id",
      whenMatched: "replace",
      whenNotMatched: "insert"
    }
  }
]);

Run this in a scheduled job (a cron task or a MongoDB Atlas Scheduled Trigger) and your monthly_customer_revenue collection becomes a fast, pre-aggregated table your application queries directly — without hitting the raw orders collection at all.

Performance Optimization

explain("executionStats") for Pipelines

The aggregation pipeline has its own explain output. Always inspect it before deploying a new pipeline to production.

javascript

db.orders.explain("executionStats").aggregate([
  { $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } }
]);

Look for these signals in the explain output:

COLLSCAN in the first stage: your leading $match is not using an index. Add a compound index on the fields used in the match.
High nReturned vs low keysExamined: index is being used but may not be selective enough.
usedDisk: true: the pipeline spilled to disk. This is a warning sign, not necessarily a failure — add allowDiskUse: true to allow it, then investigate why memory limits are being hit.

allowDiskUse for Large Aggregations

By default, each pipeline stage is limited to 100 MB of RAM. When you're grouping or sorting millions of documents, this limit is easily exceeded.

javascript

db.orders.aggregate(
  [
    { $match: { status: "completed" } },
    { $group: { _id: "$productId", totalSold: { $sum: "$quantity" } } },
    { $sort: { totalSold: -1 } }
  ],
  { allowDiskUse: true }
);

Warning

allowDiskUse is not a performance fix — it's a safety valve. Pipelines that spill to disk are significantly slower than those that run in memory. Use it to prevent failures on large datasets, but treat it as a signal to investigate your pipeline design, index coverage, and whether $project or early filtering can reduce the working set size.

Index Usage in Aggregation Pipelines

Indexes apply only to the first stage of an aggregation pipeline (or to a $match immediately after a $geoNear). After the first stage, documents are in memory as pipeline buffers — indexes on downstream fields have no effect.

This means your compound index strategy must be designed around the leading $match stage. For a pipeline that filters on status and createdAt and then sorts by amount, an index on { status: 1, createdAt: -1, amount: -1 } can cover the match and provide a pre-sorted stream to the sort stage, eliminating an in-memory sort entirely.

javascript

// Create a compound index to support this common pipeline pattern
db.orders.createIndex(
  { status: 1, createdAt: -1 },
  { name: "idx_orders_status_created" }
);

// Pipeline benefits from the index on the leading $match
db.orders.aggregate([
  { $match: { status: "completed", createdAt: { $gte: ISODate("2025-01-01") } } },
  { $group: { _id: "$customerId", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } },
  { $limit: 100 }
]);

For $lookup joins, indexes on the foreign collection's join field are critical. MongoDB uses the index to look up matching documents in the foreign collection for each input document. Without the index, each lookup triggers a full collection scan.

javascript

// Ensure the foreign collection has an index on the join field
db.customers.createIndex({ _id: 1 });           // Usually exists by default
db.customers.createIndex({ externalId: 1 });    // Custom join fields need explicit indexes

// Now $lookup on externalId is index-backed
db.orders.aggregate([
  {
    $lookup: {
      from: "customers",
      localField: "customerId",
      foreignField: "externalId",
      as: "customer"
    }
  }
]);

Tip

Use db.collection.aggregate([...]).explain("executionStats") on the foreign collection's lookup query independently to verify index usage. The overall pipeline explain output can sometimes obscure whether the lookup sub-scan is index-backed.

Key Takeaways

Key Takeaways

Place $match with indexed fields at the very start of every pipeline. Document reduction early is the highest-leverage performance action available.
Use the sub-pipeline form of $lookup to filter joined documents before the join — and always index the foreign join field.
$facet eliminates round-trips by running multiple aggregations over the same input in a single call. Filter before $facet, not inside each sub-pipeline.
$bucket and $bucketAuto provide histogram-style grouping for numeric distributions without application-side post-processing.
$unwind with preserveNullAndEmptyArrays: true prevents silent document loss when arrays are missing or empty after a $lookup.
$merge enables scheduled materialized views — write aggregation output to a collection and query that instead of the raw data at runtime.
Always run explain("executionStats") before deploying pipelines. Look for COLLSCAN, high document counts mid-pipeline, and usedDisk: true.
allowDiskUse: true prevents OOM failures on large aggregations but is a diagnostic signal, not a solution.

Run Aggregation Pipelines on JusDB

JusDB gives you a managed MongoDB environment where aggregation pipelines get the full power of dedicated compute without shared-tenant resource contention. Built-in performance monitoring surfaces slow pipelines automatically, and the query advisor highlights missing indexes and suggests compound index candidates for your most common aggregation patterns.

Whether you're running real-time dashboards with $facet, nightly ETL with $merge, or ad-hoc analytics on hundreds of millions of documents, JusDB handles the infrastructure so your team can focus on the pipeline logic that drives your product. Explore MongoDB on JusDB and see how managed database infrastructure changes what your team can ship.

MongoDB Aggregation Pipeline: Advanced Patterns and Performance

Pipeline Fundamentals

Key Stages in Depth

$group with $sum, $avg, and $push

$lookup with Sub-Pipeline

$unwind with preserveNullAndEmptyArrays

$addFields for Computed Columns

Advanced Patterns

$facet for Multi-Dimensional Aggregation

$bucket for Histogram Analytics

$merge for Materialized Views

Performance Optimization

explain("executionStats") for Pipelines

allowDiskUse for Large Aggregations

Index Usage in Aggregation Pipelines

Key Takeaways

Run Aggregation Pipelines on JusDB

Share this article

Keep reading

High Performance with MongoDB: A Top-Down Tuning Guide

MongoDB Explained (2026): Replica Sets, Sharding, Atlas & Production Patterns

ScyllaDB vs Apache Cassandra: Performance and Operational Differences

Need Expert Help?

Database Performance Optimization

MongoDB Services

MongoDB on Kubernetes

MongoDB Consulting

MongoDB Performance Tuning

MongoDB High Availability