Performing the equivalent of a SQL JOIN in MongoDB

Current Location：Home > Learning > DATABASE > MongoDB >

Performing the equivalent of a SQL JOIN in MongoDB

Author：JIYIK Last Updated：2025/04/29 Views：

MongoDB is very hungry for unstructured data and belongs in any database developer's toolset. Some aggregation processes are much slower than relational databases, even with existing indexes.

This is the case when using joins between collections. Lookup, the MongoDB JOINcounterpart, currently cannot do Merge or Hash joins; therefore, it will never be fast in its current state.

For enumerations with a limited number of options it is clearly more appropriate. You can help by providing an index 查找that allows it to do an indexed nested loop join, but you'll struggle to make any joins more efficient beyond that.

Of course, one could argue that document collections eliminate the need for joins, but this is only appropriate for very static, unchanging data. Data that is likely to change should always be kept in one place.

This article shows you how to get a MongoDB database up and running that reports on historical and slowly changing data.

Connections in MongoDB

Why use joins in a document database? More databases are being migrated from relational databases to MongoDB, and new databases are being built in MongoDB.

These, especially for reporting, require a lot of lookups. Some have suggested that document databases should denormalize their data to eliminate the need for lookups.

There is an argument that it doesn't matter whether you provide and maintain summary collections (also called aggregates or pre-aggregates) and whether you utilize a thread separate from the application to update them when the data in the table changes.

The conversion of the traditional practice database of SQL Server AdventureWorkswill be used to MongoDB as a test platform. It was chosen because it requires multiple lookups to generate reports.

It may not cause some disturbing migration issues which is good for our needs. It also allows a direct comparison of the two database systems.

Perform SQL JOIN equivalent queries without indexes in MongoDB

For the following examples, we will start without any indexes and then add them later, monitoring the timing to ensure they are used. The MongoDB Profiler will also be used to double-check the approach.

The following SQL in MongoDB was run using Studio 3T.

example:

SELECT p.PersonType, Sum(soh.TotalDue), Count(*)
    FROM "Sales.SalesOrderHeader" soh
        INNER JOIN "Sales.Customer" c
            ON soh.CustomerID = c.CustomerID
        INNER JOIN "Person.Person" p
        ON c.PersonID = p.BusinessEntityID
    GROUP BY p.PersonType
--Primary type of person: SC = Store Contact,
--IN = Individual (retail) customer

The results show how many customers and store contacts there are and the total value of their orders.

The only difference with SQL Server is that it uses string delimiters around the collection names. LookupsThis is used to implement the two joins.

Now, suppose you press the button. It ends in 5 minutes and 17 seconds.

Soon after, a concerned but accusatory person from the Society for the Prevention of Cruelty Database called. Some indexing, she claimed, would save her from grief.

cursor.maxTimeMS()is a cursor method that applies only to queries. At this stage, it is recommended to look at the automatically generated code.

use AdventureWorks;
db.getCollection("Sales.SalesOrderHeader").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "soh" : "$$ROOT"
            }
        },
        {
            "$lookup" : {
                "localField" : "soh.CustomerID",
                "from" : "Sales.Customer",
                "foreignField" : "_id",
                "as" : "c"
            }
        },
        {
            "$unwind" : {
                "path" : "$c",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$lookup" : {
                "localField" : "c.PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$group" : {
                "_id" : {
                    "p᎐PersonType" : "$p.PersonType"
                },
                "SUM(soh᎐TotalDue)" : {
                    "$sum" : "$soh.TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "p.PersonType" : "$_id.p᎐PersonType",
                "SUM(soh᎐TotalDue)" : "$SUM(soh᎐TotalDue)",
                "COUNT(*)" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : true
    }
);

When you search MongoDB, the field of the document in the collection you are searching is the key field you provided during the aggregation phase. This field describes the papers you will collect as an array of documents.

$lookupThe operation compares the external fields in the input document to the local fields in the output document.

The key field in the referenced document may not exist, in which case it is assumed to be null. If the foreign field is not indexed, it will do a full collection scan ( COLLSCAN) query for each page in the pipeline.

This becomes very expensive because you need an index hit instead of a table scan.

Indexes in MongoDB

If you need some fields from a collection, creating one using those fields and the actual query condition 覆盖索引will be significantly more efficient. MongoDB can now take a faster approach to deliver results from the index without accessing the document.

This should be done for any query that is expected to be used frequently. What fields should be indexed?

Composite indexes should be used when the key has multiple fields (such as a first name/last name combination). When sorting using multiple fields, it is best to think about how you want your reports to be sorted, as this will determine the best order of the fields in the index.

Perform SQL JOIN equivalent queries using indexes created in MongoDB

If your relational database contains a single column as a primary key, include _idthe fields in the import. These unique _idfields function like a clustered index.

It is best to name them to _idbe approved as a clustered index. To not disrupt queries, the original fields are added under their original names.

You must lookupcreate indexes for all other fields used in , such as fromthe field, and lookupfor fields referenced by , such as foreignField..

This is the same as in the ON clause of the JOIN. Then, to use the one we have indexed _id,that has the same value as the customer ID, enter CustomerID..

I retested it and found that the response time had dropped to 6.6 seconds. This is faster than the 5 minutes and 17 seconds without the index, but still not as good as the original SQL Server database.

SQL Server managed to complete the same aggregation in 160 milliseconds on the same server as MongoDB.

Unfortunately, the MongoDB profiler cannot tell anything other than that COLLSCAN was used. This is unavoidable because while the individual lookups make careful use of the index, an index cannot simply serve as a component of a larger aggregate unless it includes the initial matching step.

Suppose you rearrange the joins in a SQL query in Studio 3T. In this case, SQL Server uses the same approach as before: a hash table match inner join of the customer and person tables. You use a clustered index scan on both rows and an SalesOrderHeader.inner join on the results using .

Studio 3T versions are as follows.

example:

SELECT p.PersonType, Sum(soh.TotalDue), Count(*)
    FROM "Sales.Customer" c
        INNER JOIN "Person.Person" p
            ON c.PersonID = p.BusinessEntityID
        INNER JOIN  "Sales.SalesOrderHeader" soh
            ON soh.CustomerID = c.CustomerID
    GROUP BY p.PersonType
--Primary type of person: SC = Store Contact,
--IN = Individual (retail) customer

The order of aggregations in Studio 3T reflects the order of connections; therefore, the execution order is different, being much better at 4.2 seconds. Optimizing the aggregation script in the aggregation editor made little difference, taking just over three seconds.

The optimization simply reduces the fields that go through the pipeline to the essential fields.

use AdventureWorks2016;
db.getCollection("Sales.Customer").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "CustomerID" : 1.0,
                "PersonID" : 1.0
            }
        },
        {
            "$lookup" : {
                "localField" : "PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : "$p.PersonType"
            }
        },
        {
            "$lookup" : {
                "localField" : "CustomerID",
                "from" : "Sales.SalesOrderHeader",
                "foreignField" : "CustomerID",
                "as" : "soh"
            }
        },
        {
            "$unwind" : {
                "path" : "$soh",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : 1.0,
                "TotalDue" : "$soh.TotalDue"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "PersonType" : "$PersonType"
                },
                "SUM(TotalDue)" : {
                    "$sum" : "$TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "PersonType" : "$_id.PersonType",
                "Total" : "$SUM(TotalDue)",
                "Transactions" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : false
    }
);

If you continue down this path, you will have to put a lot of effort into optimizing each query.

Imagine you have a bunch of managers chasing you for a bunch of income reports. Now what?

Use pre-aggregated collections to simplify reporting in MongoDB

It helps if you aggregate sets at the smallest granularity you can possibly report on. This is the OLAP cube equivalent.

In this case, it is now processing trade records extracted from invoices. These will not change, and for good reason.

If you use such an intermediate collection for pre-aggregation,

use AdventureWorks2016;
db.getCollection("Sales.Customer").aggregate(
    [
        {
            "$project" : {
                "_id" : NumberInt(0),
                "CustomerID" : 1.0,
                "PersonID" : 1.0
            }
        },
        {
            "$lookup" : {
                "localField" : "PersonID",
                "from" : "Person.Person",
                "foreignField" : "BusinessEntityID",
                "as" : "p"
            }
        },
        {
            "$unwind" : {
                "path" : "$p",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : "$p.PersonType"
            }
        },
        {
            "$lookup" : {
                "localField" : "CustomerID",
                "from" : "Sales.SalesOrderHeader",
                "foreignField" : "CustomerID",
                "as" : "soh"
            }
        },
        {
            "$unwind" : {
                "path" : "$soh",
                "preserveNullAndEmptyArrays" : false
            }
        },
        {
            "$project" : {
                "CustomerID" : 1.0,
                "PersonID" : 1.0,
                "PersonType" : 1.0,
                "TotalDue" : "$soh.TotalDue"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "PersonType" : "$PersonType"
                },
                "SUM(TotalDue)" : {
                    "$sum" : "$TotalDue"
                },
                "COUNT(*)" : {
                    "$sum" : NumberInt(1)
                }
            }
        },
        {
            "$project" : {
                "PersonType" : "$_id.PersonType",
                "Total" : "$SUM(TotalDue)",
                "Transactions" : "$COUNT(*)",
                "_id" : NumberInt(0)
            }
        }
    ],
    {
        "allowDiskUse" : false
    }
);

You wouldn't want to store such a specialized collection of aggregates in practice. Instead, you should group your more general data into periods, such as weeks, months, or years, so that you can plot sales over time.

You should also provide the ID of the salesperson and store so that someone can get credit for the transaction.

Aggregations will be sketched out in SQL. However, you can leave things like date calculations and output stages out of SQL because you are limited in what it can accomplish.

SELECT c.PersonID, p.PersonType, soh.SalesPersonID, psp.Name, psp.CountryRegionCode,
    Sum(soh.TotalDue), Count(*)
    --,
    --Year(soh.OrderDate) AS year, Month(soh.OrderDate) AS month,
    --DatePart(WEEK, soh.OrderDate) AS week
    FROM "Sales.SalesOrderHeader" AS soh
        INNER JOIN "Sales.Customer" AS c
            ON c.CustomerID = soh.CustomerID
        INNER JOIN "Person.Person" AS p
            ON p.BusinessEntityID = c.PersonID
        INNER JOIN "Person.Address" AS pa
            ON pa.AddressID = soh.BillToAddressID
        INNER JOIN "Person.StateProvince" AS psp
            ON psp.StateProvinceID = pa.StateProvinceID
    GROUP BY c.PersonID, p.PersonType, soh.SalesPersonID, psp.Name,
    psp.CountryRegionCode
    --, Year(soh.OrderDate), Month(soh.OrderDate),
    --DatePart(WEEK, soh.OrderDate);

Ensure correct sequence and close loose ends

After that, copy the mongo shell query code and put it into Studio 3T's MongoDB aggregation query builder, also known as the aggregation editor.

The aggregations should then be fine-tuned. After that, you can run the report directly from the SQL Query tab in Studio 3T.

Using your brain can reduce the query from nearly five minutes to about 100 milliseconds. Just put common sense indexes on foreign key references and keys, and try covering and intersecting indexes to overcome the pain of waiting for several minutes.

Then, check to see if you are unnecessarily scanning old or unchanged data. Unfortunately, this is such a common mistake that it's almost epidemic.

This post shows 立方体how one can help speed up the design and generation of many reports that use the same core data.

Finally, getting the order of stages in the aggregation pipeline correct is critical. You should save them when you have only the documents you need for the final report, 查找such as sorting.

In the early stages, matching and projection should be done.

The moment you do the grouping is a more tactical choice, although it is not an extended process for MongoDB. It makes sense to keep the pipeline small, pushing only the data you need into each document.

But this is best viewed as part of the final cleanup, and while it will speed things up, it won't provide a significant advantage.

On the other hand, current transaction information can never be processed in this way.

For example, you would never want outdated information about a recent transaction. However, since this is a small amount of data, it is unlikely to cause problems with lookups.

Previous：Difference between ObjectId and $Oid in MongoDB

Next：Using pipelines in MongoDB query operations

For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.

Article URL：

JIYIK CN >