-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Prerequisites
- I have written a descriptive issue title
- I have searched existing issues to ensure the feature has not already been requested
🚀 Feature Proposal
I'm opening this issue as a place to discuss potential improvements for limit and perDocumentLimit with populate, thoughts may not be very organized since we have so many options.
Mongoose current behavior
Given the following posts when populating comments with limit 2
[
{
title: 'Post 1',
commentsIds: ['1', '2', '3', '4', '5']
},
{
title: 'Post 2',
commentsIds: ['6', '7', '8', '9', '10']
},
]Currently if we use limit mongoose sends a query on comments with limit of 4 (posts.length * limit) with the following ids: 1,2,3,4,5,6,7,8,9,10, which results in getting back 4 ids relying on natural order, meaning that even though both posts have comments, there's no guarantee they'll both get 2 comments populated, because Mongo sends back 1 to 4 when tested (we shouldn't rely on that fact since natural order can change).
This is explained in the docs: limit vs. perDocumentLimit.
Considerations for potential improvements
- When populating documents like
commentsIdsabove, there's no guarantee that they refer to documents that exist in thecommentscollection. From my experience, most people never hard-delete documents (not without cleaning up their references). - We don't necessarily need a single limit/perDocumentLimit algorithm to handle all scenarios, we could find optimizations that are reliable when certain conditions are met, this would be better for performance, but adds to the maintenance cost of the project.
- For most of these optimizations, when users pass
sort, ormatchinside thepopulateobject, those optimizations would be invalidated and we'd fallback to default/current behavior, at least in the initial phase. - If we decide to proceed with any of those
Optimization #1 for perDocumentLimit
Instead of always sending a query per document in perDocumentLimit, there is a scenario when allCommentsIds.length <= posts.length * perDocumentLimit. In that case, sending a single find query would guarantee we'd get back all the comments. The implementation for this change can be as simple as adding those two lines in a condition somewhere
opts.limit = opts.perDocumentLimit;
delete opts.perDocumentLimit;
Optimization #2 for limit
- This is not really a performance optimization, but a behavioral optimization. The main issue with the current
limitimplementation is that it's not guaranteed to returnlimitpopulated docs for each parent doc even if it has them. The reason is that we send all the subdocuments ids to MongoDB. - If we assume that in most of the cases people don't hard-delete documents and the references are always valid, we can make the default behavior send the first
limitids. Per document. For the data above, withlimit: 2we would callComment.find({ _id: { $in: [1, 2, 6, 7] } }); - That way we only find two comments per post, assuming they're not hard-deleted, and if we find out that some documents are hard-deleted
foundDocs.length < ids.length, we can just send another query with the remaining ids, and accept the drawback of an additional query. The point is that this should rarely happen, assuming that most people don't hard-delete data. - In addition to using this with
limit, this optimization can be introduced as an alternative way to doperDocumentLimit. CurrentperDocumentLimitsends N queries, if users don't hard-delete data, this approach should fetch all data in 1 query, and worst case scenario would be N queries. - An alternative approach to sending a query with all ids with
limit = docs.length * opts.limit, is to use aggregation pipeline that prioritizes ids and does not rely on natural order. This would behave better than the currentlimit, with the guarantee that hard-deleted documents won't cause issues. Can also be used if users pass custommatchandsort, because we can just pass them as extra stages in the pipeline. This would work by getting the first2comments ids from each post, and then rotating comments ids from each post:1,2,6,7,3,8,4,9,5,10:
// `limit` ids from each doc, then rotate an id from each doc.
const ids = [1, 2, 6, 7, 3, 8, 4, 9, 5, 10];
const comments = await Comment.aggregate([
{ $match: { _id: { $in: ids } } },
{ $addFields: { __order: { $indexOfArray: [ids, '$_id'] } } },
{ $sort: { __order: 1 } },
{ $limit: 4 },
{ $project: { __order: 0 } }
]);This overrides MongoDB natural order behavior, the downside is can be ~50% slower than a normal find but is more reliable (but not perfect, since if some comments have invalid references we might end up with ids being used up by a previous post).
We can use other aggregation pipelines that would give guaranteed results similar to perDocumentLimit but I would need to explore them in further detail.
For now I'm really looking to see what people think about this, whether they think it's an important improvement and worth the effort/maintenance cost or not.
I can see at least Optimization #1 being a simple enough improvement that has a potential for implementation.
In the process of tuning the performance, I created this benchmark that outputs benchmark results in both JSON and nice HTML page, would be helpful if anyone decides to try different approaches.
What do you think? @vkarpov15 @hasezoey