[Suggestion] make `limit` with `populate` more reliabe.

### Prerequisites

- [x] I have written a descriptive issue title
- [x] I have searched existing issues to ensure the feature has not already been requested


### 🚀 Feature Proposal

I'm opening this issue as a place to discuss potential improvements for `limit` and `perDocumentLimit` with `populate`, thoughts may not be very organized since we have so many options.

## Mongoose current behavior

Given the following `posts` when populating `comments` with `limit` 2
```js
[
  {
    title: 'Post 1',
    commentsIds: ['1', '2', '3', '4', '5']
  },
  {
    title: 'Post 2',
    commentsIds: ['6', '7', '8', '9', '10']
  },
]
```

Currently if we use `limit` mongoose sends a query on `comments` with `limit` of `4` (posts.length * limit) with the following ids: `1,2,3,4,5,6,7,8,9,10`, which results in getting back 4 ids relying on [natural order](https://www.mongodb.com/docs/manual/reference/glossary/#std-term-natural-order), meaning that even though both posts have comments, there's no guarantee they'll both get 2 comments populated, because Mongo sends back `1` to `4` when tested (we shouldn't rely on that fact since natural order can change).

This is explained in the docs: [`limit` vs. `perDocumentLimit`](https://mongoosejs.com/docs/8.x/docs/populate.html#limit-vs-perDocumentLimit). 


### Considerations for potential improvements
* When populating documents like `commentsIds` above, there's no guarantee that they refer to documents that exist in the `comments` collection. From my experience, most people never hard-delete documents (not without cleaning up their references).
* We don't necessarily need a single limit/perDocumentLimit algorithm to handle all scenarios, we could find optimizations that are reliable when certain conditions are met, this would be better for performance, but adds to the maintenance cost of the project.
* For most of these optimizations, when users pass `sort`, or `match` inside the `populate` object, those optimizations would be invalidated and we'd fallback to default/current behavior, at least in the initial phase.
* If we decide to proceed with any of those 


## Optimization #1 for `perDocumentLimit`
Instead of always sending a query per document in `perDocumentLimit`, there is a scenario when `allCommentsIds.length <= posts.length * perDocumentLimit`. In that case, sending a single find query would guarantee we'd get back all the comments. The implementation for this change can be as simple as adding those two lines in a condition somewhere
```
opts.limit = opts.perDocumentLimit;
delete opts.perDocumentLimit;
```

## Optimization #2 for `limit`
* This is not really a performance optimization, but a behavioral optimization. The main issue with the current `limit` implementation is that it's not guaranteed to return `limit` populated docs for each parent doc even if it has them. The reason is that we send all the subdocuments ids to MongoDB.
* If we assume that in most of the cases people don't hard-delete documents and the references are always valid, we can make the default behavior send the first `limit` ids. Per document. For the data above, with `limit: 2` we would call `Comment.find({ _id: { $in: [1, 2, 6, 7] } });`
* That way we only find two comments per post, assuming they're not hard-deleted, and if we find out that some documents are hard-deleted `foundDocs.length < ids.length`, we can just send another query with the remaining ids, and accept the drawback of an additional query. The point is that this should rarely happen, assuming that most people don't hard-delete data.
* In addition to using this with `limit`, this optimization can be introduced as an alternative way to do `perDocumentLimit`. Current `perDocumentLimit` sends N queries, if users don't hard-delete data, this approach should fetch all data in 1 query, and worst case scenario would be N queries.
* An alternative approach to sending a query with all ids with `limit = docs.length * opts.limit`, is to use aggregation pipeline that prioritizes ids and does not rely on natural order. This would behave better than the current `limit`, with the guarantee that hard-deleted documents won't cause issues. Can also be used if users pass custom `match` and `sort`, because we can just pass them as extra stages in the pipeline. This would work by getting the first `2` comments ids from each post, and then rotating comments ids from each post: `1,2,6,7,3,8,4,9,5,10`:
```js
// `limit` ids from each doc, then rotate an id from each doc. 
const ids = [1, 2, 6, 7, 3, 8, 4, 9, 5, 10];
const comments = await Comment.aggregate([
  { $match: { _id: { $in: ids } } },
  { $addFields: { __order: { $indexOfArray: [ids, '$_id'] } } },
  { $sort: { __order: 1 } },
  { $limit: 4 },
  { $project: { __order: 0 } }
]);
```
This overrides MongoDB natural order behavior, the downside is can be ~50% slower than a normal `find` but is more reliable (but not perfect, since if some comments have invalid references we might end up with ids being used up by a previous post).

We can use other aggregation pipelines that would give guaranteed results similar to `perDocumentLimit` but I would need to explore them in further detail.

For now I'm really looking to see what people think about this, whether they think it's an important improvement and worth the effort/maintenance cost or not.
I can see at least Optimization #1 being a simple enough improvement that has a potential for implementation.

In the process of tuning the performance, I created this [benchmark](https://gist.github.com/AbdelrahmanHafez/cfe594c17cc156b2a655efdf6d741ec9) that outputs benchmark results in both JSON and nice HTML page, would be helpful if anyone decides to try different approaches.

What do you think? @vkarpov15 @hasezoey 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Suggestion] make `limit` with `populate` more reliabe. #15863

Prerequisites

🚀 Feature Proposal

Mongoose current behavior

Considerations for potential improvements

Optimization #1 for `perDocumentLimit`

Optimization #2 for `limit`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Suggestion] make limit with populate more reliabe. #15863

Description

Prerequisites

🚀 Feature Proposal

Mongoose current behavior

Considerations for potential improvements

Optimization #1 for perDocumentLimit

Optimization #2 for limit

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Suggestion] make `limit` with `populate` more reliabe. #15863

Optimization #1 for `perDocumentLimit`

Optimization #2 for `limit`