Optimize joins to use index when possible#335
Conversation
🦋 Changeset detectedLatest commit: 7140573 The changes in this PR will be included in the next version bump. This PR includes changesets to release 9 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
ec65765 to
492f9fa
Compare
More templates
@tanstack/db
@tanstack/db-ivm
@tanstack/electric-db-collection
@tanstack/query-db-collection
@tanstack/react-db
@tanstack/solid-db
@tanstack/svelte-db
@tanstack/trailbase-db-collection
@tanstack/vue-db
commit: |
|
Size Change: +1.66 kB (+2.84%) Total Size: 60.1 kB
ℹ️ View Unchanged
|
|
Size Change: 0 B Total Size: 1.05 kB ℹ️ View Unchanged
|
2e81220 to
e706f15
Compare
|
|
||
| inner(collection: MultiSet<T>): MultiSet<T> { | ||
| return collection.map((data) => { | ||
| this.#f(data) |
There was a problem hiding this comment.
It could be useful in future to pass the multiplicity to the callback so that it's aware if it's an insert/delete. Not important for now.
There was a problem hiding this comment.
We can add this later when we need it :-)
| indexType: BTreeIndex, | ||
| }) | ||
| } catch (error) { | ||
| console.warn(`Failed to create auto-index for field "${fieldName}":`, error) |
There was a problem hiding this comment.
we've used the debug package elsewhere for logging. We should maybe use it here.
There was a problem hiding this comment.
I didn't actually change this. I took this piece of code from ensureIndexForExpression and moved it here. We could use debugLog but that would only log it in debug mode. I think this warning is useful also in non-debug mode to warn you that for some reason the index could not be created and thus the queries might be less efficient.
… and loading matching keys dynamically.
6065ac1 to
7140573
Compare
samwillis
left a comment
There was a problem hiding this comment.
This is great, let's get it merged!
the note I've added is for later
| const activePipelineWithLoading: IStreamBuilder< | ||
| [key: unknown, [originalKey: string, namespacedRow: NamespacedRow]] | ||
| > = activePipeline.pipe( | ||
| tap(([joinKey, _]) => { |
There was a problem hiding this comment.
Not something to change right now, but with the current version of tap we processes and ask for each joined key one at a time. If in future we then want to batch load via sync we need to try and reassembly a batch to ask for.
the tap operator iterates over the items in the multiset, calling this function here, and it then for each row asked for the joined items to be injected. The alternative would to do so tot as the multiset level, we have a batch of items from the left, so ask for a batch from the right to be injected all at once. These batches then naturally can be pushed back down to the sync layer and asked for from there.
But for later!
This PR optimizes joins based on available indexes.
For a left join we always have to iterate over the left collection, but we don't need to iterate over the entire right collection. Based on the rows in the left collection we can lookup the rows that match in the right collection (based on the key we're joining on). That lookup is efficient if there's an index on the join key. We can do the same for right joins. For inner joins, we can loop over the smallest collection and lookup matching rows in the bigger collection such that we don't have to loop over the bigger collection.
Here's a concrete example, imagine we're left joining a
Commentscollection with aUserscollection onComments.user_id = Users.id. And imagine we have an index onUsers.id. Now, we can loop over allCommentsand for each comment we can lookup it'suser_idin the index forUsers.id. This will give us the corresponding user that we need to join with the comment, without having to loop over the entireUserscollection.Implementation Overview
We don't actually have to modify the existing D2
joinoperator to do this. Thejoinoperator takes two streams, a stream for the left collection and a stream for the right collection. For left/right/inner joins we only need to loop over one of the streams and we don't need to loop over the entire other stream. Therefore, the idea is to modify the streams such that there is an active stream and a lazy stream. We process the entire active stream and use it to dynamically populate the lazy stream. This is depicted in the following diagram:The diagram above depicts a left-join for comments with users (the example from before). The comments are filtered (e.g. to only get the comments for a certain issue). Then, we want to left-join it with users. To do this, we add a
tapoperator after thefilterand before thejoin. This operator doesn't modify the stream, but will for every row, look up the join key in the index of theUserscollection and dynamically load the matching user into the lazy users stream. In other words, we're populating the lazy stream with users that are matching the comments as we process them. Note that the lazy users stream can apply additional operators before being joined in. In the diagram, we're doing an additional filter over the lazy stream before joining it in.Implementation Challenges
The D2 pipeline from the diagram above is created at compile time but the indexes are created at runtime. Hence, when creating this special
tapoperator we don't know if the collection that we want to lazily load via an index on the join key, will actually have the index that is required on the join key. We will only know this at runtime, when themapoperator runs for the first time. At that point, if we notice that the index does not exist, we cannot apply the optimization so we then turn the Users collection back into a regular collection (instead of a lazy collection).Currently, inner joins loop over the smallest collection and lookup matching rows in the index on the bigger collection. But that index may not exist, in which case we will need to loop over the entire bigger collection. In that case, it may be more efficient to loop over the bigger collection and try to find matching rows in the smaller collection (because that one might have an index on the join key). However, flipping these collections around is going to complicate the code quite a lot (because again this would need to happen at runtime) so we decided not to do it yet.
TODOs
eagermode)Always create an index on the PK of a collection (inBetter as a follow-up PR.eagerandpkmode)