Skip to content

Optimize joins to use index when possible#335

Merged
kevin-dp merged 9 commits intomainfrom
kevindp/join-with-index
Aug 18, 2025
Merged

Optimize joins to use index when possible#335
kevin-dp merged 9 commits intomainfrom
kevindp/join-with-index

Conversation

@kevin-dp
Copy link
Contributor

@kevin-dp kevin-dp commented Jul 30, 2025

This PR optimizes joins based on available indexes.

For a left join we always have to iterate over the left collection, but we don't need to iterate over the entire right collection. Based on the rows in the left collection we can lookup the rows that match in the right collection (based on the key we're joining on). That lookup is efficient if there's an index on the join key. We can do the same for right joins. For inner joins, we can loop over the smallest collection and lookup matching rows in the bigger collection such that we don't have to loop over the bigger collection.

Here's a concrete example, imagine we're left joining a Comments collection with a Users collection on Comments.user_id = Users.id. And imagine we have an index on Users.id. Now, we can loop over all Comments and for each comment we can lookup it's user_id in the index for Users.id. This will give us the corresponding user that we need to join with the comment, without having to loop over the entire Users collection.

Implementation Overview

We don't actually have to modify the existing D2 join operator to do this. The join operator takes two streams, a stream for the left collection and a stream for the right collection. For left/right/inner joins we only need to loop over one of the streams and we don't need to loop over the entire other stream. Therefore, the idea is to modify the streams such that there is an active stream and a lazy stream. We process the entire active stream and use it to dynamically populate the lazy stream. This is depicted in the following diagram:

shapes at 25-08-05 09 34 45

The diagram above depicts a left-join for comments with users (the example from before). The comments are filtered (e.g. to only get the comments for a certain issue). Then, we want to left-join it with users. To do this, we add a tap operator after the filter and before the join. This operator doesn't modify the stream, but will for every row, look up the join key in the index of the Users collection and dynamically load the matching user into the lazy users stream. In other words, we're populating the lazy stream with users that are matching the comments as we process them. Note that the lazy users stream can apply additional operators before being joined in. In the diagram, we're doing an additional filter over the lazy stream before joining it in.

Implementation Challenges

The D2 pipeline from the diagram above is created at compile time but the indexes are created at runtime. Hence, when creating this special tap operator we don't know if the collection that we want to lazily load via an index on the join key, will actually have the index that is required on the join key. We will only know this at runtime, when the map operator runs for the first time. At that point, if we notice that the index does not exist, we cannot apply the optimization so we then turn the Users collection back into a regular collection (instead of a lazy collection).

Currently, inner joins loop over the smallest collection and lookup matching rows in the index on the bigger collection. But that index may not exist, in which case we will need to loop over the entire bigger collection. In that case, it may be more efficient to loop over the bigger collection and try to find matching rows in the smaller collection (because that one might have an index on the join key). However, flipping these collections around is going to complicate the code quite a lot (because again this would need to happen at runtime) so we decided not to do it yet.

TODOs

  • Add unit tests to check that indexes are correctly used for left/right/inner joins
  • Automatically create indexes for join keys (in eager mode)
  • Always create an index on the PK of a collection (in eager and pk mode) Better as a follow-up PR.

@changeset-bot
Copy link

changeset-bot bot commented Jul 30, 2025

🦋 Changeset detected

Latest commit: 7140573

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@tanstack/db-ivm Patch
@tanstack/db Patch
@tanstack/electric-db-collection Patch
@tanstack/query-db-collection Patch
@tanstack/react-db Patch
@tanstack/solid-db Patch
@tanstack/svelte-db Patch
@tanstack/trailbase-db-collection Patch
@tanstack/vue-db Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@kevin-dp kevin-dp force-pushed the kevindp/join-with-index branch from ec65765 to 492f9fa Compare July 30, 2025 11:34
@pkg-pr-new
Copy link

pkg-pr-new bot commented Jul 30, 2025

More templates

@tanstack/db

npm i https://pkg.pr.new/@tanstack/db@335

@tanstack/db-ivm

npm i https://pkg.pr.new/@tanstack/db-ivm@335

@tanstack/electric-db-collection

npm i https://pkg.pr.new/@tanstack/electric-db-collection@335

@tanstack/query-db-collection

npm i https://pkg.pr.new/@tanstack/query-db-collection@335

@tanstack/react-db

npm i https://pkg.pr.new/@tanstack/react-db@335

@tanstack/solid-db

npm i https://pkg.pr.new/@tanstack/solid-db@335

@tanstack/svelte-db

npm i https://pkg.pr.new/@tanstack/svelte-db@335

@tanstack/trailbase-db-collection

npm i https://pkg.pr.new/@tanstack/trailbase-db-collection@335

@tanstack/vue-db

npm i https://pkg.pr.new/@tanstack/vue-db@335

commit: 7140573

@github-actions
Copy link
Contributor

github-actions bot commented Jul 30, 2025

Size Change: +1.66 kB (+2.84%)

Total Size: 60.1 kB

Filename Size Change
./packages/db/dist/esm/collection.js 9.86 kB +13 B (+0.13%)
./packages/db/dist/esm/errors.js 3 kB +27 B (+0.91%)
./packages/db/dist/esm/index.js 1.52 kB +11 B (+0.73%)
./packages/db/dist/esm/indexes/auto-index.js 718 B +29 B (+4.21%)
./packages/db/dist/esm/query/compiler/index.js 2.1 kB +366 B (+21.1%) 🚨
./packages/db/dist/esm/query/compiler/joins.js 2.31 kB +754 B (+48.33%) 🚨
./packages/db/dist/esm/query/live-query-collection.js 2.91 kB +463 B (+18.93%) ⚠️
ℹ️ View Unchanged
Filename Size
./packages/db/dist/esm/change-events.js 1.13 kB
./packages/db/dist/esm/deferred.js 230 B
./packages/db/dist/esm/indexes/base-index.js 605 B
./packages/db/dist/esm/indexes/btree-index.js 1.47 kB
./packages/db/dist/esm/indexes/lazy-index.js 1.25 kB
./packages/db/dist/esm/local-only.js 827 B
./packages/db/dist/esm/local-storage.js 2.03 kB
./packages/db/dist/esm/optimistic-action.js 294 B
./packages/db/dist/esm/proxy.js 4.19 kB
./packages/db/dist/esm/query/builder/functions.js 575 B
./packages/db/dist/esm/query/builder/index.js 3.79 kB
./packages/db/dist/esm/query/builder/ref-proxy.js 890 B
./packages/db/dist/esm/query/compiler/evaluators.js 1.48 kB
./packages/db/dist/esm/query/compiler/expressions.js 631 B
./packages/db/dist/esm/query/compiler/group-by.js 2.03 kB
./packages/db/dist/esm/query/compiler/order-by.js 677 B
./packages/db/dist/esm/query/compiler/select.js 655 B
./packages/db/dist/esm/query/ir.js 318 B
./packages/db/dist/esm/query/optimizer.js 2.44 kB
./packages/db/dist/esm/SortedMap.js 1.24 kB
./packages/db/dist/esm/transactions.js 2.29 kB
./packages/db/dist/esm/utils.js 419 B
./packages/db/dist/esm/utils/btree.js 5.93 kB
./packages/db/dist/esm/utils/comparison.js 718 B
./packages/db/dist/esm/utils/index-optimization.js 1.62 kB

compressed-size-action::db-package-size

@github-actions
Copy link
Contributor

github-actions bot commented Jul 30, 2025

Size Change: 0 B

Total Size: 1.05 kB

ℹ️ View Unchanged
Filename Size
./packages/react-db/dist/esm/index.js 152 B
./packages/react-db/dist/esm/useLiveQuery.js 902 B

compressed-size-action::react-db-package-size

@kevin-dp kevin-dp requested a review from samwillis August 5, 2025 07:11
@kevin-dp kevin-dp force-pushed the kevindp/join-with-index branch from 2e81220 to e706f15 Compare August 5, 2025 07:32
Copy link
Collaborator

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have central error classes in src/errros.ts we should use one form there or create a new one.

Ignore, clicked wrong btutton

Copy link
Collaborator

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all absolutely awesome!
Approving it, but may be worth just swapping out the errors for the mental central errors and using the drug debug package for logging.


inner(collection: MultiSet<T>): MultiSet<T> {
return collection.map((data) => {
this.#f(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful in future to pass the multiplicity to the callback so that it's aware if it's an insert/delete. Not important for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add this later when we need it :-)

indexType: BTreeIndex,
})
} catch (error) {
console.warn(`Failed to create auto-index for field "${fieldName}":`, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've used the debug package elsewhere for logging. We should maybe use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually change this. I took this piece of code from ensureIndexForExpression and moved it here. We could use debugLog but that would only log it in debug mode. I think this warning is useful also in non-debug mode to warn you that for some reason the index could not be created and thus the queries might be less efficient.

@kevin-dp kevin-dp force-pushed the kevindp/join-with-index branch from 6065ac1 to 7140573 Compare August 14, 2025 07:57
Copy link
Collaborator

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, let's get it merged!

the note I've added is for later

const activePipelineWithLoading: IStreamBuilder<
[key: unknown, [originalKey: string, namespacedRow: NamespacedRow]]
> = activePipeline.pipe(
tap(([joinKey, _]) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to change right now, but with the current version of tap we processes and ask for each joined key one at a time. If in future we then want to batch load via sync we need to try and reassembly a batch to ask for.

the tap operator iterates over the items in the multiset, calling this function here, and it then for each row asked for the joined items to be injected. The alternative would to do so tot as the multiset level, we have a batch of items from the left, so ask for a batch from the right to be injected all at once. These batches then naturally can be pushed back down to the sync layer and asked for from there.

But for later!

@kevin-dp kevin-dp merged commit 68538b4 into main Aug 18, 2025
6 checks passed
@kevin-dp kevin-dp deleted the kevindp/join-with-index branch August 18, 2025 07:38
@github-actions github-actions bot mentioned this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants