-
Notifications
You must be signed in to change notification settings - Fork 1.9k
WIP: Rewrite NestedLoopJoin to limit intermediate size (up to 3.2X faster) #16889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll need some time to read through the code, but reduced memory usage is also impressive 👍
I ran |
Thanks for the profiler 👍🏼 I'll review that soon. I think Q4 and Q10 should not consume that much memory 😅 -- q10:
SELECT *
FROM range(30000) AS t1
FULL JOIN range(30000) AS t2
ON (t1.value + t2.value) % 10 <> 0;It is supposed to only buffer (30k x 8Bytes), 20G RSS is not expected. UPDATE: maybe it's just because the nlj benchmark is buffering all the result (since we're only measuring the memory usage for NLJ operator, the downstream operator is okay to throw away results immediately) I'll fix and confirm it. |
@ding-young For q10, the maxRSS I got using The code for |
cfd4627 to
cb2db16
Compare
|
Run extended tests |
|
Re: #16889 (comment) I think the reason is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't expect this PR to have such a big performance improvement. Like #16443, I still don't understand why there is a performance improvement.
| /// include decoding Parquet files), so it tries to buffer more left | ||
| /// batches at once to minimize the scan passes. | ||
| /// 2. Read right side batch one at a time. For each iteration, it only | ||
| /// evaluates the join filter on (1-left-row x right-batch), and puts the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this join method, it seems we can't preserve the order of the right table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to relax that property. The new implementation can also preserve left order in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation can also preserve left order in some cases
Only when right_table.num_rows() < batch_size ?
For join types like inner join, we are no longer preserving the order of the right table. I'm not sure if this is a breaking change. It's possible that some users rely on this behavior, especially those who create an ExecutionPlan directly, bypassing the DataFusion optimizer. (Are there any users who do this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should mark it as a breaking change and add to the upgrade guide.
If anyone is relying heavily on this ordering property for optimization, we can keep the old impl (maybe rename to RightOrderPreservingNLJ) and use a configuration to control it.
I’m aware of two key differences:
However I was just hoping to cleanup the codebase a bit, I also didn’t expect it to be an easy 2X. |
I can't use perf to analyze cache misses in my Hyper-V VM. DetailsHowever, using the DetailsMy speculation is that the previous version suffered from memory management overload due to need alloc large memory, as
|
|
I tried to eliminate redundant batch transformations, the speedup goes from 2X -> 3X! Through the flamegraph, I can see that over 50% of the time is now spent evaluating the join filter (e.g. After a few more chores, this should be ready for review:
|
|
Move to the final PR #16996, so closing this one. |

Which issue does this PR close?
Rationale for this change
Now the NLJ execution logic is
This approach includes one extra step for a very large Cartesian product calculation, it can be both memory-consuming and in-efficient. A better approach can be directly perform the join on a small intermediate.
What changes are included in this PR?
Summary
This PR introduces a somewhat unusual change: it rewrites the Nested Loop Join operator from scratch.
Motivation
The original implementation performs a Cartesian product of (all-left-batches x right-batch), materializes that intermediate result for predicate evaluation, and then materializes the (potentially very large) final result all at once. This design is inherently inefficient, and although many patches have attempted to alleviate the problem, the fundamental issue remains.
A key challenge is that the original design and the ideal design (i.e., one that produces small intermediates during execution) are fundamentally different. As a result, it's practically impossible to make small incremental changes that fully address the inefficiency. These patches may also increase code complexity, making long-term maintenance more difficult.
Example of Prior Work
Here's a recent example of a small patch intended to improve the situation:
#16443
Even with careful engineering, I still feel the entropy in the code increases.
Why a Rewrite?
Since NLJ is a relatively straightforward operator, a full rewrite seemed worthwhile. This allows for a clean, simplified design focused on current goals—performance and memory efficiency—without being constrained by the legacy implementation.
Current Status and Outlook
The rewrite is smoother than I expected: The current minimal implementation, which reuses existing utilities (despite some inefficiencies), is already up to 2× faster and is likely significantly more memory-efficient (although memory usage has not yet been measured)
I also noticed there are several more optimizations opportunities to make it even faster. Now the implementation got regular tests passed, and there are several
join_fuzztests failing, after fixing them we can iterate on more optimizations.Benchmark Result
The NLJ micro-bench: #16819
Implementation
The design/implementation doc can be found in the source.
Next Steps
Given there are still many opportunities to make it even faster, I plan to include several major optimizations into the current PR to save some review bandwidth (It's always a contiguous low hundreds LoC for the core logic, even if we include more optimizations into it)
join_fuzztests to failPotential Optimizations
indices <--> batchconversions for the buffered right batchUpdate: the above change improved from 2X --> 3X!
For the following ideas, I've checked through the profiling flamegraph. They're not on the critical path.
- [ ] Don't concat input batches into a single one inJoinLeftData- [ ] Batch bitmap ops ~~~~ Now we maintain a concurrent bitmap for matched rows in the left buffer, once a while each partition will grab the lock, set matched bits one by one, then release the lock.
A more efficient way is: each partition first set a local bitmap (like a chunk of 512B), and apply all for every lock acquire. (if the match is rare maybe we can skip this step)- [ ] Arrow: InBatchCoalescersupport directly append joined result into the output buffer, without extra copies.Potential new interface:push_joined_batch(left_batch, right_batch, left_indices, right_indices)Would love to hear your thoughts on this @UBarney @jonathanc-n @ding-young , and looking forward to collaborating to make NLJ as fast as possible.
Are these changes tested?
Are there any user-facing changes?