limit intermediate batch size in nested_loop_join #16443

UBarney · 2025-06-18T13:30:52Z

Which issue does this PR close?

part of #16364

Rationale for this change

see issue

What changes are included in this PR?

Limit intermediate_batch Size During Filtering
Yield Partial Batches on Demand

Are these changes tested?

Yes

Are there any user-facing changes?

UBarney · 2025-06-19T02:40:15Z

benchmark

I use this script to do benchmark

ID	SQL	join_base Time(s)	join_limit_join_batch_size Time(s)	Performance Change
1	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	0.839	0.565	+32.67%
2	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.650	0.374	+42.42%
3	select t1.value from range(8192) t1 right join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.676	0.382	+43.43%
4	select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	Failed	1.486	N/A
5	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	0.263	0.061	+76.86%
6	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	0.374	0.149	+60.07%

I'll find out why there is a performance improvement

tpch benchmark result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃            ┃            ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 1327.45 ms │ 1307.13 ms │     no change │
│ QQuery 2     │  264.27 ms │  263.74 ms │     no change │
│ QQuery 3     │  562.75 ms │  556.45 ms │     no change │
│ QQuery 4     │  400.38 ms │  395.96 ms │     no change │
│ QQuery 5     │  977.38 ms │  958.22 ms │     no change │
│ QQuery 6     │  185.92 ms │  163.28 ms │ +1.14x faster │
│ QQuery 7     │ 1541.81 ms │ 1533.42 ms │     no change │
│ QQuery 8     │  979.84 ms │  955.30 ms │     no change │
│ QQuery 9     │ 1725.80 ms │ 1858.37 ms │  1.08x slower │
│ QQuery 10    │  728.13 ms │  725.38 ms │     no change │
│ QQuery 11    │  235.65 ms │  228.92 ms │     no change │
│ QQuery 12    │  306.85 ms │  289.34 ms │ +1.06x faster │
│ QQuery 13    │  593.70 ms │  593.58 ms │     no change │
│ QQuery 14    │  249.35 ms │  247.63 ms │     no change │
│ QQuery 15    │  487.01 ms │  487.07 ms │     no change │
│ QQuery 16    │  164.96 ms │  162.33 ms │     no change │
│ QQuery 17    │ 1592.99 ms │ 1578.65 ms │     no change │
│ QQuery 18    │ 2547.61 ms │ 2544.19 ms │     no change │
│ QQuery 19    │  444.21 ms │  446.52 ms │     no change │
│ QQuery 20    │  551.89 ms │  555.07 ms │     no change │
│ QQuery 21    │ 1956.27 ms │ 1970.84 ms │     no change │
│ QQuery 22    │  141.37 ms │  139.59 ms │     no change │
└──────────────┴────────────┴────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time ()          │ 17965.60ms │
│ Total Time ()          │ 17960.98ms │
│ Average Time ()        │   816.62ms │
│ Average Time ()        │   816.41ms │
│ Queries Faster         │          2 │
│ Queries Slower         │          1 │
│ Queries with No Change │         19 │
│ Queries with Failure   │          0 │
└────────────────────────┴────────────┘

memory usage

I use this to get sql memory usage

SQL Query	join_base Memory	join_limit_join_batch_size Memory	Improvement
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	2.83 GB	1.58 GB	↓ 44.3%
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	2.83 GB	865.5 MB	↓ 70.1%
select t1.value from range(8192) t1 right join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	2.83 GB	867.9 MB	↓ 70.0%
select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	27.18 GB ❌ (exit code: 137)	15.14 GB	N/A (process failed)
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	1.01 GB	329.5 MB	↓ 68.2%
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	1.38 GB	639.6 MB	↓ 54.6%

UBarney · 2025-06-19T07:14:39Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

-        let enforce_batch_size_in_joins =
-            context.session_config().enforce_batch_size_in_joins();


We can remove the enforce_batch_size_in_joins configuration for nested loop join since

The new implementation in this PR achieves both improved performance and lower memory usage. This surpasses the previous state where enforce_batch_size_in_joins was used to toggle between better performance (false) and lower memory usage (true).

datafusion/datafusion/common/src/config.rs

Lines 404 to 408 in e6df27c

/// Should DataFusion enforce batch size in joins or not. By default,

/// DataFusion will not enforce batch size in joins. Enforcing batch size

/// in joins can reduce memory usage when joining large

/// tables with a highly-selective join filter, but is also slightly slower.

pub enforce_batch_size_in_joins: bool, default = false

Verification confirms that results remain correct without this configuration

jonathanc-n · 2025-06-19T13:46:21Z

Those benchmark helper functions are really cool, I'll see if I can take a look today.

UBarney · 2025-06-22T02:41:04Z

I'll find out why there is a performance improvement

From the flame graph (when executing the SQL select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;) adjustments to the input indices sizes, resulted in faster execution for

apply_join_filter_to_indices Showed a reduction in execution time (sample count reduced from 528million to 241million).
- MutableBuffer::from_trusted_len_iter: Execution time became shorter (sample count: 207 million to 27 million).
build_batch_from_indices** (excluding the contribution of apply_join_filter_to_indices) Showed a reduction in execution time (sample count reduced from 79million to 35million).

But I still can't explain why these two functions performed better. 😂

jonathanc-n · 2025-06-22T02:48:57Z

When you are running the benchmarks do they stay consistent?

UBarney · 2025-06-22T04:00:16Z

When you are running the benchmarks do they stay consistent?

Yes. bechmarks result almost consistent.

I ran the benchmarks a few minutes ago on commit.

It's worth noting that the join type in query 3 (q3) was modified from a right join to a left join.

ID	SQL	join_base Time(s)	join_limit_join_batch_size Time(s)	Performance Change
1	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	0.821	0.559	+31.89%
2	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.633	0.367	+41.99%
3	select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.632	0.365	+42.29%
4	select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	Failed	1.495	N/A
5	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	0.285	0.072	+74.81%
6	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	0.392	0.172	+56.13%

SQL Query	join_base Memory	join_limit_join_batch_size Memory	Improvement
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	2.82 GB	1.57 GB	↓ 44.2%
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	2.83 GB	857.4 MB	↓ 70.4%
select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	2.82 GB	847.6 MB	↓ 70.7%
select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	26.51 GB ❌ (exit code: 137)	15.11 GB	N/A (process failed)
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	1008.6 MB	330.4 MB	↓ 67.2%
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	1.35 GB	652.1 MB	↓ 52.7%

jonathanc-n

I'm a bit confused where the performance increase is coming from as well. I noted down some nits; I'll take a better look tonight

jonathanc-n · 2025-06-22T13:27:39Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+                datafusion_common::_internal_datafusion_err!(
+                    "should have join_result_status"


I think we can change this to be more verbose, and we can use
internal_err!

Suggested change

datafusion_common::_internal_datafusion_err!(

"should have join_result_status"

internal_err!(

"get_next_join_result called without initializing join_result_status"

we can use internal_err!

internal_err! return Result but ok_or_else require closure return error

https://github.com/rust-lang/rust/blob/86e05cd300fac9e83e812c4d46582b48db780d8f/library/core/src/option.rs#L1334-L1336

Should we import _internal_datafusion_err! so we do not need the path qualifier? small nit, just looks cleaner that way

alamb · 2025-06-26T18:48:23Z

@korowa do you by any chance have time to review this PR?

korowa · 2025-06-28T13:46:40Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    // - probe_indices: row indices from probe-side table (right table)
+    // - processed_count: number of index pairs already processed into output batches
+    // We have completed join result for indices [0..processed_count)
+    join_result_status: Option<(


It may be better to create a separate struct for ProcessProbeBatch state, extended with all attributes required to track join progress (example for hash join)

In a hash join, ProcessProbeBatch is solely responsible for tracking the join progress on the probe side. In contrast, join_result_status serves a broader purpose: it tracks progress for both the probe side and for the unmatched rows from the build side.

I still think we should make this into a struct

korowa · 2025-06-28T14:21:34Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    fn new_task_ctx() -> Arc<TaskContext> {
+        let base = TaskContext::default();
+        // limit max size of intermediate batch used in nlj to 1
+        let cfg = base.session_config().clone().with_batch_size(1);


Could you, please, parameterize batch_size value and run all unit tests for various batch sizes (e.g. 1, 2, 4, 10, 8192)?

korowa · 2025-06-28T14:25:32Z

datafusion/physical-plan/src/joins/utils.rs

-        .expression()
-        .evaluate(&intermediate_batch)?
-        .into_array(intermediate_batch.num_rows())?;
+    let filter_result = if let Some(max_size) = max_intermediate_size {


Why batch_size enforcement should take place during filtering? Can we enforce it before filtering, while calculating build/probe_indices args for this function (in NestedLoopJoinExec::build_join_indices)?

Can we enforce it before filtering, while calculating build/probe_indices args for this function (in NestedLoopJoinExec::build_join_indices)?

I'll do it in next pr. #16364 (comment)

Why batch_size enforcement should take place during filtering

Although the "Process the Cartesian Product Incrementally" step is designed to limit the input size for apply_join_filter_to_indices, the size of a single batch can still be very large (up to left_table.now_rows() * N). When the left table itself is large, this can lead to the creation of a large record_batch.

Benchmarks indicate that executing joins is faster with this enforcement in place. limit intermediate batch size in nested_loop_join #16443 (comment)

Shouldn't we have this done in this pull request? I think it would make more sense (just moving this logic to build_join_indices

Yes but I believe this removes the purpose of the pull request if we are building the entire amount of indices? I may be missing something though

I believe this removes the purpose of the pull request if we are building the entire amount of indices

This PR only limits the size of the intermediate record_batch. The Cartesian product of the entire left_table and right_batch is still generated at once (this will be limited in a subsequent PR).

I believe this removes the purpose of the pull request if we are building the entire amount of indices

This PR only limits the size of the intermediate record_batch. The Cartesian product of the entire left_table and right_batch is still generated at once (this will be limited in a subsequent PR).

Additionally, making the Cartesian product step incremental likely requires a larger refactor (comparing to this PR), so it may be better suited for a separate PR.

korowa · 2025-06-28T14:32:38Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

-                );
-                timer.done();
+        if self.join_result_status.is_none() {
+            self.join_metrics.input_batches.add(1);


fetch_probe_batch seems to be better fit for tracking these two metrics (input_batches/rows)

korowa · 2025-06-28T14:46:35Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

    /// Current state of the stream
    state: NestedLoopJoinStreamState,
+    #[allow(dead_code)]
+    // TODO: remove this field ??


Since there is no more need in splitting output batch, and the output is generating progressively, I suppose it can be removed.

korowa · 2025-06-28T14:51:29Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+
+        let current_start = *start;
+
+        if left_indices.is_empty() && right_indices.is_empty() && current_start == 0 {


If both index arrays are empty, maybe it is ok to simply return None here, instead of building batch and setting start to 1?

That was my initial approach. However, it resulted in an output with 0 rows and 0 columns, which seems to be incorrect and caused the test to fail.

You can see the failed CI run here:
https://github.com/apache/datafusion/actions/runs/15734253347/job/44343070926?pr=16443#step:5:1208

Can we build and return an empty batch instead of calling build_batch_from_indices?

I don't get this status.processed_count = 1 logic either, perhaps you can add a quick comment to explain it?

That was my initial approach. However, it resulted in an output with 0 rows and 0 columns, which seems to be incorrect and caused the test to fail.
You can see the failed CI run here:
https://github.com/apache/datafusion/actions/runs/15734253347/job/44343070926?pr=16443#step:5:1208

Now I understand why this test passed after I changed the return value from None to RecordBatch::new_empty.

In this unit test, the join result is converted to a string and then compared with the expected output. When converting to a string, it retrieves the schema from the record_batch (as the passed schema_opt is None).
https://github.com/apache/arrow-rs/blob/7b219f98c25fcd318a0c207f51a41398d1b23724/arrow-cast/src/pretty.rs#L183-L187
When executed in the CLI, there's no issue even if the Nested Loop Join (NLJ) returns 0 record batches.

> select t1.value from range(1) t1 join range(1) t2 on t1.value + t2.value >100; +-------+ | value | +-------+ +-------+ 0 row(s) fetched.

From a compatibility standpoint, I think it's better to keep it consistent with the previous behavior.

korowa · 2025-06-28T15:00:23Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+        if self.join_result_status.is_none() {
+            self.join_metrics.input_batches.add(1);
+            self.join_metrics.input_rows.add(batch.num_rows());
+            let _timer = self.join_metrics.join_time.timer();


Maybe only one timer covering all the function will be enough (instead on this timer, and the one on L1023)?

Maybe only one timer covering all the function will be enough (instead on this timer, and the one on L1023)?

That was my initial approach, but it caused a borrow checker error (E0502).

The issue is a conflict between an immutable borrow for the timer (self.join_metrics.join_time) and a mutable borrow required by self.get_next_join_result.

We can create the timer inside get_next_join_result itself, rather than in the caller

I think you can clone it, the underlying structure of this metric is Arc<AtomicUsize>, so the cloned version points to the same counter.

Done. I've updated the code to clone join_metrics.join_time

Copilot

Pull Request Overview

This PR refactors join logic to limit the intermediate batch size during filtering and to yield partial batches on demand.

Updated helper functions and state management to support an optional maximum intermediate batch size
Refactored nested loop join execution logic and test contexts to integrate the new batching mechanism
Propagated changes across related join implementations (symmetric hash join and hash join)

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
datafusion/physical-plan/src/joins/utils.rs	Adds an import of std::cmp::min and updates apply_join_filter_to_indices to support an optional max intermediate batch size.
datafusion/physical-plan/src/joins/symmetric_hash_join.rs	Passes None for the new intermediate batch size parameter to maintain compatibility.
datafusion/physical-plan/src/joins/nested_loop_join.rs	Introduces incremental join result production, state transitions, and test context updates for limiting batch sizes.
datafusion/physical-plan/src/joins/hash_join.rs	Updated join filter call to pass None for the intermediate batch size parameter.

Comments suppressed due to low confidence (1)

datafusion/physical-plan/src/joins/nested_loop_join.rs:1068

The error message uses 'OutputUnmatchBatch' which is inconsistent with the enum variant named 'OutputUnmatchedBuildRows'. Please update the error message for consistency.

            return internal_err!(

Copilot · 2025-06-29T09:46:18Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    // TODO: remove this field ??
    /// Transforms the output batch before returning.
    batch_transformer: T,
    /// Result of the left data future
    left_data: Option<Arc<JoinLeftData>>,
+
+    // Tracks progress when building join result batches incrementally
+    // Contains (build_indices, probe_indices, processed_count) where:
+    // - build_indices: row indices from build-side table (left table)
+    // - probe_indices: row indices from probe-side table (right table)
+    // - processed_count: number of index pairs already processed into output batches
+    // We have completed join result for indices [0..processed_count)
+    join_result_status: Option<(
+        PrimitiveArray<UInt64Type>,
+        PrimitiveArray<UInt32Type>,
+        usize,
+    )>,
+


[nitpick] The 'join_result_status' field contains a TODO comment indicating uncertainty. Consider either removing this field if it is no longer necessary or clarifying its purpose.

Suggested change

// TODO: remove this field ??

/// Transforms the output batch before returning.

batch_transformer: T,

/// Result of the left data future

left_data: Option<Arc<JoinLeftData>>,

// Tracks progress when building join result batches incrementally

// Contains (build_indices, probe_indices, processed_count) where:

// - build_indices: row indices from build-side table (left table)

// - probe_indices: row indices from probe-side table (right table)

// - processed_count: number of index pairs already processed into output batches

// We have completed join result for indices [0..processed_count)

join_result_status: Option<(

PrimitiveArray<UInt64Type>,

PrimitiveArray<UInt32Type>,

usize,

)>,

/// Transforms the output batch before returning.

batch_transformer: T,

/// Result of the left data future

left_data: Option<Arc<JoinLeftData>>,

jonathanc-n

Thanks @UBarney, just some comments

jonathanc-n · 2025-07-02T22:23:24Z

datafusion/physical-plan/src/joins/utils.rs

-        .expression()
-        .evaluate(&intermediate_batch)?
-        .into_array(intermediate_batch.num_rows())?;
+    let filter_result = if let Some(max_size) = max_intermediate_size {


Shouldn't we have this done in this pull request? I think it would make more sense (just moving this logic to build_join_indices

jonathanc-n · 2025-07-02T22:27:37Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    fn build_unmatched_output(
+        &mut self,
+    ) -> Result<StatefulStreamResult<Option<RecordBatch>>> {
+        if matches!(


I do not think we need this check, it is already guaranteed that this function will only run when we have the OutputUnmatchedBuildRows state

Yes, we don't need this check

jonathanc-n · 2025-07-02T22:45:12Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+                datafusion_common::_internal_datafusion_err!(
+                    "should have join_result_status"


Should we import _internal_datafusion_err! so we do not need the path qualifier? small nit, just looks cleaner that way

jonathanc-n · 2025-07-02T23:00:27Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+
+        let current_start = *start;
+
+        if left_indices.is_empty() && right_indices.is_empty() && current_start == 0 {


Can we build and return an empty batch instead of calling build_batch_from_indices?

jonathanc-n · 2025-07-02T23:05:43Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    // - probe_indices: row indices from probe-side table (right table)
+    // - processed_count: number of index pairs already processed into output batches
+    // We have completed join result for indices [0..processed_count)
+    join_result_status: Option<(


I still think we should make this into a struct

jonathanc-n · 2025-07-02T23:07:28Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+            NestedLoopJoinStreamState::ProcessProbeBatch(record_batch) => record_batch,
+            NestedLoopJoinStreamState::OutputUnmatchedBuildRows(record_batch) => {


Suggested change

NestedLoopJoinStreamState::ProcessProbeBatch(record_batch) => record_batch,

NestedLoopJoinStreamState::OutputUnmatchedBuildRows(record_batch) => {

NestedLoopJoinStreamState::ProcessProbeBatch(record_batch) | NestedLoopJoinStreamState::OutputUnmatchedBuildRows(record_batch) => {

jonathanc-n · 2025-07-02T23:08:15Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+            }
+            _ => {
+                return internal_err!(
+                    "state should be ProcessProbeBatch or OutputUnmatchBatch"


Suggested change

"state should be ProcessProbeBatch or OutputUnmatchBatch"

"State should be ProcessProbeBatch or OutputUnmatchedBuildRows"

jonathanc-n · 2025-07-02T23:12:47Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+                }
            }
+        } else {
+            internal_err!("state should be OutputUnmatchBatch")


Suggested change

internal_err!("state should be OutputUnmatchBatch")

internal_err!("State should be OutputUnmatchedBuildRows")

UBarney · 2025-07-05T13:40:12Z

Thanks @UBarney, just some comments

Thanks @jonathanc-n for reviewing. I have addressed all of your comments.

jonathanc-n

Thanks @UBarney This looks good to me! Looking forward to reviewing the follow up prs

jonathanc-n · 2025-07-07T13:07:45Z

@korowa @2010YOUY01 Are you able to take a quick look? Thanks!

2010YOUY01 · 2025-07-08T08:56:05Z

@korowa @2010YOUY01 Are you able to take a quick look? Thanks!

Thank you so much for this optimization. It's on my list, but due to the complexity of the join operator, I need to find a time when my mind is clear to review it — which is challenging, as I often feel slow recently 😇

BTW I think the micro-benchmarks for NLJ is quite valuable, it would be great to see them in df's benchmark suite. The same for this memory profiling functionality in benchmark scripts.

Dandandan · 2025-07-08T09:13:02Z

datafusion/physical-plan/src/joins/utils.rs

+                filter.column_indices(),
+                build_side,
+            )?;
+            let filter_result = filter


Perhaps the performance improvement comes from the fact that the data is still in cache when doing the filtering step in subsequent operation?
When doing it on the entire array, it will be wiped out from the cache if it is large enough.

This version boosts performance with a much higher IPC of 1.75 (vs 0.87), achieved by dramatically cutting LLC misses from 109M to 25M, even with a similar L1 miss rate.

Details

sudo perf stat -e cycles,instructions,L1-dcache-load-misses,L1-dcache-loads,LLC-loads,LLC-load-misses ./limit_batch_size@36991aca -c 'select t1.value from range(100) t1 join range(819200) t2 on (t1.value + t2.value) % 1000 = 0; ' --maxrows 1 sudo perf stat -e cycles,instructions,L1-dcache-load-misses,L1-dcache-loads,LLC-loads,LLC-load-misses ./join_base@6965fd32 -c 'select t1.value from range(100) t1 join range(819200) t2 on (t1.value + t2.value) % 1000 = 0; ' --maxrows 1 DataFusion CLI v48.0.0 +-------+ | value | +-------+ | 40 | | . | | . | | . | +-------+ 81901 row(s) fetched. (First 1 displayed. Use --maxrows to adjust) Elapsed 0.067 seconds. Performance counter stats for './limit_batch_size@36991aca -c select t1.value from range(100) t1 join range(819200) t2 on (t1.value + t2.value) % 1000 = 0; --maxrows 1': 1901401922 cycles 3325776634 instructions # 1.75 insn per cycle 32419611 L1-dcache-load-misses # 5.27% of all L1-dcache accesses 614645891 L1-dcache-loads <not supported> LLC-loads <not supported> LLC-load-misses 0.073244586 seconds time elapsed 0.448238000 seconds user 0.044823000 seconds sys DataFusion CLI v48.0.0 +-------+ | value | +-------+ | 99 | | . | | . | | . | +-------+ 81901 row(s) fetched. (First 1 displayed. Use --maxrows to adjust) Elapsed 0.131 seconds. Performance counter stats for './join_base@6965fd32 -c select t1.value from range(100) t1 join range(819200) t2 on (t1.value + t2.value) % 1000 = 0; --maxrows 1': 3696196789 cycles 3201132508 instructions # 0.87 insn per cycle 21781750 L1-dcache-load-misses # 3.68% of all L1-dcache accesses 592094439 L1-dcache-loads <not supported> LLC-loads <not supported> LLC-load-misses 0.139081088 seconds time elapsed 0.835575000 seconds user 0.111277000 seconds sys (venv) √ devhomeinsp ~/c/d/t/release > valgrind --cache-sim=yes --tool=cachegrind ./join_base@6965fd32 -c 'select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;' --maxrows 1 ==94454== Cachegrind, a high-precision tracing profiler ==94454== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==94454== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==94454== Command: ./join_base@6965fd32 -c select\ t1.value\ from\ range(8192)\ t1\ join\ range(8192)\ t2\ on\ t1.value\ +\ t2.value\ \>\ t1.value\ *\ t2.value; --maxrows 1 ==94454== --94454-- warning: L3 cache found, using its data for the LL simulation. --94454-- warning: specified LL cache: line_size 64 assoc 12 total_size 31,457,280 --94454-- warning: simulated LL cache: line_size 64 assoc 15 total_size 31,457,280 DataFusion CLI v48.0.0 +-------+ | value | +-------+ | 1 | | . | | . | | . | +-------+ 32763 row(s) fetched. (First 1 displayed. Use --maxrows to adjust) Elapsed 7.948 seconds. ==94454== ==94454== I refs: 3,555,994,712 ==94454== I1 misses: 66,444 ==94454== LLi misses: 26,028 ==94454== I1 miss rate: 0.00% ==94454== LLi miss rate: 0.00% ==94454== ==94454== D refs: 813,250,121 (475,263,085 rd + 337,987,036 wr) ==94454== D1 misses: 118,285,307 ( 71,937,864 rd + 46,347,443 wr) ==94454== LLd misses: 109,455,796 ( 63,122,399 rd + 46,333,397 wr) ==94454== D1 miss rate: 14.5% ( 15.1% + 13.7% ) ==94454== LLd miss rate: 13.5% ( 13.3% + 13.7% ) ==94454== ==94454== LL refs: 118,351,751 ( 72,004,308 rd + 46,347,443 wr) ==94454== LL misses: 109,481,824 ( 63,148,427 rd + 46,333,397 wr) ==94454== LL miss rate: 2.5% ( 1.6% + 13.7% ) (venv) √ devhomeinsp ~/c/d/t/release > valgrind --cache-sim=yes --tool=cachegrind ./limit_batch_size@36991aca -c 'select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;' --maxrows 1 ==96086== Cachegrind, a high-precision tracing profiler ==96086== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al. ==96086== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==96086== Command: ./limit_batch_size@36991aca -c select\ t1.value\ from\ range(8192)\ t1\ join\ range(8192)\ t2\ on\ t1.value\ +\ t2.value\ \>\ t1.value\ *\ t2.value; --maxrows 1 ==96086== --96086-- warning: L3 cache found, using its data for the LL simulation. --96086-- warning: specified LL cache: line_size 64 assoc 12 total_size 31,457,280 --96086-- warning: simulated LL cache: line_size 64 assoc 15 total_size 31,457,280 DataFusion CLI v48.0.0 +-------+ | value | +-------+ | 1 | | . | | . | | . | +-------+ 32763 row(s) fetched. (First 1 displayed. Use --maxrows to adjust) Elapsed 8.163 seconds. ==96086== ==96086== I refs: 3,663,944,959 ==96086== I1 misses: 944,257 ==96086== LLi misses: 27,378 ==96086== I1 miss rate: 0.03% ==96086== LLi miss rate: 0.00% ==96086== ==96086== D refs: 847,265,289 (495,073,876 rd + 352,191,413 wr) ==96086== D1 misses: 122,392,985 ( 74,620,328 rd + 47,772,657 wr) ==96086== LLd misses: 25,750,761 ( 12,815,239 rd + 12,935,522 wr) ==96086== D1 miss rate: 14.4% ( 15.1% + 13.6% ) ==96086== LLd miss rate: 3.0% ( 2.6% + 3.7% ) ==96086== ==96086== LL refs: 123,337,242 ( 75,564,585 rd + 47,772,657 wr) ==96086== LL misses: 25,778,139 ( 12,842,617 rd + 12,935,522 wr) ==96086== LL miss rate: 0.6% ( 0.3% + 3.7% )

Dandandan · 2025-07-08T09:13:38Z

datafusion/physical-plan/src/joins/utils.rs

+        let filter_refs: Vec<&dyn Array> =
+            filter_results.iter().map(|a| a.as_ref()).collect();
+
+        compute::concat(&filter_refs)?


Would be nice to avoid this and rely on CoalesceBatches instead

oh wait this is filter, I have another suggestion

Dandandan · 2025-07-08T09:19:56Z

datafusion/physical-plan/src/joins/utils.rs

+                .expression()
+                .evaluate(&intermediate_batch)?
+                .into_array(intermediate_batch.num_rows())?;
+            filter_results.push(filter_result);


What about executing the filter directly on build/probe indices slice here and concatenating the indices later?
Ideally, I think the filtering operation should be done here, directly on the sub-batch (and using coalescing kernel (apache/arrow-rs#7652 ) to push results.

coalesce is now available in datafusion (we have upgraded to a new arrow version)

I hope to continue improving coalesce over time (especially for this common usecase of building up the output of filter)

Speciically -- https://docs.rs/arrow/latest/arrow/compute/struct.BatchCoalescer.html

I'm not sure I fully understand your suggestions. Could you please elaborate?

What about executing the filter directly on build/probe indices slice here and concatenating the indices later?

I'm a bit confused by this. Do you mean we can avoid constructing the intermediate_batch? It seems this approach would require rewriting the filter logic to work directly on index slices

and using coalescing kernel

I'm also not sure why the coalesce kernel should be used here. The current function takes build_indices and probe_indices, builds an intermediate_batch, executes the filter, and returns the (build_indices, probe_indices) that passed the filter.

My understanding is that coalesce is used to merge multiple RecordBatches with a row count less than a target into new RecordBatches with a row count greater than the target. How would that apply in this situation?

I'm not sure I fully understand your suggestions. Could you please elaborate?

What about executing the filter directly on build/probe indices slice here and concatenating the indices later?

I'm a bit confused by this. Do you mean we can avoid constructing the intermediate_batch? It seems this approach would require rewriting the filter logic to work directly on index slices

and using coalescing kernel

I'm also not sure why the coalesce kernel should be used here. The current function takes build_indices and probe_indices, builds an intermediate_batch, executes the filter, and returns the (build_indices, probe_indices) that passed the filter.

My understanding is that coalesce is used to merge multiple RecordBatches with a row count less than a target into new RecordBatches with a row count greater than the target. How would that apply in this situation?

This is an idea for future optimization:
If we use this interface https://docs.rs/arrow-select/55.2.0/src/arrow_select/coalesce.rs.html#189-193 instead of the current concat_batches() approach, it can be 1. use less memory 2. faster
Now the above interface hasn't been implemented with the fast path yet.

You can see the motivations in apache/arrow-rs#6692

However, after replacing concat with coalescer.push_batch_with_filter, performance actually decreased.

code

pub(crate) fn apply_join_filter_to_indices( build_input_buffer: &RecordBatch, probe_batch: &RecordBatch, build_indices: UInt64Array, probe_indices: UInt32Array, filter: &JoinFilter, build_side: JoinSide, max_intermediate_size: Option<usize>, ) -> Result<(UInt64Array, UInt32Array)> { if build_indices.is_empty() && probe_indices.is_empty() { return Ok((build_indices, probe_indices)); }; if let Some(max_size) = max_intermediate_size { let indices_schema = Arc::new(Schema::new(vec![ Field::new("build_indices", arrow::datatypes::DataType::UInt64, false), Field::new("probe_indices", arrow::datatypes::DataType::UInt32, false), ])); let build_indices = Arc::new(build_indices); let probe_indices = Arc::new(probe_indices); let indices_batch = RecordBatch::try_new( indices_schema, vec![ Arc::clone(&build_indices) as Arc<dyn Array>, Arc::clone(&probe_indices) as Arc<dyn Array>, ], )?; let mut coalescer = BatchCoalescer::new(indices_batch.schema(), indices_batch.num_rows()); for i in (0..build_indices.len()).step_by(max_size) { let end = min(build_indices.len(), i + max_size); let len = end - i; let intermediate_batch = build_batch_from_indices( filter.schema(), build_input_buffer, probe_batch, &build_indices.slice(i, len), &probe_indices.slice(i, len), filter.column_indices(), build_side, )?; let filter_result = filter .expression() .evaluate(&intermediate_batch)? .into_array(intermediate_batch.num_rows())?; coalescer.push_batch_with_filter( indices_batch.slice(i, len), as_boolean_array(&filter_result)?, )?; } coalescer.finish_buffered_batch()?; let result = coalescer.next_completed_batch(); if result.is_none() { return Ok((build_indices.slice(0, 0), probe_indices.slice(0, 0))); } if coalescer.has_completed_batch() { return internal_err!("should not have completed_batch"); } let (_, arrays, _) = result.unwrap().into_parts(); return Ok(( downcast_array(arrays[0].as_ref()), downcast_array(arrays[1].as_ref()), )); } let intermediate_batch = build_batch_from_indices( filter.schema(), build_input_buffer, probe_batch, &build_indices, &probe_indices, filter.column_indices(), build_side, )?; let filter_result = filter .expression() .evaluate(&intermediate_batch)? .into_array(intermediate_batch.num_rows())?; let mask = as_boolean_array(&filter_result)?; let left_filtered = compute::filter(&build_indices, mask)?; let right_filtered = compute::filter(&probe_indices, mask)?; Ok(( downcast_array(left_filtered.as_ref()), downcast_array(right_filtered.as_ref()), )) }

bench result

ID SQL join_limit_join_batch_size Time(s) use_BatchCoalescer Time(s) Performance Change

1 select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value; 0.559 0.671 1.20x slower 🐌

2 select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; 0.377 0.371 +1.02x faster 🚀

3 select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; 0.363 0.363 +1.00x faster 🚀

4 select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value; 1.556 2.031 1.30x slower 🐌

5 select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value; 0.063 0.057 +1.11x faster 🚀

6 select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; 0.153 0.194 1.27x slower 🐌

SQL Query join_limit_join_batch_size Memory use_BatchCoalescer Memory Improvement

select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value; 1.57 GB 2.31 GB 1.47x more 🐌

select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; 841.5 MB 824.9 MB +1.02x saved 🚀

select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value; 845.3 MB 824.6 MB +1.03x saved 🚀

select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value; 15.00 GB 20.36 GB 1.36x more 🐌

select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value; 328.1 MB 327.6 MB +1.00x saved 🚀

select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; 659.8 MB 810.4 MB 1.23x more 🐌

I think batch coalescer won't make this faster as this is buffering everything in memory anyway.

The main idea would be to apply the filters iteratively to the incoming RecordBatch instead of the indices, so we have to change the API / implementation a bit more.

It's fine to leave this as future change.

Yes, that's expected, now we only got the interface ready, the efficient implementation is still WIP.
See https://docs.rs/arrow-select/55.2.0/src/arrow_select/coalesce.rs.html#194

2010YOUY01

I think this PR's idea is great, the implementation overall looks good to me.

I recommend to doc more high-level ideas to key functions, to make this module easier to maintain in the future, specifically: build_unmatched_output(), prepare_unmatched_output_indices(), and get_next_join_result()

2010YOUY01 · 2025-07-09T08:37:41Z

datafusion/physical-plan/src/joins/utils.rs

-        .expression()
-        .evaluate(&intermediate_batch)?
-        .into_array(intermediate_batch.num_rows())?;
+    let filter_result = if let Some(max_size) = max_intermediate_size {


I believe this removes the purpose of the pull request if we are building the entire amount of indices

This PR only limits the size of the intermediate record_batch. The Cartesian product of the entire left_table and right_batch is still generated at once (this will be limited in a subsequent PR).

Additionally, making the Cartesian product step incremental likely requires a larger refactor (comparing to this PR), so it may be better suited for a separate PR.

2010YOUY01 · 2025-07-09T08:47:21Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

    }
 }

+/// Tracks progress when building join result batches incrementally.


Suggested change

/// Tracks progress when building join result batches incrementally.

/// Tracks incremental output of join result batches.

///

/// Initialized with all matching pairs that satisfy the join predicate.

/// Pairs are stored as indices in `build_indices` and `probe_indices`

/// Each poll outputs a batch within the configured size limit and updates

/// processed_count until all pairs are consumed.

///

/// Example: 5000 matches, batch size limit is 100

/// - Poll 1: output batch[0..100], processed_count = 100

/// - Poll 2: output batch[100..200], processed_count = 200

/// - ...continues until processed_count = 5000

It would be helpful to doc high-level ideas and examples, for key structs and functions.

2010YOUY01 · 2025-07-09T09:00:49Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+        match join_result {
+            Some(res) => {
+                self.join_metrics.output_batches.add(1);
+                self.join_metrics.output_rows.add(res.num_rows());


we don't have to count output_rows here: it would be automatically counted in the outer poll

This was made by a recent change: #16500
While output_batches still needs to be manually tracked here, it could also be automatically counted in the future.

2010YOUY01 · 2025-07-09T09:03:02Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+    fn build_unmatched_output(
+        &mut self,
+    ) -> Result<StatefulStreamResult<Option<RecordBatch>>> {
+        let start = Instant::now();


nit: I think we can just construct a timer guard here and let it stop on drops.

2010YOUY01 · 2025-07-09T09:09:18Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

    ProcessProbeBatch(RecordBatch),
    /// Indicates that probe-side has been fully processed
    ExhaustedProbeSide,
+    /// Output unmatched build-side rows


Suggested change

/// Output unmatched build-side rows

/// Output unmatched build-side rows.

/// The indices for rows to output has already been calculated in the previous

/// `ExhaustedProbeSide` state. In this state the final batch will be materialized

// incrementally.

// The inner `RecordBatch` is an empty dummy batch used to get right schema.

Maybe we can also rename ExhaustedProbeSide to PrepareUnmatchedBuildRows to be more accurate.

UBarney · 2025-07-10T05:18:40Z

I have addressed all of your comments. @2010YOUY01 please take another look

I recommend to doc more high-level ideas to key functions, to make this module easier to maintain in the future, specifically: build_unmatched_output(), prepare_unmatched_output_indices(), and get_next_join_result()

Since the logic of build_unmatched_output is quite straightforward, I don't think it's necessary to add extra documentation.
I've now added documentation for the other two functions

2010YOUY01 · 2025-07-10T09:15:06Z

I have addressed all of your comments. @2010YOUY01 please take another look

I recommend to doc more high-level ideas to key functions, to make this module easier to maintain in the future, specifically: build_unmatched_output(), prepare_unmatched_output_indices(), and get_next_join_result()

Since the logic of build_unmatched_output is quite straightforward, I don't think it's necessary to add extra documentation. I've now added documentation for the other two functions

I'm starting a second pass. I haven't fully grasped the internal logic of get_next_join_result() for the special join types yet, but I’ll continue soon.

I think documenting the semantics of left/right indices for different join types can help readability, like

For inner joins: left and right indices should all be valid
For left/right/full joins: left and right indices can use null to represent unmatched rows
Question: do we have other special semantics for semi/anti/mark joins?

jonathanc-n · 2025-07-10T16:04:41Z

@2010YOUY01 Special types need to only return the matching rows, so only one side needs to return rows while the other side can return a null array and not be projected in the final result. This functionality was only moved and had already existed before this pull request where most of it sits in build_batch_from_indices

2010YOUY01

LGTM, thanks again. 🙏🏼

I’ll keep it open for a few more days in case other reviewers have additional concerns they'd like to raise.

2010YOUY01 · 2025-07-11T03:32:52Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+
+        let current_start = *start;
+
+        if left_indices.is_empty() && right_indices.is_empty() && current_start == 0 {


I don't get this status.processed_count = 1 logic either, perhaps you can add a quick comment to explain it?

2010YOUY01 · 2025-07-11T03:43:36Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+            return Ok(Some(res));
+        }
+
+        if matches!(self.join_type, JoinType::RightSemi | JoinType::RightAnti) {


I think from here to the end of the function, it can look nicer if we structure it like this way

match self.join_type { JoinType::RightSemi | JoinType::RightAnti => {...} JoinType::RightMark => {...} JoinType::......(others) => {} _ => {unreachable!()} }

I'd prefer to stick with the current implementation. The reason is that the code block from L925 to L939 is shared by several JoinTypes, including RightMark, Inner, LeftSemi, etc.

If we refactor this into the match structure as suggested, we would have to duplicate that block of logic in multiple match arms

2010YOUY01 · 2025-07-11T03:45:20Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+/// - Poll 1: output batch[0..100], processed_count = 100  
+/// - Poll 2: output batch[100..200], processed_count = 200
+/// - ...continues until processed_count = 5000
+struct JoinResultStatus {


nit: Status is most commonly used for error code/ state flags, perhaps we can JoinResultProgress here to avoid confusion?

alamb · 2025-07-14T18:37:12Z

Is this one ready to merge?

jonathanc-n · 2025-07-14T19:47:30Z

@alamb Yes I believe all comments have been addressed. I think we have two notable follow ups:

Refactor so we limit building the entire cartesian product of both batches (this is already covered in the issue and I believe @UBarney is willing to work on this)
Iteratively apply filters to the incoming RecordBatch (tracked in Apply filters to RecordBatch instead of indices in nested loop join #16773) + eventualy using push_batch_with_filter in BatchCoalescer

alamb · 2025-07-14T19:49:25Z

Awesome -- thanks @jonathanc-n and @UBarney -- I am very happy to see this moving along

UBarney · 2025-07-15T02:13:02Z

Refactor so we limit building the entire cartesian product of both batches (this is already covered in the issue and I believe @UBarney is willing to work on this)

Yes. I'll work on this very soon

2010YOUY01 · 2025-07-16T02:33:12Z

select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;

I'm happy to include this benchmark in the bench suite this week, unless you were already planning to add it yourself @UBarney

UBarney · 2025-07-16T12:55:36Z

select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;

I'm happy to include this benchmark in the bench suite this week, unless you were already planning to add it yourself @UBarney

That would be fantastic, thank you! I hadn't planned on adding it myself, so your help is much appreciated. Please, go right ahead.
@2010YOUY01

github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 18, 2025

UBarney marked this pull request as draft June 18, 2025 13:31

UBarney force-pushed the join branch 3 times, most recently from ec04210 to 00140d3 Compare June 18, 2025 14:44

UBarney commented Jun 19, 2025

View reviewed changes

limit intermediate_batch_size in nested_loop_join

690843a

UBarney force-pushed the join branch from 00140d3 to 690843a Compare June 19, 2025 14:47

UBarney marked this pull request as ready for review June 22, 2025 08:17

address some todo

b65648a

UBarney force-pushed the join branch from 57f4fab to b65648a Compare June 22, 2025 08:25

This comment was marked as outdated.

Sign in to view

jonathanc-n reviewed Jun 22, 2025

View reviewed changes

UBarney mentioned this pull request Jun 24, 2025

Optimize NestedLoopJoinExec Memory Usage #16364

Closed

korowa reviewed Jun 28, 2025

View reviewed changes

2010YOUY01 requested a review from Copilot June 29, 2025 09:45

Copilot AI reviewed Jun 29, 2025

View reviewed changes

address comment

f4cc301

UBarney requested review from 2010YOUY01, jonathanc-n and korowa July 1, 2025 14:43

jonathanc-n reviewed Jul 2, 2025

View reviewed changes

address comments

8f93298

UBarney requested a review from jonathanc-n July 5, 2025 13:40

jonathanc-n approved these changes Jul 5, 2025

View reviewed changes

Dandandan reviewed Jul 8, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into join

0ade9c4

2010YOUY01 reviewed Jul 9, 2025

View reviewed changes

UBarney added 2 commits July 10, 2025 03:38

address comment

1614da6

fix typo

0a6e2c7

2010YOUY01 approved these changes Jul 11, 2025

View reviewed changes

address comment

312cbbb

jonathanc-n mentioned this pull request Jul 14, 2025

Apply filters to RecordBatch instead of indices in nested loop join #16773

Open

alamb merged commit 36991ac into apache:main Jul 14, 2025
27 checks passed

2010YOUY01 mentioned this pull request Jul 16, 2025

CI: Fix slow join test #16796

Merged

This was referenced Jul 18, 2025

Benchmark: Add micro-benchmark for Nested Loop Join operator #16819

Closed

WIP: Rewrite NestedLoopJoin to limit intermediate size (up to 3.2X faster) #16889

Closed

2010YOUY01 mentioned this pull request Jul 31, 2025

Rewrite Nested Loop Join executor for 5× speed and 1% memory usage #16996

Merged

2 tasks

		let enforce_batch_size_in_joins =
		context.session_config().enforce_batch_size_in_joins();

	/// Should DataFusion enforce batch size in joins or not. By default,
	/// DataFusion will not enforce batch size in joins. Enforcing batch size
	/// in joins can reduce memory usage when joining large
	/// tables with a highly-selective join filter, but is also slightly slower.
	pub enforce_batch_size_in_joins: bool, default = false

		datafusion_common::_internal_datafusion_err!(
		"should have join_result_status"


		let current_start = *start;

		if left_indices.is_empty() && right_indices.is_empty() && current_start == 0 {

		NestedLoopJoinStreamState::ProcessProbeBatch(record_batch) => record_batch,
		NestedLoopJoinStreamState::OutputUnmatchedBuildRows(record_batch) => {

	"state should be ProcessProbeBatch or OutputUnmatchBatch"
	"State should be ProcessProbeBatch or OutputUnmatchedBuildRows"

	internal_err!("state should be OutputUnmatchBatch")
	internal_err!("State should be OutputUnmatchedBuildRows")

ID	SQL	join_limit_join_batch_size Time(s)	use_BatchCoalescer Time(s)	Performance Change
1	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	0.559	0.671	1.20x slower 🐌
2	select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.377	0.371	+1.02x faster 🚀
3	select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	0.363	0.363	+1.00x faster 🚀
4	select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	1.556	2.031	1.30x slower 🐌
5	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	0.063	0.057	+1.11x faster 🚀
6	select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	0.153	0.194	1.27x slower 🐌

SQL Query	join_limit_join_batch_size Memory	use_BatchCoalescer Memory	Improvement
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value < t1.value * t2.value;	1.57 GB	2.31 GB	1.47x more 🐌
select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	841.5 MB	824.9 MB	+1.02x saved 🚀
select t1.value from range(8192) t1 left join range(8192) t2 on t1.value + t2.value > t1.value * t2.value;	845.3 MB	824.6 MB	+1.03x saved 🚀
select t1.value from range(8192) t1 join range(81920) t2 on t1.value + t2.value < t1.value * t2.value;	15.00 GB	20.36 GB	1.36x more 🐌
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value > t1.value * t2.value;	328.1 MB	327.6 MB	+1.00x saved 🚀
select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value;	659.8 MB	810.4 MB	1.23x more 🐌

-/// Tracks progress when building join result batches incrementally.
+/// Tracks incremental output of join result batches.
+///
+/// Initialized with all matching pairs that satisfy the join predicate.
+/// Pairs are stored as indices in `build_indices` and `probe_indices`
+/// Each poll outputs a batch within the configured size limit and updates
+/// processed_count until all pairs are consumed.
+///
+/// Example: 5000 matches, batch size limit is 100
+/// - Poll 1: output batch[0..100], processed_count = 100
+/// - Poll 2: output batch[100..200], processed_count = 200
+/// - ...continues until processed_count = 5000

-    /// Output unmatched build-side rows
+    /// Output unmatched build-side rows.
+    /// The indices for rows to output has already been calculated in the previous
+    /// `ExhaustedProbeSide` state. In this state the final batch will be materialized
+    // incrementally.
+    // The inner `RecordBatch` is an empty dummy batch used to get right schema.

limit intermediate batch size in nested_loop_join #16443

limit intermediate batch size in nested_loop_join #16443

Uh oh!

Conversation

UBarney commented Jun 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

UBarney commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

benchmark

memory usage

Uh oh!

UBarney Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathanc-n commented Jun 19, 2025

Uh oh!

UBarney commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathanc-n commented Jun 22, 2025

Uh oh!

UBarney commented Jun 22, 2025

Uh oh!

This comment was marked as outdated.

jonathanc-n left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

UBarney commented Jun 19, 2025 •

edited

Loading

UBarney Jun 19, 2025 •

edited

Loading

UBarney commented Jun 22, 2025 •

edited

Loading

jonathanc-n Jul 2, 2025 •

edited

Loading

UBarney Jul 12, 2025 •

edited

Loading

jonathanc-n Jul 2, 2025 •

edited

Loading