Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Oct 20, 2025

Which issue does this PR close?

Rationale for this change

See the above issue and its comment #18070 (comment)

What changes are included in this PR?

In nested loop join, when the join column includes List(Utf8View), use take() instead of to_array_of_size() to avoid deep copying the utf8 buffers inside Utf8View array.

This is the quick fix, avoiding deep copy inside to_array_of_size() is a bit tricky.
Here is ListArray's physical layout: https://arrow.apache.org/rust/arrow/array/struct.GenericListArray.html
If multiple elements is pointing to the same list range, the underlying payload can't be reused.So the potential fix in to_array_of_size can only avoids copying the inner-inner utf8view array buffers, but can't avoid copying the inner array (i.e. views are still copied), and deep copying for other primitive types also can't be avoided. Seems this can be better solved when ListView type is ready 🤔

Benchmark

I tried query 1 in #18070, but only used 3 randomly sampled places parquet file.

49.0.0: 4s
50.0.0: stuck > 1 minute
PR: 4s

Now the performance are similar, I suspect the most time is spend evaluating the expensive array_has so the optimization in #16996 can't help much.

Are these changes tested?

Existing tests

Are there any user-facing changes?

No

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that with this fix the reproducer from @ianthetechie from this issue is fixed:

However, am not sure this code has test coverage. I checked via

nice cargo llvm-cov --html test --test sqllogictests
nice cargo llvm-cov --html test -p datafusion -p datafusion-physical-plan

And both show no coverage
Screenshot 2025-10-20 at 10 01 37 AM

I will look into adding some coverage

Now the performance are similar, I suspect the most time is spend evaluating the expensive array_has so the optimization in #16996 can't help much.

Yes, I looked at the array_has implementation and it is doing a lot of work. I will file a follow on ticket

Also, it seems to me that the fix / improvement for ScalarValue::to_array_of_size() is more general than just NLJ, so I will also file a ticket about that as well

scalar_value.to_array_of_size(filtered_probe_batch.num_rows())?
// Avoid using `ScalarValue::to_array_of_size()` for `List(Utf8View)` to avoid
// deep copies for buffers inside `Utf8View` array. See below for details.
// https://github.com/apache/datafusion/issues/18159
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1674 to +1683
DataType::List(field) | DataType::LargeList(field)
if field.data_type() == &DataType::Utf8View =>
{
let indices_iter = std::iter::repeat_n(
build_side_index as u64,
filtered_probe_batch.num_rows(),
);
let indices_array = UInt64Array::from_iter_values(indices_iter);
take(original_left_array.as_ref(), &indices_array, None)?
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach could be ported into ScalarValue::to_array_of_size itself rather than special cased here -- which would improve performance in potentially other places

fn list_to_array_of_size(arr: &dyn Array, size: usize) -> Result<ArrayRef> {
let arrays = repeat_n(arr, size).collect::<Vec<_>>();
let ret = match !arrays.is_empty() {
true => arrow::compute::concat(arrays.as_slice())?,
false => arr.slice(0, 0),
};
Ok(ret)
}

That being said, I think this is a nice point fix that we can safely backport to the datafusion 50 branch, so I think we should merge this PR / backport it and I will file a follow on PR to further improve the code

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

THANK YOU very much for this fix and diagnosis @2010YOUY01

@2010YOUY01
Copy link
Contributor Author

THANK YOU very much for this fix and diagnosis @2010YOUY01

Thank you for the review. The feedback makes sense to me, but I can only address it tomorrow. If you're waiting on this patch to be included in the release, feel free to push changes directly to the PR.

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

Thank you @2010YOUY01

I just pushed a test that adds coverage for this case

I verified it is covered using

$ cargo llvm-cov test --html --profile=ci --test sqllogictests -- join_lists

@@ -0,0 +1,63 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new test added here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01

@alamb alamb added this pull request to the merge queue Oct 20, 2025
Merged via the queue into apache:main with commit 7f75e58 Oct 20, 2025
32 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

Filed a ticket to track making array_has faster:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set)

2 participants