feat: Optimize hash util for MapArray#20179
Conversation
|
@Jefffrey I'm not sure to also do this optimization for I was thinking of doing a take() operation on the children array using the offsets. but I don't know if its worth the allocation overhead. What exact strategy were you thinking of for union array? |
xudong963
left a comment
There was a problem hiding this comment.
Do we have a micro benchmark for the PR?
Or do you need me to enable the general benchmark in datafusion?
|
@xudong963 I added some benchmarks for the sliced values + added |
|
Not sure why, I cant find the option to change the pr title |
Would you be able to post the benchmark results here? |
| let sliced_columns: Vec<ArrayRef> = entries | ||
| .columns() | ||
| .iter() | ||
| .map(|col| col.slice(first_offset, entries_len)) |
There was a problem hiding this comment.
I do wonder if this slicing of the key/value children adds any noticeable overhead for cases where there isn't any actual slicing? i.e. all values are referenced
There was a problem hiding this comment.
I don't think so, its quite a cheap operation compared to the hashing overhead.
There was a problem hiding this comment.
Just used Claude to run a quick benchmark:
Overhead for Full Arrays (no actual slicing)
| Array Type | With slice() | Without slice() | Overhead |
|---|---|---|---|
| MapArray | 68.7 µs | 67.1 µs | +2.4% |
| ListArray | 39.4 µs | 38.1 µs | +3.4% |
Seems to be minimal overhead
| let indices_array = UInt32Array::from(indices.clone()); | ||
|
|
||
| // Extract only the elements we need using take() | ||
| let filtered = take(child.as_ref(), &indices_array, None)?; |
There was a problem hiding this comment.
I do wonder if theres a way to skip this extract/scatter pattern, maybe by trying to override the null buffer of these child arrays (though that runs into problems for arrays that don't have a null buffer like run arrays)
Or perhaps an alternate create_hashes to hash only specified indices in a separate argument
Or maybe its unnecessary to go that far if we still get an improvement in benchmarks anyway 🤔
(Not advocating for any of the above in particular, just airing out some thoughts)
There was a problem hiding this comment.
We get improvements, but I think this should be considered. Can write it on the issue.
I do wonder if theres a way to skip this extract/scatter pattern, maybe by trying to override the null buffer of these child arrays (though that runs into problems for arrays that don't have a null buffer like run arrays)
I'll try to run some benchmarks on this idea
There was a problem hiding this comment.
This is a bit awkward to test due to the run array. I think they're good ideas, I'll open an issue for it.
|
@Jefffrey Here is the results: Benchmarks for sliced| ListArray | 39.2 µs | 227.3 µs | 5.8x | Sparse is a bit faster. |
|
@Jefffrey @xudong963 Good for a quick merge |
|
Thanks @jonathanc-n & @xudong963 |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#20151 . ## Rationale for this change Reduce the irrelevant data being used to hash for `MapArray` <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Rationale for this change
Reduce the irrelevant data being used to hash for
MapArrayWhat changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?