-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-14946: [C++] Add flattening FieldPath/FieldRef::Get methods #35197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-14946: [C++] Add flattening FieldPath/FieldRef::Get methods #35197
Conversation
|
|
564e319 to
5e0920d
Compare
|
Probably worth mentioning that this (indirectly) addresses most of #34830 as well - at least on the |
8c83fd6 to
a9fddef
Compare
|
@benibus can you ping the people who understood the issue in this PR? |
|
@westonpace I feel you're probably most qualified to look at this given your comments on the original issue. |
|
cc @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this work! This looks generally good, just a bunch of relatively minor suggestions.
15bb2ee to
96cebfe
Compare
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh. Unlike GetAll, this can fail for various reasons such as failure to allocate enough memory. In that case, we'd probably want to return an error instead of dying out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'm a little confused as to why the non-flattened variants don't forward those errors either - since the standard FieldPath::Get methods can also fail (which was true prior to this PR).
I'm mostly referring to GetOne and GetOneAndNone here, as they already return a Result. Regardless, I'll propagate those errors for the new methods.
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (and this method is already returning a Result!).
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
cpp/src/arrow/type_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed it somehow, but we probably want to call ValidateFull on the actual results here (and below).
cpp/src/arrow/type_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is adding quite a lot of test code, and this file is already long, I think it may be time to move the Field{Ref,Path} tests to a separate test module.
96cebfe to
17a6591
Compare
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
17a6591 to
a1662aa
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for the update! Will merge if CI is green.
|
CI failures are unrelated. |
|
Benchmark runs are scheduled for baseline = 2216a0a and contender = f3500f6. f3500f6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…eys (#35727) ### Rationale for this change We don't currently support sorting `StructArray`s despite already having the high-level facilities to do so. For instance, we allow passing multiple sort keys (based on `FieldRef`s) to sort record batches and tables - but the current implementations are fairly limited since nested refs aren't allowed (due to the burden of null flattening). Since #35197, we now have an easier way to do this. ### What changes are included in this PR? - Adds support for `StructArray` in `sort_indices` - Adds support for nested sort keys in `sort_indices` for `RecordBatch`, `ChunkedArray`, and `Table` ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? Yes * Closes: #33206 Authored-by: benibus <bpharks@gmx.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change #35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: #36892 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change #35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: #36892 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…apache#37032) ### Rationale for this change apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: apache#36892 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
The current
FieldPath::Getmethods - when extracting nested child values - don't combine the child's null bitmap with higher-level parent bitmaps. While this is often preferable (it allows for zero-copy), there are cases where higher level "flattening" version is useful - e.g. when pre-processing sort keys for structs.What changes are included in this PR?
FieldPath::GetFlattenedmethods alongside the existingFieldPath::GetoverloadsGetAllFlattened,GetOneFlattenedandGetOneOrNoneFlattenedmethods toFieldRefGetvariants in templatesFieldPathtests in an effort to improve coverage and generalize across the supported input typesMore significantly, this alters the
FieldPathGetImplinternals to use a newNestedSelectorclass. The reason for this is that the prior method required presenting a vector of instantiated child values for each depth level prior to selection. With support for flattening (and recently,ChunkedArrays), this becomes a problem since we need to explicitly create all of those child values for each depth level despite the fact that we're only going to select one. So these changes allow any expensive instantiations to be deferred to selection time.This also indirectly solves an issue that surfaced in the new tests, which is that
FieldPath::Getwould return incorrect nested values when slicedArrays are involved. This is because the underlying child data's offset/length weren't being adjusted based on the parent.Are these changes tested?
Yes (tests are included)
Are there any user-facing changes?
Yes, this adds methods to a public API