Skip to content

Conversation

@benibus
Copy link
Contributor

@benibus benibus commented Apr 17, 2023

Rationale for this change

The current FieldPath::Get methods - when extracting nested child values - don't combine the child's null bitmap with higher-level parent bitmaps. While this is often preferable (it allows for zero-copy), there are cases where higher level "flattening" version is useful - e.g. when pre-processing sort keys for structs.

What changes are included in this PR?

  • Adds FieldPath::GetFlattened methods alongside the existing FieldPath::Get overloads
  • Adds GetAllFlattened, GetOneFlattened and GetOneOrNoneFlattened methods to FieldRef
  • Adds a couple internal helpers for dealing with both Get variants in templates
  • Overhauls the FieldPath tests in an effort to improve coverage and generalize across the supported input types

More significantly, this alters the FieldPathGetImpl internals to use a new NestedSelector class. The reason for this is that the prior method required presenting a vector of instantiated child values for each depth level prior to selection. With support for flattening (and recently, ChunkedArrays), this becomes a problem since we need to explicitly create all of those child values for each depth level despite the fact that we're only going to select one. So these changes allow any expensive instantiations to be deferred to selection time.

This also indirectly solves an issue that surfaced in the new tests, which is that FieldPath::Get would return incorrect nested values when sliced Arrays are involved. This is because the underlying child data's offset/length weren't being adjusted based on the parent.

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

Yes, this adds methods to a public API

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #14946 has been automatically assigned in GitHub to PR creator.

@benibus benibus force-pushed the GH-14946-field-ref-flattened-get branch 2 times, most recently from 564e319 to 5e0920d Compare April 18, 2023 16:27
@benibus benibus marked this pull request as ready for review April 18, 2023 20:58
@benibus
Copy link
Contributor Author

benibus commented Apr 18, 2023

Probably worth mentioning that this (indirectly) addresses most of #34830 as well - at least on the FieldPath side of things.

@benibus
Copy link
Contributor Author

benibus commented Apr 19, 2023

@zeroshade @felipecrv

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 24, 2023
@benibus benibus force-pushed the GH-14946-field-ref-flattened-get branch 2 times, most recently from 8c83fd6 to a9fddef Compare May 1, 2023 19:21
@felipecrv
Copy link
Contributor

@benibus can you ping the people who understood the issue in this PR?

@benibus
Copy link
Contributor Author

benibus commented May 3, 2023

@westonpace I feel you're probably most qualified to look at this given your comments on the original issue.

@benibus
Copy link
Contributor Author

benibus commented May 10, 2023

cc @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this work! This looks generally good, just a bunch of relatively minor suggestions.

@benibus benibus force-pushed the GH-14946-field-ref-flattened-get branch 2 times, most recently from 15bb2ee to 96cebfe Compare May 16, 2023 21:42
@benibus benibus requested a review from pitrou May 17, 2023 12:50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh. Unlike GetAll, this can fail for various reasons such as failure to allocate enough memory. In that case, we'd probably want to return an error instead of dying out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'm a little confused as to why the non-flattened variants don't forward those errors either - since the standard FieldPath::Get methods can also fail (which was true prior to this PR).

I'm mostly referring to GetOne and GetOneAndNone here, as they already return a Result. Regardless, I'll propagate those errors for the new methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (and this method is already returning a Result!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it somehow, but we probably want to call ValidateFull on the actual results here (and below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is adding quite a lot of test code, and this file is already long, I think it may be time to move the Field{Ref,Path} tests to a separate test module.

@benibus benibus force-pushed the GH-14946-field-ref-flattened-get branch from 96cebfe to 17a6591 Compare May 17, 2023 18:10
@benibus benibus requested a review from pitrou May 18, 2023 01:36
@pitrou pitrou force-pushed the GH-14946-field-ref-flattened-get branch from 17a6591 to a1662aa Compare May 22, 2023 13:33
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the update! Will merge if CI is green.

@pitrou
Copy link
Member

pitrou commented May 22, 2023

CI failures are unrelated.

@ursabot
Copy link

ursabot commented May 30, 2023

Benchmark runs are scheduled for baseline = 2216a0a and contender = f3500f6. f3500f6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.45% ⬆️0.06%] test-mac-arm
[Finished ⬇️2.94% ⬆️1.63%] ursa-i9-9960x
[Finished ⬇️1.82% ⬆️0.73%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f3500f65 ec2-t3-xlarge-us-east-2
[Finished] f3500f65 test-mac-arm
[Finished] f3500f65 ursa-i9-9960x
[Finished] f3500f65 ursa-thinkcentre-m75q
[Finished] 2216a0a4 ec2-t3-xlarge-us-east-2
[Finished] 2216a0a4 test-mac-arm
[Finished] 2216a0a4 ursa-i9-9960x
[Finished] 2216a0a4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 31, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

pitrou pushed a commit that referenced this pull request Jun 20, 2023
…eys (#35727)

### Rationale for this change

We don't currently support sorting `StructArray`s despite already having the high-level facilities to do so. For instance, we allow passing multiple sort keys (based on `FieldRef`s) to sort record batches and tables - but the current implementations are fairly limited since nested refs aren't allowed (due to the burden of null flattening). Since #35197, we now have an easier way to do this.

### What changes are included in this PR?

- Adds support for `StructArray` in `sort_indices`
- Adds support for nested sort keys in `sort_indices` for `RecordBatch`, `ChunkedArray`, and `Table`

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

Yes
* Closes: #33206

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit that referenced this pull request Aug 8, 2023
### Rationale for this change

#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: #36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this pull request Aug 9, 2023
### Rationale for this change

#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: #36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#37032)

### Rationale for this change

apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: apache#36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Should FieldRef::Get* for a StructArray return the "raw" child array or the "flattened" field array?

4 participants