Skip to content

refactor: Optimize required_columns from BTreeSet to Vec in struct PushdownChecker#19678

Merged
kosiew merged 3 commits intoapache:mainfrom
kumarUjjawal:refactor/btree_to_vec
Jan 14, 2026
Merged

refactor: Optimize required_columns from BTreeSet to Vec in struct PushdownChecker#19678
kosiew merged 3 commits intoapache:mainfrom
kumarUjjawal:refactor/btree_to_vec

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

@kumarUjjawal kumarUjjawal commented Jan 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jan 7, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

cc @kosiew

@kumarUjjawal kumarUjjawal force-pushed the refactor/btree_to_vec branch from 9e5d7ba to 8c2462e Compare January 8, 2026 07:42
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarUjjawal for contributing.
I left some comments for your consideration.

Comment thread datafusion/datasource-parquet/src/row_filter.rs
Comment on lines +310 to +312
if !self.required_columns.contains(&idx) {
self.required_columns.push(idx);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help future contributors to

  • Add comment explaining linear search is acceptable for small n
    • OR switch to HashSet for O(1) deduplication if n might grow

Comment on lines +415 to +418
let prevents_pushdown = checker.prevents_pushdown();
let nested = checker.nested_behavior;
let mut required_columns = checker.required_columns;
required_columns.sort_unstable();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding:

impl PushdownChecker {
    fn into_sorted_columns(mut self) -> PushdownColumns {
        self.required_columns.sort_unstable();
        self.required_columns.dedup(); // this removes the need for contains check
        PushdownColumns {
            required_columns: self.required_columns,
            nested: self.nested_behavior,
        }
    }

Comment on lines 419 to 422
Ok((!prevents_pushdown).then_some(PushdownColumns {
required_columns,
nested,
}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then rewriting this to:

Ok((!checker.prevents_pushdown())
    .then_some(checker.into_sorted_columns()))

Comment on lines +310 to +312
if !self.required_columns.contains(&idx) {
self.required_columns.push(idx);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see suggested

fn into_sorted_columns
below

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

Thanks @kumarUjjawal for contributing. I left some comments for your consideration.

Thanks for the feedback. Incorporated the changes.

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarUjjawal and @kosiew -- this is really nice

projected_columns: bool,
/// Indices into the file schema of columns required to evaluate the expression.
required_columns: BTreeSet<usize>,
required_columns: Vec<usize>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

#[derive(Debug)]
struct PushdownColumns {
required_columns: BTreeSet<usize>,
required_columns: Vec<usize>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth a comment here explaining the assumption that required_columns are sorted and unique (non duplicate)

@kosiew kosiew added this pull request to the merge queue Jan 14, 2026
Merged via the queue into apache:main with commit 429f5a7 Jan 14, 2026
28 checks passed
de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
…uct `PushdownChecker` (apache#19678)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#19673.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Changed
[row_filter.rs](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_filter.rs)
to use Vec instead of the BTreeSet

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize required_columns from BTreeSet<usize> to Vec<usize> in struct PushdownChecker<'schema>

3 participants