Skip to content

Improve documentation about ParquetExec / Parquet predicate pushdown#11994

Merged
alamb merged 6 commits intoapache:mainfrom
alamb:alamb/parquet_exec_dcs
Aug 16, 2024
Merged

Improve documentation about ParquetExec / Parquet predicate pushdown#11994
alamb merged 6 commits intoapache:mainfrom
alamb:alamb/parquet_exec_dcs

Conversation

@alamb
Copy link
Copy Markdown
Contributor

@alamb alamb commented Aug 14, 2024

Which issue does this PR close?

part of #4028

Rationale for this change

While reviewing this code with @itsjunetime, we discovered some interesting things that I would like to encode in comments.

What changes are included in this PR?

  1. Improve documentation in the row pushdown code

Are these changes tested?

Yes, CI

Are there any user-facing changes?

Documentation change only (no functional changes)

Note most of the docs are internal (don't appear on docs.rs)

@github-actions github-actions Bot added core Core DataFusion crate common Related to common crate labels Aug 14, 2024
/// * User provided [`ParquetAccessPlan`]s to skip row groups and/or pages
/// based on external information. See "Implementing External Indexes" below
///
/// # Predicate Pushdown
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to consolidate the description of what predicate pushdown is done in the ParquetExec

// specific language governing permissions and limitations
// under the License.

//! Utilities to push down of DataFusion filter predicates (any DataFusion
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly the same content, reformatted and made more concise.

///
/// Sorted columns may be queried more efficiently in the presence of
/// a PageIndex.
fn columns_sorted(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting that we never connected up the "columns_sorted" information -- is this on your list @thinkharderdev ?

Should I file a ticket to do this?

@alamb alamb marked this pull request as ready for review August 14, 2024 19:14
/// A [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) for recursively
/// rewriting [`TreeNode`]s via [`TreeNode::rewrite`].
///
/// For example you can implement this trait on a struct to rewrite `Expr` or
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an example of it? 🤔

Comment thread datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks lgtm @alamb it was a nice reading. left some minors

Comment thread datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated
Comment thread datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated
Comment thread datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated
//! 6. Partition the predicates according to whether they are sorted (from step 4)
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
//! 8. Build the `RowFilter` with the sorted predicates followed by
//! the unsorted predicates. Within each partition, predicates are
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation is a gem

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

/// # Return values
///
/// * `Ok(Some(candidate))` if the expression can be used as an ArrowFilter
/// * `Ok(None)` if the expression cannot be used as an ArrowFilter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor

@itsjunetime itsjunetime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps a lot with (at least my own) comprehension, I think. Thank you

if self.file_schema.field_with_name(column.name()).is_err() {
// the column expr must be in the table schema
// Replace the column reference with a NULL (using the type from the table schema)
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously a much better comment than before, but I think it could be further improved with an explanation stating why we do this column rewriting, or what purpose it serves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I tried to provide this information in c0b9012

alamb and others added 2 commits August 16, 2024 11:03
Copy link
Copy Markdown
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead and @itsjunetime for the review

//! 6. Partition the predicates according to whether they are sorted (from step 4)
//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
//! 8. Build the `RowFilter` with the sorted predicates followed by
//! the unsorted predicates. Within each partition, predicates are
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

if self.file_schema.field_with_name(column.name()).is_err() {
// the column expr must be in the table schema
// Replace the column reference with a NULL (using the type from the table schema)
// e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I tried to provide this information in c0b9012

@alamb
Copy link
Copy Markdown
Contributor Author

alamb commented Aug 16, 2024

Thanks again -- let me know if you have additional suggestions and I'll make them in a follow on PR

@alamb alamb merged commit 2a16704 into apache:main Aug 16, 2024
@alamb alamb deleted the alamb/parquet_exec_dcs branch August 16, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants