Improve documentation about `ParquetExec` / Parquet predicate pushdown by alamb · Pull Request #11994 · apache/datafusion

alamb · 2024-08-14T19:11:56Z

Which issue does this PR close?

part of #4028

Rationale for this change

While reviewing this code with @itsjunetime, we discovered some interesting things that I would like to encode in comments.

What changes are included in this PR?

Improve documentation in the row pushdown code

Are these changes tested?

Yes, CI

Are there any user-facing changes?

Documentation change only (no functional changes)

Note most of the docs are internal (don't appear on docs.rs)

alamb · 2024-08-14T19:12:31Z

 /// * User provided  [`ParquetAccessPlan`]s to skip row groups and/or pages
 ///   based on external information. See "Implementing External Indexes" below
 ///
+/// # Predicate Pushdown


I tried to consolidate the description of what predicate pushdown is done in the ParquetExec

alamb · 2024-08-14T19:12:58Z

 // specific language governing permissions and limitations
 // under the License.

+//! Utilities to push down of DataFusion filter predicates (any DataFusion


this is mostly the same content, reformatted and made more concise.

alamb · 2024-08-14T19:14:14Z

+///
+/// Sorted columns may be queried more efficiently in the presence of
 /// a PageIndex.
 fn columns_sorted(


This is interesting that we never connected up the "columns_sorted" information -- is this on your list @thinkharderdev ?

Should I file a ticket to do this?

comphead · 2024-08-15T15:18:11Z

 /// A [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) for recursively
 /// rewriting [`TreeNode`]s via [`TreeNode::rewrite`].
 ///
+/// For example you can implement this trait on a struct to rewrite `Expr` or


should we add an example of it? 🤔

comphead

thanks lgtm @alamb it was a nice reading. left some minors

comphead · 2024-08-15T15:26:05Z

+//! 6. Partition the predicates according to whether they are sorted (from step 4)
+//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
+//! 8. Build the `RowFilter` with the sorted predicates followed by
+//!    the unsorted predicates. Within each partition, predicates are


this explanation is a gem

I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

comphead · 2024-08-15T15:27:03Z

+    /// # Return values
+    ///
+    /// * `Ok(Some(candidate))` if the expression can be used as an ArrowFilter
+    /// * `Ok(None)` if the expression cannot be used as an ArrowFilter


itsjunetime

This helps a lot with (at least my own) comprehension, I think. Thank you

itsjunetime · 2024-08-15T18:41:34Z

            if self.file_schema.field_with_name(column.name()).is_err() {
-                // the column expr must be in the table schema
+                // Replace the column reference with a NULL (using the type from the table schema)
+                // e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`


This is obviously a much better comment than before, but I think it could be further improved with an explanation stating why we do this column rewriting, or what purpose it serves.

I agree -- I tried to provide this information in c0b9012

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

alamb

Thank you @comphead and @itsjunetime for the review

alamb · 2024-08-16T15:03:27Z

+//! 6. Partition the predicates according to whether they are sorted (from step 4)
+//! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`.
+//! 8. Build the `RowFilter` with the sorted predicates followed by
+//!    the unsorted predicates. Within each partition, predicates are


I think @thinkharderdev wrote it back in the day.

This PR just simplifies the wording slightly

alamb · 2024-08-16T15:15:15Z

            if self.file_schema.field_with_name(column.name()).is_err() {
-                // the column expr must be in the table schema
+                // Replace the column reference with a NULL (using the type from the table schema)
+                // e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'`


I agree -- I tried to provide this information in c0b9012

alamb · 2024-08-16T17:34:57Z

Thanks again -- let me know if you have additional suggestions and I'll make them in a follow on PR

Minor: improve ParquetExec docs

087d937

github-actions Bot added core Core DataFusion crate common Related to common crate labels Aug 14, 2024

alamb commented Aug 14, 2024

View reviewed changes

typo

d88ad71

alamb marked this pull request as ready for review August 14, 2024 19:14

alamb added 2 commits August 14, 2024 15:38

clippy

8bcfa59

fix whitespace so rustdoc does not treat as tests

5ab345a

comphead reviewed Aug 15, 2024

View reviewed changes

Comment thread datafusion/core/src/datasource/physical_plan/parquet/mod.rs Outdated

comphead approved these changes Aug 15, 2024

View reviewed changes

itsjunetime approved these changes Aug 15, 2024

View reviewed changes

alamb and others added 2 commits August 16, 2024 11:03

Apply suggestions from code review

d9f37a4

Co-authored-by: Oleks V <comphead@users.noreply.github.com>

expound upon column rewriting in the context of schema evolution

c0b9012

alamb commented Aug 16, 2024

View reviewed changes

alamb merged commit 2a16704 into apache:main Aug 16, 2024

alamb deleted the alamb/parquet_exec_dcs branch August 16, 2024 17:35

Conversation

alamb commented Aug 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itsjunetime left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants