Improve documentation about ParquetExec / Parquet predicate pushdown#11994
Improve documentation about ParquetExec / Parquet predicate pushdown#11994alamb merged 6 commits intoapache:mainfrom
ParquetExec / Parquet predicate pushdown#11994Conversation
| /// * User provided [`ParquetAccessPlan`]s to skip row groups and/or pages | ||
| /// based on external information. See "Implementing External Indexes" below | ||
| /// | ||
| /// # Predicate Pushdown |
There was a problem hiding this comment.
I tried to consolidate the description of what predicate pushdown is done in the ParquetExec
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Utilities to push down of DataFusion filter predicates (any DataFusion |
There was a problem hiding this comment.
this is mostly the same content, reformatted and made more concise.
| /// | ||
| /// Sorted columns may be queried more efficiently in the presence of | ||
| /// a PageIndex. | ||
| fn columns_sorted( |
There was a problem hiding this comment.
This is interesting that we never connected up the "columns_sorted" information -- is this on your list @thinkharderdev ?
Should I file a ticket to do this?
| /// A [Visitor](https://en.wikipedia.org/wiki/Visitor_pattern) for recursively | ||
| /// rewriting [`TreeNode`]s via [`TreeNode::rewrite`]. | ||
| /// | ||
| /// For example you can implement this trait on a struct to rewrite `Expr` or |
There was a problem hiding this comment.
should we add an example of it? 🤔
| //! 6. Partition the predicates according to whether they are sorted (from step 4) | ||
| //! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`. | ||
| //! 8. Build the `RowFilter` with the sorted predicates followed by | ||
| //! the unsorted predicates. Within each partition, predicates are |
There was a problem hiding this comment.
this explanation is a gem
There was a problem hiding this comment.
I think @thinkharderdev wrote it back in the day.
This PR just simplifies the wording slightly
| /// # Return values | ||
| /// | ||
| /// * `Ok(Some(candidate))` if the expression can be used as an ArrowFilter | ||
| /// * `Ok(None)` if the expression cannot be used as an ArrowFilter |
itsjunetime
left a comment
There was a problem hiding this comment.
This helps a lot with (at least my own) comprehension, I think. Thank you
| if self.file_schema.field_with_name(column.name()).is_err() { | ||
| // the column expr must be in the table schema | ||
| // Replace the column reference with a NULL (using the type from the table schema) | ||
| // e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'` |
There was a problem hiding this comment.
This is obviously a much better comment than before, but I think it could be further improved with an explanation stating why we do this column rewriting, or what purpose it serves.
There was a problem hiding this comment.
I agree -- I tried to provide this information in c0b9012
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
alamb
left a comment
There was a problem hiding this comment.
Thank you @comphead and @itsjunetime for the review
| //! 6. Partition the predicates according to whether they are sorted (from step 4) | ||
| //! 7. "Compile" each predicate `Expr` to a `DatafusionArrowPredicate`. | ||
| //! 8. Build the `RowFilter` with the sorted predicates followed by | ||
| //! the unsorted predicates. Within each partition, predicates are |
There was a problem hiding this comment.
I think @thinkharderdev wrote it back in the day.
This PR just simplifies the wording slightly
| if self.file_schema.field_with_name(column.name()).is_err() { | ||
| // the column expr must be in the table schema | ||
| // Replace the column reference with a NULL (using the type from the table schema) | ||
| // e.g. `column = 'foo'` is rewritten be transformed to `NULL = 'foo'` |
There was a problem hiding this comment.
I agree -- I tried to provide this information in c0b9012
|
Thanks again -- let me know if you have additional suggestions and I'll make them in a follow on PR |
Which issue does this PR close?
part of #4028
Rationale for this change
While reviewing this code with @itsjunetime, we discovered some interesting things that I would like to encode in comments.
What changes are included in this PR?
Are these changes tested?
Yes, CI
Are there any user-facing changes?
Documentation change only (no functional changes)
Note most of the docs are internal (don't appear on docs.rs)