ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646

Dandandan · 2021-03-06T11:04:59Z

This PR

adds max/min/distinct_count column statistics
implements null count statistics for the parquet reader & table provider

This will allow implementing more advanced statistics based optimizations, such as outlined in this article.

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html

github-actions · 2021-03-06T11:05:18Z

https://issues.apache.org/jira/browse/ARROW-11895

codecov-io · 2021-03-06T11:24:54Z

Codecov Report

Merging #9646 (b5bafa5) into master (bfa99d9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #9646   +/-   ##
=======================================
  Coverage   82.49%   82.50%           
=======================================
  Files         245      245           
  Lines       57347    57375   +28     
=======================================
+ Hits        47311    47339   +28     
  Misses      10036    10036

Impacted Files	Coverage Δ
rust/datafusion/src/datasource/datasource.rs	`100.00% <ø> (ø)`
rust/datafusion/src/datasource/memory.rs	`85.15% <100.00%> (+1.04%)`	⬆️
rust/datafusion/src/physical_plan/parquet.rs	`88.08% <100.00%> (+0.24%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfa99d9...b5bafa5. Read the comment docs.

Dandandan · 2021-03-07T10:05:30Z

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

alamb · 2021-03-07T11:45:58Z

BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896

andygrove

LGTM. Thanks @Dandandan

alamb · 2021-03-07T17:57:55Z

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

seddonm1 · 2021-03-07T23:35:31Z

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

This is really cool @Dandandan . And we don't have to make apply the silly limitation that Spark does where the table has to be registered in a Hive Metastore to allow you to calculate statistics!

Dandandan · 2021-03-08T09:17:53Z

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

This is really cool @Dandandan . And we don't have to make apply the silly limitation that Spark does where the table has to be registered in a Hive Metastore to allow you to calculate statistics!

I am also pretty excited about using this for data science / BI use cases where you could load all data in memory. I think that would be a pretty unique feature.

alamb

Just to be clear, this is the infrastructure to implement more sophisticated statistics, but nothing (yet) takes advantage of them. I think it is a good step in the right direction. Thanks @Dandandan

github-actions bot added Component: Rust - DataFusion Component: Rust labels Mar 6, 2021

Dandandan changed the title ~~ARROW-11895: [Rust][DataFusion] Add support for extra columns statistics~~ ARROW-11895: [Rust][DataFusion] Add support for more column statistics Mar 6, 2021

andygrove approved these changes Mar 7, 2021

View reviewed changes

Dandandan added 4 commits March 7, 2021 22:25

Add support for extra columns statistics

c75ac61

Add null countst to parquet reader stats

f47365e

Fix comments

36ef586

Add null stats to parquet table

0b9f3fa

Dandandan force-pushed the extra_column_statistics branch from d33854a to 0b9f3fa Compare March 7, 2021 21:25

alamb reviewed Mar 9, 2021

View reviewed changes

alamb approved these changes Mar 9, 2021

View reviewed changes

alamb closed this in 37ef3ad Mar 9, 2021

asfimport mentioned this pull request Mar 9, 2021

[Rust][DataFusion] Add support for extra column statistics #27737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646

ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646

Uh oh!

Dandandan commented Mar 6, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

codecov-io commented Mar 6, 2021 •

edited

Loading

Uh oh!

Dandandan commented Mar 7, 2021

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

andygrove left a comment

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

seddonm1 commented Mar 7, 2021

Uh oh!

Dandandan commented Mar 8, 2021

Uh oh!

alamb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646

ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646

Uh oh!

Conversation

Dandandan commented Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 6, 2021

Uh oh!

codecov-io commented Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dandandan commented Mar 7, 2021

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 7, 2021

Uh oh!

seddonm1 commented Mar 7, 2021

Uh oh!

Dandandan commented Mar 8, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Dandandan commented Mar 6, 2021 •

edited

Loading

codecov-io commented Mar 6, 2021 •

edited

Loading