Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Mar 6, 2021

This PR

  • adds max/min/distinct_count column statistics
  • implements null count statistics for the parquet reader & table provider

This will allow implementing more advanced statistics based optimizations, such as outlined in this article.

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

@codecov-io
Copy link

codecov-io commented Mar 6, 2021

Codecov Report

Merging #9646 (b5bafa5) into master (bfa99d9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9646   +/-   ##
=======================================
  Coverage   82.49%   82.50%           
=======================================
  Files         245      245           
  Lines       57347    57375   +28     
=======================================
+ Hits        47311    47339   +28     
  Misses      10036    10036           
Impacted Files Coverage Δ
rust/datafusion/src/datasource/datasource.rs 100.00% <ø> (ø)
rust/datafusion/src/datasource/memory.rs 85.15% <100.00%> (+1.04%) ⬆️
rust/datafusion/src/physical_plan/parquet.rs 88.08% <100.00%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfa99d9...b5bafa5. Read the comment docs.

@Dandandan Dandandan changed the title ARROW-11895: [Rust][DataFusion] Add support for extra columns statistics ARROW-11895: [Rust][DataFusion] Add support for more column statistics Mar 6, 2021
@Dandandan
Copy link
Contributor Author

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

@alamb
Copy link
Contributor

alamb commented Mar 7, 2021

BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Dandandan

@alamb
Copy link
Contributor

alamb commented Mar 7, 2021

FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem

@Dandandan Dandandan force-pushed the extra_column_statistics branch from d33854a to 0b9f3fa Compare March 7, 2021 21:25
@seddonm1
Copy link
Contributor

seddonm1 commented Mar 7, 2021

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

This is really cool @Dandandan . And we don't have to make apply the silly limitation that Spark does where the table has to be registered in a Hive Metastore to allow you to calculate statistics!

@Dandandan
Copy link
Contributor Author

FYI @andygrove I plan to use this later for some ANALYZE / COMPUTE STATISTICS support (would be useful for in memory data) and/or use parquet statistics to make the join order optimization more advanced.

This is really cool @Dandandan . And we don't have to make apply the silly limitation that Spark does where the table has to be registered in a Hive Metastore to allow you to calculate statistics!

I am also pretty excited about using this for data science / BI use cases where you could load all data in memory. I think that would be a pretty unique feature.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is the infrastructure to implement more sophisticated statistics, but nothing (yet) takes advantage of them. I think it is a good step in the right direction. Thanks @Dandandan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants