-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11895: [Rust][DataFusion] Add support for more column statistics #9646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9646 +/- ##
=======================================
Coverage 82.49% 82.50%
=======================================
Files 245 245
Lines 57347 57375 +28
=======================================
+ Hits 47311 47339 +28
Misses 10036 10036
Continue to review full report at Codecov.
|
|
FYI @andygrove I plan to use this later for some |
|
BTW the test workspace check is also failing on master, so it may not be related to this PR. See more details on https://issues.apache.org/jira/browse/ARROW-11896 |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @Dandandan
|
FYI I merged #9653 / ARROW-11896 for the Rust CI checks which may affect this PR. If you see "Rust / AMD64 Debian 10 Rust stable test workspace" failing with a linker error or no logs, rebasing against master will hopefully fix the problem |
d33854a to
0b9f3fa
Compare
This is really cool @Dandandan . And we don't have to make apply the silly limitation that Spark does where the table has to be registered in a Hive Metastore to allow you to calculate statistics! |
I am also pretty excited about using this for data science / BI use cases where you could load all data in memory. I think that would be a pretty unique feature. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is the infrastructure to implement more sophisticated statistics, but nothing (yet) takes advantage of them. I think it is a good step in the right direction. Thanks @Dandandan
This PR
This will allow implementing more advanced statistics based optimizations, such as outlined in this article.
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html