-
Notifications
You must be signed in to change notification settings - Fork 270
feat: Port Datafusion Covariance to Comet #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #234 +/- ##
============================================
+ Coverage 33.32% 33.36% +0.04%
- Complexity 769 776 +7
============================================
Files 107 108 +1
Lines 37037 37179 +142
Branches 8106 8194 +88
============================================
+ Hits 12341 12404 +63
- Misses 22098 22154 +56
- Partials 2598 2621 +23 ☔ View full report in Codecov by Sentry. |
5b2ba99 to
a80cee2
Compare
5324464 to
47604a3
Compare
| expr1: Arc<dyn PhysicalExpr>, | ||
| expr2: Arc<dyn PhysicalExpr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there better names for these two children expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark uses left and right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave this as is unless you prefer left and right.
| fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> { | ||
| Ok(Box::new(CovarianceAccumulator::try_new( | ||
| StatsType::Population, | ||
| )?)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the difference between CovariancePop and Covariance is only the StatsType parameter when creating CovarianceAccumulator?
If so, maybe we only need Covariance and make StatsType as parameter when creating Covariance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow-datafusion-comet/blob/955a0b9b9b026b477b5b527aba133d9e1402bd7e/core/src/execution/datafusion/expressions/covariance.rs#L368
Population Covariance is calculated over the entire dataset(N) whereas Sample Covariance is calculated over a sample (N-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but the behavior is decided in CovarianceAccumulator based on its StatsType.
I mean this two struct CovariancePop and Covariance. They are basically the same, except that name and StatsType are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can simply have one Covariance and add one more parameter of StatsType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update EXPRESSIONS.md too.
804dba9 to
263c8af
Compare
| PhysicalExpr, | ||
| }; | ||
|
|
||
| /// COVAR_SAMP and COVAR_POP aggregate expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to mention why we need to port it in Comet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
| message CovSample { | ||
| Expr child1 = 1; | ||
| Expr child2 = 2; | ||
| bool null_on_divide_by_zero = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use this yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see it is specified in JVM side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is not used yet. I will have a follow up to address this.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
|
Merged. Thanks. |
|
Thanks @viirya |
* feat: Port Datafusion Covariance to Comet * feat: Port Datafusion Covariance to Comet * fmt * update EXPRESSIONS.md * combine COVAR_SAMP and COVAR_POP * fix fmt * address comment --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Which issue does this PR close?
Closes #.
Port
Covariancein Comet. The reason we can't use DataFusionCovarianceimplementation is that SparkCovarianceaggregate expression's state is different from DataFusion's state.Rationale for this change
What changes are included in this PR?
How are these changes tested?