-
Notifications
You must be signed in to change notification settings - Fork 270
Coverage: Add a manual test to show what Spark built in expression the DF can support directly #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @comphead is this ready for review? |
Thanks @advancedxy I think so. |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?
|
|
||
| private val rawCoverageFilePath = "doc/spark_builtin_expr_coverage.txt" | ||
| private val aggCoverageFilePath = "doc/spark_builtin_expr_coverage_agg.txt" | ||
| private val rawCoverageFileDatafusionPath = "doc/spark_builtin_expr_df_coverage.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now we have docs dir. Let's organize the output location...
| * Manual test to calculate Spark builtin expressions coverage support by the Datafusion | ||
| * directly | ||
| * | ||
| * The test will update files doc/spark_builtin_expr_df_coverage.txt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add the new column in the previous file(spark_builtin_expr_coverage.txt), so that we can clearly get which func has already been implemented, which is not and will that func be supported directly in the DataFusion kernel without looking at two files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. let me think about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * | ||
| * The test will update files doc/spark_builtin_expr_df_coverage.txt, | ||
| * | ||
| * Requires to set DATAFUSIONCLI_PATH env variable to valid path to datafusion-cli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to build the coverage file directly in the CI pipeline, I think this is fine. Otherwise, I think we may need to update the development.md file or any other documentation to indicate how to run this coverage suite.
In my opinion, I would prefer to set up the CI pipeline earlier so others don't have to bother how to run this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be added to CI in following PRs
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| // TODO: convert results into HTML | ||
| // TODO: convert results into HTML or .md file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps log an issue for this and refer to that instead?
It would be great to link the output directly into the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in next PR its expected to generate MD file in docs, rather than txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the commented out code from this PR?
I hope in most cases yes as we have a generic wrapper, this file is more for reference what builtin function probably should be added to DF |
…uite.scala Co-authored-by: advancedxy <xianjin@apache.org>
|
Changes:
The is optional now, means it can be run manually to update what is supported, what is not and the reason |
|
@parthchandra @advancedxy @kazuyukitanimura @viirya @andygrove if you guys have time to have a second look |
I have a similar question, and I'm not sure that I really understand the motivation for the tool. As we add support for more Spark expressions, I can see that it makes sense to see if these expressions are already implemented in DataFusion, but we would still have to implement Spark-specific tests in Comet and then determine if the DataFusion expression is sufficient. If not, we then have to implement our own custom version. I don't think there is a scenario where we see that DataFusion has an expression and we just enable it in Comet. |
Thanks @andygrove for the review. Basically the idea of this test to show how much coverage we currently have in Comet for Spark builtin functions.
Having info we can later generate the coverage .MD dynamically instead of static one. |
advancedxy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response, I was quite busy last two weeks.
LGTM except one minor comment about the javadoc.
| * The test will update files doc/spark_builtin_expr_coverage.txt, | ||
| * doc/spark_builtin_expr_coverage_agg.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like the latest code only generates spark_builtin_expr_coverage.txt? Maybe these two lines should be updated.
I agree. However, I believe the value of the generated expr coverage is that it can provide developers with insights about whether the related function has already been implemented in DataFusion or not. If so, developers can use that as a reference and don't have to reinvent the whole thing from scratch. |
|
If we dont need to expose datafusion-cli message its possible just hide it from the output file. |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comphead. It would be good to remove the commented out code and maybe add a link to any follow up issues.
|
Thanks everyone for the review |
…e DF can support directly (apache#331) * Coverage: Add a manual test for DF supporting Spark expressions directly Co-authored-by: advancedxy <xianjin@apache.org> (cherry picked from commit 9ca63a2)
…pache#331) * chore: Update version to 0.2.0 and add 0.1.0 changelog (apache#696) * Generate changelog * fix error in release verification script * Change version from 0.1.0 to 0.2.0 * add changelog page and update release instructions * address feedback * update version to 0.2.0-SNAPSHOT * fix: do not overwrite withInfo data * update test
Which issue does this PR close?
Closes #.
Related to #240
Rationale for this change
Adding a tool that generates up to date coverage of Spark builtin expressions by Datafusion directly. It is supposed to let the developer better understanding should it back Spark builtin function by using DF match or implement custom functionin Comet native core
What changes are included in this PR?
How are these changes tested?
A test which generates
doc/spark_builtin_expr_df_coverage.txt