Coverage: Add a manual test to show what Spark built in expression the DF can support directly #331

comphead · 2024-04-26T15:48:47Z

Which issue does this PR close?

Closes #.
Related to #240

Rationale for this change

Adding a tool that generates up to date coverage of Spark builtin expressions by Datafusion directly. It is supposed to let the developer better understanding should it back Spark builtin function by using DF match or implement custom functionin Comet native core

What changes are included in this PR?

How are these changes tested?

A test which generates doc/spark_builtin_expr_df_coverage.txt

advancedxy · 2024-04-29T02:55:59Z

Hi @comphead is this ready for review?

comphead · 2024-04-29T15:58:02Z

Hi @comphead is this ready for review?

Thanks @advancedxy I think so.

kazuyukitanimura

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

kazuyukitanimura · 2024-04-30T01:10:12Z

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala


  private val rawCoverageFilePath = "doc/spark_builtin_expr_coverage.txt"
  private val aggCoverageFilePath = "doc/spark_builtin_expr_coverage_agg.txt"
+  private val rawCoverageFileDatafusionPath = "doc/spark_builtin_expr_df_coverage.txt"


nit: now we have docs dir. Let's organize the output location...

advancedxy · 2024-04-30T14:25:20Z

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

+   * Manual test to calculate Spark builtin expressions coverage support by the Datafusion
+   * directly
+   *
+   * The test will update files doc/spark_builtin_expr_df_coverage.txt,


I think it would be better to add the new column in the previous file(spark_builtin_expr_coverage.txt), so that we can clearly get which func has already been implemented, which is not and will that func be supported directly in the DataFusion kernel without looking at two files.

good point. let me think about it

advancedxy · 2024-04-30T14:28:48Z

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

+   *
+   * The test will update files doc/spark_builtin_expr_df_coverage.txt,
+   *
+   * Requires to set DATAFUSIONCLI_PATH env variable to valid path to datafusion-cli


If we are going to build the coverage file directly in the CI pipeline, I think this is fine. Otherwise, I think we may need to update the development.md file or any other documentation to indicate how to run this coverage suite.

In my opinion, I would prefer to set up the CI pipeline earlier so others don't have to bother how to run this file.

It will be added to CI in following PRs

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

parthchandra · 2024-05-01T19:34:46Z

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

    }

-    // TODO: convert results into HTML
+    // TODO: convert results into HTML or .md file


Perhaps log an issue for this and refer to that instead?
It would be great to link the output directly into the documentation.

Yes, in next PR its expected to generate MD file in docs, rather than txt

Can you remove the commented out code from this PR?

comphead · 2024-05-09T15:17:52Z

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

I hope in most cases yes as we have a generic wrapper, this file is more for reference what builtin function probably should be added to DF

…uite.scala Co-authored-by: advancedxy <xianjin@apache.org>

comphead · 2024-05-10T17:21:44Z

Changes:

only 1 spark coverage file: docs/spark_builtin_expr_coverage.txt
The file contains table with spark function name, query, result, cometMessage, datafusionMessage

cometMessage is the message obtained from running the query in Comet
datafusionMessage is the message obtained from running the query in DafafusionCli

The is optional now, means it can be run manually to update what is supported, what is not and the reason

comphead · 2024-05-10T17:22:28Z

@parthchandra @advancedxy @kazuyukitanimura @viirya @andygrove if you guys have time to have a second look

andygrove · 2024-05-10T18:04:05Z

Just to understand, backed by datafusion does not automatically mean that has Spark compatibility?

I have a similar question, and I'm not sure that I really understand the motivation for the tool.

As we add support for more Spark expressions, I can see that it makes sense to see if these expressions are already implemented in DataFusion, but we would still have to implement Spark-specific tests in Comet and then determine if the DataFusion expression is sufficient. If not, we then have to implement our own custom version.

I don't think there is a scenario where we see that DataFusion has an expression and we just enable it in Comet.

comphead · 2024-05-10T18:40:33Z

As we add support for more Spark expressions, I can see that it makes sense to see if these expressions are already implemented in DataFusion, but we would still have to implement Spark-specific tests in Comet and then determine if the DataFusion expression is sufficient. If not, we then have to implement our own custom version.

Thanks @andygrove for the review. Basically the idea of this test to show how much coverage we currently have in Comet for Spark builtin functions.
I supposed it would be interesting to have the following info per each builtin function:

is function supported by Comet?
If its not, is function supported by DataFusion directly or we need to send a PR to DF?
Sometimes function is supported by DF but it gives a wrong result in Comet comparing to Spark

Having info we can later generate the coverage .MD dynamically instead of static one.

advancedxy

Sorry for the late response, I was quite busy last two weeks.

LGTM except one minor comment about the javadoc.

advancedxy · 2024-05-14T13:37:39Z

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

+   * The test will update files doc/spark_builtin_expr_coverage.txt,
+   * doc/spark_builtin_expr_coverage_agg.txt


it seems like the latest code only generates spark_builtin_expr_coverage.txt? Maybe these two lines should be updated.

advancedxy · 2024-05-14T13:44:34Z

I don't think there is a scenario where we see that DataFusion has an expression and we just enable it in Comet.

I agree. However, I believe the value of the generated expr coverage is that it can provide developers with insights about whether the related function has already been implemented in DataFusion or not. If so, developers can use that as a reference and don't have to reinvent the whole thing from scratch.

comphead · 2024-05-14T20:06:03Z

If we dont need to expose datafusion-cli message its possible just hide it from the output file.

andygrove

Thanks @comphead. It would be good to remove the commented out code and maybe add a link to any follow up issues.

comphead · 2024-05-21T01:03:40Z

Thanks everyone for the review

…e DF can support directly (apache#331) * Coverage: Add a manual test for DF supporting Spark expressions directly Co-authored-by: advancedxy <xianjin@apache.org> (cherry picked from commit 9ca63a2)

…pache#331) * chore: Update version to 0.2.0 and add 0.1.0 changelog (apache#696) * Generate changelog * fix error in release verification script * Change version from 0.1.0 to 0.2.0 * add changelog page and update release instructions * address feedback * update version to 0.2.0-SNAPSHOT * fix: do not overwrite withInfo data * update test

comphead changed the title ~~Coverage: Add a manual test for DF supporting Spark expressions directly~~ Coverage: Add a manual test to show what Spark built in expression the DF can support directly Apr 26, 2024

kazuyukitanimura reviewed Apr 30, 2024

View reviewed changes

advancedxy reviewed Apr 30, 2024

View reviewed changes

parthchandra reviewed May 1, 2024

View reviewed changes

comphead and others added 3 commits May 9, 2024 08:23

Coverage: Add a manual test for DF supporting Spark expressions directly

eb2a894

Update spark/src/test/scala/org/apache/comet/CometExpressionCoverageS…

8ed1b57

…uite.scala Co-authored-by: advancedxy <xianjin@apache.org>

Coverage: Add a manual test for DF supporting Spark expressions directly

501d2ec

comphead force-pushed the dev branch from 5bc9a27 to 501d2ec Compare May 10, 2024 17:14

comphead added 2 commits May 10, 2024 10:19

Coverage: Add a manual test for DF supporting Spark expressions directly

b068d1b

Coverage: Add a manual test for DF supporting Spark expressions directly

03e4401

fmt

7a68e20

advancedxy reviewed May 14, 2024

View reviewed changes

andygrove approved these changes May 18, 2024

View reviewed changes

comphead added 3 commits May 20, 2024 16:22

fmt

0594160

fmt

1895cac

fmt

789a043

comphead merged commit 9ca63a2 into apache:main May 21, 2024

comphead mentioned this pull request May 21, 2024

Minor: Generate the supported Spark builtin expression list into MD file #455

Merged

		* The test will update files doc/spark_builtin_expr_coverage.txt,
		* doc/spark_builtin_expr_coverage_agg.txt

Coverage: Add a manual test to show what Spark built in expression the DF can support directly #331

Coverage: Add a manual test to show what Spark built in expression the DF can support directly #331

Uh oh!

Conversation

comphead commented Apr 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

advancedxy commented Apr 29, 2024

Uh oh!

comphead commented Apr 29, 2024

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented May 9, 2024

Uh oh!

comphead commented May 10, 2024

Uh oh!

comphead commented May 10, 2024

Uh oh!

andygrove commented May 10, 2024

Uh oh!

comphead commented May 10, 2024

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy commented May 14, 2024

Uh oh!

comphead commented May 14, 2024

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

comphead commented May 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants