-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-44131][SQL][PYTHON][CONNECT][FOLLOWUP] Support qualified function name for call_function #41932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The CI failure looks unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test in HiveUDFSuite to make sure we can invoke a persist hive function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
will this also support Spark Connect? |
|
|
Consider the end users, I think we should keep the behavior of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a private method that takes a Seq[String], so that we can call it if a method does not want to support qualified names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a persist function. Can you check other tests in this file? we need to use CREATE FUNCTION to create persist functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
820fb59 to
3fcf654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test calling the function with the qualified name? spark_catalog.default.custom_func
|
@beliefer branch cut is soon, shall we also support it in Spark Connect? Otherwise, the behaviors will be different |
It's better to support too. |
58aad28 to
4513675
Compare
4513675 to
80323e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang would you mind help checking this part? I am not familiar with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run the following commands:
build/sbt clean
build/sbt "connect-client-jvm/test" -Phive
there is 1 test failed:
[info] - call_function *** FAILED *** (150 milliseconds)
[info] org.apache.spark.SparkException: [CANNOT_LOAD_FUNCTION_CLASS] Cannot load class test.org.apache.spark.sql.MyDoubleSum when registering the function `spark_catalog`.`default`.`custom_sum`, please make sure it is on the classpath.
[info] at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toSparkThrowable(GrpcExceptionConverter.scala:53)
[info] at org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:30)
[info] at org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:38)
[info] at org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:80)
[info] at org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:133)
[info] at org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:150)
[info] at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2813)
[info] at org.apache.spark.sql.Dataset.withResult(Dataset.scala:3252)
[info] at org.apache.spark.sql.Dataset.collect(Dataset.scala:2812)
[info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$139(ClientE2ETestSuite.scala:1175)
[info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.$anonfun$test$1(RemoteSparkSession.scala:246)
@beliefer we should add (LocalProject("sql") / Test / Keys.package).value to
spark/project/SparkBuild.scala
Lines 875 to 878 in 228b5db
| buildTestDeps := { | |
| (LocalProject("assembly") / Compile / Keys.`package`).value | |
| (LocalProject("catalyst") / Test / Keys.`package`).value | |
| }, |
then the sql test jar will build&package before testing.
For maven, let me do more check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang Thank you for you reminder. I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maven test ClientE2ETestSuite and ReplE2ESuite is ok, but there are another 68 maven tests failed of connect-jvm-client module, will tracking with a new ticket. @zhengruifeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, in my local testing:
- master branch: All tests passed.
- with this pr: 68 TESTS FAILED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is truly unrelated to this pr and I think the way the –jars is being used in the code is incorrect now.
When submitting the args as
--jars spark-catalyst-xx.jar
--jars spark-connect-client-jvm-xx.jar
--jars spark-sql-xx.jar
the final effective arg will be --jars spark-sql-xx.jar, if we enable debugging logs, we will found that only the Added JAR logs related to spark-sql_2.12-3.5.0-SNAPSHOT-tests.jar and spark-connect_2.12-3.5.0-SNAPSHOT.jar are present.
23/07/19 14:00:34 INFO SparkContext: Added JAR file:///Users/yangjie01/SourceCode/git/spark-mine-12/sql/core/target/spark-sql_2.12-3.5.0-SNAPSHOT-tests.jar at spark://localhost:56841/jars/spark-sql_2.12-3.5.0-SNAPSHOT-tests.jar with timestamp 1689746434318
23/07/19 14:00:34 INFO SparkContext: Added JAR file:/Users/yangjie01/SourceCode/git/spark-mine-12/connector/connect/server/target/spark-connect_2.12-3.5.0-SNAPSHOT.jar at spark://localhost:56841/jars/spark-connect_2.12-3.5.0-SNAPSHOT.jar with timestamp 1689746434318
and the configuration item “spark.jars” also only includes these two jars.
Array((spark.app.name,org.apache.spark.sql.connect.SimpleSparkConnectService), (spark.jars,file:///Users/yangjie01/SourceCode/git/spark-mine-12/sql/core/target/spark-sql_2.12-3.5.0-SNAPSHOT-tests.jar,file:/Users/yangjie01/SourceCode/git/spark-mine-12/connector/connect/server/target/spark-connect_2.12-3.5.0-SNAPSHOT.jar), ...
We should correct the usage of –jars to --jars spark-catalyst-xx.jar,spark-connect-client-jvm-xx.jar,spark-sql-xx.jar, then the maven test should pass.
I think we can merge this pr first and then fix this issue separately. But, @beliefer if you prefer, you can also address this issue in this one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LuciferYang Thank you for the investigation. I will take it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both maven and sbt ok now, thanks @beliefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the double check. @LuciferYang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * function name that can be qualified using the SQL syntax | |
| * function name that follows the SQL identifier syntax (can be quoted, can be qualified) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call a SQL function. It supports any function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case "call_function" if fun.getArgumentsCount > 1 => | |
| case "call_function" if fun.getArgumentsCount >= 1 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support no-arg function as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a new proto message for it? Currently it may conflict with calling a temp function named call_function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Good suggestion.
37ec1f6 to
c553691
Compare
f220e63 to
1891af5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to not test persist functions in spark connect, as it seems hard to include the jar containing the UDF. The client-side implementation is quite simple: constructs a small proto message and server-side turns it to UnresolvedFunction. Making sure it works for builtin function is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the issue. Please wait for the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really worth it? @zhengruifeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have communicated with @zhengruifeng and he agreed your opinion. Let's remove the test case for connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think we don't need to include this change in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // (Required) Name of the SQL function. | |
| // (Required) Unparsed name of the SQL function. |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the doc in all places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make sure the docs are consistent in all places.
411fdb7 to
7456dc5
Compare
|
The CI failure is unrelated to this PR. |
…rameters for jars ### What changes were proposed in this pull request? #41932 try to add test case for connect, then we found the maven build failure based on the bug discussed at #41932 (comment) After some communication, cloud-fan and zhengruifeng suggested to ignore the test case for connect. So I commit this PR to fix the bug. ### Why are the changes needed? Fix the bug that `SparkConnectServerUtils` generated incorrect parameters for jars. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? N/A Closes #42121 from beliefer/SPARK-44519. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
…rameters for jars ### What changes were proposed in this pull request? #41932 try to add test case for connect, then we found the maven build failure based on the bug discussed at #41932 (comment) After some communication, cloud-fan and zhengruifeng suggested to ignore the test case for connect. So I commit this PR to fix the bug. ### Why are the changes needed? Fix the bug that `SparkConnectServerUtils` generated incorrect parameters for jars. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? N/A Closes #42121 from beliefer/SPARK-44519. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 4644344) Signed-off-by: yangjie01 <yangjie01@baidu.com>
|
the failure is unrelated, merging to master/3.5, thanks! |
…ion name for call_function ### What changes were proposed in this pull request? #41687 added `call_function` and deprecate `call_udf` for Scala API. Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well. ### Why are the changes needed? Support qualified function name for `call_function`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #41932 from beliefer/SPARK-44131_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d97a4e2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
@cloud-fan @zhengruifeng @LuciferYang Thank you for all! |
What changes were proposed in this pull request?
#41687 added
call_functionand deprecatecall_udffor Scala API.Some times, the function name can be qualified, we should let users use it to invoke persistent functions as well.
Why are the changes needed?
Support qualified function name for
call_function.Does this PR introduce any user-facing change?
'No'.
New feature.
How was this patch tested?
New test cases.