-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27340][SS][2.4] Alias on TimeWindow expression may cause watermark metadata lost #28377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…data lost Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work. Make metadata propagatable between Aliases. In Structured Streaming, we added an Alias for TimeWindow by default. https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3272-L3273 For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata https://github.com/apache/spark/blob/590b9a0132b68d9523e663997def957b2e46dfb1/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1049-L1054 and finally cause the AnalysisException: ``` Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition ``` Bugfix for an alias on time window with watermark. New UTs added. One for the functionality and one for explaining the common scenario. Closes #28326 from xuanyuanking/SPARK-27340. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ba7adc4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d272482) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
cc @cloud-fan , @zsxwing , @LiangchangZ , @xuanyuanking , @HeartSaVioR . |
|
From a backporting perspective it seems like this is maybe a bug given that |
|
Yes. Correct. In |
|
@dongjoon-hyun @holdenk My understanding is different from what you said above. It sounds like the metadata is not available for TimeWindow expression when the function |
|
@gatorsmile . Are you assuming |
|
Any other Expressions except |
|
Got it. You are right. |
That's right, but I don't see the reason of the fix only because we encountered an actual problem from there. This has been simply wrong because we set explicit metadata on Alias while it's not asked to overwrite. Javadoc of the Alias clearly describes the usage of explicitMetadata: (that's why I couldn't agree with the initial proposal from #28326.) |
|
That said, I'd support porting back this to 2.4 as it fixes the wrong code. This change might impact broader audiences (so technically saying the PR/commit title doesn't represent actual change, sorry I should have found this earlier) hence the risk might be bigger than we may imagine, but I couldn't imagine the case which relies on the previous behavior (bug) to make it work. If we can imagine anything then it might be a signal to reconsider. |
|
Test build #121930 has finished for PR 28377 at commit
|
|
Test build #121936 has finished for PR 28377 at commit
|
|
Are we still planning on backporting this or not? I see you deleted the branch so I'm assuming this particular attempt at backporting is abandoned but the root issue does seem important. |
|
We have no viable solution for branch-2.4. You can proceed 2.4.6 release without this. |
This is a backport of #28326 . The authorship is kept. (Credit to @LiangchangZ and @xuanyuanking )
What changes were proposed in this pull request?
Make metadata propagatable between Aliases.
Why are the changes needed?
In Structured Streaming, we added an Alias for TimeWindow by default.
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Lines 3272 to 3273 in 590b9a0
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for
aswill lose the watermark metadataspark/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
Lines 1049 to 1054 in 590b9a0
and finally cause the AnalysisException:
Does this PR introduce any user-facing change?
Bugfix for an alias on time window with watermark.
How was this patch tested?
New UTs added. One for the functionality and one for explaining the common scenario.