-
Notifications
You must be signed in to change notification settings - Fork 331
Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0 #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you resolve conflicts? Thanks! |
enable tests for spark 3.0
Merged and resolved conflicts. Re-enabled some tests. |
fix exception message.
|
@suhsteve is this ready for review? |
yea, looks like tests are passing. |
azure-pipelines.yml
Outdated
| backwardCompatibleRelease: '0.9.0' | ||
| TestsToFilterOut: "(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameGroupedMapUdf)&\ | ||
| (FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameVectorUdf)&\ | ||
| (FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf)&\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change since these are not the new APIs introduced, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to spark 3.0.0 tests only.
src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for few nit comments.
Btw, I think this is a breaking change, but it can be addressed as a follow up PR.
| env: | ||
| SPARK_HOME: $(Build.BinariesDirectory)\spark-2.4.6-bin-hadoop2.7 | ||
|
|
||
| # Spark 3.0.0 uses Arrow 0.15.1, which contains a new Arrow spec. This breaks backward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy to track if we have a different backward compatibility for different Spark version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment there are no published workers that's backward compatible with 3.0 (since the previous workers only use 0.14.1 and aren't aware of the new spec). But I agree that this is a breaking change.
For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?
This can be addressed in a separate PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?
I would say the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think we will have to wait until the next official Worker release before we can remove these extra filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... since we have one worker binary to support all spark versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can remove the backward compatibility test (breaking change) then add it back when the new one is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to remove the extra filters for 3.0 and in the unit tests add the skip attribute ? Or just remove the spark 3.0.0 section in the backward compatibility tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed 3.0 backward compatibility testing.
src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs
Outdated
Show resolved
Hide resolved
|
@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version. |
Addressed the comments and updated title/description. |
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @suhsteve!
This PR updates the Arrow library from 0.14.1 to 0.15.1 and also addresses the Broadcast and GroupedMapUdf Tests for Spark-3.0.0. Currently supporting GroupedMapUdf in Spark-3.0.0 is blocked/unsupported so we skip these tests.
https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format for the new message format - 0xFFFFFFFF 0x00000000
This is a part of the effort to bring in CI for Spark 3.0: #348