Skip to content

Conversation

@suhsteve
Copy link
Member

@suhsteve suhsteve commented Sep 4, 2020

This PR updates the Arrow library from 0.14.1 to 0.15.1 and also addresses the Broadcast and GroupedMapUdf Tests for Spark-3.0.0. Currently supporting GroupedMapUdf in Spark-3.0.0 is blocked/unsupported so we skip these tests.

This is a part of the effort to bring in CI for Spark 3.0: #348

@suhsteve suhsteve self-assigned this Sep 5, 2020
@imback82 imback82 mentioned this pull request Sep 6, 2020
6 tasks
@imback82
Copy link
Contributor

imback82 commented Sep 8, 2020

Can you resolve conflicts? Thanks!

@suhsteve
Copy link
Member Author

suhsteve commented Sep 8, 2020

Can you resolve conflicts? Thanks!

Merged and resolved conflicts. Re-enabled some tests.

@imback82
Copy link
Contributor

imback82 commented Sep 8, 2020

@suhsteve is this ready for review?

@suhsteve suhsteve requested a review from imback82 September 8, 2020 22:05
@suhsteve
Copy link
Member Author

suhsteve commented Sep 8, 2020

@suhsteve is this ready for review?

yea, looks like tests are passing.

backwardCompatibleRelease: '0.9.0'
TestsToFilterOut: "(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameGroupedMapUdf)&\
(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameVectorUdf)&\
(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf)&\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change since these are not the new APIs introduced, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to spark 3.0.0 tests only.

@suhsteve suhsteve requested a review from imback82 September 10, 2020 23:13
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for few nit comments.

Btw, I think this is a breaking change, but it can be addressed as a follow up PR.

env:
SPARK_HOME: $(Build.BinariesDirectory)\spark-2.4.6-bin-hadoop2.7

# Spark 3.0.0 uses Arrow 0.15.1, which contains a new Arrow spec. This breaks backward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to track if we have a different backward compatibility for different Spark version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment there are no published workers that's backward compatible with 3.0 (since the previous workers only use 0.14.1 and aren't aware of the new spec). But I agree that this is a breaking change.

For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?

This can be addressed in a separate PR if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?

I would say the latter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we will have to wait until the next official Worker release before we can remove these extra filters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... since we have one worker binary to support all spark versions.

Copy link
Contributor

@imback82 imback82 Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can remove the backward compatibility test (breaking change) then add it back when the new one is released.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to remove the extra filters for 3.0 and in the unit tests add the skip attribute ? Or just remove the spark 3.0.0 section in the backward compatibility tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 3.0 backward compatibility testing.

@imback82
Copy link
Contributor

@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version.

@suhsteve suhsteve changed the title Broadcast and GroupedMapUdf Tests for Spark-3.0.0 Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0 Sep 11, 2020
@suhsteve
Copy link
Member Author

@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version.

Addressed the comments and updated title/description.

imback82
imback82 previously approved these changes Sep 11, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @suhsteve!

@imback82 imback82 merged commit a5f707c into dotnet:master Sep 11, 2020
@suhsteve suhsteve added this to the 1.0.0 milestone Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants