-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22232][PYTHON][SQL] Fixed Row pickling to include __from_dict__ flag #20280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
315b8de
a7d3396
cfb5be6
10bf2d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1392,9 +1392,11 @@ def _create_row_inbound_converter(dataType): | |
| return lambda *a: dataType.fromInternal(a) | ||
|
|
||
|
|
||
| def _create_row(fields, values): | ||
| def _create_row(fields, values, from_dict=False): | ||
| row = Row(*values) | ||
| row.__fields__ = fields | ||
| if from_dict: | ||
| row.__from_dict__ = True | ||
| return row | ||
|
|
||
|
|
||
|
|
@@ -1445,15 +1447,15 @@ def __new__(self, *args, **kwargs): | |
| raise ValueError("Can not use both args " | ||
| "and kwargs to create Row") | ||
| if kwargs: | ||
| # create row objects | ||
| # create row object from named arguments, order not guaranteed so will be sorted | ||
| names = sorted(kwargs.keys()) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing that out. In my opinion dropping 3.5 support should be followed by removal of sorting by name (from what I've seen this behavior is confusing at best and it will have no use, once 3.5 support is dropped), although that might hard to sell, especially in a minor version. Ideally I would see
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we're in 3.0, we might have to consider to drop that sorting by name one .. yes.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That'd be great, but 3.0 only deprecates, not drops Python 2 support. So the main why sorting was introduced stays valid. If Spark 2.5 was accepted, Python 2 could be deprecated there, and dropped in 3.0. In general I think that bringing Python 2 into Spark 3 is a mistake, but I am first to admit I am biased.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think Spark 2.5 idea looks almost failed (and I either don't personally support that idea). This thing is rather legitimate to make it happen between minor releases. Actually, 3.0 is getting close. We could just let 3.0 release first and remove that sorting behaviour too optionally. |
||
| row = tuple.__new__(self, [kwargs[n] for n in names]) | ||
| row.__fields__ = names | ||
| row.__from_dict__ = True | ||
| row.__from_dict__ = True # Row elements will be accessed by field name, not position | ||
| return row | ||
|
|
||
| else: | ||
| # create row class or objects | ||
| # create a row class for generating objects or a tuple-like object | ||
| return tuple.__new__(self, args) | ||
|
|
||
| def asDict(self, recursive=False): | ||
|
|
@@ -1532,7 +1534,10 @@ def __setattr__(self, key, value): | |
| def __reduce__(self): | ||
| """Returns a tuple so Python knows how to pickle Row.""" | ||
| if hasattr(self, "__fields__"): | ||
| return (_create_row, (self.__fields__, tuple(self))) | ||
| if hasattr(self, "__from_dict__"): | ||
| return (_create_row, (self.__fields__, tuple(self), True)) | ||
| else: | ||
| return (_create_row, (self.__fields__, tuple(self))) | ||
| else: | ||
| return tuple.__reduce__(self) | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was flawed because it only worked since ("a", "b") is in the same alphabetical order as ("key", "value"). If it was ("key", "aaa") then it would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wasn't this testing
field names can differ.by user's schema input?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right. The schema should be able to override names no matter what. I was thinking it was flawed because it relied on the row field to be sorted internally, so just changing the kwarg names (not the order) could cause it to fail. That seems a little strange, but maybe it is the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this test is invalid because it only works as the
Rows get serialized and lose the__from_dict__flag. This should be the same think, but would fail:And it would also fail if the named args of the
Rowobjects were not in the same alphabetical order as the schema:This fails because the Row fields would be sorted in a different order, switching the type order.