-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-3981] Futurize coders subpackage #5053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @robertwb |
|
R: @charlesccychen cc: @tvalentyn Some high level comments:
What do you think about that?
|
|
|
Working on some other subpackages, I've come across an additional reason to upgrade the cython version. Currently, the Cython has added support for the special comparison methods in version 0.27.0, which allows us to use |
charlesccychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is a great change. Added a few comments.
| # pylint: enable=protected-access | ||
|
|
||
| def __hash__(self): | ||
| return hash(type(self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for this change? Previously, the hash would default to object.__hash__, which tries to give a different hash code for each instance. This change would give the same hash code for each class, which may not be desirable, since coders in general could be parameterized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I've changed the hash to match the __eq__ check.
| def encode(self, value): | ||
| try: # Python 2 | ||
| if isinstance(value, unicode): | ||
| return value.encode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same unicode = str thing here? (In the new version, we will raise an error if the value is a non-ascii unicode string; eg: str(u'😋') raises an error, while this worked before.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unicode = str trick won't work here on Python 3:
>>> str('test'.encode('utf-8'))
"b'test'"
I've reverted this change to the previous solution
| stream.write_byte(UNICODE_TYPE) | ||
| stream.write(unicode_value.encode('utf-8'), nested) | ||
| elif t is unicode: | ||
| text_value = value # for typing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the same try: unicode except: unicode = str in the corresponding part of the .pxd file so that the type annotation directive for text_value is respected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it throws a 'not a type' error for unicode.
| stream.write_byte(DICT_TYPE) | ||
| stream.write_var_int64(len(dict_value)) | ||
| for k, v in dict_value.iteritems(): | ||
| for k, v in dict_value.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for the iteritems() -> items() change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iteritems() isn't available anymore in Python 3. Instead, items() returns an iterator. I've replaced it with iteritems(dict) from future.utils, which returns an iterator on both versions.
tvalentyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Robbe! A few minor comments below.
| and self._dict_without_impl() == other._dict_without_impl()) | ||
| # pylint: enable=protected-access | ||
|
|
||
| def __hash__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required by Python3 migration or we are just fixing an omission that hash was not previously defined, while eq was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Python 2, __hash__ defaults to give a different value for each instance. On Python 3, __hash__ defaults to None if __eq__ is implemented. By implementing __hash__, we get consistent behavior on both versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this in light of other PRs. I think, it would be safer to guarantee the contract that hash does not change for the same object if we compute it here based on object type, sent #5390.
Another possibility to guarantee consistent behavior between Python 2 and 3 would be to set __hash__ = None if we can infer that a class is obviously non-hashable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use the id which is guaranteed to stay the same for an object:
hash(id(self))
The default Python 2 hash also relies on id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Although that wouldn't honor the contract between eq and hash.
| self.check_coder(coders.TupleCoder((CustomCoder(), coders.BytesCoder())), | ||
| (1, 'a'), (-10, 'b'), (5, 'c')) | ||
| (1, b'a'), (-10, b'b'), (5, b'c')) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not critical, but looks like 'a' is not replaced with b'a' here - are these changes done by some tool or manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is aimed at the 'a' in line 109?
The marking of the strings as bytes literals is done manually. I've only marked strings as bytes literals when it's clear that they're meant to represent bytes (when testing BytesCoder, when the content of the string are clearly bytes, ...). When a string is not marked, it represents str on both versions, which seems ok for the 'a' at line 109 for example.
| self.check_coder(coder, None, 1, -1, 1.5, b'str\0str', u'unicode\0\u0101') | ||
| self.check_coder(coder, (), (1, 2, 3)) | ||
| self.check_coder(coder, [], [1, 2, 3]) | ||
| self.check_coder(coder, dict(), {'a': 'b'}, {0: dict(), 1: len}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here 'a' and 'b' are not bytestrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
The unmarked 'a' and 'b' here represent str on both versions, which seems ok for this test.
| pip --version | ||
| time {toxinidir}/run_pylint.sh | ||
|
|
||
| [testenv:py27-lint3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment how this is different from py3-lint? Or perhaps we don't need both of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
py27-lint3 checks for portability issues, while py3-lint checks for python 3 issues. I'll add a comment to clarify.
|
Thanks for the reviews @charlesccychen, @tvalentyn. I've tried to address all of your comments and have committed some changes based on your input. |
|
run java precommit |
sdks/python/run_pylint.sh
Outdated
| echo | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a new line at the end of file here and in sdks/python/run_pylint_2to3.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Thank you Robbe! The PR and approach look good to me. |
|
Run Python Dataflow ValidatesRunner |
|
I see the following error in the output I assume this is because The new code requires the new Cython version however dataflow workers do not have it. @tvalentyn Could you upgrade the workers at head to use the 0.28.1 cython version? |
|
Containers are released and PR #5131 is out to upgrade them on master, we can also apply the changes from that PR here as well and rerun the postcommit suite. |
|
Run Python Dataflow ValidatesRunner |
|
With #5131 merged, we probably need to rebase this PR off the current master, for the ValidatesRunner tests to pass. @RobbeSneyders can we do that please? |
|
Actually, looking at Jenkins logs I see that Jenkins already merges this PR with latest commit on master when we run a Postcommit suite:
|
|
There is a proto generation error in the ValidatesRunner tests, which seems to be unrelated to this PR, I also see that error in other Postcommit test suites. |
|
I've rebased the branch anyway. Is there anything else I should do before we can merge? Like squash some commits? |
|
Yes, please squash the commits; I am looking into postcommit issue, hopefully that does not affect this PR. |
|
Run Python Dataflow ValidatesRunner |
|
Run Python Dataflow ValidatesRunner |
|
Run Python Precommit |
|
retest this please |
|
@tvalentyn if the test issues are unrelated to this PR and related to the current ongoing Jenkins issues should we move forward with it? (I suggest that we run a few manual tests locally to check whether this PR introduces issues or not.) |
|
I have added comments in the file and added a remark in the documentation of the used approach. |
|
Run Python Dataflow ValidatesRunner |
|
@RobbeSneyders Thank you for the comments. |
|
Run Python Dataflow ValidatesRunner |
1 similar comment
|
Run Python Dataflow ValidatesRunner |
|
ValidatesRunner suite just takes time to start. It failed due to an unrelated issue, filed: https://issues.apache.org/jira/browse/INFRA-16508 to clarify. I'll do a sanity check locally. |
|
Regarding items vs. iteritems, let's just go with dict_value.items everywhere. There's little if any need for explicit iteritems in this case (and if dict is typed, I think Cython optimizes Though macrobenchmarks can be good for final validation, this is the kind of code that could probably benefit from some pretty simple microbenchmarks. |
|
We chatted with @tvalentyn. The test failure is unrelated and @tvalentyn will implement @robertwb's suggestions in a follow up PR. Merging this now to unblock progress. Thank you @RobbeSneyders and @tvalentyn for pushing it thus far! |
|
An update on I don't observe a difference in performance of As far as With With With I also observe a 2.5x degradation in coder implementation with With 10000 elements, python 2 With We should try to use microbenchmarks for performance evaluations moving forward since they can provide feedback in a matter of seconds. |
|
|
|
Hmm. I am pretty sure my microbenchmark uses a Cython codepath since in order for any code change to take effect I have to run My microbenchmark setup is here: https://github.com/apache/beam/compare/master...tvalentyn:coders_dict_microbencmark?expand=1 |
|
For the record, #5586 improves the performance of |
This pull request prepares the coders subpackage for Python 3 support. This pull request is the first of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the WIP pull request can be found at #4990.
The used approach can be summarized as follows:
The future package provides tools to forward-port our code, while the six package focuses more on backporting. Whenever possible, we will therefore use the future package instead of the six package.
The future package provides backported Python 3 builtins, which can be used to write Python 3 style code with python 2 compatibility.
One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes.
The future package provides backported Python 3
strandbytestypes. While these can be convenient to write Python 2/3 compatible code in end products, we don’t believe they are the right choice for Beam.These backported types are new classes, subclassed from existing Python 2 builtins (e.g.
from future.builtins import intimports the classfuture.builtins.newint.newint, which is a subclass op the Python 2longtype).While these new classes behave like the Python 3 types, they don’t give the same results when used in type checks, which are constantly used in beam (e.g. typecoders, typechecks, …)
Instead, we propose to rewrite everything using the default
strandbytestypes. On Python 2, thebytestype is an alias forstr. On Python 3,stris equivalent to Python 2unicode.A consistent behaviour between Python 2/3 can be reached by using the
bytestype wheneverstrbehaviour is desired on Python 2 andbytesbehaviour is desired on Python 3 (= bytes data).The
unicodetype can be used wheneverunicodebehaviour is desired on Python 2 andstrbehaviour is desired on Python 3 (= text data). Theunicodetype is not available in Python 3, which can be solved by addingat the top of the module.
All string literals which represent
bytes, should be marked asb’’. String literals representingunicodein test modules should not be marked u’’. These will automatically be interpreted asunicodeliterals in Python 3, but we still want to test for unmarked Python 2 code.Do not use the
from __future__ import unicode_literalsimport since its changes are too implicit and introduces a risk of subtle regressions on python 2.The used approach for the long / int types is equivalent to the
unicode/strapproach outlined above.The
longtype is not available in Python 3, sinceintnow haslongbehaviour. This can be solved by addingat the top of the module.
Regression should be avoided as much as possible between the application of step 2 and step 3. This document proposes to take following measures to keep the probability of regression as low as possible:
from __future__ import absolute_importWe can also add following imports to the top of every measure. This will ensures that no new code can be added using for instance the old python 2 division and adds consistency across modules. We would like to hear the community’s opinion on this.
range()anditeritems()were not imported in from future.builtins in coder_impl.py to avoid performance regression in Cython compiled code.@aaltay @charlesccychen