-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-3981] [WIP] Futurize and fix python 2 compatibility for coders subpackage #4990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| libc.stdlib.free(self.data) | ||
|
|
||
| cpdef write(self, bytes b, bint nested=False): | ||
| cpdef write(self, b, bint nested=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.
charlesccychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this very useful change! Added some comments below.
| """Tests common to all coder implementations.""" | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use print in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three have been added to the top of each updated module. This is safest to avoid regression before we add full python 3 support.
| try: | ||
| long # Python 2 | ||
| except NameError: | ||
| long = int # Python 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to keep long for Python 2 compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int from the future.builtins is a subclass of python 2's long.
| def encode(self, x): | ||
| return str(x+1) | ||
| x = x + 1 | ||
| return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))
| cdef size_t pos | ||
|
|
||
| cpdef write(self, bytes b, bint nested=*) | ||
| cpdef write(self, b, bint nested=*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure removing this type annotation is the right thing to do, for performance...
CC: @robertwb, who could provide more guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't meant to be permanent. I wanted to get some feedback on this in the pr reviews and then add a commit to fix it.
| libc.stdlib.free(self.data) | ||
|
|
||
| cpdef write(self, bytes b, bint nested=False): | ||
| cpdef write(self, b, bint nested=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RobbeSneyders wrote:
The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.
CC: @robertwb, who could provide more guidance.
| def encode(self, value): | ||
| return str(value.number) | ||
| x = value.number | ||
| return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be simpler indeed. It clashed with the cython bytes, since str().encode('latin-1') will return newbytes with the future package.
However, seems like we need to fix this cython bytes issue anyway, so I have changed it back. This is the drawback of working on a per module basis I guess :)
| import sys | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we revert the changes in this generated file?
Also, which of these is really necessary? Do we use division and print in this file?
| start=Timestamp(micros=(x['end'] - x['span']) * 1000), | ||
| end=Timestamp(micros=x['end'] * 1000)), | ||
| 'urn:beam:coders:stream:0.1': lambda x, parser: map(parser, x), | ||
| 'urn:beam:coders:stream:0.1': lambda x, parser: list(map(parser, x)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this list call here? This may not be intended behavior, to materialize the list, since the stream may be a long sequence that doesn't fit in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need it. Seems like I missed this one. Thanks for pointing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one test that fails without the list call. I'll look into it.
| """Unit tests for uncompiled implementation of coder impls.""" | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use division and print in this file?
| """Unit tests for compiled implementation of coder impls.""" | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use division and print in this file?
| """Tests for the Observable mixin class.""" | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use division and print in this file?
|
|
||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use division and print in this file?
| """ | ||
| from __future__ import absolute_import | ||
| from __future__ import division | ||
| from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of these is really necessary? Do we use division and print in this file?
| try: | ||
| import cPickle as pickle | ||
| except ImportError: | ||
| import pickle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment that this is for Py2/3 compatibility, since in Python 3, cPickle was renamed pickle.
|
|
||
|
|
||
| REQUIRED_CYTHON_VERSION = '0.26.1' | ||
| REQUIRED_CYTHON_VERSION = '0.28.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this version bump determined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be an incompatibility between the future package and earlier versions of cython. On version 0.26.1, cython throws a compile error because it doesn't recognize the future builtins as types.
| t = type(value) | ||
| if t is NoneType: | ||
| if value is None: | ||
| stream.write_byte(NONE_TYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revert the change removing t = type(value)? I believe this is done because isinstance is much slower than is, and this is very performance-sensitive code.
CC: @robertwb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed because is doesn't check for subclasses. The whole future.builtins package depends on subclasses to add compatibility.
This can however be changed by depending on six again if necessary. Problem with this approach is that the str and bytes types will work different across modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just checked the timing myself.
type(list()) is list is exactly as fast as isinstance(list(), list), but with an average of 5 if checks for every type call, the type check is 2.5x faster.
|
|
||
| def _check_safe(self, value): | ||
| if isinstance(value, (str, six.text_type, long, int, float)): | ||
| if isinstance(value, (str, bytes, int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to keep long for Python 2 compatibility.
|
|
||
| import six | ||
| from builtins import chr | ||
| from builtins import int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my edification, is this the recommended way of using these types / methods in python 3? Are they no longer in the global namespace? It seems a bit verbose to have to import object like this in every file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for compatibility. On python 2, they import from future.builtins, on python 3 it has no effect.
|
Thanks for the review @charlesccychen. Some general points based on your feedback and my answers:
|
d02e52f to
e826515
Compare
|
FAILURE --none-- |
Also remove bytes type from stream.pyx due to imcompatibility with future.builtins.bytes
Remove unicode & dict type from .pxd file because of incompatibility with future.builtins
This reverts commit a9c0193
|
I've added some changes. Most notable:
I've also added the applied strategy to the Python 3 proposal document. It would be great to get some feedback on this so we can start moving forward with the other subpackages. |
|
I ran into some problems while applying the same strategy to some more subpackages (mostly the typehints package). Most of these problems are caused by the use of future.builtins str, bytes and int. I've therefore decided to change the used approach, which can be found in the new pull request #5053. A new pull request was made because a lot of the comments in this thread are now irrelevant. The new approach is unfortunately a bit less resistant to regression, but a better fit for the beam project and I suspect we will be able to move forward more quickly. Closing this PR in favor of #5053 |
This pull request is the result of applying the automatic conversion provided by the future package to the coders subpackage, after which all python 2 errors were fixed. The result is python 3 styled code with python 2 compatibility.
This pull request is the first of a series in which all subpackages will be updated. We should therefore discuss the chosen approach, so we can agree on one strategy to apply throughout.
The approach I've taken, is to focus on writing python 3 code with python 2 compatibility:
The future package provides tools to forward-port our code, while the six package focuses more on backporting. I've therefore replaced six with future everywhere it was already used.
One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes. To get a consistent behavior between versions, I have tried to rewrite everything to use the
strandbytestype provided by the future.builtins package. I have not used thefrom __future__ import unicode_literalsimport since its changes are too implicit and introduces a risk of subtle regressions on python 2I started out with running futurize on the complete coders subpackage and then tried to fix the errors introduced by the automatic conversion. This however proved to be difficult, because it's not obvious where certain errors were introduced.
I therefore switched to a per module approach, in which I first updated all non-test modules. This way, I could check if everything still ran with the native python 2 tests. Afterwards, I updated all test modules.
This pull request also contains updates to run_pylint.sh and tox.ini, so pylint can be run with the --py3k parameter. This should help avoid regression between the different steps of the update process.