Skip to content

Conversation

@RobbeSneyders
Copy link
Contributor

This pull request is the result of applying the automatic conversion provided by the future package to the coders subpackage, after which all python 2 errors were fixed. The result is python 3 styled code with python 2 compatibility.
This pull request is the first of a series in which all subpackages will be updated. We should therefore discuss the chosen approach, so we can agree on one strategy to apply throughout.

The approach I've taken, is to focus on writing python 3 code with python 2 compatibility:

  • The future package provides tools to forward-port our code, while the six package focuses more on backporting. I've therefore replaced six with future everywhere it was already used.

  • One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes. To get a consistent behavior between versions, I have tried to rewrite everything to use the str and bytes type provided by the future.builtins package. I have not used the from __future__ import unicode_literals import since its changes are too implicit and introduces a risk of subtle regressions on python 2

I started out with running futurize on the complete coders subpackage and then tried to fix the errors introduced by the automatic conversion. This however proved to be difficult, because it's not obvious where certain errors were introduced.
I therefore switched to a per module approach, in which I first updated all non-test modules. This way, I could check if everything still ran with the native python 2 tests. Afterwards, I updated all test modules.

This pull request also contains updates to run_pylint.sh and tox.ini, so pylint can be run with the --py3k parameter. This should help avoid regression between the different steps of the update process.

libc.stdlib.free(self.data)

cpdef write(self, bytes b, bint nested=False):
cpdef write(self, b, bint nested=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.

@aaltay
Copy link
Member

aaltay commented Apr 2, 2018

cc: @tvalentyn @charlesccychen

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this very useful change! Added some comments below.

"""Tests common to all coder implementations."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use print in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three have been added to the top of each updated module. This is safest to avoid regression before we add full python 3 support.

try:
long # Python 2
except NameError:
long = int # Python 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep long for Python 2 compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int from the future.builtins is a subclass of python 2's long.

def encode(self, x):
return str(x+1)
x = x + 1
return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))

cdef size_t pos

cpdef write(self, bytes b, bint nested=*)
cpdef write(self, b, bint nested=*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure removing this type annotation is the right thing to do, for performance...

CC: @robertwb, who could provide more guidance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't meant to be permanent. I wanted to get some feedback on this in the pr reviews and then add a commit to fix it.

libc.stdlib.free(self.data)

cpdef write(self, bytes b, bint nested=False):
cpdef write(self, b, bint nested=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RobbeSneyders wrote:
The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.

CC: @robertwb, who could provide more guidance.

def encode(self, value):
return str(value.number)
x = value.number
return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be simpler indeed. It clashed with the cython bytes, since str().encode('latin-1') will return newbytes with the future package.
However, seems like we need to fix this cython bytes issue anyway, so I have changed it back. This is the drawback of working on a per module basis I guess :)

import sys
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert the changes in this generated file?

Also, which of these is really necessary? Do we use division and print in this file?

start=Timestamp(micros=(x['end'] - x['span']) * 1000),
end=Timestamp(micros=x['end'] * 1000)),
'urn:beam:coders:stream:0.1': lambda x, parser: map(parser, x),
'urn:beam:coders:stream:0.1': lambda x, parser: list(map(parser, x)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this list call here? This may not be intended behavior, to materialize the list, since the stream may be a long sequence that doesn't fit in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need it. Seems like I missed this one. Thanks for pointing out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one test that fails without the list call. I'll look into it.

"""Unit tests for uncompiled implementation of coder impls."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use division and print in this file?

"""Unit tests for compiled implementation of coder impls."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use division and print in this file?

"""Tests for the Observable mixin class."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use division and print in this file?


from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use division and print in this file?

"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these is really necessary? Do we use division and print in this file?

try:
import cPickle as pickle
except ImportError:
import pickle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment that this is for Py2/3 compatibility, since in Python 3, cPickle was renamed pickle.



REQUIRED_CYTHON_VERSION = '0.26.1'
REQUIRED_CYTHON_VERSION = '0.28.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this version bump determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an incompatibility between the future package and earlier versions of cython. On version 0.26.1, cython throws a compile error because it doesn't recognize the future builtins as types.

t = type(value)
if t is NoneType:
if value is None:
stream.write_byte(NONE_TYPE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the change removing t = type(value)? I believe this is done because isinstance is much slower than is, and this is very performance-sensitive code.

CC: @robertwb

Copy link
Contributor Author

@RobbeSneyders RobbeSneyders Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed because is doesn't check for subclasses. The whole future.builtins package depends on subclasses to add compatibility.
This can however be changed by depending on six again if necessary. Problem with this approach is that the str and bytes types will work different across modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just checked the timing myself.
type(list()) is list is exactly as fast as isinstance(list(), list), but with an average of 5 if checks for every type call, the type check is 2.5x faster.


def _check_safe(self, value):
if isinstance(value, (str, six.text_type, long, int, float)):
if isinstance(value, (str, bytes, int, float)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep long for Python 2 compatibility.


import six
from builtins import chr
from builtins import int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification, is this the recommended way of using these types / methods in python 3? Are they no longer in the global namespace? It seems a bit verbose to have to import object like this in every file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for compatibility. On python 2, they import from future.builtins, on python 3 it has no effect.

@RobbeSneyders RobbeSneyders changed the title [BEAM-3981] Futurize and fix python 2 compatibility for coders subpackage [BEAM-3981] [WIP] Futurize and fix python 2 compatibility for coders subpackage Apr 2, 2018
@RobbeSneyders
Copy link
Contributor Author

RobbeSneyders commented Apr 2, 2018

Thanks for the review @charlesccychen.

Some general points based on your feedback and my answers:

  • The imports:
    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    were added at the top of each updated module to prevent regression before full python 3 support is added. This way no new code can be added using for instance the old python 2 division. Another benefit is the consistency of division and print across modules.

  • from builtins import ... imports from future.builtins on python 2 and has no effect on python 3. future.builtins contains a bunch of backported python 3 builtins for compatibility.

  • The bytes type annotation was removed in the stream cython files because it was breaking due to a mismatch between the cython bytes type and the future bytes type. However, this is not meant to be merged this way, but I wanted to submit the pull request with working code to get some feedback on this. I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error. Any help on this is appreciated.

  • The Cython version was upgraded from 0.26.1 to 0.28.1 because of an incompatibility between cython and future types. I have not noticed any backward incompatibility.

  • The is type checks were replaced by isinstance checks because the future.builtins are all subclasses of the standard python classes. However, this works a lot slower. I could revert this change if I use six again for compatibility. The drawback is that str and bytes will work different across modules.

@RobbeSneyders RobbeSneyders force-pushed the master branch 3 times, most recently from d02e52f to e826515 Compare April 5, 2018 22:08
@asfgit
Copy link

asfgit commented Apr 5, 2018

FAILURE

--none--

@RobbeSneyders
Copy link
Contributor Author

RobbeSneyders commented Apr 6, 2018

I've added some changes. Most notable:

  • Replace bytes with memoryview in stream cython files. This also works with subtypes of bytes like the future.builtins bytes type.

  • Revert isinstance checks to typechecks for performance with the use of past.builtins.

I've also added the applied strategy to the Python 3 proposal document. It would be great to get some feedback on this so we can start moving forward with the other subpackages.

@aaltay @charlesccychen

@RobbeSneyders
Copy link
Contributor Author

I ran into some problems while applying the same strategy to some more subpackages (mostly the typehints package). Most of these problems are caused by the use of future.builtins str, bytes and int.

I've therefore decided to change the used approach, which can be found in the new pull request #5053. A new pull request was made because a lot of the comments in this thread are now irrelevant.

The new approach is unfortunately a bit less resistant to regression, but a better fit for the beam project and I suspect we will be able to move forward more quickly.

Closing this PR in favor of #5053

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants