[BEAM-3981] [WIP] Futurize and fix python 2 compatibility for coders subpackage #4990

RobbeSneyders · 2018-04-02T18:52:02Z

This pull request is the result of applying the automatic conversion provided by the future package to the coders subpackage, after which all python 2 errors were fixed. The result is python 3 styled code with python 2 compatibility.
This pull request is the first of a series in which all subpackages will be updated. We should therefore discuss the chosen approach, so we can agree on one strategy to apply throughout.

The approach I've taken, is to focus on writing python 3 code with python 2 compatibility:

The future package provides tools to forward-port our code, while the six package focuses more on backporting. I've therefore replaced six with future everywhere it was already used.
One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes. To get a consistent behavior between versions, I have tried to rewrite everything to use the str and bytes type provided by the future.builtins package. I have not used the from __future__ import unicode_literals import since its changes are too implicit and introduces a risk of subtle regressions on python 2

I started out with running futurize on the complete coders subpackage and then tried to fix the errors introduced by the automatic conversion. This however proved to be difficult, because it's not obvious where certain errors were introduced.
I therefore switched to a per module approach, in which I first updated all non-test modules. This way, I could check if everything still ran with the native python 2 tests. Afterwards, I updated all test modules.

This pull request also contains updates to run_pylint.sh and tox.ini, so pylint can be run with the --py3k parameter. This should help avoid regression between the different steps of the update process.

RobbeSneyders · 2018-04-02T18:59:45Z

sdks/python/apache_beam/coders/stream.pyx

      libc.stdlib.free(self.data)

-  cpdef write(self, bytes b, bint nested=False):
+  cpdef write(self, b, bint nested=False):


The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.

aaltay · 2018-04-02T19:09:14Z

cc: @tvalentyn @charlesccychen

charlesccychen

Thanks for this very useful change! Added some comments below.

charlesccychen · 2018-04-02T19:57:06Z

sdks/python/apache_beam/coders/coders_test_common.py

 """Tests common to all coder implementations."""
 from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function


Which of these is really necessary? Do we use print in this file?

All three have been added to the top of each updated module. This is safest to avoid regression before we add full python 3 support.

charlesccychen · 2018-04-02T19:57:06Z

sdks/python/apache_beam/coders/coder_impl.py

-try:
-  long        # Python 2
-except NameError:
-  long = int  # Python 3


I think we need to keep long for Python 2 compatibility.

The int from the future.builtins is a subclass of python 2's long.

charlesccychen · 2018-04-02T19:57:06Z

sdks/python/apache_beam/coders/coders_test_common.py

  def encode(self, x):
-    return str(x+1)
+    x = x + 1
+    return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True)


What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))

charlesccychen · 2018-04-02T19:57:06Z

sdks/python/apache_beam/coders/stream.pxd

  cdef size_t pos

-  cpdef write(self, bytes b, bint nested=*)
+  cpdef write(self, b, bint nested=*)


I'm not sure removing this type annotation is the right thing to do, for performance...

CC: @robertwb, who could provide more guidance.

This wasn't meant to be permanent. I wanted to get some feedback on this in the pr reviews and then add a commit to fix it.

charlesccychen · 2018-04-02T19:57:07Z

sdks/python/apache_beam/coders/stream.pyx

      libc.stdlib.free(self.data)

-  cpdef write(self, bytes b, bint nested=False):
+  cpdef write(self, b, bint nested=False):


RobbeSneyders wrote:
The bytes type has been removed from this module for now, since a mismatch between the cython bytes type and future bytes type results in a TypeError: expected bytes, got newbytes.
I have tried replacing bytes with a memory view as explained here, but this resulted in a packaging error.
Any help on this would be appreciated.

CC: @robertwb, who could provide more guidance.

charlesccychen · 2018-04-02T19:57:08Z

sdks/python/apache_beam/coders/typecoders_test.py

  def encode(self, value):
-    return str(value.number)
+    x = value.number
+    return int(x).to_bytes((x.bit_length() + 7) // 8, 'big', signed=True)


What is the rationale for changing the coder behavior here to use a binary encoding instead of using a printable decimal string, for what is intended to be a simple example? Is there something simpler we can do? (e.g. we could coerce to str and encode as .encode('latin-1'))

That would be simpler indeed. It clashed with the cython bytes, since str().encode('latin-1') will return newbytes with the future package.
However, seems like we need to fix this cython bytes issue anyway, so I have changed it back. This is the drawback of working on a per module basis I guess :)

charlesccychen · 2018-04-02T19:57:09Z

sdks/python/apache_beam/coders/proto2_coder_test_messages_pb2.py

-import sys
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function


Should we revert the changes in this generated file?

Also, which of these is really necessary? Do we use division and print in this file?

charlesccychen · 2018-04-02T19:57:09Z

sdks/python/apache_beam/coders/standard_coders_test.py

              start=Timestamp(micros=(x['end'] - x['span']) * 1000),
              end=Timestamp(micros=x['end'] * 1000)),
-      'urn:beam:coders:stream:0.1': lambda x, parser: map(parser, x),
+      'urn:beam:coders:stream:0.1': lambda x, parser: list(map(parser, x)),


Do we need this list call here? This may not be intended behavior, to materialize the list, since the stream may be a long sequence that doesn't fit in memory.

I don't think we need it. Seems like I missed this one. Thanks for pointing out.

There's one test that fails without the list call. I'll look into it.

charlesccychen · 2018-04-02T19:57:10Z

sdks/python/apache_beam/coders/slow_coders_test.py

 """Unit tests for uncompiled implementation of coder impls."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function