[BEAM-3981] Futurize coders subpackage #5053

RobbeSneyders · 2018-04-09T14:36:18Z

This pull request prepares the coders subpackage for Python 3 support. This pull request is the first of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the WIP pull request can be found at #4990.

The used approach can be summarized as follows:

The future package provides tools to forward-port our code, while the six package focuses more on backporting. Whenever possible, we will therefore use the future package instead of the six package.
The future package provides backported Python 3 builtins, which can be used to write Python 3 style code with python 2 compatibility.
One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes.
- The future package provides backported Python 3 str and bytes types. While these can be convenient to write Python 2/3 compatible code in end products, we don’t believe they are the right choice for Beam.
  These backported types are new classes, subclassed from existing Python 2 builtins (e.g. from future.builtins import int imports the class future.builtins.newint.newint, which is a subclass op the Python 2 long type).
  While these new classes behave like the Python 3 types, they don’t give the same results when used in type checks, which are constantly used in beam (e.g. typecoders, typechecks, …)
- Instead, we propose to rewrite everything using the default str and bytes types. On Python 2, the bytes type is an alias for str. On Python 3, str is equivalent to Python 2 unicode.
  A consistent behaviour between Python 2/3 can be reached by using the bytes type whenever str behaviour is desired on Python 2 and bytes behaviour is desired on Python 3 (= bytes data).
  The unicode type can be used whenever unicode behaviour is desired on Python 2 and str behaviour is desired on Python 3 (= text data). The unicode type is not available in Python 3, which can be solved by adding
```
Try:
  unicode           # pylint: disable=unicode-builtin
except NameError:
  unicode = str
```
  at the top of the module.
- All string literals which represent bytes, should be marked as b’’. String literals representing unicode in test modules should not be marked u’’. These will automatically be interpreted as unicode literals in Python 3, but we still want to test for unmarked Python 2 code.
  Do not use the from __future__ import unicode_literals import since its changes are too implicit and introduces a risk of subtle regressions on python 2.
The used approach for the long / int types is equivalent to the unicode / str approach outlined above.
The long type is not available in Python 3, since int now has long behaviour. This can be solved by adding
```
Try:
  long          # pylint: disable=long-builtin
except NameError:
  long = int
```
at the top of the module.
Regression should be avoided as much as possible between the application of step 2 and step 3. This document proposes to take following measures to keep the probability of regression as low as possible:
- Add the following import to the top of every module:
  from __future__ import absolute_import
  We can also add following imports to the top of every measure. This will ensures that no new code can be added using for instance the old python 2 division and adds consistency across modules. We would like to hear the community’s opinion on this.
```
from __future__ import division
from __future__ import print_function
```
- A new tox environment has been added which runs pylint --py3k to check for python 3 compatibility.
range() and iteritems() were not imported in from future.builtins in coder_impl.py to avoid performance regression in Cython compiled code.

@aaltay @charlesccychen

angoenka · 2018-04-09T17:25:28Z

R: @robertwb

aaltay · 2018-04-10T05:39:25Z

R: @charlesccychen cc: @tvalentyn

Some high level comments:

six could be handy in some cases, for example six.string_types could be used instead of

Try:
  unicode           # pylint: disable=unicode-builtin
except NameError:
  unicode = str

What do you think about that?

Is there a reason to change cython version?

RobbeSneyders · 2018-04-10T09:40:21Z

I tried to remove all six references so we don't have 2 dependencies for py2/3 compatibility. If the six method is preferred, I can revert the changes.
The previous cython version is not compatible with future.builtin types. Without this bump, any pure python cythonized file cannot use the builtins.

RobbeSneyders · 2018-04-11T09:46:00Z

Working on some other subpackages, I've come across an additional reason to upgrade the cython version.

Currently, the __cmp__ method is used in cythonized python files instead of the __eq__ method because cython did not support the __eq__ or any other special comparison methods. In Python 3, support for __cmp__ is gone, so this workaround will not work anymore.

Cython has added support for the special comparison methods in version 0.27.0, which allows us to use __eq__ again.

charlesccychen

Thank you! This is a great change. Added a few comments.

charlesccychen · 2018-04-12T01:31:17Z

sdks/python/apache_beam/coders/coders.py

    # pylint: enable=protected-access

+  def __hash__(self):
+    return hash(type(self))


Any particular reason for this change? Previously, the hash would default to object.__hash__, which tries to give a different hash code for each instance. This change would give the same hash code for each class, which may not be desirable, since coders in general could be parameterized.

You're right. I've changed the hash to match the __eq__ check.

charlesccychen · 2018-04-12T01:31:28Z

sdks/python/apache_beam/coders/coders.py

  def encode(self, value):
-    try:               # Python 2
-      if isinstance(value, unicode):
-        return value.encode('utf-8')


Should we do the same unicode = str thing here? (In the new version, we will raise an error if the value is a non-ascii unicode string; eg: str(u'😋') raises an error, while this worked before.)

The unicode = str trick won't work here on Python 3:

>>> str('test'.encode('utf-8')) "b'test'"

I've reverted this change to the previous solution

charlesccychen · 2018-04-12T01:31:46Z

sdks/python/apache_beam/coders/coder_impl.py

-      stream.write_byte(UNICODE_TYPE)
-      stream.write(unicode_value.encode('utf-8'), nested)
+    elif t is unicode:
+      text_value = value  # for typing


Can you use the same try: unicode except: unicode = str in the corresponding part of the .pxd file so that the type annotation directive for text_value is respected?

I tried this, but it throws a 'not a type' error for unicode.

charlesccychen · 2018-04-12T01:33:41Z

sdks/python/apache_beam/coders/coder_impl.py

      stream.write_byte(DICT_TYPE)
      stream.write_var_int64(len(dict_value))
-      for k, v in dict_value.iteritems():
+      for k, v in dict_value.items():


Any particular reason for the iteritems() -> items() change?

iteritems() isn't available anymore in Python 3. Instead, items() returns an iterator. I've replaced it with iteritems(dict) from future.utils, which returns an iterator on both versions.

tvalentyn

Thank you, Robbe! A few minor comments below.

tvalentyn · 2018-04-13T06:34:46Z

sdks/python/apache_beam/coders/coders.py

            and self._dict_without_impl() == other._dict_without_impl())
-    # pylint: enable=protected-access
+
+  def __hash__(self):


Is this change required by Python3 migration or we are just fixing an omission that hash was not previously defined, while eq was?

On Python 2, __hash__ defaults to give a different value for each instance. On Python 3, __hash__ defaults to None if __eq__ is implemented. By implementing __hash__, we get consistent behavior on both versions.

Revisiting this in light of other PRs. I think, it would be safer to guarantee the contract that hash does not change for the same object if we compute it here based on object type, sent #5390.
Another possibility to guarantee consistent behavior between Python 2 and 3 would be to set __hash__ = None if we can infer that a class is obviously non-hashable.

We can also use the id which is guaranteed to stay the same for an object:
hash(id(self))
The default Python 2 hash also relies on id.

That's true. Although that wouldn't honor the contract between eq and hash.

tvalentyn · 2018-04-13T06:35:09Z

sdks/python/apache_beam/coders/coders_test_common.py

    self.check_coder(coders.TupleCoder((CustomCoder(), coders.BytesCoder())),
-                     (1, 'a'), (-10, 'b'), (5, 'c'))
+                     (1, b'a'), (-10, b'b'), (5, b'c'))



Probably not critical, but looks like 'a' is not replaced with b'a' here - are these changes done by some tool or manually?

I guess this is aimed at the 'a' in line 109?
The marking of the strings as bytes literals is done manually. I've only marked strings as bytes literals when it's clear that they're meant to represent bytes (when testing BytesCoder, when the content of the string are clearly bytes, ...). When a string is not marked, it represents str on both versions, which seems ok for the 'a' at line 109 for example.

tvalentyn · 2018-04-13T06:35:28Z

sdks/python/apache_beam/coders/coders_test_common.py

+    self.check_coder(coder, None, 1, -1, 1.5, b'str\0str', u'unicode\0\u0101')
    self.check_coder(coder, (), (1, 2, 3))
    self.check_coder(coder, [], [1, 2, 3])
    self.check_coder(coder, dict(), {'a': 'b'}, {0: dict(), 1: len})


Also here 'a' and 'b' are not bytestrings.

See comment above.
The unmarked 'a' and 'b' here represent str on both versions, which seems ok for this test.

tvalentyn · 2018-04-13T06:36:17Z

sdks/python/tox.ini

  pip --version
  time {toxinidir}/run_pylint.sh

+[testenv:py27-lint3]


Can we add a comment how this is different from py3-lint? Or perhaps we don't need both of them?

py27-lint3 checks for portability issues, while py3-lint checks for python 3 issues. I'll add a comment to clarify.

RobbeSneyders · 2018-04-13T08:42:52Z

Thanks for the reviews @charlesccychen, @tvalentyn. I've tried to address all of your comments and have committed some changes based on your input.
Please have another look if everything seems ok now.

RobbeSneyders · 2018-04-13T14:41:53Z

run java precommit

tvalentyn · 2018-04-13T14:50:55Z

sdks/python/run_pylint.sh

  echo
  exit 1
-fi
+fi


Let's add a new line at the end of file here and in sdks/python/run_pylint_2to3.sh.

tvalentyn · 2018-04-13T15:25:21Z

Thank you Robbe! The PR and approach look good to me.

tvalentyn · 2018-04-13T15:27:10Z

Run Python Dataflow ValidatesRunner

aaltay · 2018-04-13T18:00:45Z

I see the following error in the output

I      /tmp/pip-cVZPmP-build/setup.py:79: UserWarning: You are using version 0.27.3 of cython. However, version 0.28.1 is recommended. 
I        _CYTHON_VERSION, REQUIRED_CYTHON_VERSION 
I       
I      Error compiling Cython file: 
I      ------------------------------------------------------------ 
I      ... 
I      except NameError:   # Python 3 
I        long = int 
I        unicode = str 
I       
I       
I      class CoderImpl(object): 
I      ^ 
I      ------------------------------------------------------------ 
I       
I      apache_beam/coders/coder_impl.py:68:0: 'object' is not a type name 
I      Compiling apache_beam/coders/stream.pyx because it changed. 
I      Compiling apache_beam/runners/worker/statesampler_fast.pyx because it changed. 
I      Compiling apache_beam/coders/coder_impl.py because it changed. 
I      Compiling apache_beam/metrics/execution.py because it changed. 
I      Compiling apache_beam/runners/common.py because it changed. 
I      Compiling apache_beam/runners/worker/logger.py because it changed. 
I      Compiling apache_beam/runners/worker/opcounters.py because it changed. 
I      Compiling apache_beam/runners/worker/operations.py because it changed. 
I      Compiling apache_beam/transforms/cy_combiners.py because it changed. 
I      Compiling apache_beam/utils/counters.py because it changed. 
I      Compiling apache_beam/utils/windowed_value.py because it changed. 
I      [ 1/11] Cythonizing apache_beam/coders/coder_impl.py 
I      Traceback (most recent call last): 
I        File "<string>", line 1, in <module> 
I        File "/tmp/pip-cVZPmP-build/setup.py", line 204, in <module> 
I          'apache_beam/utils/windowed_value.py', 
I        File "/usr/local/lib/python2.7/dist-packages/Cython/Build/Dependencies.py", line 1039, in cythonize 
I          cythonize_one(*args) 
I        File "/usr/local/lib/python2.7/dist-packages/Cython/Build/Dependencies.py", line 1161, in cythonize_one 
I          raise CompileError(None, pyx_file) 
I      Cython.Compiler.Errors.CompileError: apache_beam/coders/coder_impl.py 
I       
I      ----------------------------------------

I assume this is because The new code requires the new Cython version however dataflow workers do not have it.

@tvalentyn Could you upgrade the workers at head to use the 0.28.1 cython version?

tvalentyn · 2018-04-14T04:25:11Z

Containers are released and PR #5131 is out to upgrade them on master, we can also apply the changes from that PR here as well and rerun the postcommit suite.

aaltay · 2018-04-16T23:43:38Z

Run Python Dataflow ValidatesRunner

tvalentyn · 2018-04-16T23:46:49Z

With #5131 merged, we probably need to rebase this PR off the current master, for the ValidatesRunner tests to pass. @RobbeSneyders can we do that please?

tvalentyn · 2018-04-17T01:02:07Z

Actually, looking at Jenkins logs I see that Jenkins already merges this PR with latest commit on master when we run a Postcommit suite:

Checking out Revision 1b8df077a60fc0188d786a146d5c5edb9eb2732f (refs/remotes/origin/pr/5053/merge)

git config core.sparsecheckout # timeout=10
git checkout -f 1b8df077a60fc0188d786a146d5c5edb9eb2732f
Commit message: "Merge 6909ff3 into e1c526d"
...

tvalentyn · 2018-04-17T01:10:26Z

There is a proto generation error in the ValidatesRunner tests, which seems to be unrelated to this PR, I also see that error in other Postcommit test suites.

RobbeSneyders · 2018-04-17T08:12:15Z

I've rebased the branch anyway. Is there anything else I should do before we can merge? Like squash some commits?

tvalentyn · 2018-04-17T20:35:08Z

Yes, please squash the commits; I am looking into postcommit issue, hopefully that does not affect this PR.
Thank you.

tvalentyn · 2018-04-17T20:35:29Z

Run Python Dataflow ValidatesRunner

RobbeSneyders · 2018-04-17T21:15:28Z

Run Python Dataflow ValidatesRunner

RobbeSneyders · 2018-04-17T21:44:46Z

Run Python Precommit

aaltay · 2018-04-18T01:29:23Z

retest this please

aaltay · 2018-04-18T01:36:24Z

@tvalentyn if the test issues are unrelated to this PR and related to the current ongoing Jenkins issues should we move forward with it? (I suggest that we run a few manual tests locally to check whether this PR introduces issues or not.)

RobbeSneyders · 2018-05-09T17:19:17Z

I have added comments in the file and added a remark in the documentation of the used approach.

tvalentyn · 2018-05-09T17:52:57Z

Run Python Dataflow ValidatesRunner

tvalentyn · 2018-05-09T17:53:12Z

@RobbeSneyders Thank you for the comments.

tvalentyn · 2018-05-09T17:53:53Z

Run Python Dataflow ValidatesRunner

tvalentyn · 2018-05-09T17:57:54Z

Run Python Dataflow ValidatesRunner

tvalentyn · 2018-05-09T21:28:53Z

ValidatesRunner suite just takes time to start. It failed due to an unrelated issue, filed: https://issues.apache.org/jira/browse/INFRA-16508 to clarify. I'll do a sanity check locally.

robertwb · 2018-05-09T22:44:08Z

Regarding items vs. iteritems, let's just go with dict_value.items everywhere. There's little if any need for explicit iteritems in this case (and if dict is typed, I think Cython optimizes for k, v in dict_value.items() optimally without the intermediate list anyways).

Though macrobenchmarks can be good for final validation, this is the kind of code that could probably benefit from some pretty simple microbenchmarks.

aaltay · 2018-05-10T16:08:48Z

We chatted with @tvalentyn. The test failure is unrelated and @tvalentyn will implement @robertwb's suggestions in a follow up PR. Merging this now to unblock progress.

Thank you @RobbeSneyders and @tvalentyn for pushing it thus far!

tvalentyn · 2018-06-06T22:08:11Z

An update on dict.iteritems vs dict.items() vs future.utils.iteritems(dict) - I did more performance testing of encode-decode operation using a microbenchmark (currently in flight: #5565).

I don't observe a difference in performance of dict.iteritems() and future.utils.iteritems(dict).

As far as dict.items() vs dict.iteritems() goes, I saw a 2x performance slowdown in coder implementation with dict.items() for dictionaries with over 100000 entries, but did not observe a significant difference on dictionaries with 10000 entries or less. That said I think it would not hurt to keep using iteritems() for Python 2 as we do now.

With future.utils.iteritems():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 3.27529e-07 sec

With dict.iteritems():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 3.4485e-07 sec

With dict.items():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 7.3393e-07 sec

I also observe a 2.5x degradation in coder implementation with builtins.range() compared to range() on lists as small as 1000 - 10000 elements. I did not try smaller lists.

With 10000 elements, python 2 range():

Median time cost:
List[int], FastPrimitiveCoder              : per element median time cost: 1.17695e-07 sec

With builtins.range():

Median time cost:
List[int], FastPrimitiveCoder              : per element median time cost: 3.22402e-07 sec

We should try to use microbenchmarks for performance evaluations moving forward since they can provide feedback in a matter of seconds.

robertwb · 2018-06-06T22:41:03Z

dict.items() should have the same performance when compiled with Cython, as it unpacks it into item-iterating code without the function call (or intermediate object generation). Having two cases for Py2 and Py3 will break this optimization (as well as the fact that the case checking code could be (relatively) expensive for small dicts.)

tvalentyn · 2018-06-07T22:05:02Z

Hmm. I am pretty sure my microbenchmark uses a Cython codepath since in order for any code change to take effect I have to run python setup.py build_ext --inplace to recompile associated C extensions. I checked once again and I do see a 2x slowdown with items() once the size of the dictionary goes more than 100000 elements. Here's code generated by Cython: a https://docs.google.com/document/d/1S-oeqJGiMHt_L3iCgr9dYfQdR0_ukQE25mcvK-BqudU/edit#heading=h.drcukhvo4hd6. Perhaps the slowdown is related to materializing the list of keys?

My microbenchmark setup is here: https://github.com/apache/beam/compare/master...tvalentyn:coders_dict_microbencmark?expand=1

tvalentyn · 2018-06-12T20:27:42Z

For the record, #5586 improves the performance of dict.items() with a Cython directive to use Python 3 interpretation.

RobbeSneyders mentioned this pull request Apr 9, 2018

[BEAM-3981] [WIP] Futurize and fix python 2 compatibility for coders subpackage #4990

Closed

RobbeSneyders force-pushed the coders branch from a744cf4 to 4ee7eb1 Compare April 9, 2018 14:58

RobbeSneyders force-pushed the coders branch from 4ee7eb1 to 7cbfd8d Compare April 11, 2018 15:25

charlesccychen reviewed Apr 12, 2018

View reviewed changes

tvalentyn reviewed Apr 13, 2018

View reviewed changes

RobbeSneyders force-pushed the coders branch from 6909ff3 to c18b2f3 Compare April 17, 2018 08:06

RobbeSneyders force-pushed the coders branch from c18b2f3 to ab65b54 Compare April 17, 2018 21:14

Futurize coders subpackage

cc8bc3f

RobbeSneyders force-pushed the coders branch from 179286a to cc8bc3f Compare May 9, 2018 17:16

aaltay merged commit e0e6b50 into apache:master May 10, 2018

RobbeSneyders mentioned this pull request May 11, 2018

[BEAM-3999] Futurize internal subpackage #5334

Merged

Fematich mentioned this pull request May 11, 2018

[BEAM-4001] Futurize metrics subpackage #5335

Merged

This was referenced May 11, 2018

[BEAM-4008] Futurize utils subpackage #5336

Merged

[BEAM-4007] Futurize typehints subpackage #5337

Merged

This was referenced May 11, 2018

[BEAM-4002] Futurize options subpackage #5339

Merged

[BEAM-4005] Futurize tools subpackage #5343

Closed

RobbeSneyders mentioned this pull request May 12, 2018

Add py27-lint3 test to gradle.build #5350

Merged

This was referenced May 14, 2018

[BEAM-4004] Futurize testing subpackage #5352

Closed

[BEAM-4003] Futurize runners subpackage #5373

Merged

[BEAM-4229] Futurize portability subpackage #5385

Merged

[BEAM-4009] Futurize unpackaged files #5387

Closed

This was referenced Jun 14, 2018

[BEAM-3998] Futurize examples subpackage #5652

Merged

[BEAM-4000] Futurize io subpackage #5715

Merged

[BEAM-4006] Futurize transforms subpackage #5729

Merged

Fematich mentioned this pull request Jul 10, 2018

[BEAM-4751] fix missing pylint3 check for io subpackage #5916

Merged

[BEAM-3981] Futurize coders subpackage #5053

[BEAM-3981] Futurize coders subpackage #5053

Uh oh!

Conversation

RobbeSneyders commented Apr 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angoenka commented Apr 9, 2018

Uh oh!

aaltay commented Apr 10, 2018

Uh oh!

RobbeSneyders commented Apr 10, 2018

Uh oh!

RobbeSneyders commented Apr 11, 2018

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobbeSneyders May 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobbeSneyders commented Apr 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobbeSneyders commented Apr 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn commented Apr 13, 2018

Uh oh!

tvalentyn commented Apr 13, 2018

Uh oh!

aaltay commented Apr 13, 2018

Uh oh!

tvalentyn commented Apr 14, 2018

Uh oh!

aaltay commented Apr 16, 2018

Uh oh!

RobbeSneyders commented Apr 9, 2018 •

edited

Loading

RobbeSneyders May 16, 2018 •

edited

Loading

RobbeSneyders commented Apr 13, 2018 •

edited

Loading

tvalentyn commented Apr 16, 2018 •

edited

Loading

tvalentyn commented Jun 7, 2018 •

edited

Loading