Skip to content

Conversation

@RobbeSneyders
Copy link
Contributor

@RobbeSneyders RobbeSneyders commented Apr 9, 2018

This pull request prepares the coders subpackage for Python 3 support. This pull request is the first of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the WIP pull request can be found at #4990.

The used approach can be summarized as follows:

  • The future package provides tools to forward-port our code, while the six package focuses more on backporting. Whenever possible, we will therefore use the future package instead of the six package.
    The future package provides backported Python 3 builtins, which can be used to write Python 3 style code with python 2 compatibility.

  • One of the biggest problems in porting python 2 code to python 3, is the changed handling of strings and bytes.

    • The future package provides backported Python 3 str and bytes types. While these can be convenient to write Python 2/3 compatible code in end products, we don’t believe they are the right choice for Beam.
      These backported types are new classes, subclassed from existing Python 2 builtins (e.g. from future.builtins import int imports the class future.builtins.newint.newint, which is a subclass op the Python 2 long type).
      While these new classes behave like the Python 3 types, they don’t give the same results when used in type checks, which are constantly used in beam (e.g. typecoders, typechecks, …)

    • Instead, we propose to rewrite everything using the default str and bytes types. On Python 2, the bytes type is an alias for str. On Python 3, str is equivalent to Python 2 unicode.
      A consistent behaviour between Python 2/3 can be reached by using the bytes type whenever str behaviour is desired on Python 2 and bytes behaviour is desired on Python 3 (= bytes data).
      The unicode type can be used whenever unicode behaviour is desired on Python 2 and str behaviour is desired on Python 3 (= text data). The unicode type is not available in Python 3, which can be solved by adding

      Try:
        unicode           # pylint: disable=unicode-builtin
      except NameError:
        unicode = str
      

      at the top of the module.

    • All string literals which represent bytes, should be marked as b’’. String literals representing unicode in test modules should not be marked u’’. These will automatically be interpreted as unicode literals in Python 3, but we still want to test for unmarked Python 2 code.
      Do not use the from __future__ import unicode_literals import since its changes are too implicit and introduces a risk of subtle regressions on python 2.

  • The used approach for the long / int types is equivalent to the unicode / str approach outlined above.
    The long type is not available in Python 3, since int now has long behaviour. This can be solved by adding

    Try:
      long          # pylint: disable=long-builtin
    except NameError:
      long = int
    

    at the top of the module.

  • Regression should be avoided as much as possible between the application of step 2 and step 3. This document proposes to take following measures to keep the probability of regression as low as possible:

    • Add the following import to the top of every module:
      from __future__ import absolute_import
      We can also add following imports to the top of every measure. This will ensures that no new code can be added using for instance the old python 2 division and adds consistency across modules. We would like to hear the community’s opinion on this.
      from __future__ import division
      from __future__ import print_function
      
    • A new tox environment has been added which runs pylint --py3k to check for python 3 compatibility.
  • range() and iteritems() were not imported in from future.builtins in coder_impl.py to avoid performance regression in Cython compiled code.

@aaltay @charlesccychen

@angoenka
Copy link
Contributor

angoenka commented Apr 9, 2018

R: @robertwb

@aaltay
Copy link
Member

aaltay commented Apr 10, 2018

R: @charlesccychen cc: @tvalentyn

Some high level comments:

  • six could be handy in some cases, for example six.string_types could be used instead of
Try:
  unicode           # pylint: disable=unicode-builtin
except NameError:
  unicode = str

What do you think about that?

  • Is there a reason to change cython version?

@RobbeSneyders
Copy link
Contributor Author

  • I tried to remove all six references so we don't have 2 dependencies for py2/3 compatibility. If the six method is preferred, I can revert the changes.

  • The previous cython version is not compatible with future.builtin types. Without this bump, any pure python cythonized file cannot use the builtins.

@RobbeSneyders
Copy link
Contributor Author

Working on some other subpackages, I've come across an additional reason to upgrade the cython version.

Currently, the __cmp__ method is used in cythonized python files instead of the __eq__ method because cython did not support the __eq__ or any other special comparison methods. In Python 3, support for __cmp__ is gone, so this workaround will not work anymore.

Cython has added support for the special comparison methods in version 0.27.0, which allows us to use __eq__ again.

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is a great change. Added a few comments.

# pylint: enable=protected-access

def __hash__(self):
return hash(type(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for this change? Previously, the hash would default to object.__hash__, which tries to give a different hash code for each instance. This change would give the same hash code for each class, which may not be desirable, since coders in general could be parameterized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've changed the hash to match the __eq__ check.

def encode(self, value):
try: # Python 2
if isinstance(value, unicode):
return value.encode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same unicode = str thing here? (In the new version, we will raise an error if the value is a non-ascii unicode string; eg: str(u'😋') raises an error, while this worked before.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unicode = str trick won't work here on Python 3:

>>> str('test'.encode('utf-8'))
"b'test'"

I've reverted this change to the previous solution

stream.write_byte(UNICODE_TYPE)
stream.write(unicode_value.encode('utf-8'), nested)
elif t is unicode:
text_value = value # for typing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the same try: unicode except: unicode = str in the corresponding part of the .pxd file so that the type annotation directive for text_value is respected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it throws a 'not a type' error for unicode.

stream.write_byte(DICT_TYPE)
stream.write_var_int64(len(dict_value))
for k, v in dict_value.iteritems():
for k, v in dict_value.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for the iteritems() -> items() change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iteritems() isn't available anymore in Python 3. Instead, items() returns an iterator. I've replaced it with iteritems(dict) from future.utils, which returns an iterator on both versions.

Copy link
Contributor

@tvalentyn tvalentyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Robbe! A few minor comments below.

and self._dict_without_impl() == other._dict_without_impl())
# pylint: enable=protected-access

def __hash__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required by Python3 migration or we are just fixing an omission that hash was not previously defined, while eq was?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Python 2, __hash__ defaults to give a different value for each instance. On Python 3, __hash__ defaults to None if __eq__ is implemented. By implementing __hash__, we get consistent behavior on both versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this in light of other PRs. I think, it would be safer to guarantee the contract that hash does not change for the same object if we compute it here based on object type, sent #5390.
Another possibility to guarantee consistent behavior between Python 2 and 3 would be to set __hash__ = None if we can infer that a class is obviously non-hashable.

Copy link
Contributor Author

@RobbeSneyders RobbeSneyders May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use the id which is guaranteed to stay the same for an object:
hash(id(self))
The default Python 2 hash also relies on id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Although that wouldn't honor the contract between eq and hash.

self.check_coder(coders.TupleCoder((CustomCoder(), coders.BytesCoder())),
(1, 'a'), (-10, 'b'), (5, 'c'))
(1, b'a'), (-10, b'b'), (5, b'c'))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not critical, but looks like 'a' is not replaced with b'a' here - are these changes done by some tool or manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is aimed at the 'a' in line 109?
The marking of the strings as bytes literals is done manually. I've only marked strings as bytes literals when it's clear that they're meant to represent bytes (when testing BytesCoder, when the content of the string are clearly bytes, ...). When a string is not marked, it represents str on both versions, which seems ok for the 'a' at line 109 for example.

self.check_coder(coder, None, 1, -1, 1.5, b'str\0str', u'unicode\0\u0101')
self.check_coder(coder, (), (1, 2, 3))
self.check_coder(coder, [], [1, 2, 3])
self.check_coder(coder, dict(), {'a': 'b'}, {0: dict(), 1: len})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here 'a' and 'b' are not bytestrings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.
The unmarked 'a' and 'b' here represent str on both versions, which seems ok for this test.

pip --version
time {toxinidir}/run_pylint.sh

[testenv:py27-lint3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment how this is different from py3-lint? Or perhaps we don't need both of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py27-lint3 checks for portability issues, while py3-lint checks for python 3 issues. I'll add a comment to clarify.

@RobbeSneyders
Copy link
Contributor Author

RobbeSneyders commented Apr 13, 2018

Thanks for the reviews @charlesccychen, @tvalentyn. I've tried to address all of your comments and have committed some changes based on your input.
Please have another look if everything seems ok now.

@RobbeSneyders
Copy link
Contributor Author

run java precommit

echo
exit 1
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a new line at the end of file here and in sdks/python/run_pylint_2to3.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tvalentyn
Copy link
Contributor

Thank you Robbe! The PR and approach look good to me.

@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesRunner

@aaltay
Copy link
Member

aaltay commented Apr 13, 2018

I see the following error in the output

I      /tmp/pip-cVZPmP-build/setup.py:79: UserWarning: You are using version 0.27.3 of cython. However, version 0.28.1 is recommended. 
I        _CYTHON_VERSION, REQUIRED_CYTHON_VERSION 
I       
I      Error compiling Cython file: 
I      ------------------------------------------------------------ 
I      ... 
I      except NameError:   # Python 3 
I        long = int 
I        unicode = str 
I       
I       
I      class CoderImpl(object): 
I      ^ 
I      ------------------------------------------------------------ 
I       
I      apache_beam/coders/coder_impl.py:68:0: 'object' is not a type name 
I      Compiling apache_beam/coders/stream.pyx because it changed. 
I      Compiling apache_beam/runners/worker/statesampler_fast.pyx because it changed. 
I      Compiling apache_beam/coders/coder_impl.py because it changed. 
I      Compiling apache_beam/metrics/execution.py because it changed. 
I      Compiling apache_beam/runners/common.py because it changed. 
I      Compiling apache_beam/runners/worker/logger.py because it changed. 
I      Compiling apache_beam/runners/worker/opcounters.py because it changed. 
I      Compiling apache_beam/runners/worker/operations.py because it changed. 
I      Compiling apache_beam/transforms/cy_combiners.py because it changed. 
I      Compiling apache_beam/utils/counters.py because it changed. 
I      Compiling apache_beam/utils/windowed_value.py because it changed. 
I      [ 1/11] Cythonizing apache_beam/coders/coder_impl.py 
I      Traceback (most recent call last): 
I        File "<string>", line 1, in <module> 
I        File "/tmp/pip-cVZPmP-build/setup.py", line 204, in <module> 
I          'apache_beam/utils/windowed_value.py', 
I        File "/usr/local/lib/python2.7/dist-packages/Cython/Build/Dependencies.py", line 1039, in cythonize 
I          cythonize_one(*args) 
I        File "/usr/local/lib/python2.7/dist-packages/Cython/Build/Dependencies.py", line 1161, in cythonize_one 
I          raise CompileError(None, pyx_file) 
I      Cython.Compiler.Errors.CompileError: apache_beam/coders/coder_impl.py 
I       
I      ---------------------------------------- 

I assume this is because The new code requires the new Cython version however dataflow workers do not have it.

@tvalentyn Could you upgrade the workers at head to use the 0.28.1 cython version?

@tvalentyn
Copy link
Contributor

Containers are released and PR #5131 is out to upgrade them on master, we can also apply the changes from that PR here as well and rerun the postcommit suite.

@aaltay
Copy link
Member

aaltay commented Apr 16, 2018

Run Python Dataflow ValidatesRunner

@tvalentyn
Copy link
Contributor

tvalentyn commented Apr 16, 2018

With #5131 merged, we probably need to rebase this PR off the current master, for the ValidatesRunner tests to pass. @RobbeSneyders can we do that please?

@tvalentyn
Copy link
Contributor

Actually, looking at Jenkins logs I see that Jenkins already merges this PR with latest commit on master when we run a Postcommit suite:

Checking out Revision 1b8df077a60fc0188d786a146d5c5edb9eb2732f (refs/remotes/origin/pr/5053/merge)

git config core.sparsecheckout # timeout=10
git checkout -f 1b8df077a60fc0188d786a146d5c5edb9eb2732f
Commit message: "Merge 6909ff3 into e1c526d"
...

@tvalentyn
Copy link
Contributor

There is a proto generation error in the ValidatesRunner tests, which seems to be unrelated to this PR, I also see that error in other Postcommit test suites.

@RobbeSneyders
Copy link
Contributor Author

I've rebased the branch anyway. Is there anything else I should do before we can merge? Like squash some commits?

@tvalentyn
Copy link
Contributor

Yes, please squash the commits; I am looking into postcommit issue, hopefully that does not affect this PR.
Thank you.

@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesRunner

@RobbeSneyders
Copy link
Contributor Author

Run Python Dataflow ValidatesRunner

@RobbeSneyders
Copy link
Contributor Author

Run Python Precommit

@aaltay
Copy link
Member

aaltay commented Apr 18, 2018

retest this please

@aaltay
Copy link
Member

aaltay commented Apr 18, 2018

@tvalentyn if the test issues are unrelated to this PR and related to the current ongoing Jenkins issues should we move forward with it? (I suggest that we run a few manual tests locally to check whether this PR introduces issues or not.)

@RobbeSneyders
Copy link
Contributor Author

I have added comments in the file and added a remark in the documentation of the used approach.

@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesRunner

@tvalentyn
Copy link
Contributor

@RobbeSneyders Thank you for the comments.

@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesRunner

1 similar comment
@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesRunner

@tvalentyn
Copy link
Contributor

ValidatesRunner suite just takes time to start. It failed due to an unrelated issue, filed: https://issues.apache.org/jira/browse/INFRA-16508 to clarify. I'll do a sanity check locally.

@robertwb
Copy link
Contributor

robertwb commented May 9, 2018

Regarding items vs. iteritems, let's just go with dict_value.items everywhere. There's little if any need for explicit iteritems in this case (and if dict is typed, I think Cython optimizes for k, v in dict_value.items() optimally without the intermediate list anyways).

Though macrobenchmarks can be good for final validation, this is the kind of code that could probably benefit from some pretty simple microbenchmarks.

@aaltay
Copy link
Member

aaltay commented May 10, 2018

We chatted with @tvalentyn. The test failure is unrelated and @tvalentyn will implement @robertwb's suggestions in a follow up PR. Merging this now to unblock progress.

Thank you @RobbeSneyders and @tvalentyn for pushing it thus far!

@tvalentyn
Copy link
Contributor

An update on dict.iteritems vs dict.items() vs future.utils.iteritems(dict) - I did more performance testing of encode-decode operation using a microbenchmark (currently in flight: #5565).

I don't observe a difference in performance of dict.iteritems() and future.utils.iteritems(dict).

As far as dict.items() vs dict.iteritems() goes, I saw a 2x performance slowdown in coder implementation with dict.items() for dictionaries with over 100000 entries, but did not observe a significant difference on dictionaries with 10000 entries or less. That said I think it would not hurt to keep using iteritems() for Python 2 as we do now.

With future.utils.iteritems():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 3.27529e-07 sec                                                  

With dict.iteritems():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 3.4485e-07 sec

With dict.items():

Median time cost:
Dict[int, int], FastPrimitiveCoder         : per element median time cost: 7.3393e-07 sec

I also observe a 2.5x degradation in coder implementation with builtins.range() compared to range() on lists as small as 1000 - 10000 elements. I did not try smaller lists.

With 10000 elements, python 2 range():

Median time cost:
List[int], FastPrimitiveCoder              : per element median time cost: 1.17695e-07 sec

With builtins.range():

Median time cost:
List[int], FastPrimitiveCoder              : per element median time cost: 3.22402e-07 sec

We should try to use microbenchmarks for performance evaluations moving forward since they can provide feedback in a matter of seconds.

@robertwb
Copy link
Contributor

robertwb commented Jun 6, 2018

dict.items() should have the same performance when compiled with Cython, as it unpacks it into item-iterating code without the function call (or intermediate object generation). Having two cases for Py2 and Py3 will break this optimization (as well as the fact that the case checking code could be (relatively) expensive for small dicts.)

@tvalentyn
Copy link
Contributor

tvalentyn commented Jun 7, 2018

Hmm. I am pretty sure my microbenchmark uses a Cython codepath since in order for any code change to take effect I have to run python setup.py build_ext --inplace to recompile associated C extensions. I checked once again and I do see a 2x slowdown with items() once the size of the dictionary goes more than 100000 elements. Here's code generated by Cython: a https://docs.google.com/document/d/1S-oeqJGiMHt_L3iCgr9dYfQdR0_ukQE25mcvK-BqudU/edit#heading=h.drcukhvo4hd6. Perhaps the slowdown is related to materializing the list of keys?

My microbenchmark setup is here: https://github.com/apache/beam/compare/master...tvalentyn:coders_dict_microbencmark?expand=1

@tvalentyn
Copy link
Contributor

For the record, #5586 improves the performance of dict.items() with a Cython directive to use Python 3 interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants