[BEAM-4000] Futurize io subpackage #5715

Fematich · 2018-06-21T15:29:51Z

This pull request prepares the io subpackage for Python 3 support. This PR is part of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the first pull request in the series (Futurize coders subpackage) demonstrating this approach can be found at #5053.

R: @aaltay @tvalentyn @RobbeSneyders @charlesccychen

superbobry · 2018-06-29T07:23:09Z

sdks/python/apache_beam/io/avroio.py

 from apache_beam.io.iobase import Read
 from apache_beam.transforms import PTransform

+standard_library.install_aliases()


Is this needed?

This is needed for import io, but is misplaced.

io is available starting from Python 2.6.

I removed all the install_aliases caused by io

superbobry · 2018-06-29T07:29:54Z

sdks/python/apache_beam/io/filebasedsink.py


      exception_batches = util.run_using_threadpool(
-          _rename_batch, zip(source_file_batch, destination_file_batch),
+          _rename_batch, list(zip(source_file_batch, destination_file_batch)),


IIUC run_using_threadpool accepts any iterable, so forcing a list is redundant.

Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.

Ack, please discard my initial comment.

superbobry · 2018-06-29T07:34:50Z

sdks/python/apache_beam/io/filebasedsource.py

 __all__ = ['FileBasedSource']

+try:
+  unicode       # pylint: disable=unicode-builtin


How about

from past.builtins import long, unicode

This applies to other similar try-except imports in this (and related) PRs.

I used the try/except method to be consistent with the methodology implemented in #5053 and outlined in the doc by @RobbeSneyders. I believe there where some issues in the coder package for the typechecks using past.builtins

I'd humbly suggest reconsidering (unless there are indeed issues with using past in some modules) as try-except is much more verbose syntactically.

The try/except was indeed used to prevent problems with typechecking when using builtins from the future package.
However, I tested this again, and it seems that there are no such problems using the past.builtins.

I have created the issue BEAM-4730 to apply this change to already ported packages.

This is done in #5869.

Great, thank you!

superbobry · 2018-06-29T07:36:09Z

sdks/python/apache_beam/io/filebasedsource.py

  def __init__(self, file_based_source, file_name, start_offset, stop_offset,
               min_bundle_size=0, splittable=True):
-    if not isinstance(start_offset, integer_types):
+    if not isinstance(start_offset, (int, long)):


If you import from past you can replace this with just int.

See http://python-future.org/compatible_idioms.html#long-integers

If you want to replace this with just int, we would have to import int from (future.)builtins, which gives problems when used for typechecks. To stay consistent with other modules, which do use typechecks, I would advice against this.

Instead, we can replace the try/except block with from past.builtins import long as mentioned in the comment above.

superbobry · 2018-06-29T07:37:10Z

sdks/python/apache_beam/io/filesystem.py

    if self.readable():
      self._read_size = read_size
-      self._read_buffer = cStringIO.StringIO()
+      self._read_buffer = BytesIO()


Consistency nitpick: other modules use io in a qualified way.

superbobry · 2018-06-29T07:38:00Z

sdks/python/apache_beam/io/filesystems_test.py

                                 r'^Unable to get the Filesystem') as error:
      FileSystems.match([None])
-    self.assertEqual(error.exception.exception_details.keys(), [None])
+    self.assertEqual(list(error.exception.exception_details.keys()), [None])


Consider using list(d) instead of list(d.keys()).

superbobry · 2018-06-29T07:39:54Z

sdks/python/apache_beam/io/gcp/datastore/v1/helper.py

    return resp

+  def __next__(self):
+    return next(self.__iter__())


This will always return the first element. Is this the expected behaviour?

No this is indeed an error. This method is actually now also superfluous if I use your suggestion for the test in #5715 (comment). I removed it

superbobry · 2018-06-29T07:41:39Z

sdks/python/apache_beam/io/gcp/datastore/v1/helper_test.py

    query_iterator = helper.QueryIterator("project", None, self._query,
                                          self._mock_datastore)
-    self.assertRaises(RPCError, iter(query_iterator).next)
+    self.assertRaises(RPCError, query_iterator.__next__)


This will not work on Python2. How about

from future.builtins import next self.assertRaises(..., lambda: next(query_iterator))

superbobry · 2018-06-29T07:42:38Z

sdks/python/apache_beam/io/gcp/gcsio.py

    return self._size

  def get_range(self, start, end):
+    self._download_stream.seek(0)


Why is this needed in Python 3?

Difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here

Yes, thanks for the clarification.

superbobry · 2018-06-29T07:43:29Z

sdks/python/apache_beam/io/gcp/pubsub.py

  pubsub_pb2 = None

+try:
+  basestring


from past.builtins import basestring

charlesccychen

Thanks! Also, please rebase to head.

charlesccychen · 2018-06-26T22:22:23Z

sdks/python/apache_beam/io/avroio.py

 from apache_beam.io.iobase import Read
 from apache_beam.transforms import PTransform

+standard_library.install_aliases()


Should we do this before import io on line 46? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 46.

charlesccychen · 2018-06-26T22:26:05Z

sdks/python/apache_beam/io/filebasedsource_test.py

 from apache_beam.transforms.display import DisplayData
 from apache_beam.transforms.display_test import DisplayDataItemMatcher

+standard_library.install_aliases()


Should we do this before import io on line 21? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 21.

charlesccychen · 2018-06-26T22:29:43Z

sdks/python/apache_beam/io/filesystem.py


 from apache_beam.utils.plugin import BeamPlugin

+standard_library.install_aliases()


Should we do this before import io?

charlesccychen · 2018-06-26T22:29:48Z

sdks/python/apache_beam/io/filesystem_test.py

 from apache_beam.io.filesystem import FileMetadata
 from apache_beam.io.filesystem import FileSystem

+standard_library.install_aliases()


Should we do this before import io?

charlesccychen · 2018-06-26T22:29:53Z

sdks/python/apache_beam/io/gcp/gcsio.py

 from apache_beam.io.filesystemio import UploaderStream
 from apache_beam.utils import retry

+standard_library.install_aliases()


Should we do this before import io?

charlesccychen · 2018-06-26T22:29:59Z

sdks/python/apache_beam/io/tfrecordio_test.py

 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to

+standard_library.install_aliases()


Should we do this before import io?

charlesccychen · 2018-06-26T22:30:41Z

sdks/python/apache_beam/io/gcp/gcsio.py

    return self._size

  def get_range(self, start, end):
+    self._download_stream.seek(0)


This is fine, but I'm curious why you had to make this change--did you encounter an error before this?

Yes, this is a difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here

charlesccychen · 2018-07-03T00:01:23Z

sdks/python/apache_beam/io/avroio.py

 from apache_beam.io.iobase import Read
 from apache_beam.transforms import PTransform

+standard_library.install_aliases()


This is needed for import io, but is misplaced.

charlesccychen · 2018-07-03T00:02:54Z

sdks/python/apache_beam/io/filebasedsink.py


      exception_batches = util.run_using_threadpool(
-          _rename_batch, zip(source_file_batch, destination_file_batch),
+          _rename_batch, list(zip(source_file_batch, destination_file_batch)),


Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.

charlesccychen

Thank you! This LGTM. (ccy-benchmark-ok)

tvalentyn · 2018-08-04T00:18:34Z

sdks/python/apache_beam/io/gcp/pubsub.py

+from builtins import object

-from six import text_type
+from past.builtins import basestring


@Fematich
Do you remember, why did we decide to use past.builtins.basestring here instead of past.builtins.unicode?

It seems that on Python 2 this change translates to changing the element_type type from unicode to bytes/str in line 176.
@chamikaramj Do you have an intuition if this could bite us?
CC: @altay

Correction: past.builtins.basestring is configured to mimic a superclass of both str and unicode in Python 2. So possibly there is no significant change in behavior but I am not 100% sure.

It's indeed a superclass of both str and unicode. Only in #5373 (comment) I observed the real need to use this superclass. However for this example, it seems like past.builtins.unicode would be a more compatible replacement (https://pythonhosted.org/six/#six.text_type) since I can't recall an explicit reason to use basestring instead of unicode here.

#6144 cleans this up here as well as in the place you mentioned, since we have reverted # cython: language_level=3.

superbobry suggested changes Jun 29, 2018

View reviewed changes

charlesccychen reviewed Jul 3, 2018

View reviewed changes

Fematich and others added 2 commits July 6, 2018 11:35

Futurize io subpackage

8bf17d4

incorporated all feedback for futurize io subpackage

071bc35

Fematich force-pushed the io branch from 81406c6 to 071bc35 Compare July 6, 2018 10:10

charlesccychen approved these changes Jul 6, 2018

View reviewed changes

charlesccychen merged commit 6f6feaa into apache:master Jul 6, 2018

Fematich deleted the io branch July 10, 2018 18:59

Fematich mentioned this pull request Jul 10, 2018

[BEAM-4751] fix missing pylint3 check for io subpackage #5916

Merged

RobbeSneyders mentioned this pull request Jul 12, 2018

[BEAM-1251] Replace NameError-driven dispatch with past #5869

Merged

tvalentyn reviewed Aug 4, 2018

View reviewed changes


		from apache_beam.utils.plugin import BeamPlugin

		standard_library.install_aliases()

[BEAM-4000] Futurize io subpackage #5715

[BEAM-4000] Futurize io subpackage #5715

Uh oh!

Conversation

Fematich commented Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich Jul 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesccychen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich Jul 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fematich commented Jun 21, 2018 •

edited

Loading

Fematich Jul 4, 2018 •

edited

Loading

Fematich Jul 4, 2018 •

edited

Loading

charlesccychen left a comment •

edited

Loading