-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-4000] Futurize io subpackage #5715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sdks/python/apache_beam/io/avroio.py
Outdated
| from apache_beam.io.iobase import Read | ||
| from apache_beam.transforms import PTransform | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for import io, but is misplaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io is available starting from Python 2.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all the install_aliases caused by io
|
|
||
| exception_batches = util.run_using_threadpool( | ||
| _rename_batch, zip(source_file_batch, destination_file_batch), | ||
| _rename_batch, list(zip(source_file_batch, destination_file_batch)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC run_using_threadpool accepts any iterable, so forcing a list is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, please discard my initial comment.
| __all__ = ['FileBasedSource'] | ||
|
|
||
| try: | ||
| unicode # pylint: disable=unicode-builtin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
from past.builtins import long, unicodeThis applies to other similar try-except imports in this (and related) PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the try/except method to be consistent with the methodology implemented in #5053 and outlined in the doc by @RobbeSneyders. I believe there where some issues in the coder package for the typechecks using past.builtins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd humbly suggest reconsidering (unless there are indeed issues with using past in some modules) as try-except is much more verbose syntactically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try/except was indeed used to prevent problems with typechecking when using builtins from the future package.
However, I tested this again, and it seems that there are no such problems using the past.builtins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created the issue BEAM-4730 to apply this change to already ported packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in #5869.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you!
| def __init__(self, file_based_source, file_name, start_offset, stop_offset, | ||
| min_bundle_size=0, splittable=True): | ||
| if not isinstance(start_offset, integer_types): | ||
| if not isinstance(start_offset, (int, long)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you import from past you can replace this with just int.
See http://python-future.org/compatible_idioms.html#long-integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to replace this with just int, we would have to import int from (future.)builtins, which gives problems when used for typechecks. To stay consistent with other modules, which do use typechecks, I would advice against this.
Instead, we can replace the try/except block with from past.builtins import long as mentioned in the comment above.
| if self.readable(): | ||
| self._read_size = read_size | ||
| self._read_buffer = cStringIO.StringIO() | ||
| self._read_buffer = BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency nitpick: other modules use io in a qualified way.
| r'^Unable to get the Filesystem') as error: | ||
| FileSystems.match([None]) | ||
| self.assertEqual(error.exception.exception_details.keys(), [None]) | ||
| self.assertEqual(list(error.exception.exception_details.keys()), [None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using list(d) instead of list(d.keys()).
| return resp | ||
|
|
||
| def __next__(self): | ||
| return next(self.__iter__()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always return the first element. Is this the expected behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is indeed an error. This method is actually now also superfluous if I use your suggestion for the test in #5715 (comment). I removed it
| query_iterator = helper.QueryIterator("project", None, self._query, | ||
| self._mock_datastore) | ||
| self.assertRaises(RPCError, iter(query_iterator).next) | ||
| self.assertRaises(RPCError, query_iterator.__next__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work on Python2. How about
from future.builtins import next
self.assertRaises(..., lambda: next(query_iterator))| return self._size | ||
|
|
||
| def get_range(self, start, end): | ||
| self._download_stream.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed in Python 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for the clarification.
| pubsub_pb2 = None | ||
|
|
||
| try: | ||
| basestring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from past.builtins import basestring
charlesccychen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Also, please rebase to head.
sdks/python/apache_beam/io/avroio.py
Outdated
| from apache_beam.io.iobase import Read | ||
| from apache_beam.transforms import PTransform | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io on line 46? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 46.
| from apache_beam.transforms.display import DisplayData | ||
| from apache_beam.transforms.display_test import DisplayDataItemMatcher | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io on line 21? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 21.
|
|
||
| from apache_beam.utils.plugin import BeamPlugin | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io?
| from apache_beam.io.filesystem import FileMetadata | ||
| from apache_beam.io.filesystem import FileSystem | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io?
| from apache_beam.io.filesystemio import UploaderStream | ||
| from apache_beam.utils import retry | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io?
| from apache_beam.testing.util import assert_that | ||
| from apache_beam.testing.util import equal_to | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this before import io?
| return self._size | ||
|
|
||
| def get_range(self, start, end): | ||
| self._download_stream.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I'm curious why you had to make this change--did you encounter an error before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here
sdks/python/apache_beam/io/avroio.py
Outdated
| from apache_beam.io.iobase import Read | ||
| from apache_beam.transforms import PTransform | ||
|
|
||
| standard_library.install_aliases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for import io, but is misplaced.
|
|
||
| exception_batches = util.run_using_threadpool( | ||
| _rename_batch, zip(source_file_batch, destination_file_batch), | ||
| _rename_batch, list(zip(source_file_batch, destination_file_batch)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This LGTM. (ccy-benchmark-ok)
| from builtins import object | ||
|
|
||
| from six import text_type | ||
| from past.builtins import basestring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fematich
Do you remember, why did we decide to use past.builtins.basestring here instead of past.builtins.unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that on Python 2 this change translates to changing the element_type type from unicode to bytes/str in line 176.
@chamikaramj Do you have an intuition if this could bite us?
CC: @altay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: past.builtins.basestring is configured to mimic a superclass of both str and unicode in Python 2. So possibly there is no significant change in behavior but I am not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed a superclass of both str and unicode. Only in #5373 (comment) I observed the real need to use this superclass. However for this example, it seems like past.builtins.unicode would be a more compatible replacement (https://pythonhosted.org/six/#six.text_type) since I can't recall an explicit reason to use basestring instead of unicode here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6144 cleans this up here as well as in the place you mentioned, since we have reverted # cython: language_level=3.
This pull request prepares the io subpackage for Python 3 support. This PR is part of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the first pull request in the series (Futurize coders subpackage) demonstrating this approach can be found at #5053.
R: @aaltay @tvalentyn @RobbeSneyders @charlesccychen