Skip to content

Conversation

@Fematich
Copy link
Contributor

@Fematich Fematich commented Jun 21, 2018

This pull request prepares the io subpackage for Python 3 support. This PR is part of a series in which all subpackages will be updated using the same approach.
This approach has been documented here and the first pull request in the series (Futurize coders subpackage) demonstrating this approach can be found at #5053.

R: @aaltay @tvalentyn @RobbeSneyders @charlesccychen

from apache_beam.io.iobase import Read
from apache_beam.transforms import PTransform

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for import io, but is misplaced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io is available starting from Python 2.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all the install_aliases caused by io


exception_batches = util.run_using_threadpool(
_rename_batch, zip(source_file_batch, destination_file_batch),
_rename_batch, list(zip(source_file_batch, destination_file_batch)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC run_using_threadpool accepts any iterable, so forcing a list is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, please discard my initial comment.

__all__ = ['FileBasedSource']

try:
unicode # pylint: disable=unicode-builtin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

from past.builtins import long, unicode

This applies to other similar try-except imports in this (and related) PRs.

Copy link
Contributor Author

@Fematich Fematich Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the try/except method to be consistent with the methodology implemented in #5053 and outlined in the doc by @RobbeSneyders. I believe there where some issues in the coder package for the typechecks using past.builtins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd humbly suggest reconsidering (unless there are indeed issues with using past in some modules) as try-except is much more verbose syntactically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/except was indeed used to prevent problems with typechecking when using builtins from the future package.
However, I tested this again, and it seems that there are no such problems using the past.builtins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created the issue BEAM-4730 to apply this change to already ported packages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in #5869.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

def __init__(self, file_based_source, file_name, start_offset, stop_offset,
min_bundle_size=0, splittable=True):
if not isinstance(start_offset, integer_types):
if not isinstance(start_offset, (int, long)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you import from past you can replace this with just int.

See http://python-future.org/compatible_idioms.html#long-integers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to replace this with just int, we would have to import int from (future.)builtins, which gives problems when used for typechecks. To stay consistent with other modules, which do use typechecks, I would advice against this.

Instead, we can replace the try/except block with from past.builtins import long as mentioned in the comment above.

if self.readable():
self._read_size = read_size
self._read_buffer = cStringIO.StringIO()
self._read_buffer = BytesIO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency nitpick: other modules use io in a qualified way.

r'^Unable to get the Filesystem') as error:
FileSystems.match([None])
self.assertEqual(error.exception.exception_details.keys(), [None])
self.assertEqual(list(error.exception.exception_details.keys()), [None])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using list(d) instead of list(d.keys()).

return resp

def __next__(self):
return next(self.__iter__())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always return the first element. Is this the expected behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is indeed an error. This method is actually now also superfluous if I use your suggestion for the test in #5715 (comment). I removed it

query_iterator = helper.QueryIterator("project", None, self._query,
self._mock_datastore)
self.assertRaises(RPCError, iter(query_iterator).next)
self.assertRaises(RPCError, query_iterator.__next__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work on Python2. How about

from future.builtins import next

self.assertRaises(..., lambda: next(query_iterator))

return self._size

def get_range(self, start, end):
self._download_stream.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed in Python 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the clarification.

pubsub_pb2 = None

try:
basestring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from past.builtins import basestring

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Also, please rebase to head.

from apache_beam.io.iobase import Read
from apache_beam.transforms import PTransform

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io on line 46? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 46.

from apache_beam.transforms.display import DisplayData
from apache_beam.transforms.display_test import DisplayDataItemMatcher

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io on line 21? This only works now because by the arbitrary current import order, some other module has called install_aliases() before we get to line 21.


from apache_beam.utils.plugin import BeamPlugin

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io?

from apache_beam.io.filesystem import FileMetadata
from apache_beam.io.filesystem import FileSystem

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io?

from apache_beam.io.filesystemio import UploaderStream
from apache_beam.utils import retry

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io?

from apache_beam.testing.util import assert_that
from apache_beam.testing.util import equal_to

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this before import io?

return self._size

def get_range(self, start, end):
self._download_stream.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but I'm curious why you had to make this change--did you encounter an error before this?

Copy link
Contributor Author

@Fematich Fematich Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a difference in behavior between cStringIO and BytesIO. In BytesIO truncate doesn't move the file pointer, see here

from apache_beam.io.iobase import Read
from apache_beam.transforms import PTransform

standard_library.install_aliases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for import io, but is misplaced.


exception_batches = util.run_using_threadpool(
_rename_batch, zip(source_file_batch, destination_file_batch),
_rename_batch, list(zip(source_file_batch, destination_file_batch)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading util.run_using_threadpool, it looks like the code calls len(inputs), which may not be compatible with all iterables.

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This LGTM. (ccy-benchmark-ok)

from builtins import object

from six import text_type
from past.builtins import basestring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fematich
Do you remember, why did we decide to use past.builtins.basestring here instead of past.builtins.unicode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that on Python 2 this change translates to changing the element_type type from unicode to bytes/str in line 176.
@chamikaramj Do you have an intuition if this could bite us?
CC: @altay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: past.builtins.basestring is configured to mimic a superclass of both str and unicode in Python 2. So possibly there is no significant change in behavior but I am not 100% sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a superclass of both str and unicode. Only in #5373 (comment) I observed the real need to use this superclass. However for this example, it seems like past.builtins.unicode would be a more compatible replacement (https://pythonhosted.org/six/#six.text_type) since I can't recall an explicit reason to use basestring instead of unicode here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6144 cleans this up here as well as in the place you mentioned, since we have reverted # cython: language_level=3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants