Skip to content

Conversation

@chadrik
Copy link
Contributor

@chadrik chadrik commented Aug 15, 2019

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x ] Choose reviewer(s) and mention them in a comment (R: @username).
  • [ x] Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • [ x] If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@chadrik
Copy link
Contributor Author

chadrik commented Aug 15, 2019

R: @robertwb

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user specifies the type as List, should we ensure that we always return a List? (I.e should we explicitly state our assumption here that IterableCoder returns a list and also adda note in the implementation of IterableCoderImpl that if this changes we need to create a separate ListCoder? Or should we go ahead and create a separate ListCoder now?

@chadrik
Copy link
Contributor Author

chadrik commented Aug 27, 2019

If the user specifies the type as List, should we ensure that we always return a List? (I.e should we explicitly state our assumption here that IterableCoder returns a list and also adda note in the implementation of IterableCoderImpl that if this changes we need to create a separate ListCoder? Or should we go ahead and create a separate ListCoder now?

I dug into this a bit more and realized that I was mistaken about my assertion that IterableCoderImpl.decode always return a list. Here's the relevant code from the base class:

class SequenceCoderImpl(StreamCoderImpl):

  def __init__(self, elem_coder,
               read_state=None, write_state=None, write_state_threshold=0):
    self._elem_coder = elem_coder
    self._read_state = read_state
    self._write_state = write_state
    self._write_state_threshold = write_state_threshold

  ...

  def decode_from_stream(self, in_stream, nested):
    size = in_stream.read_bigendian_int32()

    if size >= 0:
      elements = [self._elem_coder.decode_from_stream(in_stream, True)
                  for _ in range(size)]
    else:
      elements = []
      count = in_stream.read_var_int64()
      while count > 0:
        for _ in range(count):
          elements.append(self._elem_coder.decode_from_stream(in_stream, True))
        count = in_stream.read_var_int64()

      if count == -1:
        if self._read_state is None:
          raise ValueError(
              'Cannot read state-written iterable without state reader.')

        state_token = in_stream.read_all(True)
        elements = _ConcatSequence(
            elements, self._read_state(state_token, self._elem_coder))

    return self._construct_from_sequence(elements)

_ConcatSequence and self._read_state are both iterators.

Some options:

  • create a ListCoder: the problem is this won't be portable, which defeats the original motivation here, for external transforms
  • make IterableCoder always return a list: It usually returns a list, but returns an iterator if a read_state is provided (not sure when this happens). In that case it seems like it progressively yields elements as they are decoded, but I'm not 100% clear on that. If that is the case, converting to a list on decode would undermine the usefulness of this feature.
  • leave it as is: list is iterable after all, and in the case that the user provides a list to encode, they will get a list back, since there shouldn't be a read_state in that case.

So as it is, IterableCoder is true to its name in that it really only guarantees that the decoded value is iterable. However, as the tests that I added imply, in the typical case, the result is actually a list. So I think the best thing to do is for me to update the test that I added to assert that the returned type is a list, so that if that ever changes, someone is forced to consider it.

@robertwb
Copy link
Contributor

Yeah, leaving as is, with some comments, seems the best path forward.

@chadrik chadrik force-pushed the python-typecoders-list branch from 5568da6 to f88989b Compare August 29, 2019 00:22
@chadrik
Copy link
Contributor Author

chadrik commented Aug 29, 2019

Addressed

@robertwb
Copy link
Contributor

Run Python PreCommit

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests passing.

@chadrik
Copy link
Contributor Author

chadrik commented Aug 30, 2019

I find the python precommit test rather inscrutable. It seems that it always fails on :sdks:python:setupVirtualenv and I don't understand whether I should care or not.

@mxm
Copy link
Contributor

mxm commented Aug 30, 2019

Run Python PreCommit

@mxm
Copy link
Contributor

mxm commented Aug 30, 2019

These failures could be related to the PR changes: https://builds.apache.org/job/beam_PreCommit_Python_Phrase/761/

@robertwb
Copy link
Contributor

robertwb commented Sep 3, 2019

Run Python PreCommit

@robertwb
Copy link
Contributor

robertwb commented Sep 3, 2019

I can't reproduce these test failures locally.

@mxm
Copy link
Contributor

mxm commented Sep 4, 2019

Tests are also failing locally for me with the same error:

TypeError: can only concatenate list (not "bytes") to list [while running 'FormatOutput']```

@chadrik
Copy link
Contributor Author

chadrik commented Sep 4, 2019 via email

@mxm
Copy link
Contributor

mxm commented Sep 4, 2019

Haven't fully figured out how to run a specific test through Gradle. What I do is to run this from the test directory (/sdks/python/test-suites/tox/py2/build/srcs/sdks/python:

python setup.py nosetests --tests apache_beam.testing.synthetic_pipeline_test

You need to activate the virtual environment that Gradle uses first. It's in /build/gradleenv

@chadrik
Copy link
Contributor Author

chadrik commented Sep 4, 2019

I gave up trying to figure out how to run specific tests from gradle as well, so I just created a PR #9474 to enable running individual tests using tox. This should theoretically be the same environment as gradle, since gradle calls tox.

I'm cherry-picking the commit from my PR and running the tests like this:

tox -e py27 -- --tests apache_beam.testing.synthetic_pipeline_test

Running this on my Macbook Pro succeeds, so I'm not sure where to go from here.

@chadrik
Copy link
Contributor Author

chadrik commented Sep 4, 2019

Update: I get the error using python 3.7, and I confirmed that I do not get it on master:

tox -e py37 -- --tests apache_beam.testing.synthetic_pipeline_test.SyntheticPipelineTest

@mxm
Copy link
Contributor

mxm commented Sep 4, 2019

That makes sense. I was testing with 3.7 as well.

@chadrik
Copy link
Contributor Author

chadrik commented Sep 4, 2019

What's happened here is that this change has exposed a difference in the inference code between python2 and python3.

Here's the function that's being analyzed differently:

def rotate_key(element):
  """Returns a new key-value pair of the same size but with a different key."""
  (key, value) = element
  return key[-1:] + key[:-1], value

python3:

In [1]: from apache_beam.typehints.trivial_inference import infer_return_type                                                                                                                                                                                           
In [2]: from apache_beam.testing.synthetic_pipeline import rotate_key                                                                                                                                                                                                   
In [3]: from apache_beam.typehints import Any                                                                                                                                                                                                                           
In [4]: infer_return_type(rotate_key, [Any])                                                                                                                                                                                                                            
Out[4]: Tuple[List[Any], Any]

python2:

In [1]: from apache_beam.typehints.trivial_inference import infer_return_type                                                                                                                                                                                           
In [2]: from apache_beam.testing.synthetic_pipeline import rotate_key                                                                                                                                                                                                   
In [3]: from apache_beam.typehints import Any                                                                                                                                                                                                                           
In [4]: infer_return_type(rotate_key, [Any])                                                                                                                                                                                                                            
Out[4]: Any

The actual return type of rotate_key is Tuple[bytes, bytes]. Previously, List[Any] resulted in the FastPrimitiveCoder, so even though the hint for the first element of the tuple was wrong in python3, the data for that element was properly round-tripped. After this change, the coder for that element became IterableCoder[FastPrimitiveCoder] and so it failed, because the data was in fact bytes.

Personally, it seems like an overreach to infer that key[-1:] + key[:-1] is a List.

Edit: clarity

@robertwb
Copy link
Contributor

robertwb commented Sep 5, 2019

Thanks for tracking this down. #9486

@robertwb
Copy link
Contributor

robertwb commented Sep 5, 2019

Run Python PreCommit

@chadrik
Copy link
Contributor Author

chadrik commented Sep 5, 2019

wow, Python PreCommit takes forever

@chadrik
Copy link
Contributor Author

chadrik commented Sep 6, 2019

yay, it works!

@robertwb
Copy link
Contributor

robertwb commented Sep 6, 2019

Long queue...

@robertwb robertwb merged commit d21bbaf into apache:master Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants