[BEAM-7984] The coder returned for typehints.List should be IterableCoder #9344

chadrik · 2019-08-15T02:49:47Z

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

[x ] Choose reviewer(s) and mention them in a comment (R: @username).
[ x] Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
[ x] If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

chadrik · 2019-08-15T02:50:03Z

R: @robertwb

robertwb

If the user specifies the type as List, should we ensure that we always return a List? (I.e should we explicitly state our assumption here that IterableCoder returns a list and also adda note in the implementation of IterableCoderImpl that if this changes we need to create a separate ListCoder? Or should we go ahead and create a separate ListCoder now?

chadrik · 2019-08-27T01:36:24Z

If the user specifies the type as List, should we ensure that we always return a List? (I.e should we explicitly state our assumption here that IterableCoder returns a list and also adda note in the implementation of IterableCoderImpl that if this changes we need to create a separate ListCoder? Or should we go ahead and create a separate ListCoder now?

I dug into this a bit more and realized that I was mistaken about my assertion that IterableCoderImpl.decode always return a list. Here's the relevant code from the base class:

class SequenceCoderImpl(StreamCoderImpl):

  def __init__(self, elem_coder,
               read_state=None, write_state=None, write_state_threshold=0):
    self._elem_coder = elem_coder
    self._read_state = read_state
    self._write_state = write_state
    self._write_state_threshold = write_state_threshold

  ...

  def decode_from_stream(self, in_stream, nested):
    size = in_stream.read_bigendian_int32()

    if size >= 0:
      elements = [self._elem_coder.decode_from_stream(in_stream, True)
                  for _ in range(size)]
    else:
      elements = []
      count = in_stream.read_var_int64()
      while count > 0:
        for _ in range(count):
          elements.append(self._elem_coder.decode_from_stream(in_stream, True))
        count = in_stream.read_var_int64()

      if count == -1:
        if self._read_state is None:
          raise ValueError(
              'Cannot read state-written iterable without state reader.')

        state_token = in_stream.read_all(True)
        elements = _ConcatSequence(
            elements, self._read_state(state_token, self._elem_coder))

    return self._construct_from_sequence(elements)

_ConcatSequence and self._read_state are both iterators.

Some options:

create a ListCoder: the problem is this won't be portable, which defeats the original motivation here, for external transforms
make IterableCoder always return a list: It usually returns a list, but returns an iterator if a read_state is provided (not sure when this happens). In that case it seems like it progressively yields elements as they are decoded, but I'm not 100% clear on that. If that is the case, converting to a list on decode would undermine the usefulness of this feature.
leave it as is: list is iterable after all, and in the case that the user provides a list to encode, they will get a list back, since there shouldn't be a read_state in that case.

So as it is, IterableCoder is true to its name in that it really only guarantees that the decoded value is iterable. However, as the tests that I added imply, in the typical case, the result is actually a list. So I think the best thing to do is for me to update the test that I added to assert that the returned type is a list, so that if that ever changes, someone is forced to consider it.

robertwb · 2019-08-28T01:43:53Z

Yeah, leaving as is, with some comments, seems the best path forward.

chadrik · 2019-08-29T00:22:31Z

Addressed

robertwb · 2019-08-30T16:26:19Z

Run Python PreCommit

robertwb

LGTM pending tests passing.

chadrik · 2019-08-30T17:59:58Z

I find the python precommit test rather inscrutable. It seems that it always fails on :sdks:python:setupVirtualenv and I don't understand whether I should care or not.

mxm · 2019-08-30T19:28:27Z

Run Python PreCommit

mxm · 2019-08-30T20:40:09Z

These failures could be related to the PR changes: https://builds.apache.org/job/beam_PreCommit_Python_Phrase/761/

robertwb · 2019-09-03T23:29:58Z

Run Python PreCommit

robertwb · 2019-09-03T23:44:44Z

I can't reproduce these test failures locally.

mxm · 2019-09-04T16:36:33Z

Tests are also failing locally for me with the same error:

TypeError: can only concatenate list (not "bytes") to list [while running 'FormatOutput']```

chadrik · 2019-09-04T16:49:22Z

There are a handful of levels from which the python tests can be invoked (gradle, tox, setup.py, nose): can you provide the exact command you're running?

…

On Wed, Sep 4, 2019 at 9:36 AM Maximilian Michels ***@***.***> wrote: Tests are also failing locally for me with the same error: TypeError: can only concatenate list (not "bytes") to list [while running 'FormatOutput']``` — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9344?email_source=notifications&email_token=AAAPOE6CF4RUU7HEMAGQJTDQH7PZ7A5CNFSM4IL2TD2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54GEJQ#issuecomment-527983142>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAPOE5RXOZFVWDYBVHRSMDQH7PZ7ANCNFSM4IL2TD2A> .

mxm · 2019-09-04T16:54:31Z

Haven't fully figured out how to run a specific test through Gradle. What I do is to run this from the test directory (/sdks/python/test-suites/tox/py2/build/srcs/sdks/python:

python setup.py nosetests --tests apache_beam.testing.synthetic_pipeline_test

You need to activate the virtual environment that Gradle uses first. It's in /build/gradleenv

chadrik · 2019-09-04T17:25:38Z

I gave up trying to figure out how to run specific tests from gradle as well, so I just created a PR #9474 to enable running individual tests using tox. This should theoretically be the same environment as gradle, since gradle calls tox.

I'm cherry-picking the commit from my PR and running the tests like this:

tox -e py27 -- --tests apache_beam.testing.synthetic_pipeline_test

Running this on my Macbook Pro succeeds, so I'm not sure where to go from here.

chadrik · 2019-09-04T17:30:57Z

Update: I get the error using python 3.7, and I confirmed that I do not get it on master:

tox -e py37 -- --tests apache_beam.testing.synthetic_pipeline_test.SyntheticPipelineTest

mxm · 2019-09-04T21:07:27Z

That makes sense. I was testing with 3.7 as well.

chadrik · 2019-09-04T22:32:58Z

What's happened here is that this change has exposed a difference in the inference code between python2 and python3.

Here's the function that's being analyzed differently:

def rotate_key(element):
  """Returns a new key-value pair of the same size but with a different key."""
  (key, value) = element
  return key[-1:] + key[:-1], value

python3:

In [1]: from apache_beam.typehints.trivial_inference import infer_return_type                                                                                                                                                                                           
In [2]: from apache_beam.testing.synthetic_pipeline import rotate_key                                                                                                                                                                                                   
In [3]: from apache_beam.typehints import Any                                                                                                                                                                                                                           
In [4]: infer_return_type(rotate_key, [Any])                                                                                                                                                                                                                            
Out[4]: Tuple[List[Any], Any]

python2:

In [1]: from apache_beam.typehints.trivial_inference import infer_return_type                                                                                                                                                                                           
In [2]: from apache_beam.testing.synthetic_pipeline import rotate_key                                                                                                                                                                                                   
In [3]: from apache_beam.typehints import Any                                                                                                                                                                                                                           
In [4]: infer_return_type(rotate_key, [Any])                                                                                                                                                                                                                            
Out[4]: Any

The actual return type of rotate_key is Tuple[bytes, bytes]. Previously, List[Any] resulted in the FastPrimitiveCoder, so even though the hint for the first element of the tuple was wrong in python3, the data for that element was properly round-tripped. After this change, the coder for that element became IterableCoder[FastPrimitiveCoder] and so it failed, because the data was in fact bytes.

Personally, it seems like an overreach to infer that key[-1:] + key[:-1] is a List.

Edit: clarity

robertwb · 2019-09-05T17:42:38Z

Thanks for tracking this down. #9486

robertwb · 2019-09-05T21:06:57Z

Run Python PreCommit

chadrik · 2019-09-05T22:51:21Z

wow, Python PreCommit takes forever

chadrik · 2019-09-06T01:29:04Z

yay, it works!

robertwb · 2019-09-06T01:49:01Z

Long queue...

chadrik mentioned this pull request Aug 15, 2019

[7746] Create a more user friendly external transform API #9098

Merged

robertwb reviewed Aug 23, 2019

View reviewed changes

chadrik added 2 commits August 28, 2019 17:13

The coder returned for typehints.List should be IterableCoder

846ada6

Clarify the story surrounding the list return value guarantee

f88989b

chadrik force-pushed the python-typecoders-list branch from 5568da6 to f88989b Compare August 29, 2019 00:22

robertwb approved these changes Aug 30, 2019

View reviewed changes

lint

eb5c794

robertwb merged commit d21bbaf into apache:master Sep 6, 2019

cisaacstern mentioned this pull request Jul 27, 2023

On dataflow, combine_fragments arg fragments is intermittently decoded incorrectly pangeo-forge/pangeo-forge-recipes#552

Open

[BEAM-7984] The coder returned for typehints.List should be IterableCoder #9344

[BEAM-7984] The coder returned for typehints.List should be IterableCoder #9344

Uh oh!

Conversation

chadrik commented Aug 15, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

chadrik commented Aug 15, 2019

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

chadrik commented Aug 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertwb commented Aug 28, 2019

Uh oh!

chadrik commented Aug 29, 2019

Uh oh!

robertwb commented Aug 30, 2019

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

chadrik commented Aug 30, 2019

Uh oh!

mxm commented Aug 30, 2019

Uh oh!

mxm commented Aug 30, 2019

Uh oh!

robertwb commented Sep 3, 2019

Uh oh!

robertwb commented Sep 3, 2019

Uh oh!

mxm commented Sep 4, 2019

Uh oh!

chadrik commented Sep 4, 2019 via email

Uh oh!

mxm commented Sep 4, 2019

Uh oh!

chadrik commented Sep 4, 2019

Uh oh!

chadrik commented Sep 4, 2019

Uh oh!

mxm commented Sep 4, 2019

Uh oh!

chadrik commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertwb commented Sep 5, 2019

Uh oh!

robertwb commented Sep 5, 2019

Uh oh!

chadrik commented Sep 5, 2019

Uh oh!

chadrik commented Sep 6, 2019

Uh oh!

robertwb commented Sep 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chadrik commented Aug 27, 2019 •

edited

Loading

chadrik commented Sep 4, 2019 •

edited

Loading