[7746] Create a more user friendly external transform API #9098

chadrik · 2019-07-18T07:09:39Z

As a python developer I'd like to have a more opinionated API for implementing external transforms so that A) it can guide me toward more standardized solutions and B) reduce the amount of boilerplate code that I need to write.

Improvements:

Create a constant for the expansion service address, to avoid copying it into the __init__ args of every external transform
Move the urn to the class-level to enforce consistency (which enables us to reduce boilerplate) and also for the sake of correctness: the value does not vary per instance
Reduce boilerplate:
- Provide a helper to create a ConfigValue with encoded value
  - Automatically determine the list of coder urns from a compound coder
  - Handle utf-8 encoding, if requested
- Create the ExternalTransform for the user based on the configuration dictionary that they provide

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Gearpump	Samza
Go	---	---	---	---
Java
Python	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

chadrik · 2019-07-18T07:11:02Z

R: @robertwb
R: @mxm
R: @lukecwik

robertwb

This looks like a nice improvement.

sdks/python/apache_beam/coders/coders.py

sdks/python/apache_beam/transforms/external.py

robertwb · 2019-07-18T11:43:53Z

sdks/python/apache_beam/transforms/external.py

+    """
+    return ConfigValue(
+      coder_urn=[urn for urn in iter_urns(coder)
+                 if urn not in FILTERED_CODERS],


I don't think we can safely filter out certain coders, as that changes the encoding.

Glad you mentioned it :)

I'm confused about why the LengthPrefixCoder urn is not included in the ConfigValue .coder_urns list in the original code. The original list or the kafka consumer_config param was ['beam:coder:iterable:v1', 'beam:coder:kv:v1', 'beam:coder:bytes:v1', 'beam:coder:bytes:v1']. I was trying to keep the end result the same, and I assumed that there were coders which were "write-only", and thus did not need to be transmitted to Java.

We do not need the length prefix coder here, do we? That's because we get the encoded bytes via Protobuf and do not have to reason about the length.

But we do need the lengths if it is, say, KV<LengthPrefixedX, LenthPrefixedY>. What's more, we're getting the already encoded value here (which is different than if we had tried to filter them out before doing the encoding.

Are you seeing instances of LengthPrefixCoder that need filtering out? If so, this deserves further investigation. Otherwise, let's just drop this.

I removed this filter.

Note that between this change and the StrUtf8Coder change, the list of coder urns produced by ReadFromKafka has changed from this:

coder_urn: "beam:coder:iterable:v1" coder_urn: "beam:coder:kv:v1" coder_urn: "beam:coder:bytes:v1" coder_urn: "beam:coder:bytes:v1"

to this:

coder_urn: "beam:coder:iterable:v1" coder_urn: "beam:coder:kv:v1" coder_urn: "beam:coder:length_prefix:v1" coder_urn: "beam:coder:string_utf8:v1" coder_urn: "beam:coder:length_prefix:v1" coder_urn: "beam:coder:string_utf8:v1"

Likewise for WriteToKafka,

@mxm can you please confirm whether this will require adjustments to the Kafka Java code?

We should not worry about LengthPrefixCoder at all.

Sounds good.

Eventually this can all be replaced by schemas anyway, so not something to worry to much about here.

Can you point me at a document or issue for this? I'd love to learn more.

I played around with this some more. The current design seems to require the LengthPrefixCoder. If I don't wrap the str/bytes coder with length-prefix I get this error:

Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException: reached end of stream after reading 38 bytes; 112 bytes expected

@mxm can you confirm that the current design requires the LengthPrefixCoder for strings?

@robertwb I saw the schema / row-coder PR at #9188. Is this the schema support you were referring to? Is anyone assigned to porting external transforms to using schema coders? Luckily I think most of the discussion here about the design of the interface remains valid even after that's complete.

Coders also have the notion of nested/unnested. It happens that LengthPrefix(UnnestedBytes) == NestedBytes. Perhaps this is part of the issue here?

Yes, that's the schema/row stuff. I was just commenting that this particular part is going to change (no, no one's on it yet) so whatever works here is fine (and should simplify the above issues).

@chadrik Sorry for the late reply. Vacations and open-source work do not always go together well :) Actually, for the StringUtf8Coder, the length should be added automatically in the nested context, no need to add the length prefix.

how does "elements_per_period": 20 in our python-based payload end up calling GenerateSequence.Builder.setElementsPerPeriod(20) in Java?

It is a simple mapping scheme which we also use for the pipeline options. It converts snake_case to camelCase and then looks up the setter in the Java configuration class.

@mxm can you confirm that the current design requires the LengthPrefixCoder for strings?

I don't think so, even with the byte encoding in master, the length prefix is added automatically by the ByteArrayCoder.

Perhaps paste the full stack trace here if you are still seeing problems.

sdks/python/apache_beam/transforms/external.py

mxm

Looks great @chadrik. Thanks a lot. Some comments inline.

sdks/python/apache_beam/io/external/generate_sequence.py

sdks/python/apache_beam/transforms/external.py

mxm · 2019-07-23T14:01:00Z

sdks/python/apache_beam/transforms/external.py

+    """
+    return ConfigValue(
+      coder_urn=[urn for urn in iter_urns(coder)
+                 if urn not in FILTERED_CODERS],


We do not need the length prefix coder here, do we? That's because we get the encoded bytes via Protobuf and do not have to reason about the length.

sdks/python/apache_beam/transforms/external.py

sdks/python/apache_beam/coders/coders.py

robertwb

The design looks sound to me.

sdks/python/apache_beam/io/external/kafka.py

robertwb · 2019-07-29T12:10:49Z

sdks/python/apache_beam/transforms/external.py

+          typehint is None or not isinstance(typehint, typehints.Optional)):
+        # make it easy for user to filter None by default
+        continue
+      result[k] = cls.config_value(v, typehint)


Do we want to strip the optional wrapping here (e.g. so an Optional[int] is encoded just as a raw into or omitted)?

yes, I think so. I didn't realize that Union/Optional caused coders to fall back to FastPrimitive, which is kind of a bummer.

chamikaramj

Thanks.

sdks/python/apache_beam/transforms/external.py

mxm

Thanks @chadrik. For the naming, please see #9098 (comment).

robertwb · 2019-08-02T13:42:24Z

OK, taking a step back and thinking about this some more, there are three essential pieces of data that the user must provide for this transform: the urn and the configuration parameters, and how to encode the configuration parameters. This suggests the methods get_urn, get_config, and get_schema which are invoked by the base class.

For simplicity, as urns are typically static, we could provide a default get_urn that returns self._urn.

get_config and get_schema are a bit more interesting as there is non-trivial redundancy between them. If get_schema is not implemented, we can infer it from the types in get_config. If get_config is not implemented, we can infer it from the identically-named self attributes in get_schema. I propose we let the user implement either (or both).

Additionally, get_schema could be inferred for data classes, or those derived from cattrs. (Should this be done implicitly due to subclassing, or explicitly via a class decorator?)

This is related to the further magic/ease-of-use, which is allowing a single _schema attribute rather than implementing get_schemea, and also auto-generating __init__ methods from it.

Does this sound right?

chadrik · 2019-08-02T15:29:20Z

Does this sound right?

Spot on, as usual. A few comments below:

For simplicity, as urns are typically static, we could provide a default get_urn that returns self._urn.

Note, we might want get_urn to be a classmethod.

get_config and get_schema are a bit more interesting as there is non-trivial redundancy between them. If get_schema is not implemented, we can infer it from the types in get_config. If get_config is not implemented, we can infer it from the identically-named self attributes in get_schema. I propose we let the user implement either (or both).

I'm leaning toward requiring get_schema and not deriving schema from the config, because I think consistency of the schema is important, especially for the author of the native transform who would (likely) be expecting each key to use the same coder every time. If we derive the coder from the value, then it could change for a given config key depending on how the external transform is initialized (x=0 vs x=0.5) . The schema lets us be consistent and concrete.

Additionally, get_schema could be inferred for data classes, or those derived from cattrs. (Should this be done implicitly due to subclassing, or explicitly via a class decorator?)

I'd like to avoid the sub-class requirement. More info below.

This is related to the further magic/ease-of-use, which is allowing a single _schema attribute rather than implementing get_schemea, and also auto-generating init methods from it.

yes, basically. let me clarify this a little bit more.

A user should be able to write a complete class including schema using type annotations, like this:

class ReadFromKafka(External):
  _urn = 'beam:external:java:kafka:read:v1'

  def __init__(self,
               consumer_config: Iterable[Tuple[str, str]],
               topics: Iterable[str],
               key_deserializer: str,
               value_deserializer: str,
               expansion_service: Optional[str] = None):
    super(ReadFromKafka, self).__init__(expansion_service)
    self.consumer_config = list(consumer_config.items())
    self.topics = topics
    self.key_deserializer = key_deserializer
    self.value_deserializer = value_deserializer

And attrs / dataclasses can be used as a shortcut for deriving the __init__:

@dataclass
class ReadFromKafka(External):
  _urn = 'beam:external:java:kafka:read:v1'

  consumer_config: Iterable[Tuple[str, str]]
  topics: Iterable[str]
  key_deserializer: str
  value_deserializer: str
  expansion_service: Optional[str] = field(default=None)

expansion_service is obviously the odd-man-out here, so I'll give some thought to how this could be simplified.

The thing I wanted to make clear is that I don't think get_schema would need special support for attrs/dataclasses in this case: both approaches outlined above will add type __annotations__ to __init__, from which we can derive our schema.

robertwb · 2019-08-05T11:12:21Z

On may have transforms with no config params, or whose parameter is a single string. Requiring a schema in this case is extra boilerplate (though arguably good practice, but in the spirit of PEP 484 very optional). To pass a wrong value (e.g. a float when an int is required) would be a runtime type error (like one would expect).

A default implementation that looks at the __annotations__ of __init__, if present, allow succinct creation of such classes as mentioned.

chadrik · 2019-08-07T06:28:42Z

Run Python PreCommit

chamikaramj

Thanks. Updated API LGTM.

chamikaramj · 2019-08-08T17:05:50Z

sdks/python/apache_beam/io/external/kafka.py

+from apache_beam.transforms.external import ExternalTransform, NamedTupleBasedPayloadBuilder
+
+
+ReadFromKafkaSchema = typing.NamedTuple(


nit: 'ReadFromKafkaPayloadTuple' will be a better name I think.

btw, I used the name "schema" since ultimately we will switch to using the upcoming schema support when serializing this payload. Not sure if that changes your mind on the naming at all.

chamikaramj · 2019-08-08T17:08:36Z

sdks/python/apache_beam/io/external/kafka.py

+    )
+
+
+WriteToKafkaSchema = typing.NamedTuple(


chamikaramj · 2019-08-08T18:10:23Z

sdks/python/apache_beam/transforms/external.py

+    return ExternalConfigurationPayload(configuration=args)
+
+
+class NamedTupleBasedPayloadBuilder(SchemaBasedPayloadBuilder):


Please add unit tests for each of these payload builder types.

chamikaramj · 2019-08-08T18:11:45Z

sdks/python/apache_beam/transforms/external.py

+
+  Supported in python 3 only.
+  """
+  def __init__(self, transform, **values):


Please add a pydocs and define parameters (here and in other places)

chamikaramj · 2019-08-08T18:15:19Z

sdks/python/apache_beam/transforms/external.py

+    for k, v in config.items():
+      typehint = schema.get(k)
+      if v is None and (
+          typehint is None or not isinstance(typehint, typehints.Optional)):


So seems like 'None' here means, used default value in remote SDK, right ? We should clarify that in documentation.

The behavior is a bit undefined at this point, since we only have GenerateSequence as an example. In that case, the field is null on the Java side as well.

The reason we skip the fields if they're none here is ultimately because we don't have a cross-platform coder for none/null or a cross-platform way of declaring a type as optional/nullable. So the behavior would be better defined if we could serialize None and send that value to Java.

mxm

@chadrik I think everything looks good now. Would be great if we could add the final touches and get this in.

chadrik · 2019-08-13T17:29:31Z

@mxm I'll start working on the serialization tests for this today.

I haven't tested this recently, but I'm pretty sure that deserialization in Java still does not work because of the LengthPrefix error I described earlier (the Java side seems to expect LengthPrefix, and the python side no longer adds it, since it's difficult to know when to do that automatically). I'll get you the full stack trace soon.

chadrik · 2019-08-15T03:27:50Z

Pushed a lot of updates to this.

I fixed up a number of issues and created a new ImplicitSchemaPayloadBuilder which covers the use case of determining coders from the provided values, though I'm still not 100% convinced that this is a good idea. It's fine for python-to-python, but it's going to be brittle if it's intended for expansion via other languages, where expectations are more rigid.

The tests fail because #9344 is required for the implicit schema generation, so it'd be good to get that PR merged soon.

@mxm this is ready for you to look at why the string decoding is failing on the java side.

robertwb · 2019-08-26T21:55:30Z

This is looking great, and will be a very nice simplification for authoring external transform stubs.

I think the ImplicitSchemaPayloadBuilder is pretty safe because most of the types here will be simple basic ones (I'd say 90%+ just ints, floats, and strings).

I reviewed #9344, just needs some more comments. Could you clarify what string decoding issues you were seeing that @mxm was going to look into? I agree with Cham that it'd be good to have tests for these various payload builders. After that, this looks good to go and it'd be great to get in.

chadrik · 2019-08-27T01:43:05Z

I reviewed #9344, just needs some more comments.

I started to respond there and realized it was more complicated than I thought. I just updated that.

Could you clarify what string decoding issues you were seeing that @mxm was going to look into?

Yes, I have to revisit that test to get the full stack trace, but the gist is that the external transform code in java assumes all strings have been wrapped in LengthPrefixCoder, which is no longer happening in this PR.

I agree with Cham that it'd be good to have tests for these various payload builders. After that, this looks good to go and it'd be great to get in.

Added tests. They will fail until #9344 is merged.

robertwb · 2019-09-06T01:50:47Z

Run Python PreCommit

robertwb · 2019-09-06T01:51:00Z

Run RAT PreCommit

robertwb · 2019-09-06T01:51:08Z

Run Python_PVR_Flink PreCommit

chadrik · 2019-09-06T02:21:54Z

Now that all the dependent MRs are in, I can finally demonstrate the coder problem that I was running into. I'll do that tomorrow.

chadrik · 2019-09-06T06:49:27Z

@mxm I could use your help now!

With the latest from this branch, I'm using the following to expand an external KafkIO xform:

import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka, WriteToKafka

def main(pipeline_options, args):

    pipe = beam.Pipeline(options=pipeline_options)

    (
        pipe
        | 'PubSubInflow' >> ReadFromKafka(
            consumer_config={'foo': 'bar'},
            topics=['this', 'that'])
    )

    result = pipe.run()

    try:
        result.wait_until_finish()
    except KeyboardInterrupt:
        pass

And I get this error in the expansion service:

[grpc-default-executor-0] INFO org.apache.beam.runners.core.construction.expansion.ExpansionService - Expanding 'root' with URN 'beam:external:java:kafka:read:v1'
Sep 05, 2019 11:33:38 PM org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@31d95dcf
java.lang.RuntimeException: Failed to build transform beam:external:java:kafka:read:v1 from spec urn: "beam:external:java:kafka:read:v1"
payload: "\nl\n\020key_deserializer\022X\n\031beam:coder:string_utf8:v1\022;org.apache.kafka.common.serialization.ByteArrayDeserializer\nn\n\022value_deserializer\022X\n\031beam:coder:string_utf8:v1\022;org.apache.kafka.common.serialization.ByteArrayDeserializer\n\201\001\n\017consumer_config\022n\n\026beam:coder:iterable:v1\n\020beam:coder:kv:v1\n\031beam:coder:string_utf8:v1\n\031beam:coder:string_utf8:v1\022\f\000\000\000\001\003foo\003bar\nM\n\006topics\022C\n\026beam:coder:iterable:v1\n\031beam:coder:string_utf8:v1\022\016\000\000\000\002\004this\004that"

        at org.apache.beam.runners.core.construction.expansion.ExpansionService$ExternalTransformRegistrarLoader.lambda$knownTransforms$0(ExpansionService.java:116)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService$TransformProvider.apply(ExpansionService.java:290)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService.expand(ExpansionService.java:335)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService.expand(ExpansionService.java:372)
        at org.apache.beam.model.expansion.v1.ExpansionServiceGrpc$MethodHandlers.invoke(ExpansionServiceGrpc.java:224)
        at org.apache.beam.vendor.grpc.v1p21p0.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)
        at org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:322)
        at org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:762)
        at org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.beam.sdk.coders.CoderException: java.io.EOFException: reached end of stream after reading 58 bytes; 111 bytes expected
        at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:104)
        at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:90)
        at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:37)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService$ExternalTransformRegistrarLoader.populateConfiguration(ExpansionService.java:186)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService$ExternalTransformRegistrarLoader.translate(ExpansionService.java:134)
        at org.apache.beam.runners.core.construction.expansion.ExpansionService$ExternalTransformRegistrarLoader.lambda$knownTransforms$0(ExpansionService.java:113)
        ... 12 more
Caused by: java.io.EOFException: reached end of stream after reading 58 bytes; 111 bytes expected
        at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.io.ByteStreams.readFully(ByteStreams.java:780)
        at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.io.ByteStreams.readFully(ByteStreams.java:762)
        at org.apache.beam.sdk.coders.StringUtf8Coder.readString(StringUtf8Coder.java:60)
        at org.apache.beam.sdk.coders.StringUtf8Coder.decode(StringUtf8Coder.java:100)
        ... 17 more

mxm · 2019-09-06T14:33:38Z

sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java

+        private Iterable<KV<String, String>> producerConfig;
+        private String topic;
+        private String keySerializer;
+        private String valueSerializer;


If you change the types here this will affect the lookup of the configuration fields in ExpansionService. The lookup is performed based on the returned type of the coder. The test still uses the bytes coder, so this will attempt to lookup a byte[] field.

This is for the KafkaIOExternal test which I see failing.

chadrik · 2019-09-06T15:05:48Z

Forget about the tests for a moment, I don’t think they are relevant to the problem I’m seeing. I will fix them shortly. Try the example code that I sent.

The original Kafka code was manually handling conversion between utf8 strings and bytes in both python and java, instead of using the proper string coders. Now that the coders are based on types/schemas in python we are forced to use the correct coder (using bytes in the schema would not have been appropriate because it would have have handled utf8 properly). So fixing this in python forced us to fix it in Java.

If you think it would be helpful I can make a separate PR to change KafkaIO to use StringUtf8Coder.

mxm · 2019-09-06T15:31:53Z

sdks/python/apache_beam/transforms/external_test.py

+          payload=VarIntCoder().encode(values['integer_example'])),
+      'string_example': ConfigValue(
+          coder_urn=['beam:coder:string_utf8:v1'],
+          payload=StrUtf8Coder().encode(values['string_example'])),


This is not sufficient because the Python StrUtf8Coder does not add a length prefix while the Java one does. Maybe that is actually a bug. The old version was using a length prefix coder before the bytes coder.

See

beam/sdks/python/apache_beam/coders/coders.py

Line 326 in 9678149

def encode(self, value):

Difference between Python and Java UTF-8 coders was discussed before in dev list: https://lists.apache.org/thread.html/a3a0d9e7c4196bb6be14ba0bec103209317dcb98e781560eb3ccd48c@%3Cdev.beam.apache.org%3E

Unfortunately I don't believe we resolved this though.

This is the nested vs. unnested issue back to bite us again.

Here we should be using the nested context consistently. For Python, write coder.get_impl().encode_nested(value). Probably would be worth adding an encode_nested method to coder itself.

@robertwb Can you provide a bit more explanation for the beam-newbs in the audience (me), please? Does this just ensure lengths prefixed for applicable types?

Basically, ever coder actually represents two concrete encodings, a "nested" one for use in an unbounded input stream (e.g. as a stream of many elements, or within a composite element type like list), and a "outer" one for use when the end-of-record must be provided explicitly (e.g. when writing to a text file, were the newline delimits strings and prefixing each line with a string length would be bad). (In Java this is reified in the second Context argument of the encode/decode functions. For Python it's only visible in the impl layer).

For some coders, nested and unnested are identical (e.g. varints or doubles, where the element length is implicit in the encoding). For others, a length prefix is added (e.g. utf8 strings and bytes). For others still, it gets pushed down (e.g. for kv coders, the key is always encoded nested, and the value is encoded in the outer context).

We're trying to move to use the "nested" one everywhere, as this is really confusing (see the linked thread).

I think we should be good here and elsewhere to use the nested encoding. On the Java side, we need to change the decode call in ExpansionServer to use the NESTED parameter. The old code here did not need that because it simply wrapped everything in an additional LengthPrefixCoder, which encode_nested will essentially do as well. So either reverting back to adding a wrapping LengthPrefixCoder or using encode_nested should work. Note that the latter also requires the additional NESTED parameter on the Java side in ExpansionServer.

chadrik · 2019-09-11T06:12:07Z

Run Java PreCommit

chadrik · 2019-09-11T19:42:22Z

Run Java PreCommit

chadrik · 2019-09-11T20:47:01Z

woohoo!

chadrik · 2019-09-11T21:14:27Z

@robertwb there's a commit in this PR that makes it possible to specify a test's minimum major and minor version for python. I use it in this PR to ensure that the dataclasses tests, which rely on syntax changes introduced in python 3.6, and the dataclasses module introduced in python 3.7, are not loaded and run by the python 3.5 tests. To do so I just extended the regexes a bit. Let me know if you think this should be in a separate PR.

robertwb · 2019-09-11T23:14:36Z

In this particular case, can you just guard the test from running with a runtime check of sys.version, rather than the boilerplate to exclude tests in the testing scripts? (The py3 stuff was needed for exercising new syntax).

After that, could you squash to more logical commits, and then I think it's ready to merge. Thanks again for doing this!

mxm · 2019-09-12T00:07:56Z

Thanks you for your persistent work on the PR @chadrik. Will be great to merge this! :)

chadrik · 2019-09-12T01:40:03Z

In this particular case, can you just guard the test from running with a runtime check of sys.version, rather than the boilerplate to exclude tests in the testing scripts? (The py3 stuff was needed for exercising new syntax).

I can't do a runtime check because this is protecting against a syntax change, introduced in python 3.6 (variable annotations, PEP-526) so the py35 tests fail with a syntax error during load. I figured this solution was simpler (though uglier) than trying to lobby the mailing to drop python 3.5 support :)

robertwb · 2019-09-12T03:49:36Z

Oh, I missed that new syntax... I would be a fan of dropping 3.5 support, it's security-fix-only, source-release-only by this point, and might be easier to pull now than after we have a bunch of Python 3 users. However, I agree that's bigger than this PR, so it's fine as is.

…

On Wed, Sep 11, 2019 at 6:40 PM Chad Dombrova ***@***.***> wrote: In this particular case, can you just guard the test from running with a runtime check of sys.version, rather than the boilerplate to exclude tests in the testing scripts? (The py3 stuff was needed for exercising new syntax). I can't do a runtime check because this *is* a syntax change, introduced in python 3.6 (variable annotations, PEP-526) so the py35 tests fail with a syntax error during load. I figured this solution was simpler (though uglier) than trying to lobby the mailing to drop python 3.5 support :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9098?email_source=notifications&email_token=AADWVAJTTB2E7MX5LNKQNCLQJGMXJA5CNFSM4IEXTNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6QMFFI#issuecomment-530629269>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADWVAI6GRRNYWJ73VREEXTQJGMXJANCNFSM4IEXTNYA> .

Standardize and reduce boilerplate

previously was manually handling conversion from byte[] to String

…anges in 3.6

chadrik · 2019-09-12T14:42:49Z

After that, could you squash to more logical commits, and then I think it's ready to merge.

Done.

aaltay · 2019-09-13T16:10:43Z

I will revert this, postcommit tests are failing: https://issues.apache.org/jira/browse/BEAM-8229

aaltay · 2019-09-13T16:17:03Z

Filed an issues for having py 3.x tests as part of precommits: https://issues.apache.org/jira/browse/BEAM-8230

This seems like a simple issue, feel free to fix forward if that can happen quicker than a rollback.

chadrik · 2019-09-13T16:23:47Z

There's a misunderstanding here. Pre-commit does run the python 3.x tests, but it runs many of the tests on python 3.5. This PR introduced tests that can only run on python 3.6 or higher, and I had to do some extra work to the beam test framework to make that possible: 0c31f7c

I appears that same work to properly exclude/include tests at the major + minor version needs to be done for post-commit tests. I didn't do that simply because I didn't know it could be a problem. I assumed everything went through tox.

aaltay · 2019-09-13T16:27:38Z

I appears that same work to properly exclude/include tests at the major + minor version needs to be done for post-commit tests.

@chadrik I agree this seems be the issue.

robertwb · 2019-09-13T16:55:08Z

Sorry about that.

This PR has a lot going on in it. Chad, could you pull out the bulk of this PR (minus the data classes, and associated test and test framework changes) and we could get that latter part in via a subsequent PR?

chadrik force-pushed the py_external_api branch from 6f77aa2 to 2b309f9 Compare July 18, 2019 07:12

robertwb reviewed Jul 18, 2019

View reviewed changes

mxm self-requested a review July 18, 2019 15:00

mxm reviewed Jul 23, 2019

View reviewed changes

robertwb reviewed Jul 29, 2019

View reviewed changes

chamikaramj reviewed Jul 31, 2019

View reviewed changes

sdks/python/apache_beam/transforms/external.py Outdated Show resolved Hide resolved

sdks/python/apache_beam/transforms/external.py Outdated Show resolved Hide resolved

mxm reviewed Aug 1, 2019

View reviewed changes

chadrik force-pushed the py_external_api branch from e2af3ba to 7396fdd Compare August 1, 2019 23:42

chamikaramj reviewed Aug 8, 2019

View reviewed changes

mxm approved these changes Aug 13, 2019

View reviewed changes

mxm mentioned this pull request Aug 13, 2019

[BEAM-7738] Add external transform support to PubsubIO #9268

Merged

chadrik force-pushed the py_external_api branch 2 times, most recently from c86f017 to 250a584 Compare August 15, 2019 00:21

chadrik force-pushed the py_external_api branch from 250a584 to dafe890 Compare August 26, 2019 00:32

chadrik force-pushed the py_external_api branch from dafe890 to 250a584 Compare August 27, 2019 01:40

chadrik force-pushed the py_external_api branch from 250a584 to b9de4bc Compare September 6, 2019 02:18

mxm reviewed Sep 6, 2019

View reviewed changes

chadrik force-pushed the py_external_api branch from c2314c1 to 2f90bc7 Compare September 11, 2019 18:21

chadrik and others added 7 commits September 11, 2019 23:25

Create a more user friendly external transform API

551faa1

Standardize and reduce boilerplate

KafkaIO: use StringUtf8Coder in Java to match python

356f470

previously was manually handling conversion from byte[] to String

Encode and decode using nested context

ef6a5dc

Add better error message for configuration type mismatch

932a696

Allow tests to specify a python 3 minor version, to isolate syntax ch…

0c31f7c

…anges in 3.6

Split dataclass-based external transform tests to run in python 3.7 only

e470c42

Add dataclasses as a native library for isort

0ae5582

chadrik force-pushed the py_external_api branch from 2f90bc7 to 0ae5582 Compare September 12, 2019 06:26

robertwb approved these changes Sep 12, 2019

View reviewed changes

robertwb merged commit 535bf37 into apache:master Sep 12, 2019

aaltay mentioned this pull request Sep 13, 2019

[8229] Revert "[7746] Create a more user friendly external transform API" #9565

Closed

chadrik mentioned this pull request Sep 13, 2019

[BEAM-8229] Exclude python 3-specific test files from running in post-commit #9570

Merged

3 tasks

		from apache_beam.transforms.external import ExternalTransform, NamedTupleBasedPayloadBuilder


		ReadFromKafkaSchema = typing.NamedTuple(

		return ExternalConfigurationPayload(configuration=args)


		class NamedTupleBasedPayloadBuilder(SchemaBasedPayloadBuilder):

		)


		WriteToKafkaSchema = typing.NamedTuple(

[7746] Create a more user friendly external transform API #9098

[7746] Create a more user friendly external transform API #9098

Uh oh!

Conversation

chadrik commented Jul 18, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

chadrik commented Jul 18, 2019

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertwb Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

robertwb commented Aug 2, 2019

Uh oh!

chadrik commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertwb commented Aug 5, 2019

Uh oh!

chadrik commented Aug 7, 2019

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

robertwb Aug 5, 2019 •

edited

Loading

chadrik commented Aug 2, 2019 •

edited

Loading