-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-5967] Add handling of DynamicMessage in ProtoCoder #8496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mxm @reuvenlax @kennknowles This is the first draft of supporting ProtoBuf DynamicMessages. I took a design decision here (see commit) and like your feedback. |
|
Thanks @alexvanboxel. I'll defer to Kenn and Reuven for now. |
b7929e8 to
cb4fac3
Compare
|
Added extra check to make sure that when a DynamicMessage is used a Descriptor is provided. |
cb4fac3 to
0660124
Compare
|
Rebased against master to keep PR uptodate |
| this.protoMessageDescriptor = protoMessageDescriptor; | ||
| } | ||
|
|
||
| private void writeObject(ObjectOutputStream oos) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. It is a problem that ProtoCoder is a CustomCoder serialized via Java serialization rather than a StructuredCoder. Not related to your change, just noting. This essentially makes it incompatible with pipeline update. @reuvenlax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I pulled the upgradability of by using (and making protoMessageClass transient):
oos.defaultWriteObject();
if (DynamicMessage.class.equals(this.protoMessageClass)) {
I don't know if a StructuredCoder would help because a Descriptor is quite a complex beast (I serialize the complete proto binary on the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, that's pretty bad. Maybe we should change that?
An alternative would be to just put the DynamicMessage support in a ProtoSchemaProvider, and encourage people to stop using ProtoCoder altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, | don't see another way todo this. I need to get the Descriptor to the workers. The only way I see it working is sending the FileDescriptorSet as a proto steam onto the Java serialized object stream.
Getting everyone on ProtoSchemaProvider will have the exact same problem I'm afraid and I even think it will be worse as there is no way of sharing payload between the toRow and fromRow functions. I'll know in a few days when I got a working prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be cleaner if you had class DynamicMessageProtoCoder extends ProtoCoder? Then you wouldn't need to branch based on protoMessageClass == DynamicMessage you just override the relevant functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering the same thing. Might be cleaner to have a separate Coder for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the DynamicProtoCoder from the ProtoCoder
...va/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoCoder.java
Outdated
Show resolved
Hide resolved
...xtensions/protobuf/src/test/java/org/apache/beam/sdk/extensions/protobuf/ProtoCoderTest.java
Outdated
Show resolved
Hide resolved
0660124 to
6ca429b
Compare
|
Force pushed to resolve weak type of |
|
Alex, Kenn's concern is that people who make no changes to their pipeline
(other than updating the Beam version) will be unable to update.
ProtoSchemaCoder will not have this problem as people will need to make
code changes to use it.
*From: *Alex Van Boxel <notifications@github.com>
*Date: *Sun, May 12, 2019 at 10:11 AM
*To: *apache/beam
*Cc: *reuvenlax, Mention
*@alexvanboxel* commented on this pull request.
… ------------------------------
In
sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoCoder.java
<#8496 (comment)>:
> /** Private constructor. */
private ProtoCoder(Class<T> protoMessageClass, Set<Class<?>> extensionHostClasses) {
this.protoMessageClass = protoMessageClass;
this.extensionHostClasses = extensionHostClasses;
+ this.protoMessageDescriptor = null;
+ }
+
+ private ProtoCoder(
+ Descriptors.Descriptor protoMessageDescriptor, Set<Class<?>> extensionHostClasses) {
+ @SuppressWarnings("unchecked")
+ Class<T> protoMessageClass = (Class<T>) DynamicMessage.class;
+ this.protoMessageClass = protoMessageClass;
+ this.extensionHostClasses = extensionHostClasses;
+ this.protoMessageDescriptor = protoMessageDescriptor;
+ }
+
+ private void writeObject(ObjectOutputStream oos) throws IOException {
Well, | don't see another way todo this. I need to get the Descriptor to
the workers. The only way I see it working is sending the FileDescriptorSet
as a proto steam onto the Java serialized object stream.
Getting everyone on ProtoSchemaProvider will have the exact same problem
I'm afraid. I'll know in a few days when I got a working prototype.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8496 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFAYJVKDNB5Q7JLUP6IWKGDPVBFUBANCNFSM4HKZOJIQ>
.
|
|
@reuvenlax @kennknowles if upgradeability is a problem I think I handled this correctly, please note this extract from the serialization spec:
ref: https://docs.oracle.com/javase/8/docs/platform/serialization/spec/version.html This is why:
|
|
You should also set serialVersionUID to a fixed value. |
|
Setting the serialVersionUID will break compatibility. I admit, my implementation is only forward compatible, although backward compatibility is not an issue because I only add bytes to the stream when it's a DynamicMessage. Anyway, I'm have the impression this will never get accepted this way: I'll propose I override the ProtoCoder (calling it DynamicMessageCoder) and do the special logic there only for DynamicMessage. WDYT? |
|
If you set |
6ca429b to
0402c5b
Compare
|
@kennknowles had to take some time for this. I set the I also rebased again master. |
|
@kennknowles @reuvenlax Should this go through another review round? |
|
Well, as I'm sharing some code with the Protobuf Schema support, I like to
keep this one on hold a bit, if thats ok.
_/
_/ Alex Van Boxel
…On Thu, Jul 4, 2019 at 1:29 PM Maximilian Michels ***@***.***> wrote:
@kennknowles <https://github.com/kennknowles> @reuvenlax
<https://github.com/reuvenlax> Should this go through another review
round?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8496?email_source=notifications&email_token=AAE4EM7KL4G6F64UKAQDPP3P5XNJXA5CNFSM4HKZOJI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHFBQI#issuecomment-508448961>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAE4EMZNO34OLSQWSFPTJSTP5XNJXANCNFSM4HKZOJIQ>
.
|
0402c5b to
02304ca
Compare
|
@reuvenlax @kennknowles , The ProtoDomain code from the Proto Schema support is now backported to this pull-request. I've created a pipeline on Dataflow with lots of static Protobufs and performed an upgrade for 2.14.0 -> 2.16.0-SNAPSHOT (build locally from this PR). What is How is upgradability handled? ProtoCoder has now a fixed It would be great if this still made the 2.16.0 release. |
02304ca to
545ba9d
Compare
kennknowles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This seems quite nice and useful. I am not expert in proto descriptor wrangling. I would guess if there is a problem in that, @reuvenlax is more likely to catch it.
Maybe follow up with more extensive testing exercising lots of cases of ProtoDomain?
|
|
||
| /** | ||
| * Verifies that for the given {@code Coder<T>}, {@code Coder.Context}, and value of type {@code | ||
| * T}, encoding followed by decoding yields an equal value of type {@code T}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one doesn't test that it is equal, but tests that the matcher succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed description
| return map; | ||
| } | ||
|
|
||
| private static Descriptors.FileDescriptor convertToFileDescriptorMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nullable. I'm a bit sad our analyses didn't catch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added @nullable (stange that github doesn't mark this file as Outdated)
|
The ProtoDomain is tested though the Protoaware support, I aligned the
class with that PR, having this PR approved for this would make the schema
aware PR smalled, as I will rebase the PR.
_/
_/ Alex Van Boxel
…On Thu, Oct 10, 2019 at 6:55 PM Kenn Knowles ***@***.***> wrote:
***@***.**** approved this pull request.
Sorry for the delay. This seems quite nice and useful. I am not expert in
proto descriptor wrangling. I would guess if there is a problem in that,
@reuvenlax <https://github.com/reuvenlax> is more likely to catch it.
Maybe follow up with more extensive testing exercising lots of cases of
ProtoDomain?
------------------------------
In
sdks/java/core/src/main/java/org/apache/beam/sdk/testing/CoderProperties.java
<#8496 (comment)>:
> @@ -104,6 +104,16 @@
assertThat(decodeEncode(coder, context, value), equalTo(value));
}
+ /**
+ * Verifies that for the given ***@***.*** Coder<T>}, ***@***.*** Coder.Context}, and value of type ***@***.***
+ * T}, encoding followed by decoding yields an equal value of type ***@***.*** T}.
This one doesn't test that it is equal, but tests that the matcher
succeeds.
------------------------------
In
sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoDomain.java
<#8496 (comment)>:
> + }
+
+ private ProtoDomain(DescriptorProtos.FileDescriptorSet fileDescriptorSet) {
+ this.fileDescriptorSet = fileDescriptorSet;
+ hashCode = java.util.Arrays.hashCode(this.fileDescriptorSet.toByteArray());
+ crosswire();
+ }
+
+ private static Map<String, DescriptorProtos.FileDescriptorProto> extractProtoMap(
+ DescriptorProtos.FileDescriptorSet fileDescriptorSet) {
+ HashMap<String, DescriptorProtos.FileDescriptorProto> map = new HashMap<>();
+ fileDescriptorSet.getFileList().forEach(fdp -> map.put(fdp.getName(), fdp));
+ return map;
+ }
+
+ private static Descriptors.FileDescriptor convertToFileDescriptorMap(
@nullable. I'm a bit sad our analyses didn't catch this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8496?email_source=notifications&email_token=AAE4EM6SXFD4LZQQJJBWNBLQN5M6XA5CNFSM4HKZOJI2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCHSNI2Q#pullrequestreview-300209258>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE4EM2TBV4JI73PGIM7QHLQN5M6XANCNFSM4HKZOJIQ>
.
|
|
I'll fix the comments tommorow (it's evening now) |
|
Thanks Alex! @reuvenlax @TheNeuralBit to review : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too onerous to add a few more tests here? Maybe set some more fields and some nested fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another test testDynamicNestedRepeatedMessage that adds another repeated nested message.
|
LGTM overall. I'm not an expert in proto descriptor wrangling either though, so some more testing (or confirmation from Reuven) would make me feel better. On the other hand, #8690 does test this more thoroughly, and it would be nice to get to remove these changes from that diff. |
442eeee to
ed1d964
Compare
Backposted the tests from the schema aware PR. If we get at least this PR in 2.17 the schema aware one will be somewhat smaller. |
|
Can this be merged before the 2.17 cut (scheduled for Wednesday, October 23)? |
|
The ProtoCoder was unable to handle DynamicMessage as it was unable to get a message specific parser. The Coder is expanded to handle DynamicMessage as a special case. It stores the complete descriptor set when (de)serializing. Design decision: Although DynamicMessage could in theory have a different schema per message in a single stream this doesn't make sense. The common use-case for using DynamicMessages is when the schema is not known at compile type, but is known at pipeline build time (example, like pulled from a schema registry). With this restriction we only need to store the schema (or descriptor) once when the pipeline is serialized and send to the workers.
63dca7f to
1117a40
Compare
1117a40 to
64829e1
Compare
|
lgtm |
|
Thanks everyone! Glad we were able to get this in : ) |
|
Congrats @alexvanboxel! :) |
|
Hi! Awesome work, thanks! When will this feature be available at maven central? |
|
Hey @TheMatrix97! I think this made it into the 2.17.0 release cut that happened last week. Usually it takes a few weeks to get a release ready after the cut. When it's finished there will be an announcement on the mailing list and the blog. Alternatively you should be able to get this feature if you use the 2.18.0-SNAPSHOT version. |
|
This doesn't appear to be in the |
|
@TheNeuralBit Actually I don't see the 2.18-SNAPSHOT version at Maven central repo :( The most recent one is 2.16. |
|
ah every now and then the snapshot publishing seems to run into trouble. I'll ping the mailing list about it... |
|
@pabloem seems it's not published at maven central, but you can find it in the apache repo https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-core/2.18.0-SNAPSHOT/ |
The ProtoCoder was unable to handle DynamicMessage as it was unable to
get a message specific parser. The Coder is expanded to handle
DynamicMessage as a special case. It stores the complete descriptor set
when (de)serializing.
Design decision: Although DynamicMessage could in theory have a
different schema per message in a single stream this doesn't make sense.
The common use-case for using DynamicMessages is when the schema is not
known at compile type, but is known at pipeline build time (example,
like pulled from a schema registry).
With this restriction we only need to store the schema (or descriptor)
once when the pipeline is serialized and send to the workers.
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.