-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-9035] Typed options for Row Schema and Field #10413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3c69b4a to
432b319
Compare
|
@reuvenlax this is draft, but before investing more into testing the builders and documentation I like your opinion. |
432b319 to
cc7c1a1
Compare
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java
Outdated
Show resolved
Hide resolved
829c44c to
0816326
Compare
|
Run Java PreCommit |
|
@TheNeuralBit @reuvenlax the first PR of the schema options feature is ready for review. Reuven already done a pre-review when this PR was in draft. See tickets for more details, but coming in later pull requests are: options on Logical types (removing metadata and making LT portable), conversion from proto and avro, ... note: it seems the build infrastructure for java is hanging. |
fd49963 to
58bb737
Compare
|
Run Java PreCommit |
58bb737 to
6b3f5ec
Compare
6b3f5ec to
64b644d
Compare
64b644d to
5c23fae
Compare
|
Run Java PreCommit |
|
As promised, schema aware topics adding: @iemejia @aromanenko-dev (please read the associated doc: https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit# ) |
|
@iemejia would appreciate if you also had a look at this. |
|
Sure @alexvanboxel sorry for delay taking a look later today. |
|
Run Java11 PreCommit |
|
Run Java PreCommit |
|
Run Python PreCommit |
|
Verified that the failed Python 2 build isn't due to the PR. This PR is ready for review. I have 2 PR's based on this one ready: one for AVRO and other for Protobuf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fields redundant since Option already contains a name. Maybe just make repeated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back to the history I see that it was a repeated option. I don't have a preference. I can set it back to repeated option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as arrayValueFromProto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this refactoring needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because validation is used by both setOption as the Row builder. It's the same validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if its worth having all these overload. Maybe just have the setOption() method? Having users pass in the FieldType doesn't seem horrible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thru, I can remove this. Most people will just use the getters anyway. The setters will be used by proto/avro/ etc implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why special case null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options should never be null, if a null is set it's equivalent to removing the option. Or do you prefer and Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to use Preconditions.checkNotNull
9685373 to
5ec1553
Compare
|
@reuvenlax the fixup addresses the following issues:
Answered:
|
|
@reuvenlax can you have a look at the changes? |
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler to use Preconditions.checkNotNull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you instead use Row.Equals.deepEquals(thisOption.getValue(), otherOption.getValue(), thisOption.getType()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any actually you should move that call into Option.equals, and not run that code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, use Row.Equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Row.Equals.deepHashCode on the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's kind of odd to have a remove method in a builder - usually a builder only has add methods. What's the use case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe best to throw exception if optionName isn't found? That's what Row.getValue does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think best to throw if option not found.
ba12eb1 to
95f2cb1
Compare
|
@reuvenlax did most of your fixup's. Except:
if it would throw an exception... or you try catch everything (no no no!) or you use the hasOption method like this: I and my colleagues prefer the simpler null check. |
|
Run Java PreCommit |
I also want to understand better why we can't just use Row here (revisiting Brian's question). You mentioned using dots in field names, but AFAIK protobuf also prohibits dots. It's true that dots are often used in option specifications in proto, but that's just to set individual fields of a message. In Beam, we really should model those as a single row-valued option, not as multiple individual options. I.e. (from the proto docs): should translate to a single Beam option named "my_method_option" not to two separate options. dots are also used to represent package names of course. Is this the main reason you need to support dots? |
most of the use cases we have result in a transformation based on options they generally result on a change of the schema. But it's important to not drop all the meta data because in another (following) transform maybe needs that meta data. Examples message PaymentServiceProvider {
PaymentPlatform payment_platform = 1 [
(vptech.data.contract.v1.description) = "The booking platform payment",
(vptech.data.contract.v1.enum_prefix) = "PAYMENT_PLATFORM_",
(vptech.data.contract.v1.optional) = true ]
}in beam the
as you see, each transform changes the schema slightly.
The implementation for each format (proto, avro, zeta, calcite) has control over setting it. It should set type nullable too
I disagree that they are the same. A schema describe the types of rows and fields, options are meta data on top of the those rows and fields. An option being there has a meaning, I never saw an option that was null. It's not possible in proto, neither in avro. If it's a blocker, I'll change it too move this along (but then I need to change the BIP-1: https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options)
No, it would translate to
yes. How could I ever map the above options if options would really be a row? I also have an avro implementation ready. It also has don't. Here is a real world example: this results in a Beam type of |
|
Some thoughts: I know your use case for options involves mapping specific input types to schemas, however we need to make sure that the actual mechanism is more generic than that. We have many uses for field metadata (or options) inside of Beam, completely unrelated to any option concept on input data types. I would not like to make a design choice that precludes nullable FieldTypes now, because we may very well find that we need nullable field types. That being said: I notice that you haven't removed the FieldType metadata in this PR. Is that planned for a future PR? I don't think we want both metadata and options in the protocol - Options seemed to me to be a better-typed version of metadata. I don't personally think that the hasOption code is too bad, and it's definitely better than constraining the type system for Options in a way that we quite likely will regret. That being said, you've already added the getValueOrDefault method. So you can still write this code: .getFields() A tiny bit more verbose, but much more explicit about what you are trying to do. (FYI the reason I don't suggest allowing getValue to return an Optional - even though it might be the "better" solution - is because the rest of the code base does not use Optional, and I want things to stay consistent). About the removeOption builder method. I'm struggling with this a bit, since it feels like a weird pattern to put remove into a builder. I'm also not sure whether copying a a Schema field with just a subset of options is a common use case or not. Generally we should start off by keeping such things out of the core Beam API until we are convinced that there are multiple users who need them. I think you can still accomplish the same thing by simply adding a new Options.Builder.setOptions override that takes in a Map. newField = newField.withOptions(Options.builder().setOptions( This code is probably a bit more annoying for you to write. If we are convinced that this is a common use case then I'm ok with adding removeOption, but if it's specific to your use case then I think we're better off adding the new setOptions method and using that. |
|
I think I addresses now all the issues in the latest fixup. Talking about field metadata: yes this will be another PR. I didn't want this all in one PR. This is the groundwork to get a decent API in. I have at least 2 related PR's (proto and avro support) waiting before tackling removal of metadata though. |
|
Taking another look. I'm fine with metadata being in a separate PR - it
makes it easier to review. I just wanted to ensure that metadata was still
in the plan.
…On Sun, Mar 15, 2020 at 4:24 AM Alex Van Boxel ***@***.***> wrote:
I think I addresses now all the issues in the latest fixup.
Talking about field metadata: yes this will be another PR. I didn't want
this all in one PR. This is the groundwork to get a decent API in. I have
at least 2 related PR's (proto and avro support) waiting before tackling
removal of metadata though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10413 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVLQZ6BGMQQ37KJ65QTRHS3HPANCNFSM4J4M5UTA>
.
|
Let me know if you see other concerns, otherwise I need a lgtm. I don't expect anyone else to review this. |
|
@alexvanboxel I'm still not 100% sure that I agree that options are not schemas, and I don't love making SchemaVerification public. However I think it will be easier to iterate on those issues in isolation once the basic support is in, so I'm approving for now. LGTM |
This is the first PR of a multipart commit: this ticket implements the basic infrastructure of options on row and field. Options in Beam Schema add extra context to fields and schema. In contracts to metadata, options would be added to fields, logical types and rows. Options are key/typed value combination. The type system is using the beam schema itself and the value can be any type that is supported by the beam schema, including row.
4bb0a79 to
d851e6c
Compare
Thanks, I'm always open for the schema discussion. |
| /** Find the index of a given field. */ | ||
| public int indexOf(String fieldName) { | ||
| Integer index = fieldIndices.get(fieldName); | ||
| if (index == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change caused BEAM-10631, we should always use if-blocks (as it was), or lazy Preconditions.checkArgument(boolean condition, String format, Object args...).
This is the first PR of a multipart commit: this ticket implements the
basic infrastructure of options on row and field. Options in Beam
Schema add extra context to fields and schema. In contracts to
metadata, options would be added to fields, logical types and rows.
Options are key/typed value combination. The type system is using
the beam schema itself and the value can be any type that is supported
by the beam schema, including row.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.