Skip to content

Conversation

@TheNeuralBit
Copy link
Member

@TheNeuralBit TheNeuralBit commented Jun 13, 2019

Also adds tests in SchemaTranslationTest.

Things that are not currently included in this PR:

  • LogicalType registration and resolution by URN. We cannot decode a Schema with a logical type.
  • Representing datetime and decimal as logical types (see BEAM-7554). Instead they are still represented as primitive types in Java's Schema.FieldType and they are mapped to the appropriate URNs when converting to/from the proto representation.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

R: @reuvenlax, @robertwb, @kennknowles

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, nice and clear diff. Meaningful change. The only comment that is actionable here is that it might be an opportune time to reset the 0 value for the AtomicType enum.

// Experimental: A representation of a Beam Schema.
message Schema {
enum TypeName {
enum AtomicType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 0 in an enum supposed to be reserved for unknown? It is wise, because of defaulting in proto libs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to my knowledge specifically adding an UNSPECIFIED with a value of 0 will make this clearer.
For example:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ArrayType array_type = 3;
MapType map_type = 4;
Schema row_schema = 5;
Schema row_type = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto best practice I think is to go ahead and have a RowType message with one field. It has overhead, yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Agreed this is much cleaner

.put(TypeName.BYTES, RunnerApi.Schema.AtomicType.BYTES)
.build();

private static final String URN_BEAM_LOGICAL_DATETIME = "urn:beam:logical:datetime";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two stylistic nits:

  1. We tend to omit the urn, don't we? While it does make a valid URI out of the thing, it seems a bit silly.
  2. I would leave out logical but put in something like type or schema_type or fieldtype to namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I went with fieldtype

.put(TypeName.ROW, RunnerApi.Schema.TypeName.ROW)
.put(TypeName.LOGICAL_TYPE, RunnerApi.Schema.TypeName.LOGICAL_TYPE)

private static final BiMap<TypeName, RunnerApi.Schema.AtomicType> ATOMIC_TYPE_MAPPING =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I find a switch clearer than a map lookup, and it takes the same amount of code space. Not for this PR, in which you are just editing the existing structure not restructuring.

switch (typeName) {
case ROW:
fieldType = FieldType.row(fromProto(protoFieldType.getRowSchema()));
switch (protoFieldType.getTypeInfoCase()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another not-for-this PR comment that this would be cleaner with the switch in a function so the branches could all return.

@lukecwik
Copy link
Member

Is the beam_runner_api.proto the right place to put all the schema stuff?

- add UNSPECIFIED to AtomicType
- add RowType
- urn:beam:logical:(.*) -> beam:fieldtype:\1
@TheNeuralBit
Copy link
Member Author

I guess I don't have a strong opinion, I was just updating it in place. Do you think it should get it's own schema.proto file?

@kennknowles
Copy link
Member

I think it would be great to have a separate schema.proto file. I wouldn't block merging on this. I would definitely like that to be a separate commit if you do add it here. Moving + editing in one commit would be bad form IMO.

@TheNeuralBit
Copy link
Member Author

Agreed. I can follow-up with PR(s) for that move and the other code cleanup suggestions.

@robertwb
Copy link
Contributor

LGTM too. Thanks.

@robertwb robertwb merged commit e65c176 into apache:master Jun 25, 2019
TheNeuralBit added a commit to TheNeuralBit/beam that referenced this pull request Aug 28, 2019
TheNeuralBit added a commit to TheNeuralBit/beam that referenced this pull request Sep 18, 2019
soyrice pushed a commit to soyrice/beam that referenced this pull request Sep 19, 2019
@TheNeuralBit TheNeuralBit deleted the new-schema-representation branch October 10, 2019 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants