avro: Abstract AvroWithPartnerSchemaVisitor #1235

JingsongLi · 2020-07-23T04:12:12Z

Abstract AvroWithPartnerSchemaVisitor to extract specific avro logical from AvroSchemaWithTypeVisitor and AvroWithSparkSchemaVisitor.

openinx · 2020-07-24T08:32:21Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerSchemaVisitor.java

+        Schema.Field field = fields.get(i);
+        Preconditions.checkArgument(AvroSchemaUtil.makeCompatibleName(fieldName).equals(field.name()),
+            "Structs do not match: field %s != %s", fieldName, field.name());
+        results.add(visit(fieldTypes[i], field.schema(), visitor));


So the difference between this sub field visit and the previous sub field visit is: we use the different methods to get the data type of inner field. I think we don't need both the structFieldTypeById and structFieldTypes, is it possible to abstract them to be one method and then we could remove the if (visitor.schemaEvolution()) {} else {...} finally ?

The core difference is matching up schemas by ID or not.

rdblue · 2020-07-26T20:37:35Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaWithTypeVisitor.java

-    return null;
+  @Override
+  public Type mapValueType(Type mapType) {
+    return mapType == null ? null : mapType.asMapType().valueType();


If this defined isNullType, then the visit method could do these checks instead:

if (isNullType(mapType)) { return nullType(); } else { return mapValueType(mapType); }

rdblue · 2020-07-26T22:28:33Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerSchemaVisitor.java

+ * - For writing, the avro schema should be consistent with partner type.
+ *
+ * @param <P> Partner type.
+ * @param <T> Return T.


I like that this standardizes the logic to traverse a schema with a partner, but I don't think that it makes sense to mix the two cases together into a single class for a few reasons:

The meaning of schemaEvolution is hard to understand. The case where the schemas must have the same structure is more related to when we don't have IDs for one type of schema, like Spark. In that case, we rely on the structure matching exactly and are guaranteed that because both schemas are derived from the same Iceberg schema. We prefer to match up schemas by ID, even for the write path, but require it for the read path (because of evolution as you correctly noted).

It isn't clear which methods should be implemented for a visitor. Even if we added documentation, that's not going to be as easy to understand as having two types, one for traversing by IDs and the other for traversing by structure.

There isn't much benefit to sharing because record visiting is very different between cases. The union, array, and map methods aren't very complicated.

I think it would be better to have the two cases broken out into AvroWithPartnerByIDVisitor and AvroWithPartnerByStructureVisitor. Then it is clear what needs to be implemented in both cases and there is no schemaEvolution flag.

+1
I also tangled for a long time, it is more confusing to put them together by force.

rdblue · 2020-07-26T22:31:36Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerSchemaVisitor.java

+    throw new UnsupportedOperationException();
+  }
+
+  public P[] structFieldTypes(P structType) {


Arrays are a bit difficult to work with. For the structure-based traversal, how about combining structFieldNames and structFieldTypes into a single method indexed by position in the struct?

Pair<String, P> fieldNameAndType(P structType, int pos);

That corresponds to the fieldType(int id) for the id-based lookup.

rdblue · 2020-07-26T22:34:01Z

Thanks @JingsongLi! This looks like a good thing to do, but I would keep the two cases (id- or structure-based traversal) separate to simplify implementations and make it easier to read.

JingsongLi · 2020-07-28T07:09:11Z

Thanks @JingsongLi! This looks like a good thing to do, but I would keep the two cases (id- or structure-based traversal) separate to simplify implementations and make it easier to read.

Thanks @rdblue for your review, I think we can keep AvroSchemaWithTypeVisitor as it is (No other implementation), and introduce AvroWithPartnerByStructureVisitor for Flink implementation.

JingsongLi mentioned this pull request Jul 23, 2020

Flink: Using RowData to avro reader and writer #1231

Closed

openinx reviewed Jul 24, 2020

View reviewed changes

rdblue reviewed Jul 26, 2020

View reviewed changes

avro: Abstract AvroWithPartnerSchemaVisitor

449688f

JingsongLi force-pushed the AvroWithPartnerSchemaVisitor branch from f1fd7d1 to 9795d36 Compare July 28, 2020 07:09

Address Ryan's comments

24cf7cd

JingsongLi force-pushed the AvroWithPartnerSchemaVisitor branch from 9795d36 to 24cf7cd Compare July 28, 2020 08:09

rdblue merged commit 18d52e0 into apache:master Jul 28, 2020

rdblue mentioned this pull request Jul 29, 2020

Flink: Using RowData to avro reader and writer #1232

Merged

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Jul 29, 2020

Avro: Extract AvroWithPartnerSchemaVisitor base visitor (apache#1235)

2774324

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Avro: Extract AvroWithPartnerSchemaVisitor base visitor (apache#1235)

8f0f957

JingsongLi deleted the AvroWithPartnerSchemaVisitor branch November 5, 2020 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avro: Abstract AvroWithPartnerSchemaVisitor #1235

avro: Abstract AvroWithPartnerSchemaVisitor #1235

Uh oh!

JingsongLi commented Jul 23, 2020

Uh oh!

openinx Jul 24, 2020

Uh oh!

JingsongLi Jul 28, 2020

Uh oh!

rdblue Jul 26, 2020

Uh oh!

rdblue Jul 26, 2020

Uh oh!

JingsongLi Jul 28, 2020

Uh oh!

rdblue Jul 26, 2020

Uh oh!

rdblue commented Jul 26, 2020

Uh oh!

JingsongLi commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avro: Abstract AvroWithPartnerSchemaVisitor #1235

avro: Abstract AvroWithPartnerSchemaVisitor #1235

Uh oh!

Conversation

JingsongLi commented Jul 23, 2020

Uh oh!

openinx Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 26, 2020

Uh oh!

JingsongLi commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants