Skip to content

Conversation

@JingsongLi
Copy link
Contributor

Abstract AvroWithPartnerSchemaVisitor to extract specific avro logical from AvroSchemaWithTypeVisitor and AvroWithSparkSchemaVisitor.

Schema.Field field = fields.get(i);
Preconditions.checkArgument(AvroSchemaUtil.makeCompatibleName(fieldName).equals(field.name()),
"Structs do not match: field %s != %s", fieldName, field.name());
results.add(visit(fieldTypes[i], field.schema(), visitor));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the difference between this sub field visit and the previous sub field visit is: we use the different methods to get the data type of inner field. I think we don't need both the structFieldTypeById and structFieldTypes, is it possible to abstract them to be one method and then we could remove the if (visitor.schemaEvolution()) {} else {...} finally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core difference is matching up schemas by ID or not.

return null;
@Override
public Type mapValueType(Type mapType) {
return mapType == null ? null : mapType.asMapType().valueType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this defined isNullType, then the visit method could do these checks instead:

  if (isNullType(mapType)) {
    return nullType();
  } else {
    return mapValueType(mapType);
  }

* - For writing, the avro schema should be consistent with partner type.
*
* @param <P> Partner type.
* @param <T> Return T.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this standardizes the logic to traverse a schema with a partner, but I don't think that it makes sense to mix the two cases together into a single class for a few reasons:

  • The meaning of schemaEvolution is hard to understand. The case where the schemas must have the same structure is more related to when we don't have IDs for one type of schema, like Spark. In that case, we rely on the structure matching exactly and are guaranteed that because both schemas are derived from the same Iceberg schema. We prefer to match up schemas by ID, even for the write path, but require it for the read path (because of evolution as you correctly noted).
  • It isn't clear which methods should be implemented for a visitor. Even if we added documentation, that's not going to be as easy to understand as having two types, one for traversing by IDs and the other for traversing by structure.
  • There isn't much benefit to sharing because record visiting is very different between cases. The union, array, and map methods aren't very complicated.

I think it would be better to have the two cases broken out into AvroWithPartnerByIDVisitor and AvroWithPartnerByStructureVisitor. Then it is clear what needs to be implemented in both cases and there is no schemaEvolution flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I also tangled for a long time, it is more confusing to put them together by force.

throw new UnsupportedOperationException();
}

public P[] structFieldTypes(P structType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays are a bit difficult to work with. For the structure-based traversal, how about combining structFieldNames and structFieldTypes into a single method indexed by position in the struct?

  Pair<String, P> fieldNameAndType(P structType, int pos);

That corresponds to the fieldType(int id) for the id-based lookup.

@rdblue
Copy link
Contributor

rdblue commented Jul 26, 2020

Thanks @JingsongLi! This looks like a good thing to do, but I would keep the two cases (id- or structure-based traversal) separate to simplify implementations and make it easier to read.

@JingsongLi
Copy link
Contributor Author

Thanks @JingsongLi! This looks like a good thing to do, but I would keep the two cases (id- or structure-based traversal) separate to simplify implementations and make it easier to read.

Thanks @rdblue for your review, I think we can keep AvroSchemaWithTypeVisitor as it is (No other implementation), and introduce AvroWithPartnerByStructureVisitor for Flink implementation.

@JingsongLi JingsongLi force-pushed the AvroWithPartnerSchemaVisitor branch from f1fd7d1 to 9795d36 Compare July 28, 2020 07:09
@JingsongLi JingsongLi force-pushed the AvroWithPartnerSchemaVisitor branch from 9795d36 to 24cf7cd Compare July 28, 2020 08:09
@rdblue rdblue merged commit 18d52e0 into apache:master Jul 28, 2020
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Jul 29, 2020
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
@JingsongLi JingsongLi deleted the AvroWithPartnerSchemaVisitor branch November 5, 2020 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants