-
Notifications
You must be signed in to change notification settings - Fork 3k
avro: Abstract AvroWithPartnerSchemaVisitor #1235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avro: Abstract AvroWithPartnerSchemaVisitor #1235
Conversation
| Schema.Field field = fields.get(i); | ||
| Preconditions.checkArgument(AvroSchemaUtil.makeCompatibleName(fieldName).equals(field.name()), | ||
| "Structs do not match: field %s != %s", fieldName, field.name()); | ||
| results.add(visit(fieldTypes[i], field.schema(), visitor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the difference between this sub field visit and the previous sub field visit is: we use the different methods to get the data type of inner field. I think we don't need both the structFieldTypeById and structFieldTypes, is it possible to abstract them to be one method and then we could remove the if (visitor.schemaEvolution()) {} else {...} finally ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core difference is matching up schemas by ID or not.
| return null; | ||
| @Override | ||
| public Type mapValueType(Type mapType) { | ||
| return mapType == null ? null : mapType.asMapType().valueType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this defined isNullType, then the visit method could do these checks instead:
if (isNullType(mapType)) {
return nullType();
} else {
return mapValueType(mapType);
}| * - For writing, the avro schema should be consistent with partner type. | ||
| * | ||
| * @param <P> Partner type. | ||
| * @param <T> Return T. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this standardizes the logic to traverse a schema with a partner, but I don't think that it makes sense to mix the two cases together into a single class for a few reasons:
- The meaning of
schemaEvolutionis hard to understand. The case where the schemas must have the same structure is more related to when we don't have IDs for one type of schema, like Spark. In that case, we rely on the structure matching exactly and are guaranteed that because both schemas are derived from the same Iceberg schema. We prefer to match up schemas by ID, even for the write path, but require it for the read path (because of evolution as you correctly noted). - It isn't clear which methods should be implemented for a visitor. Even if we added documentation, that's not going to be as easy to understand as having two types, one for traversing by IDs and the other for traversing by structure.
- There isn't much benefit to sharing because record visiting is very different between cases. The union, array, and map methods aren't very complicated.
I think it would be better to have the two cases broken out into AvroWithPartnerByIDVisitor and AvroWithPartnerByStructureVisitor. Then it is clear what needs to be implemented in both cases and there is no schemaEvolution flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I also tangled for a long time, it is more confusing to put them together by force.
| throw new UnsupportedOperationException(); | ||
| } | ||
|
|
||
| public P[] structFieldTypes(P structType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays are a bit difficult to work with. For the structure-based traversal, how about combining structFieldNames and structFieldTypes into a single method indexed by position in the struct?
Pair<String, P> fieldNameAndType(P structType, int pos);That corresponds to the fieldType(int id) for the id-based lookup.
|
Thanks @JingsongLi! This looks like a good thing to do, but I would keep the two cases (id- or structure-based traversal) separate to simplify implementations and make it easier to read. |
Thanks @rdblue for your review, I think we can keep |
f1fd7d1 to
9795d36
Compare
9795d36 to
24cf7cd
Compare
Abstract
AvroWithPartnerSchemaVisitorto extract specific avro logical fromAvroSchemaWithTypeVisitorandAvroWithSparkSchemaVisitor.