Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Dec 1, 2020

Implements the Deserializer for Hive writes, and creates the corresponding test suite

@pvary
Copy link
Contributor Author

pvary commented Dec 2, 2020

@marton-bod, @lcspinter: Could you please review?
Thanks,
Peter

Copy link
Collaborator

@marton-bod marton-bod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, just a few minor questions

@pvary
Copy link
Contributor Author

pvary commented Dec 2, 2020

@rdblue, @shardulm94: Could you please review if you have some time? This is a part of #1407 (Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables)

Thanks!
Peter

Object value(Object object);
}

private static FieldDeserializer deserializer(Type type, ObjectInspector fieldInspector) throws SerDeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the other modules use the visitor pattern to traverse a type. That keeps the logic for traversing a schema in just one place so you don't need to mix it in with your domain-specific code.

It looks like this method is called from with the StructDeserializer constructor, so there is a recursive traversal of the schema that goes back and forth between object constructors and this method. That's a bit hard to follow, so I think it would be simpler if you took the visitor approach.

A GenericAvroReader is similar to what you're building here, so take a look at that as an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I moved to the Visitor pattern. The the code is quite nice, and compact 😄

Hive struct names are not matched to the Iceberg struct name fields in case of the main struct which contains the result columns of a query. In this case we have the following col names: 0:col1, 1:col2 and they should be matched by the positions of the fields and not by the names.

Handling this will add more complexity. I propose to handle this in the next PR. What do you think?

PrimitiveObjectInspectorFactory.writableLongObjectInspector,
PrimitiveObjectInspectorFactory.writableStringObjectInspector
));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: double whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use 2 spaces for Continuation indent as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuation indents are 2 indents, which are 4 spaces. What you have here looks correct to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I am not sure what was the comment about. 😢

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was an extra newline.

@rdblue
Copy link
Contributor

rdblue commented Dec 5, 2020

Thanks, @pvary! This looks great so far. I had a few comments, but it looks like a good improvement!

return null;
}

List<Object> result = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer Lists.newArrayList().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember on some other PR got a review from someone else to avoid unnecessary guava uses.
I am find with both, but I think we should stick to one or the other. Shall it be Lists.newArrayList then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to bring in any additional Guava classes if possible, but using the ones that are already there is a good thing. For this case, we can easily replace map class implementations with an import change later.

}

private static class PartnerObjectInspectorByNameAccessors
implements SchemaWithPartnerVisitor.PartnerAccessors<ObjectInspector> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This could probably be a singleton.

* wrapper around the ObjectInspector. This wrapper uses the Iceberg schema column names instead of the Hive column
* names for {@link #getStructFieldRef(String) getStructFieldRef}
*/
private static class FixNameMappingObjectInspector extends StructObjectInspector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't all of the structs need to be wrapped by this?

@rdblue rdblue merged commit ddba7fc into apache:master Dec 8, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 8, 2020

I don't have any major concerns about this, so I'm going to merge it to unblock the other work. Thanks @pvary!

It would be nice to figure out a better way to traverse the object inspector tree that doesn't require the FixNameMappingObjectInspector. I'm not quite sure what that is, but I think it might be a matter of converting the name into a position and accessing the object inspectors by position. That can be done later, though.

@pvary
Copy link
Contributor Author

pvary commented Dec 9, 2020

Thanks @rdblue for the review and the merge!

In Hive you always need to provide the name-value pair for structs. AFAIK the only exception is the Schema level struct.
I have been thinking about 2 other solutions:

  • Move field matching to be based on the ordering of the fields. We can do it by adding a new input to parameter to SchemaWithPartnerVisitor<P, R>.PartnerAccessors<P>.fieldPartner(P partnerStruct, int fieldId, String name, int fieldOrderId). I do not like this because I do not feel that Struct fields should have specific ordering. It could work, because we are the ones who generate the ObjectInspectors, but still...
  • Add the field name to the ObjectInspectors themselves when generating them for the Schema. I do not like this because then we will have specific ObjectInspector instances we have to create and drop.

So this is why I have kept the current one which has its own drawbacks, so I am happy to go for a more general solution if we find one.

@pvary pvary deleted the deserializer branch January 7, 2021 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants