-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Add Variant data type #11324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add Variant data type #11324
Conversation
3a3282d to
15d8ed6
Compare
15d8ed6 to
15b2b71
Compare
ca226ee to
1d3056b
Compare
1d3056b to
d4af2b3
Compare
| case FIXED: | ||
| return ((Types.FixedType) type).length(); | ||
| case BINARY: | ||
| case VARIANT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for this size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have the accurate size for Variant similar to Binary. So I use the same value as Binary. I'm wondering how we come up with 80 for Binary.
| return sanitizeNumber(((Literals.DoubleLiteral) literal).value(), "float"); | ||
| } else { | ||
| // for uuid, decimal, fixed, and binary, match the string result | ||
| // for uuid, decimal, fixed, variant, and binary, match the string result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay but we may be missing information by not sanitizing based on the variant's type (i.e. date) and it would be nice to have some idea of the structure in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like
{
foo:
bar: 3
baz:
bozz: "flew"
}
to
{
(hash-foo):
(hash-bar) : (1 digit number)
(hash-baz) :
(hash-bozz) : (hash-xaxa)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a nice feature but probably ok for it's own issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping me understand the concept. I have filed a followup issue #11479 and will work on separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually to support this, we need to have defined the interface VariantLike on how to access the subfields and then recursively sanitize the values based on the subfield types.
When we promote the VariantLike to API, then we can add such implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think this needs to be done right away, but being able to have some representation that goes into the variant types would be great.
d4af2b3 to
5c07698
Compare
| } | ||
|
|
||
| @Test | ||
| public void testV3VariantTypeSupport() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is copying a lot of tests in this class but we should start future proofing a bit more imho. We also have tests around this sort of thing in TestSchema.java. I think it is probably ok to just keep all of our schema validation tests there, but it wouldn't hurt to have some redundancy here as well.
Refactoring the whole suite can come in another pr but I think we should build a templated test that's something like
@ParameterizedTest
@ValueSource(types = {Types.TimetstampNanos, Types.Variant, ....})
testTypeSupport(Type type) {
Schema schemaWithType = new Schema(
Types.NestedField.required(1, "id", Types.LongType.get()),
Types.NestedField.optional(2, type.name, type),
Types.NestedField.optional(3, "arr", Types.ListType.ofRequired(4, type)),
Types.NestedField.required(5, "struct",
Types.StructType.of(
Types.NestedField.optional(6, "inner_" + type.name, type),
Types.NestedField.required(7, "data", Types.StringType.get()))),
Types.NestedField.optional(8, "struct_arr",
Types.StructType.of(
Types.NestedField.optional(9, "ts", type))));
//Psuedo code here
from 0 -> MIN_FORMAT_VERSION.get(type)
fail to make metadata
from MIN_FORMAT_VERSION.get(type) -> SUPPORTED_TABLE_VERSION
succeed
}The most important part about this is that we wouldn't have to continually update tests every time a new valid metadata version is added. It also would be much easier to test type compatibility. (I'm thinking that Geo is going to need the exact same thing soon)
For this PR I think it is enough to write a parameterized version for just Variant, then we could raise another PR to add in nanos and remove the redundant tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this is part of my goal to remove all tests that have V3 or V2 in their title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also todo add Variant to Schema.MIN_FORMAT_VERSION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RussellSpitzer for already creating a PR to address that. I will include that when it's merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for moving this to TestSchema. The test here should really just be validating that schemas are validated against the table version, but the actual validation tests should be in TestSchema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets just drop this test then (or add a single variant field into the above test) and just change the above test to be called testV3SchemaValidation or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the actual validation for both variant and timestamp_ns since it's already covered in TestSchema. I added version check/validation coverage for TableMetadata.newTableMetadata().
14665cb to
ca2b5de
Compare
| @Deprecated | ||
| public static <I> Identity<I> get(Type type) { | ||
| Preconditions.checkArgument( | ||
| type.typeId() != Type.TypeID.VARIANT, "Unsupported type for identity: %s", type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to make sure that canTransform doesn't accept variants. We could special case it, but I'm actually wondering if it makes sense for variant to be a primitive type because it is only primitive sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do only need to special case it if we make Variant a primitive. I'm not sure what primitive really means in this context though. I think anything that may or may not be scalar is probably not a primitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the test coverage in TestBucketing.java TestIdentity.java to make canTransform doesn't accept variants.
| } | ||
| } | ||
|
|
||
| public static class VariantType extends PrimitiveType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that variant should be either a primitive type or a nested type. There are at least a few places where making variant a primitive creates problems. For example, Schema checks that a field is a primitive to allow it as an identifier field and the Identity transform will bind to a type if it is a primitive. There are also places that check whether a type is nested and make decisions based on whether there are sub-fields.
Neither one of those seems appropriate to me so I'd suggest making this extend just Type. Then we can adjust the logic in those places without assuming either one and potentially introducing bugs when the assumption is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have isPrimitiveType() to return false. I was debating on that - if VariantType extends PrimitiveType, some internal change would be easier. But concept-wise t should be from Type directly and let me change it.
| } | ||
|
|
||
| @Test | ||
| public void testVariantUnsupported() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other transforms? None of them are allowed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I have the coverage in TestIdentity.java and TestBucketing.java and I think I have checked all the transforms.
|
|
||
| @Override | ||
| public boolean isPrimitiveType() { | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that although this extends PrimitiveType, it reports that it is not one. I think it would be safer not to extend PrimitiveType. Good that this is already used to get the tests passing.
… full implementation. Block Transforms, SortOrder
ca2b5de to
5eebb6e
Compare
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I think we could clean up TestTableMetadata more (i left a comment there) but we can always push that out as well. I mean to take a pass on cleaning some of that up later as well.
9b77a5f to
4050ac0
Compare
4050ac0 to
1e2fc66
Compare
|
@rdblue I think you are the last reviewer on this, do you have any further comments for this one? |
| } | ||
| } | ||
|
|
||
| public static class VariantType implements Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aihuaxu, could you also implement the same Java serialization that we use for PrimitiveType? The existing PrimitiveHolder should work:
Object writeReplace() throws ObjectStreamException {
return new PrimitiveHolder(toString());
}
|
Late +1 from me! Thanks @aihuaxu! |
Add Variant data type to API module. We only add limited required type to API for now and we will promote stable methods to API after full implementation to the interface
VariantLikelater.Fix #11178.