KAFKA-8885: The Kafka Protocol should Support Optional Tagged Fields#7325
KAFKA-8885: The Kafka Protocol should Support Optional Tagged Fields#7325hachikuji merged 19 commits intoapache:trunkfrom
Conversation
0616758 to
ef30342
Compare
WithClientID was annoying by having to specify a string pointer; instead this adds a second opt to nil the client id if necessary. Doing so is likely highly uncommon. Adds controlled shutdown v0 correct encoding; noticed when scanning apache/kafka#7325.
| * "string": a string. Strings are serialized as a length followed by the | ||
| contents as UTF-8. The contents must be less than 64kb in size. In | ||
| non-flexible versions, the string length will always be 2 bytes. In flexible | ||
| versions, the length is a variable-length integer. |
There was a problem hiding this comment.
KIP-482 indicates that flexible versions will actually use unsigned variable length numbers, offset by 1 to reserve 0 indicating null. If that's the case, this should change from variable-length integer, which implies the old zig-zag varint, to unsigned variable-length integer offset by 1, with 0 indicating null (or something like that).
Also, it might be worth mentioning that the numbers are always offset by one, even if the field is non-nullable.
There was a problem hiding this comment.
I'm going to split the documentation part off into a separate PR, since there are some tricky questions about what should go in what documentation section. Let's discuss it there.
|
Hi all, I have split this PR up into several other PRs to be more reviewable, including #7372, #7344, #7340, and more to come. I'm going to leave this one up for now so that people can see the bigger context of some of the changes, however. Maybe I will rebase this one and use it for the last part once the other ones are in. Thanks. |
ef30342 to
0bb3970
Compare
| // | ||
| // Version 3 is the first flexible version. | ||
| "validVersions": "0-3", | ||
| "flexibleVersions": "3+", |
There was a problem hiding this comment.
Leaving the comment here because I haven't found a better place. It would be great if you could add tests in RequestResponseTest#testSerialization to cover all the versions which have been bumped.
There was a problem hiding this comment.
Hmm. We don't generally do exhaustive testing of all versions in RequestResponseTest. There would be a lot of entries! But I agree we should add more stuff in RequestResponseTest. Let's think about it in a follow-on PR
There was a problem hiding this comment.
+1 for additional tests. Testing every version is not crazy. We have been bitten a few times already due to untested version bumps.
ijuma
left a comment
There was a problem hiding this comment.
Thanks for the PR. I took an initial pass and left some high level questions and a couple of nits.
| default List<RawTaggedField> readRawTaggedField(List<RawTaggedField> unknowns, int tag, int size) { | ||
| if (unknowns == null) { | ||
| unknowns = new ArrayList<>(); | ||
| } |
There was a problem hiding this comment.
Are we doing this to avoid the allocation of unknowns unless there is at least one unknown?
| } | ||
|
|
||
| @Test | ||
| public void testInvalidFieldName() { |
There was a problem hiding this comment.
It would be helpful to indicate what is invalid about the name. Is it the underscore at the start?
There was a problem hiding this comment.
I added some JavaDoc
| " { \"name\": \"_badName\", \"type\": \"[]int32\", \"versions\": \"0+\" }", | ||
| " ]", | ||
| "}")), MessageSpec.class); | ||
| fail("Expected MessageDataGenerator constructor to fail"); |
There was a problem hiding this comment.
I would suggest using assertThrows in this and other tests added in this PR that validate that an exception is thrown.
| * A compact array represents its length with a varint rather than a | ||
| * fixed-length field. | ||
| */ | ||
| public class CompactArrayOf extends DocumentedType { |
There was a problem hiding this comment.
We also talked about arrays of primitives having a compact representation where tags are not needed per element. How would we describe such arrays, are they packed arrays versus compact arrays?
There was a problem hiding this comment.
They could be either ArrayOf and CompactArrayOf. We don't have a separate array type for arrays of objects vs. arrays of non-object types
|
|
There are some compiler errors after the latest updates. |
|
The compiler errors should be fixed now. |
hachikuji
left a comment
There was a problem hiding this comment.
Thanks, looking good overall. I left a few comments.
There was a problem hiding this comment.
Seems a bit messy to support different value types in the same map. Are we saving that much by not having separate maps?
There was a problem hiding this comment.
The memory overhead of having a separate map would be pretty large in the common case where objects are small.
There was a problem hiding this comment.
I couldn't find any uses for this code in any of the generated classes. Do we have test cases which exercise this logic?
There was a problem hiding this comment.
The test message file SimpleExampleMessage.json contains a tagged array, which will use this logic. I will add a test that uses that field.
There was a problem hiding this comment.
Maybe readUnknownTaggedField?
There was a problem hiding this comment.
nit: use vararg constructor. A couple below as well
There was a problem hiding this comment.
nit: would be nice to document the expected type of fields
There was a problem hiding this comment.
Could probably use Collections.emptyList() here
There was a problem hiding this comment.
the problem is that this field is mutable, and Collections.emptyList returns something immutable
There was a problem hiding this comment.
I wonder if you have given any thought to limiting allocations like this. For example, in the case of the byte array, we may be able to validate the size using the available bytes in the request
There was a problem hiding this comment.
I do think we're kind of goofy to allow arrays with 2**31 elements. There must be a reasonable maximum we could set lower than that. But there will probably be some compatibility implications to this, so it will take time to impose a reasonable limit now....
There was a problem hiding this comment.
Not a big deal, but there are a few cases where we could use a null check of _taggedField instead of a version check. Might make the generated code a little more readable.
There was a problem hiding this comment.
It would be a bit complex to change now since we're also filtering versions that aren't present at all and so on
There was a problem hiding this comment.
I think there's a bug in the handling of nullable arrays when the default is not null. For example, consider the following field:
{ "name": "field2", "type": "[]BlahType",
"versions": "1+", "taggedVersions": "1+", "tag": 1,
"nullableVersions": "1+",
"fields": [
{ "name": "wootId", "versions": "1+", "type": "int32" },
]
}This results in the following code:
if (_version >= 1) {
if (!field2.isEmpty()) {
if (field2 == null) {
_taggedFields.put(1, null);
} else {
Struct[] _nestedObjects = new Struct[field2.size()];
int i = 0;
for (BlahType element : this.field2) {
_nestedObjects[i++] = element.toStruct(_version);
}
_taggedFields.put(1, _nestedObjects);
}
}
}The null check should come first. Seems like the default value optimization needs to take into account nullable values. The same bug affects size.
In general, we probably need more testing, especially for default value handling.
There was a problem hiding this comment.
Thanks for finding this. It might be better to address it in a follow on, since the fix could get complicated. I'll push what I have for now.
Rename ObjectSizeCache to ObjectSerializationCache Prefix readable, writable, and size with an underscore in the generated code to avoid conflicting with message fields that have these names. Create MessageTestUtil.
* Fix code generation for tagged array fields * Rename TestUUID to SimpleExampleMessage and add some tests for tagged fields there. * Fix a bug in generating the code for tagged array fields
I filed some follow-on JIRAs:
|
|
Still at least one test failure. This one is reproducible locally: |
|
retest this please |
|
Failures are all flakes. |
|
retest this please |
|
Tests passed locally:
|
|
I will go ahead and merge. The failing tests are known to be flaky prior to this patch. |
|
Thanks, @ijuma and @hachikuji! And all the other reviewers who helped with this |
|
Edit: concern retracted; after consideration, I think that tags on every struct level is fine. |
|
Hi @twmb, thanks for looking at this. As you probably figured out (I see you edited your comment a bit), it's important to allow tagged fields to be added without a version bump. Otherwise we don't get a lot of the benefits of a flexible schema. This does require an extra byte per struct. There was a lot of discussion about this on the mailing list. The discussion period was actually much longer than the implementation period and definitely was not done at the last minute. I looked for alternate solutions that didn't require the extra byte, but they were all very awkward and complex. To counteract the extra space taken, we implemented more efficient serialization for strings, bytes, and arrays. In the common case where these fields are small, we save between 1 and 3 bytes per object. So if the objects in your hypothetical array of objects contain any of these things, the overhead is already cancelled out. I agree that it is annoying that an I hope this answers all the questions (and potential ones?) :) There is more discussion about this on the mailing list if you want to go in depth. I always appreciate feedback and I made a point of pulling in some Kafka client authors before this was finalized. |
No description provided.