add protobuf flattener, direct to plain java conversion for faster flattening#13519
Merged
clintropolis merged 9 commits intoapache:masterfrom Dec 9, 2022
Merged
add protobuf flattener, direct to plain java conversion for faster flattening#13519clintropolis merged 9 commits intoapache:masterfrom
clintropolis merged 9 commits intoapache:masterfrom
Conversation
…attening, nested column tests
imply-cheddar
approved these changes
Dec 7, 2022
| @Override | ||
| public Object createMap() | ||
| { | ||
| return new HashMap<>(); |
Contributor
There was a problem hiding this comment.
Nit: LinkedHashMaps are nicer 'cause they maintain order on iteration. This makes things a lot nicer for like, toString() and other debug-style activities.
Member
Author
There was a problem hiding this comment.
oops, I forgot to do this, will try to remember to do in a follow-up to not churn through ci again
| return ((ByteString) value).toByteArray(); | ||
| case ENUM: | ||
| // Special-case google.protobuf.NullValue (it's an Enum). | ||
| if (field.getEnumType().getFullName().equals("google.protobuf.NullValue")) { |
Contributor
There was a problem hiding this comment.
Is String equality really the best way?
If it is, do it as "google.protobuf...".equals() instead.
Member
Author
There was a problem hiding this comment.
not sure if this is best or not, this is what JsonFormat was doing which I adapted most of this code form. I can switch the order though on next PR since I think i'm going to have to add some test coverage anyway to make 🤖 happy
clintropolis
added a commit
to clintropolis/druid
that referenced
this pull request
Dec 9, 2022
…attening (apache#13519) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests
10 tasks
kfaraz
pushed a commit
that referenced
this pull request
Dec 16, 2022
…3573) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.
kfaraz
pushed a commit
to kfaraz/druid
that referenced
this pull request
Dec 16, 2022
…ache#13573) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by apache#13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.
kfaraz
added a commit
that referenced
this pull request
Dec 16, 2022
…3573) (#13582) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff. Co-authored-by: Clint Wylie <cwylie@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR overhauls
ProtobufInputFormatto use a newProtobufFlattenerMakerthat works by converting ProtobufMessagedirectly to plain java types rather than printing to a JSON string and then deserializing that intoJsonNodewith Jackson. This is roughly based on theJsonFormatconversion code that we were previously using, but without the performance penalty this inflicts. I haven't measured it, but for actually flat schemas this should be approximately the same cost as the old way of not using the flattener since it is doing the samegetAllFieldsto convert the top levelMessageto aMap.While I was here, i extracted a base type,
FlattenerJsonProviderfor the common code shared between most implementations.I've also added tests for using Druid nested columns and JSON transform functions (which is what motivated this change, because they both prefer to work on plain java types).
I did not update
ProtobufInputRowParserbecause I don't care about Hadoop, but if anyone does it should be possible to wire this up.Key changed/added classes in this PR
ProtobufInputFormatProtobufReaderProtobufFlattenerMakerProtobufJsonProviderProtobufConverterFlattenerJsonProviderThis PR has: