add protobuf flattener, direct to plain java conversion for faster flattening by clintropolis · Pull Request #13519 · apache/druid

clintropolis · 2022-12-07T10:10:11Z

Description

This PR overhauls ProtobufInputFormat to use a new ProtobufFlattenerMaker that works by converting Protobuf Message directly to plain java types rather than printing to a JSON string and then deserializing that into JsonNode with Jackson. This is roughly based on the JsonFormat conversion code that we were previously using, but without the performance penalty this inflicts. I haven't measured it, but for actually flat schemas this should be approximately the same cost as the old way of not using the flattener since it is doing the same getAllFields to convert the top level Message to a Map.

While I was here, i extracted a base type, FlattenerJsonProvider for the common code shared between most implementations.

I've also added tests for using Druid nested columns and JSON transform functions (which is what motivated this change, because they both prefer to work on plain java types).

I did not update ProtobufInputRowParser because I don't care about Hadoop, but if anyone does it should be possible to wire this up.

Key changed/added classes in this PR

ProtobufInputFormat
ProtobufReader
ProtobufFlattenerMaker
ProtobufJsonProvider
ProtobufConverter
FlattenerJsonProvider

This PR has:

…attening, nested column tests

imply-cheddar · 2022-12-07T10:33:59Z

+  @Override
+  public Object createMap()
+  {
+    return new HashMap<>();


Nit: LinkedHashMaps are nicer 'cause they maintain order on iteration. This makes things a lot nicer for like, toString() and other debug-style activities.

oops, I forgot to do this, will try to remember to do in a follow-up to not churn through ci again

imply-cheddar · 2022-12-07T10:37:41Z

+        return ((ByteString) value).toByteArray();
+      case ENUM:
+        // Special-case google.protobuf.NullValue (it's an Enum).
+        if (field.getEnumType().getFullName().equals("google.protobuf.NullValue")) {


Is String equality really the best way?

If it is, do it as "google.protobuf...".equals() instead.

not sure if this is best or not, this is what JsonFormat was doing which I adapted most of this code form. I can switch the order though on next PR since I think i'm going to have to add some test coverage anyway to make 🤖 happy

…attening (apache#13519) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests

…attening (#13519) (#13546) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests

…3573) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.

…ache#13573) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by apache#13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.

…3573) (#13582) This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing). I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff. Co-authored-by: Clint Wylie <cwylie@apache.org>

add protobuf flattener, direct to plain java conversion for faster fl…

216e1ff

…attening, nested column tests

clintropolis added the Area - Ingestion label Dec 7, 2022

imply-cheddar approved these changes Dec 7, 2022

View reviewed changes

clintropolis added 4 commits December 7, 2022 12:37

more test, fixes

0f1b841

more test

5b32f6e

checkstyle

ec9701c

inspection

1a6e4b4

clintropolis added the Bug label Dec 8, 2022

clintropolis added 4 commits December 7, 2022 22:22

more test

942e716

use LinkedHashMap

ed03856

Merge remote-tracking branch 'upstream/master' into nested-protobuf

3903eb8

my bad, CharsetEncoder is not thread safe

cc2fc7e

clintropolis merged commit 7002ecd into apache:master Dec 9, 2022

clintropolis deleted the nested-protobuf branch December 9, 2022 20:24

clintropolis mentioned this pull request Dec 9, 2022

[Backport] add protobuf flattener, direct to plain java conversion for faster flattening #13546

Merged

kfaraz pushed a commit that referenced this pull request Dec 12, 2022

add protobuf flattener, direct to plain java conversion for faster fl…

93e2a7f

…attening (#13519) (#13546) * add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests

clintropolis mentioned this pull request Dec 15, 2022

allow string dimension indexer to handle byte[] as base64 strings #13573

Merged

10 tasks

vtlim mentioned this pull request Jan 6, 2023

doc: List Protobuf as a supported format #13640

Merged

clintropolis modified the milestones: 26.0, 25.0 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add protobuf flattener, direct to plain java conversion for faster flattening#13519

add protobuf flattener, direct to plain java conversion for faster flattening#13519
clintropolis merged 9 commits intoapache:masterfrom
clintropolis:nested-protobuf

clintropolis commented Dec 7, 2022

Uh oh!

imply-cheddar Dec 7, 2022

Uh oh!

clintropolis Dec 8, 2022

Uh oh!

imply-cheddar Dec 7, 2022

Uh oh!

clintropolis Dec 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clintropolis commented Dec 7, 2022

Description

Key changed/added classes in this PR

Uh oh!

imply-cheddar Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

clintropolis Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

imply-cheddar Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

clintropolis Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants