Skip to content

add protobuf flattener, direct to plain java conversion for faster flattening#13519

Merged
clintropolis merged 9 commits intoapache:masterfrom
clintropolis:nested-protobuf
Dec 9, 2022
Merged

add protobuf flattener, direct to plain java conversion for faster flattening#13519
clintropolis merged 9 commits intoapache:masterfrom
clintropolis:nested-protobuf

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

This PR overhauls ProtobufInputFormat to use a new ProtobufFlattenerMaker that works by converting Protobuf Message directly to plain java types rather than printing to a JSON string and then deserializing that into JsonNode with Jackson. This is roughly based on the JsonFormat conversion code that we were previously using, but without the performance penalty this inflicts. I haven't measured it, but for actually flat schemas this should be approximately the same cost as the old way of not using the flattener since it is doing the same getAllFields to convert the top level Message to a Map.

While I was here, i extracted a base type, FlattenerJsonProvider for the common code shared between most implementations.

I've also added tests for using Druid nested columns and JSON transform functions (which is what motivated this change, because they both prefer to work on plain java types).

I did not update ProtobufInputRowParser because I don't care about Hadoop, but if anyone does it should be possible to wire this up.


Key changed/added classes in this PR
  • ProtobufInputFormat
  • ProtobufReader
  • ProtobufFlattenerMaker
  • ProtobufJsonProvider
  • ProtobufConverter
  • FlattenerJsonProvider

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Override
public Object createMap()
{
return new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: LinkedHashMaps are nicer 'cause they maintain order on iteration. This makes things a lot nicer for like, toString() and other debug-style activities.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot to do this, will try to remember to do in a follow-up to not churn through ci again

return ((ByteString) value).toByteArray();
case ENUM:
// Special-case google.protobuf.NullValue (it's an Enum).
if (field.getEnumType().getFullName().equals("google.protobuf.NullValue")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is String equality really the best way?

If it is, do it as "google.protobuf...".equals() instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is best or not, this is what JsonFormat was doing which I adapted most of this code form. I can switch the order though on next PR since I think i'm going to have to add some test coverage anyway to make 🤖 happy

@clintropolis clintropolis merged commit 7002ecd into apache:master Dec 9, 2022
@clintropolis clintropolis deleted the nested-protobuf branch December 9, 2022 20:24
clintropolis added a commit to clintropolis/druid that referenced this pull request Dec 9, 2022
…attening (apache#13519)

* add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests
kfaraz pushed a commit that referenced this pull request Dec 12, 2022
…attening (#13519) (#13546)

* add protobuf flattener, direct to plain java conversion for faster flattening, nested column tests
kfaraz pushed a commit that referenced this pull request Dec 16, 2022
…3573)

This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. 

This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing).

I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Dec 16, 2022
…ache#13573)

This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. 

This issue was uncovered by a regression of sorts introduced by apache#13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing).

I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.
kfaraz added a commit that referenced this pull request Dec 16, 2022
…3573) (#13582)

This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. 

This issue was uncovered by a regression of sorts introduced by #13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing).

I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.

Co-authored-by: Clint Wylie <cwylie@apache.org>
@clintropolis clintropolis modified the milestones: 26.0, 25.0 Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants