Skip to content

[Backport] allow string dimension indexer to handle byte[] as base64 strings (#13573)#13582

Merged
kfaraz merged 1 commit intoapache:25.0.0from
kfaraz:backport_string_indexer
Dec 16, 2022
Merged

[Backport] allow string dimension indexer to handle byte[] as base64 strings (#13573)#13582
kfaraz merged 1 commit intoapache:25.0.0from
kfaraz:backport_string_indexer

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Dec 16, 2022

Backports #13573

…ache#13573)

This PR expands `StringDimensionIndexer` to handle conversion of `byte[]` to base64 encoded strings, rather than the current behavior of calling java `toString`. 

This issue was uncovered by a regression of sorts introduced by apache#13519, which updated the protobuf extension to directly convert stuff to java types, resulting in `bytes` typed values being converted as `byte[]` instead of a base64 string which the previous JSON based conversion created. While outputting `byte[]` is more consistent with other input formats, and preferable when the bytes can be consumed directly (such as complex types serde), when fed to a `StringDimensionIndexer`, it resulted in an ugly java `toString` because `processRowValsToUnsortedEncodedKeyComponent` is fed the output of `row.getRaw(..)`. Converting `byte[]` to a base64 string within `StringDimensionIndexer` is consistent with the behavior of calling `row.getDimension(..)` which does do this coercion (and why many tests on binary types appeared to be doing the expected thing).

I added some protobuf `bytes` tests, but they don't really hit the new `StringDimensionIndexer` behavior because they operate on the `InputRow` directly, and call `getDimension` to validate stuff. The parser based version still uses the old conversion mechanisms, so when not using a flattener incorrectly calls `toString` on the `ByteString`. I have encoded this behavior in the test for now, if we either update the parser to use the new flattener or just .. remove parsers we can remove this test stuff.
@kfaraz kfaraz added this to the 25.0 milestone Dec 16, 2022
@kfaraz kfaraz merged commit cbccde5 into apache:25.0.0 Dec 16, 2022
@kfaraz kfaraz deleted the backport_string_indexer branch May 3, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants