Skip to content

Fix String Frame Readers to read String Arrays correctly#16885

Merged
LakshSingla merged 25 commits intoapache:masterfrom
sreemanamala:string-arr-frames
Sep 10, 2024
Merged

Fix String Frame Readers to read String Arrays correctly#16885
LakshSingla merged 25 commits intoapache:masterfrom
sreemanamala:string-arr-frames

Conversation

@sreemanamala
Copy link
Copy Markdown
Contributor

@sreemanamala sreemanamala commented Aug 13, 2024

Description

While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.
Fixed it by reading the byte similar to readColumn method.


Key changed/added classes in this PR
  • StringFrameColumnReader
  • StringArrayFrameColumnReader

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just make a separate StringArrayFrameColumnReader/StringArrayFrameColumnWriter? i don't care if they have the same format, this just seems like a dangerous way to mix-up mvds which are not arrays with actual string arrays...

@sreemanamala sreemanamala marked this pull request as draft August 13, 2024 15:47
@sreemanamala sreemanamala marked this pull request as ready for review August 13, 2024 16:41
@LakshSingla
Copy link
Copy Markdown
Contributor

LGTM! Can you make sure the CodeQL issue isn't real, and suppress/fix it accordingly

@Akshat-Jain
Copy link
Copy Markdown
Contributor

@sreemanamala Can you please try out this query once with MSQ engine as well to see if the latest changes of this PR are still fixing it? Thanks!

memory,
positionOfLengths,
positionOfPayloads,
multiValue // Read MVDs as String arrays
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem correct, it will implicitly cast the MVD to an array? I guess that is what the code here was intending to do, but is that actually correct? Is there documentation somewhere on what the expected behavior of window functions with mvds is?

I guess if it is a one way coercion to array that is potentially fine, but im pretty against this retaining the type as STRING, e.g. implicitly coercing the string array inside of the window operators back into a multi-value string on the way out. That seems incorrect behavior and would lead to confusion about the differences between ARRAY types and mvds, which I'm trying to avoid creating more of (see arrayIngestMode)

Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strings (mvd) and string arrays are laid out in a pretty similar fashion. They have 3 sections:

  1. Indicates the number of the elements in a single row
  2. Indicates the end point of the continuously laid out elements ƒor a particular row
  3. Actual string value

So if the string array is [foo, bar], it would be laid out like:
Section 1: .....2....
Section 2: ....190, 193......
Section 3: ....foo, bar,.....

This is the same for MV strings as well as the string arrays. However, single-value strings don't need the first section, therefore they omit it. This leads to 2 different formats a column is represented, and that is handled by the "multiValue" flag (unfortunate naming I guess, but it doesn't have anything to do with string-multi valuedness).


From what I know, the duplication b/w the string arrays and mv strings is only because they are laid out in the same way. It doesn't impact how other parts of the system refer to it, the frames themselves don't store the column type. If window functions begin treating string arrays as multi-value strings, it shouldn't be because of this, but because something upstream is telling it to. With the arrayIngestMode, it happened because of the incorrect implementation in the DimensionHandlerUtils afaik.

That being said, if you as a reader got confused, there's more to what we can do to separate it and make it cleaner:

  1. Separate out the two format implementations in a static block - strings can use either one depending on the flag, and arrays must always use the one where there are three sections (the one mentioned)
  2. Use an alternative naming to "multiValue" to refer the layout so that it doesn't look like we are coercing anything
  3. Kill all mentions of "multivalue" in the string array frame column reader implementation
  4. Separate out everything at the cost of a little duplication (though I'd really like the core logic of reading the values to be kept in a helper method, if possible, since it's easier to fix bugs that way)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are other subtle differences between arrays and mvds too, mvds for example will never have an actual null value, only [] or [null], while with arrays all 3 values are distinct. At least this is true of normal segments, i would assume that frames preserve this (didn't confirm yet). Selectors on MVDs also change single element arrays into scalar string values, while arrays never do this. Mvds are expected to spit out lists from their selectors (or scalar values), while arrays spit out arrays. Mvds use dimension selectors, arrays do not typically ever support dimension selectors, and so on.

I'd personally prefer everything entirely split so that arrays never accidentally become mvds and mvds never accidentally become arrays. It seems like it simplifies both string and string array implementations because the behavior differs quite a bit in a lot of places and we can drop lots of conditional checking. I guess the part that reads a rows worth of values could be shared? That said, since there are only 2 implementations the cost of duplication doesn't seem very high unless I'm missing something.

I guess this doesn't need to be done in this PR, and am happy to have further discussion, i just would feel a lot safer if all string array stuff was completely split out of all string stuff, even if it shares formats. Alternatively, maybe the string reader/writer only ever handles single value strings, and all the stuff is moved into the string array writer, and if anything, the MVD reader/writer can subclass the array reader/writer to override stuff like implementing a dimension selector or whatever.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for splitting this up, i feel a lot more comfortable about things

Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM post refactor

@LakshSingla LakshSingla merged commit c7c3307 into apache:master Sep 10, 2024
@sreemanamala sreemanamala deleted the string-arr-frames branch September 10, 2024 09:02
sreemanamala added a commit to sreemanamala/druid that referenced this pull request Sep 18, 2024
While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.
sreemanamala added a commit to sreemanamala/druid that referenced this pull request Sep 18, 2024
While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.

(cherry picked from commit c7c3307)
sreemanamala added a commit to sreemanamala/druid that referenced this pull request Sep 18, 2024
While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.

(cherry picked from commit c7c3307)
sreemanamala added a commit to sreemanamala/druid that referenced this pull request Sep 18, 2024
While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.

(cherry picked from commit c7c3307)
LakshSingla pushed a commit that referenced this pull request Sep 19, 2024
…7103)

While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.

(cherry picked from commit c7c3307)
@adarshsanjeev adarshsanjeev added this to the 32.0.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants