Fix String Frame Readers to read String Arrays correctly#16885
Fix String Frame Readers to read String Arrays correctly#16885LakshSingla merged 25 commits intoapache:masterfrom
Conversation
|
LGTM! Can you make sure the CodeQL issue isn't real, and suppress/fix it accordingly |
|
@sreemanamala Can you please try out this query once with MSQ engine as well to see if the latest changes of this PR are still fixing it? Thanks! |
| memory, | ||
| positionOfLengths, | ||
| positionOfPayloads, | ||
| multiValue // Read MVDs as String arrays |
There was a problem hiding this comment.
this doesn't seem correct, it will implicitly cast the MVD to an array? I guess that is what the code here was intending to do, but is that actually correct? Is there documentation somewhere on what the expected behavior of window functions with mvds is?
I guess if it is a one way coercion to array that is potentially fine, but im pretty against this retaining the type as STRING, e.g. implicitly coercing the string array inside of the window operators back into a multi-value string on the way out. That seems incorrect behavior and would lead to confusion about the differences between ARRAY types and mvds, which I'm trying to avoid creating more of (see arrayIngestMode)
There was a problem hiding this comment.
Strings (mvd) and string arrays are laid out in a pretty similar fashion. They have 3 sections:
- Indicates the number of the elements in a single row
- Indicates the end point of the continuously laid out elements ƒor a particular row
- Actual string value
So if the string array is [foo, bar], it would be laid out like:
Section 1: .....2....
Section 2: ....190, 193......
Section 3: ....foo, bar,.....
This is the same for MV strings as well as the string arrays. However, single-value strings don't need the first section, therefore they omit it. This leads to 2 different formats a column is represented, and that is handled by the "multiValue" flag (unfortunate naming I guess, but it doesn't have anything to do with string-multi valuedness).
From what I know, the duplication b/w the string arrays and mv strings is only because they are laid out in the same way. It doesn't impact how other parts of the system refer to it, the frames themselves don't store the column type. If window functions begin treating string arrays as multi-value strings, it shouldn't be because of this, but because something upstream is telling it to. With the arrayIngestMode, it happened because of the incorrect implementation in the DimensionHandlerUtils afaik.
That being said, if you as a reader got confused, there's more to what we can do to separate it and make it cleaner:
- Separate out the two format implementations in a static block - strings can use either one depending on the flag, and arrays must always use the one where there are three sections (the one mentioned)
- Use an alternative naming to "multiValue" to refer the layout so that it doesn't look like we are coercing anything
- Kill all mentions of "multivalue" in the string array frame column reader implementation
- Separate out everything at the cost of a little duplication (though I'd really like the core logic of reading the values to be kept in a helper method, if possible, since it's easier to fix bugs that way)
There was a problem hiding this comment.
there are other subtle differences between arrays and mvds too, mvds for example will never have an actual null value, only [] or [null], while with arrays all 3 values are distinct. At least this is true of normal segments, i would assume that frames preserve this (didn't confirm yet). Selectors on MVDs also change single element arrays into scalar string values, while arrays never do this. Mvds are expected to spit out lists from their selectors (or scalar values), while arrays spit out arrays. Mvds use dimension selectors, arrays do not typically ever support dimension selectors, and so on.
I'd personally prefer everything entirely split so that arrays never accidentally become mvds and mvds never accidentally become arrays. It seems like it simplifies both string and string array implementations because the behavior differs quite a bit in a lot of places and we can drop lots of conditional checking. I guess the part that reads a rows worth of values could be shared? That said, since there are only 2 implementations the cost of duplication doesn't seem very high unless I'm missing something.
I guess this doesn't need to be done in this PR, and am happy to have further discussion, i just would feel a lot safer if all string array stuff was completely split out of all string stuff, even if it shares formats. Alternatively, maybe the string reader/writer only ever handles single value strings, and all the stuff is moved into the string array writer, and if anything, the MVD reader/writer can subclass the array reader/writer to override stuff like implementing a dimension selector or whatever.
clintropolis
left a comment
There was a problem hiding this comment.
thanks for splitting this up, i feel a lot more comfortable about things
While writing to a frame, String arrays are written by setting the multivalue byte. But while reading, it was hardcoded to false.
While writing to a frame, String arrays are written by setting the multivalue byte. But while reading, it was hardcoded to false. (cherry picked from commit c7c3307)
While writing to a frame, String arrays are written by setting the multivalue byte. But while reading, it was hardcoded to false. (cherry picked from commit c7c3307)
While writing to a frame, String arrays are written by setting the multivalue byte. But while reading, it was hardcoded to false. (cherry picked from commit c7c3307)
Description
While writing to a frame, String arrays are written by setting the multivalue byte.
But while reading, it was hardcoded to false.
Fixed it by reading the byte similar to readColumn method.
Key changed/added classes in this PR
StringFrameColumnReaderStringArrayFrameColumnReaderThis PR has: