Properly read SQL-compatible segments in default-value mode.#14142
Properly read SQL-compatible segments in default-value mode.#14142gianm merged 13 commits intoapache:masterfrom
Conversation
Main changes: 1) Dictionary-encoded and front-coded string columns: in default-value mode, detect cases where a dictionary has the empty string in it, then either combine it with null (if null is present) or replace it with null (if null is not present). 2) Numeric nullable columns: in default-value mode, ignore the null value bitmap. This causes all null numbers to be read as zeroes. Testing strategy: 1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that writes segments under SQL-compatible mode, and reads them under default-value mode. 2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed, CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts, CombineFirstTwoValuesIndexedInts).
|
Currently a draft PR since I haven't written unit tests for the new helper classes yet. I'm interested in feedback on the approach though. |
| if (index == FIRST_ID) { | ||
| return newFirstValue(); | ||
| } else { | ||
| return delegate.get(index + 1); |
Check failure
Code scanning / CodeQL
User-controlled data in arithmetic expression
| if (i == NULL_ID) { | ||
| return i; | ||
| } else { | ||
| return i - 1; |
Check failure
Code scanning / CodeQL
User-controlled data in arithmetic expression
|
There is going to be some query runtime performance overhead here for the case where a server in default-value mode is reading a segment that was written in SQL-compatible mode. However, I don't really see a way around this. The dictionaries do need to be adjusted and it will add some extra overhead. There's no overhead for default-value mode reading default-value-mode-written segments, or SQL-compatible mode reading any kind of segments. |
|
Added tests and marked this PR as ready for review. |
clintropolis
left a comment
There was a problem hiding this comment.
I think the 'auto' string column should probably (unfortunately) use this stuff too, but I think it can be done as a follow-up, since I also think that it can be totally combined with the front-coded string column (which is no longer specific to front-coding after this PR, rather its a string column which only has a utf8 buffer dictionary).
| final ImmutableBitmap bitmap; | ||
| final boolean hasNulls; | ||
| if (buffer.hasRemaining()) { | ||
| if (buffer.hasRemaining() && NullHandling.sqlCompatible()) { |
There was a problem hiding this comment.
i think technically we still want to either read this to move the buffer ahead by side effect or just move the buffer position like when we read the numeric column? i mean its probably fine because nothing actually uses more than one column part, but column parts are just a for loop so each part is expecting the buffer to be in the correct position to deserialize. We should at least leave a comment about it just in case anything ever starts using column parts or we add some additional stuff to the end of this column part
| final ImmutableBitmap bitmap; | ||
| final boolean hasNulls; | ||
| if (buffer.hasRemaining()) { | ||
| if (buffer.hasRemaining() && NullHandling.sqlCompatible()) { |
There was a problem hiding this comment.
same comment about buffer position
| final ImmutableBitmap bitmap; | ||
| final boolean hasNulls; | ||
| if (buffer.hasRemaining()) { | ||
| if (buffer.hasRemaining() && NullHandling.sqlCompatible()) { |
There was a problem hiding this comment.
same comment about buffer position
| @Nullable | ||
| private final ColumnarMultiInts multiValueColumn; | ||
| private final FrontCodedIndexed utf8Dictionary; | ||
| private final Indexed<ByteBuffer> utf8Dictionary; |
There was a problem hiding this comment.
i think this and ScalarStringDictionaryEncodedColumn can now be combined but I can do that in a follow-up PR
There was a problem hiding this comment.
I did this in the latest patch.
| * | ||
| * @see NullHandling#mustReplaceFirstValueWithNullInDictionary(Indexed) | ||
| */ | ||
| public class ReplaceFirstValueWithNullIndexed<T> implements Indexed<T> |
There was a problem hiding this comment.
is this primarily so that callers don't have to worry about emptyToNullIfNeeded/nullToEmptyIfNeeded for the case where the dictionary had "" but no nulls?
There was a problem hiding this comment.
Yes, it's for situations where the dictionary has "" but no null. It replaces the "" with a null.
1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.
|
Pushed up a patch that addresses these comments:
|
…14142) * Properly read SQL-compatible segments in default-value mode. Main changes: 1) Dictionary-encoded and front-coded string columns: in default-value mode, detect cases where a dictionary has the empty string in it, then either combine it with null (if null is present) or replace it with null (if null is not present). 2) Numeric nullable columns: in default-value mode, ignore the null value bitmap. This causes all null numbers to be read as zeroes. Testing strategy: 1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that writes segments under SQL-compatible mode, and reads them under default-value mode. 2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed, CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts, CombineFirstTwoValuesIndexedInts). * Fix a mistake, use more singlethreadedness. * WIP * Tests, improvements. * Style. * See Spot bug. * Remove unused method. * Address review comments. 1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn. * Add missing tests.
Main changes:
Dictionary-encoded and front-coded string columns: in default-value
mode, detect cases where a dictionary has the empty string in it, then
either combine it with null (if null is present) or replace it with
null (if null is not present).
Numeric nullable columns: in default-value mode, ignore the null
value bitmap. This causes all null numbers to be read as zeroes.
Testing strategy:
Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that
writes segments under SQL-compatible mode, and reads them under
default-value mode.
Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed,
CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts,
CombineFirstTwoValuesIndexedInts).