Skip to content

Properly read SQL-compatible segments in default-value mode.#14142

Merged
gianm merged 13 commits intoapache:masterfrom
gianm:null-handling-compat-reads
Jun 28, 2023
Merged

Properly read SQL-compatible segments in default-value mode.#14142
gianm merged 13 commits intoapache:masterfrom
gianm:null-handling-compat-reads

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Apr 21, 2023

Main changes:

  1. Dictionary-encoded and front-coded string columns: in default-value
    mode, detect cases where a dictionary has the empty string in it, then
    either combine it with null (if null is present) or replace it with
    null (if null is not present).

  2. Numeric nullable columns: in default-value mode, ignore the null
    value bitmap. This causes all null numbers to be read as zeroes.

Testing strategy:

  1. Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that
    writes segments under SQL-compatible mode, and reads them under
    default-value mode.

  2. Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed,
    CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts,
    CombineFirstTwoValuesIndexedInts).

Main changes:

1) Dictionary-encoded and front-coded string columns: in default-value
   mode, detect cases where a dictionary has the empty string in it, then
   either combine it with null (if null is present) or replace it with
   null (if null is not present).

2) Numeric nullable columns: in default-value mode, ignore the null
   value bitmap. This causes all null numbers to be read as zeroes.

Testing strategy:

1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that
   writes segments under SQL-compatible mode, and reads them under
   default-value mode.

2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed,
   CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts,
   CombineFirstTwoValuesIndexedInts).
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Apr 21, 2023

Currently a draft PR since I haven't written unit tests for the new helper classes yet. I'm interested in feedback on the approach though.

if (index == FIRST_ID) {
return newFirstValue();
} else {
return delegate.get(index + 1);

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an overflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an overflow. This arithmetic expression depends on a [user-provided value](3), potentially causing an overflow.
if (i == NULL_ID) {
return i;
} else {
return i - 1;

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](3), potentially causing an underflow.
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Apr 21, 2023

There is going to be some query runtime performance overhead here for the case where a server in default-value mode is reading a segment that was written in SQL-compatible mode. However, I don't really see a way around this. The dictionaries do need to be adjusted and it will add some extra overhead.

There's no overhead for default-value mode reading default-value-mode-written segments, or SQL-compatible mode reading any kind of segments.

@2bethere 2bethere mentioned this pull request Jun 7, 2023
5 tasks
@clintropolis clintropolis added this to the 27.0 milestone Jun 7, 2023
@gianm gianm marked this pull request as ready for review June 26, 2023 03:22
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 26, 2023

Added tests and marked this PR as ready for review.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 'auto' string column should probably (unfortunately) use this stuff too, but I think it can be done as a follow-up, since I also think that it can be totally combined with the front-coded string column (which is no longer specific to front-coding after this PR, rather its a string column which only has a utf8 buffer dictionary).

final ImmutableBitmap bitmap;
final boolean hasNulls;
if (buffer.hasRemaining()) {
if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think technically we still want to either read this to move the buffer ahead by side effect or just move the buffer position like when we read the numeric column? i mean its probably fine because nothing actually uses more than one column part, but column parts are just a for loop so each part is expecting the buffer to be in the correct position to deserialize. We should at least leave a comment about it just in case anything ever starts using column parts or we add some additional stuff to the end of this column part

final ImmutableBitmap bitmap;
final boolean hasNulls;
if (buffer.hasRemaining()) {
if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about buffer position

final ImmutableBitmap bitmap;
final boolean hasNulls;
if (buffer.hasRemaining()) {
if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about buffer position

@Nullable
private final ColumnarMultiInts multiValueColumn;
private final FrontCodedIndexed utf8Dictionary;
private final Indexed<ByteBuffer> utf8Dictionary;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this and ScalarStringDictionaryEncodedColumn can now be combined but I can do that in a follow-up PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in the latest patch.

*
* @see NullHandling#mustReplaceFirstValueWithNullInDictionary(Indexed)
*/
public class ReplaceFirstValueWithNullIndexed<T> implements Indexed<T>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this primarily so that callers don't have to worry about emptyToNullIfNeeded/nullToEmptyIfNeeded for the case where the dictionary had "" but no nulls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for situations where the dictionary has "" but no null. It replaces the "" with a null.

1) Read bitmaps even if we don't retain them.
2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jun 27, 2023

Pushed up a patch that addresses these comments:

  1. Read bitmaps even if we don't retain them.
  2. Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.

@gianm gianm merged commit 82fbb31 into apache:master Jun 28, 2023
@gianm gianm deleted the null-handling-compat-reads branch June 28, 2023 17:30
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
…14142)

* Properly read SQL-compatible segments in default-value mode.

Main changes:

1) Dictionary-encoded and front-coded string columns: in default-value
   mode, detect cases where a dictionary has the empty string in it, then
   either combine it with null (if null is present) or replace it with
   null (if null is not present).

2) Numeric nullable columns: in default-value mode, ignore the null
   value bitmap. This causes all null numbers to be read as zeroes.

Testing strategy:

1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that
   writes segments under SQL-compatible mode, and reads them under
   default-value mode.

2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed,
   CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts,
   CombineFirstTwoValuesIndexedInts).

* Fix a mistake, use more singlethreadedness.

* WIP

* Tests, improvements.

* Style.

* See Spot bug.

* Remove unused method.

* Address review comments.

1) Read bitmaps even if we don't retain them.
2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.

* Add missing tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants