Properly read SQL-compatible segments in default-value mode. by gianm · Pull Request #14142 · apache/druid

gianm · 2023-04-21T20:34:55Z

Main changes:

Dictionary-encoded and front-coded string columns: in default-value
mode, detect cases where a dictionary has the empty string in it, then
either combine it with null (if null is present) or replace it with
null (if null is not present).
Numeric nullable columns: in default-value mode, ignore the null
value bitmap. This causes all null numbers to be read as zeroes.

Testing strategy:

Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that
writes segments under SQL-compatible mode, and reads them under
default-value mode.
Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed,
CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts,
CombineFirstTwoValuesIndexedInts).

Main changes: 1) Dictionary-encoded and front-coded string columns: in default-value mode, detect cases where a dictionary has the empty string in it, then either combine it with null (if null is present) or replace it with null (if null is not present). 2) Numeric nullable columns: in default-value mode, ignore the null value bitmap. This causes all null numbers to be read as zeroes. Testing strategy: 1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that writes segments under SQL-compatible mode, and reads them under default-value mode. 2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed, CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts, CombineFirstTwoValuesIndexedInts).

gianm · 2023-04-21T20:35:35Z

Currently a draft PR since I haven't written unit tests for the new helper classes yet. I'm interested in feedback on the approach though.

+    if (index == FIRST_ID) {
+      return newFirstValue();
+    } else {
+      return delegate.get(index + 1);


+    if (i == NULL_ID) {
+      return i;
+    } else {
+      return i - 1;


gianm · 2023-04-21T23:53:48Z

There is going to be some query runtime performance overhead here for the case where a server in default-value mode is reading a segment that was written in SQL-compatible mode. However, I don't really see a way around this. The dictionaries do need to be adjusted and it will add some extra overhead.

There's no overhead for default-value mode reading default-value-mode-written segments, or SQL-compatible mode reading any kind of segments.

gianm · 2023-06-26T03:22:27Z

Added tests and marked this PR as ready for review.

clintropolis

I think the 'auto' string column should probably (unfortunately) use this stuff too, but I think it can be done as a follow-up, since I also think that it can be totally combined with the front-coded string column (which is no longer specific to front-coding after this PR, rather its a string column which only has a utf8 buffer dictionary).

clintropolis · 2023-06-26T23:53:49Z

      final ImmutableBitmap bitmap;
      final boolean hasNulls;
-      if (buffer.hasRemaining()) {
+      if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {


i think technically we still want to either read this to move the buffer ahead by side effect or just move the buffer position like when we read the numeric column? i mean its probably fine because nothing actually uses more than one column part, but column parts are just a for loop so each part is expecting the buffer to be in the correct position to deserialize. We should at least leave a comment about it just in case anything ever starts using column parts or we add some additional stuff to the end of this column part

clintropolis · 2023-06-26T23:55:21Z

      final ImmutableBitmap bitmap;
      final boolean hasNulls;
-      if (buffer.hasRemaining()) {
+      if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {


same comment about buffer position

clintropolis · 2023-06-26T23:58:32Z

      final ImmutableBitmap bitmap;
      final boolean hasNulls;
-      if (buffer.hasRemaining()) {
+      if (buffer.hasRemaining() && NullHandling.sqlCompatible()) {


same comment about buffer position

clintropolis · 2023-06-27T03:08:46Z

  @Nullable
  private final ColumnarMultiInts multiValueColumn;
-  private final FrontCodedIndexed utf8Dictionary;
+  private final Indexed<ByteBuffer> utf8Dictionary;


i think this and ScalarStringDictionaryEncodedColumn can now be combined but I can do that in a follow-up PR

I did this in the latest patch.

clintropolis · 2023-06-27T03:12:53Z

+ *
+ * @see NullHandling#mustReplaceFirstValueWithNullInDictionary(Indexed)
+ */
+public class ReplaceFirstValueWithNullIndexed<T> implements Indexed<T>


is this primarily so that callers don't have to worry about emptyToNullIfNeeded/nullToEmptyIfNeeded for the case where the dictionary had "" but no nulls?

Yes, it's for situations where the dictionary has "" but no null. It replaces the "" with a null.

1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.

gianm · 2023-06-27T16:38:59Z

Pushed up a patch that addresses these comments:

Read bitmaps even if we don't retain them.
Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.

…14142) * Properly read SQL-compatible segments in default-value mode. Main changes: 1) Dictionary-encoded and front-coded string columns: in default-value mode, detect cases where a dictionary has the empty string in it, then either combine it with null (if null is present) or replace it with null (if null is not present). 2) Numeric nullable columns: in default-value mode, ignore the null value bitmap. This causes all null numbers to be read as zeroes. Testing strategy: 1) Add a mmappedWithSqlCompatibleNulls case to BaseFilterTest that writes segments under SQL-compatible mode, and reads them under default-value mode. 2) Unit tests for the new wrapper classes (CombineFirstTwoEntriesIndexed, CombineFirstTwoValuesColumnarInts, CombineFirstTwoValuesColumnarMultiInts, CombineFirstTwoValuesIndexedInts). * Fix a mistake, use more singlethreadedness. * WIP * Tests, improvements. * Style. * See Spot bug. * Remove unused method. * Address review comments. 1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn. * Add missing tests.

gianm added Compatibility Area - Null Handling labels Apr 21, 2023

github-advanced-security AI found potential problems Apr 21, 2023

View reviewed changes

Fix a mistake, use more singlethreadedness.

782ecf1

gianm mentioned this pull request Apr 25, 2023

SQL-compatible null handling and strict booleans by default #14154

Closed

gianm added 2 commits May 3, 2023 11:49

Merge branch 'master' into null-handling-compat-reads

2b02d97

WIP

56086b0

2bethere mentioned this pull request Jun 7, 2023

Druid 27.0 release planning #14386

Closed

5 tasks

clintropolis added this to the 27.0 milestone Jun 7, 2023

gianm added 3 commits June 23, 2023 15:06

Merge branch 'master' into null-handling-compat-reads

bfd309f

Merge branch 'master' into null-handling-compat-reads

cd07647

Tests, improvements.

f8fcd31

gianm marked this pull request as ready for review June 26, 2023 03:22

gianm added 4 commits June 25, 2023 22:34

Style.

eef70cb

See Spot bug.

6f13101

Merge branch 'master' into null-handling-compat-reads

a20e3e4

Remove unused method.

e17e0be

clintropolis approved these changes Jun 27, 2023

View reviewed changes

Address review comments.

897c81a

1) Read bitmaps even if we don't retain them. 2) Combine StringFrontCodedDictionaryEncodedColumn and ScalarStringDictionaryEncodedColumn.

clintropolis approved these changes Jun 27, 2023

View reviewed changes

Add missing tests.

8690646

clintropolis approved these changes Jun 27, 2023

View reviewed changes

gianm merged commit 82fbb31 into apache:master Jun 28, 2023

gianm deleted the null-handling-compat-reads branch June 28, 2023 17:30

clintropolis mentioned this pull request Jun 29, 2023

remove druid.processing.columnCache.sizeBytes and CachingIndexed, combine string column implementations #14500

Merged

6 tasks

clintropolis mentioned this pull request Aug 9, 2023

enable sql compatible null handling mode by default #14792

Merged

4 tasks

Conversation

gianm commented Apr 21, 2023

Uh oh!

gianm commented Apr 21, 2023

Uh oh!

Check failure

Check failure

gianm commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jun 26, 2023

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 26, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 26, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 26, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

gianm Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

gianm Jun 27, 2023

Choose a reason for hiding this comment

Uh oh!

gianm commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gianm commented Apr 21, 2023 •

edited

Loading

gianm commented Jun 27, 2023 •

edited

Loading