Make NULL handling in Druid more compatible to SQL standard by nishantmonu51 · Pull Request #4754 · apache/druid

nishantmonu51 · 2017-09-05T17:54:52Z

This PR contains changes to improve handling of null values in druid by treating nulls as missing values which don’t actually exist. This will make it more compatible to SQL standard and help with integration with other existing BI systems which support ODBC/JDBC. (Detailed description in proposal - #4349)

Changes include -

New Configuration

New configuration - "druid.generic.useDefaultValueForNull" in class NullValueHandlingConfig default to true for preserving old behavior.
NullHandlingHelper - this a helper class with static methods for handling optional nullToEmpty / emptyToNull conversions for switching between old/new behavior. Without this static helper, we would need to inject NullValueHandlingConfig almost everywhere throughout the querying and indexing layer.

Storage Layer -

GenericIndexed - Added ability to distinguish between null/empty bytes. the value buffer for GenericIndexed contains element size followed by the bytes, for null values size is written as -1 and for empty bytes size is 0.
String Columns - the dictionary is stores as a genericIndexed, which now can contain null value. An inverted index is created with the id assigned to null value. Null and empty strings will get different id's in dictionary encoding.
Numeric Columns - The column is serialized as primitive objects of long/float/double type, a nullRowsBitmap is also stored which is used to distinguish rows having 0 and null values.
Complex columns - Complex columns use genericIndexed internally for storage, so supported nulls in genericIndexed make this work. the ObjectStrategy can choose to return null objects when deserializing data.

Indexing Layer -

Row - instead of returning primitive types return Float/Long/Double objects so that indexer can distinguish between null and 0 values.
StringIndexer,DoubleIndexer,LongIndexer,FloatIndexer changed to distinguish between null and empty values.
ColumnValueSelector - added new method isNull() to check whether a value is null or not while aggregating. this is mainly done in order to avoid boxing/unboxing from primitive values while doing segment scans.

Nullability for Aggregators/Metrics

As per the current implementation, most aggregators are coded to replace null values with default e.g. sum treats them as zeroes, min treats them as positive infinity, etc.
To match the new semantics and make aggregators nullable, where if an aggregator encounters only the null values the result will be null value following changes are made -

Aggrgator/BufferAggregator - Added a new method boolean isNull() which returns false by default. aggregators that support nullability can choose to override this and return true if the aggregated result is null.
Added a NullableAggregator/NullableBufferAggregator which is be used as a decorator for existing aggregator implementations to make them nullable. This will return null values if all the values aggregated are null, If any of the aggregated value is non-null, it will have same behavior as the delegate aggregator.
If new null behavior is enabled, aggregator factories will decorate the aggregators with NullableAggregator/NullableBufferAggregator.
Cardinality and HyperUnique aggregators ignore null values and only count non-null values, If all the encountered values are null, the result will be 0. This is different from current behavior where null values are also counted.
Count aggregator - since count aggregator is not associated with any column and just a count of the number of rows, it has the same behavior as before.

Math Expressions Null Handling

For general calculations like sum, full expression is be considered as null if any of the components is null.
StringLiteral now supports null and and expressions containing non-quoted null is parsed as StringLiteral with null
Specifying a default value for null is supported by the use of NVL or IF clause to assign default values at query time.

Filtering on Null values

SelectorDimFilter currently specifies filtering on null values but the implementation assumes null and empty strings as equivalent. The implementation is changed to consider null and empty string differently. Generation of cache key for selectorDimFilter is also modified to support null.
InFilter/ExpressionFilter/LikeFilter are also modified to not treat nulls as empty strings.

Changes to Druid build-in SQL layer

NULL and empty strings are treated differently.
IS NULL and IS NOT NULL filters against null values instead of empty strings.
NVL, IFNULL and COALESCE work as per SQL standard.
count now skips null values and works as per SQL standard.

Misc Changes

Above are the major changes in the user facing APIs and behavior. Other than these there are multiple places in the code where we convert empty Strings to null and vice-versa. They are changed in order to treat null and String values separately.

Backwards Compatibility

"druid.generic.useDefaultValueForNull" is added as a new config. The default value is set to true for backwards compatibility.
code places in query/ingestion where we converted emptyToNull or nullToEmpty now use NullHandlingHelper which only does the conversion if backwards compatibility is enabled.
Storage layer changes are backwards compatible for GenericIndexed/String columns.
For Numeric columns new V2 serde implementations are introduced which also supports serde of nullRowsBitmaps. Previous implementations are kept unchanged to read any existing segments.

Testing

Most of the unit tests that tested nulls are modified to have two branches of expected results to test new/old behavior both.
For CI have added new runs to test matrix to test new behavior for nulls.
To run the tests with new behavior run "-Ddruid.generic.useDefaultValueForNull=false"

drcrallen · 2017-09-07T16:54:40Z

Re: Row - instead of returning primitive types return Float/Long/Double objects so that indexer can distinguish between null and 0 values.

We have pretty bad heap memory usage problems during indexing. Does this add extra heap pressure?

leventov · 2017-09-07T19:29:19Z

It could be avoided by changing Long Row.getLongMetric(String) to void Row.getLongMetric(String, LongMetricReceiver), Where LongMetricReceiver is a mutable object with long metric and boolean isNull fields.

nishantmonu51 · 2017-09-13T15:45:18Z

@drcrallen @leventov Row already stores map underneath, so we are not creating any additonal objects or introducing new boxing/unboxing by this change, The columnSelectors add a isNull method as suggested by @leventov

b-slim · 2017-09-13T19:11:02Z

style issue.

b-slim · 2017-09-13T19:13:06Z

annotate with nullable.

b-slim · 2017-09-13T19:13:49Z

b-slim · 2017-09-13T19:14:23Z

same here need to revert those changes i guess

b-slim · 2017-09-13T19:15:37Z

b-slim · 2017-09-13T19:16:09Z

no need for new lines

b-slim · 2017-09-13T20:35:06Z

can we add a comment here what this function is for and when should be implemented?

b-slim · 2017-09-13T20:36:06Z

same here not clear to me what this function is for?

b-slim · 2017-09-13T20:42:16Z

TODO need to be done?

b-slim · 2017-09-13T20:53:02Z

this seems to be duplicated, can we use this one or io.druid.segment.DimensionHandlerUtils#ZERO_DOUBLE

this still open -> https://github.com/druid-io/druid/blob/7552a611f18fe460136f1b4953ef69d64d9e8718/processing/src/main/java/io/druid/segment/DimensionHandlerUtils.java#L48

b-slim · 2017-09-13T20:57:02Z

not sure what does this means?

b-slim · 2017-09-13T21:37:07Z

recommend just using true since it is used only here.

b-slim · 2017-09-13T21:38:33Z

this change is not needed making style issue rather fixing it thought.

b-slim · 2017-09-13T21:38:45Z

b-slim · 2017-09-14T19:50:53Z

Need to set this to true thought.

this is set to true in reset() method. Called it from constructor too.

I don't think it is true for all the aggregators, for instance, io.druid.query.aggregation.CountAggregator. It is better to initialized since it should be true, not sure why you want to avoid this?

b-slim · 2017-09-14T23:00:41Z

unused stuff

used when deserializing from column metadata i.e json.

b-slim · 2017-09-15T01:29:59Z

how is this working if it returns null ? not sure am getting this?

It was an unused method, was earlier added before columnSelectors were refactored in another PR, removed this method.

Storage changes

nishantmonu51 · 2017-12-20T13:25:31Z

@leventov handled comments.

For the BaseNullableColumnValueSelector method isNull() returning boolean value, I still feel that returning a boolean instead of three-valued logic is cleaner api wise.
To address your review comment, I have changed BaseObjectColumnValueSelector to not inherit isNull(). boolean isNull() is only meant for Long/Float/Double selectors where the return type is primitive value instead of Object. For the selectors to be used as ObjectSelectors the user should call getObject() which returns a Nullable object.

Now for the MapVirtualColumnValueSelector case that you pointed out, modified isNull() method to return always false, since it is not supposed to used as a Long/Float/Double column selector. We can also throw Unsupported Ex there, but chose to return false as the implementations for Double/getFloat/getLong also return 0 always instead of unsupported ex.

leventov

Please fix IntelliJ inspections, their failure is not flaky, it's real.

leventov · 2017-12-20T16:24:41Z

  }

-  public static String defaultValue()
+  public static String defaultStringValue()


Add @Nullable.

leventov · 2017-12-20T16:33:31Z

+    return useDefaultValuesForNull() ? Strings.emptyToNull(value) : value;
+  }
+
+  public static String defaultStringValue()


There are at least 6 more places in this PR where this method could be used.

All defaultXxxValue() methods could be optimized via lazy static constant evaluation via a private static inner class.

For the lazy static constant evaluation, thought JIT would be able to optimize this

JIT could probably throw branches if the condition is a static final boolean constant, but it's a method call, that delegates to another method call on an instance. So I doubt it could optimize this.

leventov · 2017-12-20T16:34:36Z

+  }
+
+  @Nullable
+  public static Long nullToZeroIfNeeded(@Nullable Long value)


nullToZeroIfNeeded() methods are actually used only with null argument, so I think they should be removed in favor of defaultXxxValue()

leventov · 2017-12-20T16:39:32Z

+    return useDefaultValuesForNull() ? "" : null;
+  }
+
+  public static Long defaultLongValue()


Add @Nullable. Could you please do it yourself, in a PR that is all about nulls, so that I don't need to point this about each parameter, method return type and field in separation?

leventov · 2017-12-20T16:42:25Z

  }

+  /**
+   * returns true if the Aggregator supports returning null values and the aggregated value is Null.


Please rewrite this doc according to the new policy (always false for logical object aggregators)

leventov · 2017-12-20T16:44:34Z

  }

+  /**
+   * Returns true if the aggregator is nullable and the aggregated value is null


Please update this doc according to the new policy

…ling-4

nishantmonu51 · 2018-01-04T07:16:32Z

@leventov: have you finished reviewing the changes ?
Let me know once if are done so that i can handle all review comments and test the changes at once.

leventov · 2018-01-04T08:59:08Z

@nishantmonu51 I won't review until the next Tuesday. Could you please fix conflicts and failing IntelliJ CI in the meantime

leventov · 2018-01-09T16:02:54Z

@nishantmonu51

I think returning always false is an implementation detail, not sure if that belongs to the API javadoc.
Can you elaborate more on exactly what you need me to add here ?

I cannot find this above in comments (probably it was discussed during a meeting), that isNull() returns boolean rather than 3-value enum. It's because it effectively returns something reasonable only when the aggregator's output type is primitive long/double/float, but when it's Object, isNull() always returns false. I don't think it's implementation detail, because it yields "unexpected" results sometimes: getObject() could return null, but isNull() returns false. But it allows isNull() to be simpler. It specifies expectations from the interface and how it should be implemented (any impl of the interface), so it should be in the javadoc

leventov

Review up to DoubleCardinalityAggregatorColumnSelectorStrategy

leventov · 2018-01-09T16:07:07Z

+
+import javax.annotation.Nullable;
+
+public class NullHandling


Please add doc to this class, explaining it's purpose. Maybe include a link to this PR or corresponding issue.

leventov · 2018-01-09T16:36:45Z

  public StringExpr(String value)
  {
-    this.value = Strings.emptyToNull(value);
+    this.value = NullHandling.emptyToNullIfNeeded(value);


Is there any way to prevent the situation when somebody forgets to update Strings.emptyToNull() to NullHandling.emptyToNullIfNeeded(), or uses the "old" method in some newly written code?

done. added a checkstyle check.

leventov · 2018-01-09T16:39:41Z

+      if (value == null) {
+        idForNull = index;
+        return false;
+      }


I think it's clearer to add else { return true; } here

leventov · 2018-01-09T16:42:00Z

+      if (value == null) {
+        idForNull = index;
+        return false;
+      }


leventov · 2018-01-09T16:49:25Z

+        @Override
+        public boolean isNull()
+        {
+          return baseSelector.getObject().isNull();


Please align impl with Long and Double variants

leventov · 2018-01-09T17:47:04Z

+public abstract class NullableAggregatorFactory extends AggregatorFactory
+{
+  @Override
+  final public Aggregator factorize(ColumnSelectorFactory metricFactory)


public is usually before final, same below

leventov · 2018-01-09T17:48:44Z

+  @Override
+  public float getFloat(ByteBuffer buf, int position)
+  {
+    return delegate.getFloat(buf, position + Byte.BYTES);


Same as above, suggested to make a defensive check

leventov · 2018-01-09T17:49:07Z

+  @Override
+  public long getLong(ByteBuffer buf, int position)
+  {
+    return delegate.getLong(buf, position + Byte.BYTES);


leventov · 2018-01-09T17:49:15Z

+  @Override
+  public double getDouble(ByteBuffer buf, int position)
+  {
+    return delegate.getDouble(buf, position + Byte.BYTES);


leventov · 2018-01-09T17:49:47Z

+  @Override
+  public boolean isNull(ByteBuffer buf, int position)
+  {
+    return buf.get(position) == IS_NULL_BYTE;


|| delegate.isNull(buf, position + Byte.BYTES)

leventov · 2018-01-10T11:13:58Z

@@ -31,12 +32,16 @@ public class DoubleCardinalityAggregatorColumnSelectorStrategy
  @Override
  public void hashRow(BaseDoubleColumnValueSelector dimSelector, Hasher hasher)


"dimSelector" traditionally means DimensionSelector, so to avoid confusion I suggest to either call this parameter "columnSelector" or just "selector". In FloatCardinalityAggregatorColumnSelectorStrategy, it is called just "selector".

leventov · 2018-01-10T11:19:15Z

@@ -31,12 +32,16 @@ public class DoubleCardinalityAggregatorColumnSelectorStrategy
  @Override


Please add a javadoc to DoubleCardinalityAggregatorColumnSelectorStrategy stating that "if performance of this class appears to be a bottleneck for somebody, one simple way to improve it is to split it into two different classes, one that is used when {@link NullHandling.useDefaultValuesForNull()} is false, and one - when it's true, moving this computation out of the tight loop".

leventov · 2018-01-10T11:19:58Z

  }

  @Override
  public void hashValues(BaseDoubleColumnValueSelector dimSelector, HyperLogLogCollector collector)


Same as above

leventov · 2018-01-10T11:21:04Z

 import io.druid.segment.DimensionSelector;
 import io.druid.segment.data.IndexedInts;

 public class DistinctCountAggregator implements Aggregator


Please add a javadoc to DistinctCountAggregator stating that "if performance of this class appears to be a bottleneck for somebody, one simple way to improve it is to split it into two different classes, one that is used when {@link NullHandling.useDefaultValuesForNull()} is false, and one - when it's true, moving this computation out of the tight loop".

leventov · 2018-01-10T11:21:29Z

@@ -32,14 +34,23 @@

 public class DistinctCountBufferAggregator implements BufferAggregator


Same as for DistinctCountAggregator

leventov · 2018-01-10T11:47:40Z

-        final BaseObjectColumnValueSelector selector = metricFactory.makeColumnValueSelector(name);
-        return new LongLastAggregator(null, null)
+        final ColumnValueSelector selector = metricFactory.makeColumnValueSelector(name);
+        return Pair.of(new LongLastAggregator(null, null)


Please reformat

leventov · 2018-01-10T11:47:51Z

-        final BaseObjectColumnValueSelector selector = metricFactory.makeColumnValueSelector(name);
-        return new LongLastBufferAggregator(null, null)
+        final ColumnValueSelector selector = metricFactory.makeColumnValueSelector(name);
+        return Pair.of(new LongLastBufferAggregator(null, null)


Please reformat

leventov · 2018-01-10T11:47:58Z

@@ -163,7 +177,7 @@ public Object deserialize(Object object)
  @Override


leventov · 2018-01-10T11:48:05Z

@@ -163,7 +177,7 @@ public Object deserialize(Object object)
  @Override
  public Object finalizeComputation(Object object)


leventov · 2018-01-10T11:50:20Z

  {
    Iterator<PostAggregator> fieldsIter = fields.iterator();
-    double retVal = 0.0;
+    Double retVal = NullHandling.useDefaultValuesForNull() ? DimensionHandlerUtils.ZERO_DOUBLE : null;


NullHandling.defaultDoubleValue()

leventov · 2018-01-10T12:00:06Z

@nishantmonu51 after you address comments that I have already left (please answer to them individually), I suggest to close this PR and open several new ones, at least separate Aggregators/Expressions stuff (largely things that go first in this PR and which I already mostly reviewed) from everything else, or maybe create a separate PR for each part that you given a title in the first comment in this thread, i. e. "Storage Layer", "Indexing Layer", etc.

There are two reasons for that:

it's almost impossible to work with this PR now, Github often fails with Unicorn page, and even when it doesn't, in-page responsiveness is awful.
To merge separate things faster and avoid excessive code conflicts with other PRs.

It's not too important that those part won't "work" in separation. E. g. see #4676, I added some abstractions in August which are still unused, because I haven't committed code that uses them yet.

nishantmonu51 · 2018-01-12T15:46:01Z

@leventov Thanks for the review. working on your review comments, will reply to individual comments and try to break the PR soon.

nishantmonu51 · 2018-01-22T18:51:29Z

closing in favor of #5278

nishantmonu51 changed the title ~~[WIP] Improve Handling of Nulls in Druid~~ [WIP] Make NULL handling in Druid more compatible to SQL standard Sep 5, 2017

nishantmonu51 requested review from b-slim, gianm and leventov and removed request for leventov September 5, 2017 17:55