Make NULL handling in Druid more compatible to SQL standard#4754
Make NULL handling in Druid more compatible to SQL standard#4754nishantmonu51 wants to merge 67 commits intoapache:masterfrom
Conversation
|
Re: We have pretty bad heap memory usage problems during indexing. Does this add extra heap pressure? |
|
It could be avoided by changing |
|
@drcrallen @leventov Row already stores map underneath, so we are not creating any additonal objects or introducing new boxing/unboxing by this change, The columnSelectors add a isNull method as suggested by @leventov |
There was a problem hiding this comment.
same here need to revert those changes i guess
There was a problem hiding this comment.
can we add a comment here what this function is for and when should be implemented?
There was a problem hiding this comment.
same here not clear to me what this function is for?
There was a problem hiding this comment.
this seems to be duplicated, can we use this one or io.druid.segment.DimensionHandlerUtils#ZERO_DOUBLE
There was a problem hiding this comment.
There was a problem hiding this comment.
not sure what does this means?
There was a problem hiding this comment.
recommend just using true since it is used only here.
There was a problem hiding this comment.
this change is not needed making style issue rather fixing it thought.
dc5a2bc to
3da96e4
Compare
There was a problem hiding this comment.
Need to set this to true thought.
There was a problem hiding this comment.
this is set to true in reset() method. Called it from constructor too.
There was a problem hiding this comment.
I don't think it is true for all the aggregators, for instance, io.druid.query.aggregation.CountAggregator. It is better to initialized since it should be true, not sure why you want to avoid this?
There was a problem hiding this comment.
used when deserializing from column metadata i.e json.
There was a problem hiding this comment.
how is this working if it returns null ? not sure am getting this?
There was a problem hiding this comment.
It was an unused method, was earlier added before columnSelectors were refactored in another PR, removed this method.
|
@leventov handled comments. For the BaseNullableColumnValueSelector method isNull() returning boolean value, I still feel that returning a boolean instead of three-valued logic is cleaner api wise. Now for the MapVirtualColumnValueSelector case that you pointed out, modified isNull() method to return always false, since it is not supposed to used as a Long/Float/Double column selector. We can also throw Unsupported Ex there, but chose to return false as the implementations for Double/getFloat/getLong also return 0 always instead of unsupported ex. |
leventov
left a comment
There was a problem hiding this comment.
Please fix IntelliJ inspections, their failure is not flaky, it's real.
| } | ||
|
|
||
| public static String defaultValue() | ||
| public static String defaultStringValue() |
| return useDefaultValuesForNull() ? Strings.emptyToNull(value) : value; | ||
| } | ||
|
|
||
| public static String defaultStringValue() |
There was a problem hiding this comment.
There are at least 6 more places in this PR where this method could be used.
There was a problem hiding this comment.
All defaultXxxValue() methods could be optimized via lazy static constant evaluation via a private static inner class.
There was a problem hiding this comment.
For the lazy static constant evaluation, thought JIT would be able to optimize this
There was a problem hiding this comment.
JIT could probably throw branches if the condition is a static final boolean constant, but it's a method call, that delegates to another method call on an instance. So I doubt it could optimize this.
| } | ||
|
|
||
| @Nullable | ||
| public static Long nullToZeroIfNeeded(@Nullable Long value) |
There was a problem hiding this comment.
nullToZeroIfNeeded() methods are actually used only with null argument, so I think they should be removed in favor of defaultXxxValue()
| return useDefaultValuesForNull() ? "" : null; | ||
| } | ||
|
|
||
| public static Long defaultLongValue() |
There was a problem hiding this comment.
Add @Nullable. Could you please do it yourself, in a PR that is all about nulls, so that I don't need to point this about each parameter, method return type and field in separation?
| } | ||
|
|
||
| /** | ||
| * returns true if the Aggregator supports returning null values and the aggregated value is Null. |
There was a problem hiding this comment.
Please rewrite this doc according to the new policy (always false for logical object aggregators)
| } | ||
|
|
||
| /** | ||
| * Returns true if the aggregator is nullable and the aggregated value is null |
There was a problem hiding this comment.
Please update this doc according to the new policy
|
@leventov: have you finished reviewing the changes ? |
|
@nishantmonu51 I won't review until the next Tuesday. Could you please fix conflicts and failing IntelliJ CI in the meantime |
I cannot find this above in comments (probably it was discussed during a meeting), that |
leventov
left a comment
There was a problem hiding this comment.
Review up to DoubleCardinalityAggregatorColumnSelectorStrategy
|
|
||
| import javax.annotation.Nullable; | ||
|
|
||
| public class NullHandling |
There was a problem hiding this comment.
Please add doc to this class, explaining it's purpose. Maybe include a link to this PR or corresponding issue.
| public StringExpr(String value) | ||
| { | ||
| this.value = Strings.emptyToNull(value); | ||
| this.value = NullHandling.emptyToNullIfNeeded(value); |
There was a problem hiding this comment.
Is there any way to prevent the situation when somebody forgets to update Strings.emptyToNull() to NullHandling.emptyToNullIfNeeded(), or uses the "old" method in some newly written code?
There was a problem hiding this comment.
done. added a checkstyle check.
| if (value == null) { | ||
| idForNull = index; | ||
| return false; | ||
| } |
There was a problem hiding this comment.
I think it's clearer to add else { return true; } here
| if (value == null) { | ||
| idForNull = index; | ||
| return false; | ||
| } |
| @Override | ||
| public boolean isNull() | ||
| { | ||
| return baseSelector.getObject().isNull(); |
There was a problem hiding this comment.
Please align impl with Long and Double variants
| public abstract class NullableAggregatorFactory extends AggregatorFactory | ||
| { | ||
| @Override | ||
| final public Aggregator factorize(ColumnSelectorFactory metricFactory) |
There was a problem hiding this comment.
public is usually before final, same below
| @Override | ||
| public float getFloat(ByteBuffer buf, int position) | ||
| { | ||
| return delegate.getFloat(buf, position + Byte.BYTES); |
There was a problem hiding this comment.
Same as above, suggested to make a defensive check
| @Override | ||
| public long getLong(ByteBuffer buf, int position) | ||
| { | ||
| return delegate.getLong(buf, position + Byte.BYTES); |
| @Override | ||
| public double getDouble(ByteBuffer buf, int position) | ||
| { | ||
| return delegate.getDouble(buf, position + Byte.BYTES); |
| @Override | ||
| public boolean isNull(ByteBuffer buf, int position) | ||
| { | ||
| return buf.get(position) == IS_NULL_BYTE; |
There was a problem hiding this comment.
|| delegate.isNull(buf, position + Byte.BYTES)
| @@ -31,12 +32,16 @@ public class DoubleCardinalityAggregatorColumnSelectorStrategy | |||
| @Override | |||
| public void hashRow(BaseDoubleColumnValueSelector dimSelector, Hasher hasher) | |||
There was a problem hiding this comment.
"dimSelector" traditionally means DimensionSelector, so to avoid confusion I suggest to either call this parameter "columnSelector" or just "selector". In FloatCardinalityAggregatorColumnSelectorStrategy, it is called just "selector".
| @@ -31,12 +32,16 @@ public class DoubleCardinalityAggregatorColumnSelectorStrategy | |||
| @Override | |||
There was a problem hiding this comment.
Please add a javadoc to DoubleCardinalityAggregatorColumnSelectorStrategy stating that "if performance of this class appears to be a bottleneck for somebody, one simple way to improve it is to split it into two different classes, one that is used when {@link NullHandling.useDefaultValuesForNull()} is false, and one - when it's true, moving this computation out of the tight loop".
| } | ||
|
|
||
| @Override | ||
| public void hashValues(BaseDoubleColumnValueSelector dimSelector, HyperLogLogCollector collector) |
| import io.druid.segment.DimensionSelector; | ||
| import io.druid.segment.data.IndexedInts; | ||
|
|
||
| public class DistinctCountAggregator implements Aggregator |
There was a problem hiding this comment.
Please add a javadoc to DistinctCountAggregator stating that "if performance of this class appears to be a bottleneck for somebody, one simple way to improve it is to split it into two different classes, one that is used when {@link NullHandling.useDefaultValuesForNull()} is false, and one - when it's true, moving this computation out of the tight loop".
| @@ -32,14 +34,23 @@ | |||
|
|
|||
| public class DistinctCountBufferAggregator implements BufferAggregator | |||
There was a problem hiding this comment.
Same as for DistinctCountAggregator
| final BaseObjectColumnValueSelector selector = metricFactory.makeColumnValueSelector(name); | ||
| return new LongLastAggregator(null, null) | ||
| final ColumnValueSelector selector = metricFactory.makeColumnValueSelector(name); | ||
| return Pair.of(new LongLastAggregator(null, null) |
| final BaseObjectColumnValueSelector selector = metricFactory.makeColumnValueSelector(name); | ||
| return new LongLastBufferAggregator(null, null) | ||
| final ColumnValueSelector selector = metricFactory.makeColumnValueSelector(name); | ||
| return Pair.of(new LongLastBufferAggregator(null, null) |
| @@ -163,7 +177,7 @@ public Object deserialize(Object object) | |||
| @Override | |||
| @@ -163,7 +177,7 @@ public Object deserialize(Object object) | |||
| @Override | |||
| public Object finalizeComputation(Object object) | |||
| { | ||
| Iterator<PostAggregator> fieldsIter = fields.iterator(); | ||
| double retVal = 0.0; | ||
| Double retVal = NullHandling.useDefaultValuesForNull() ? DimensionHandlerUtils.ZERO_DOUBLE : null; |
There was a problem hiding this comment.
NullHandling.defaultDoubleValue()
|
@nishantmonu51 after you address comments that I have already left (please answer to them individually), I suggest to close this PR and open several new ones, at least separate Aggregators/Expressions stuff (largely things that go first in this PR and which I already mostly reviewed) from everything else, or maybe create a separate PR for each part that you given a title in the first comment in this thread, i. e. "Storage Layer", "Indexing Layer", etc. There are two reasons for that:
It's not too important that those part won't "work" in separation. E. g. see #4676, I added some abstractions in August which are still unused, because I haven't committed code that uses them yet. |
|
@leventov Thanks for the review. working on your review comments, will reply to individual comments and try to break the PR soon. |
|
closing in favor of #5278 |
This PR contains changes to improve handling of null values in druid by treating nulls as missing values which don’t actually exist. This will make it more compatible to SQL standard and help with integration with other existing BI systems which support ODBC/JDBC. (Detailed description in proposal - #4349)
Changes include -
New Configuration
Storage Layer -
Indexing Layer -
Nullability for Aggregators/Metrics
As per the current implementation, most aggregators are coded to replace null values with default e.g. sum treats them as zeroes, min treats them as positive infinity, etc.
To match the new semantics and make aggregators nullable, where if an aggregator encounters only the null values the result will be null value following changes are made -
Math Expressions Null Handling
Filtering on Null values
Changes to Druid build-in SQL layer
Misc Changes
Above are the major changes in the user facing APIs and behavior. Other than these there are multiple places in the code where we convert empty Strings to null and vice-versa. They are changed in order to treat null and String values separately.
Backwards Compatibility
Testing