Skip to content

move numeric null value coercion out of expression processing engine#13809

Merged
clintropolis merged 11 commits intoapache:masterfrom
clintropolis:expr-null-handling
Mar 1, 2023
Merged

move numeric null value coercion out of expression processing engine#13809
clintropolis merged 11 commits intoapache:masterfrom
clintropolis:expr-null-handling

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Feb 15, 2023

Description

This PR removes numeric null value coercion from the expression processing engine in favor of doing it on the consumer side. This appears to have minimal impact, since isNumericNull will still return false in default value mode, and numeric operators check this and use the numeric primitive methods of ExprEval instead of value(). This walks back some of the changes in #13781. To help push this responsibility onto callers, a new method has been added ExprEval.valueOrDefault() as a convenience/alternative to value(), which will perform the coercion for callers if they wish, which many of them do.

Since vector expression processors are only used at query time, I've modified the 'null vectors' (boolean[]), I've modified them to consistently only set to true if NullHandling.sqlCompatible() is true. Alternatively, we could have also instead checked NullHandling.sqlCompatible() in the places we are using these null vectors, but doing it this way seemed more consistent with other vectorized selectors.

Backstory, it would be nice to always write null values into segments, shifting druid.generic.useDefaultValueForNull to a query time only setting on whether or not to use the null value bitmap index when scanning/aggregating columns. However, I have not made this change yet in this PR.

Unrelated, I have renamed Expr.buildVectorized to Expr.asVectorProcessor to be consistent with Function.asVectorProcessor and ApplyFunction.asVectorProcessor, and also added some javadocs to VectorProcessors.

Release note

Null values input to and created by the Druid native expression processing engine no longer coerce null to the type appropriate 'default' value if druid.generic.useDefaultValueForNull=true. This should not impact existing behavior since this has been shifted onto the consumer and internally operators will still use default values in this mode, but calling this change to attention in case any subtle behavior changes around the handling of null values. Direct callers can get default values by using the new valueOrDefault() method of ExprEval, instead of value().


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@clintropolis clintropolis removed the WIP label Feb 15, 2023
Comment on lines +334 to +336
// null values can (but not always) appear as string typed
// so type isn't necessarily string unless value is non-null
if (NullHandling.sqlCompatible() || (Types.is(leftVal.type(), ExprType.STRING) && leftVal.value() != null)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm not understanding this logic, but it seems like previously you were doing null checks if it was a String. Now, you will only do null checks if it is a String and the left-hand side is not null? This seems to be trying to coerce for boolean operations, if the left side is numerical but also not null, it seems like we should have semantics that are like if == 0 then false, else true. But, I think that this will just ignore it if it's not String?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially made this change before i added valueOrDefault, because in a situation where expressions are evaluated without the type information to tell us that a null is a long (such as ingestion time transforms), an expression like null && 1 with the left side is a null value it will be evaluated as a STRING typed expression, and follow the string/sql compatible rules, in default value mode meaning value() spits out null. if the expression was 1 && null however, it would be evaluated with long rules, and spit out 0 in default value mode.

None of this really matters now though, since callers should be using valueOrDefault() after the changes in this patch, instead of value(), which produces equivalent results, so I will switch it back and switch the tests around null constants to use that.

public final int asInt()
{
if (value == null) {
assert NullHandling.replaceWithDefault();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the assertions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no they can be probably removed, i added during testing and for consistency with some of the other ExprEval, but i think i will just remove them all instead.

}

@Override
public boolean isNumericNull()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a comment on unchanged code, but wondering how you feel about renaming this to just isNull? I've never quite understand why the Numeric part was important.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, so here it sort of is important for the way things work (right now) with non-vectorized expressions because the contract of the method is sort of like that of the ColumnValueSelector isNull method - that if it returns false then the asLong and such primitive returning methods will return a valid value.

StringExprEval for example implements these methods as best effort conversions, so StringExprEval("100") returns false for isNumericNull and asLong returns the value 100L in both default value and sql compat modes. StringExprEval("x") returns true for isNumericNull if in sql compatible mode, but the value() is not in fact null, so callers should know that the value of this is null in sql compatible mode instead of calling asLong to get 0. In default value mode, isNumericNull returns false and so no one cares if the string was a real number or not, if it wasn't it was zero.

Things that do not want to use these number methods can just check the result of value() directly.

At least for the way things currently work, this is actually probably a more accurate name than isNull, and arguably the ColumnValueSelector method should probably be renamed to match this one 😛 . The only things that check isNull on ColumnValueSelector are things that want to use the primitive numeric getters, object selectors often do not implement it since callers are currently expected to call getObject and check that it is null per the current javadoc contract in BaseNullableColumnValueSelector.

I would be in favor of changing this contract into making a correct implementation of isNull a requirement for all ColumnValueSelectors, but I totally don't want to do that in this PR, and it would require I think splitting up anything that does type coercion which might produce nulls into separate selectors (and expressions to more liberally explicitly cast values to different types instead of use these implicit conversions which some of them have)

private DoubleExprEval(@Nullable Number value)
{
super(value == null ? NullHandling.defaultDoubleValue() : (Double) value.doubleValue());
super(value == null ? null : value.doubleValue());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the .doubleValue() add anything here? It looks like it's storing a reference to a Number anyway? I guess it's a forced cast?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, can be simplified

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i was wrong, at least a lot of tests were counting on forced casts, i see lots of

Expected :java.lang.Long<1>
Actual   :java.lang.Integer<1>

after removing it, will leave for now

switch (castTo.getElementType().getType()) {
case DOUBLE:
return ExprEval.ofDoubleArray(value == null ? null : new Object[] {computeDouble()});
return ExprEval.ofDoubleArray(value == null ? null : new Object[] {number != null ? number.doubleValue() : null});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: invert the ternary to be == null?

Comment on lines +274 to +275
* Creates an {@link ExprVectorProcessor} for the 'isnull' function, that produces a "boolean" (long) typed output
* vector with values set to 1 if the input value was null or 0 if it was not null.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I correct in understanding that this creates a long[]?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will clarify javadocs

}

/**
* Creates an {@link ExprVectorProcessor} for the 'isnotnull' function, that produces a "boolean" (long) typed output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, a long[]?

Comment on lines +849 to +850
* true/null, null/true, null/null -> null
* false/null, null/false -> false (0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does null null out a true, but not null out a false!?!?!?!?!? Is that some SQL-ism to handling nulls?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, SQL compat stuff. null is basically a stand-in for "unknown" for logical operations,true && ??? could be true or false depending on the value of ??? so the result is ???. But false && ??? is definitely false no matter the value of ???.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, I can see the logic there... it's providing the best answer based on what it can know from the boolean operator. It would be a lot easier to reason about if it consistently nulled things out, but if this is how SQL is expected to work, so be it.

@clintropolis clintropolis merged commit 6cf754b into apache:master Mar 1, 2023
@clintropolis clintropolis deleted the expr-null-handling branch March 1, 2023 02:10
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants