fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode#13214
Conversation
…ion math null value handling in default mode changes: * json_value 'returning' decimal will now plan to native double typed query instead of ending up with default string typing, allowing decimal vector math expressions to work with this type * vector math expressions now zero out 'null' values even in 'default' mode (druid.generic.useDefaultValueForNull=false) to prevent downstream things that do not check the null vector from producing incorrect results
| @Test | ||
| public void testReturningAndSumPathDecimalWithMaths() | ||
| { | ||
| testQuery( |
There was a problem hiding this comment.
How did these tests fail before this change? The description makes it sound like the failure is actually a failure to pick the vectorized version rather than a failure in actual results, but I don't see this test ensuring that the vectorized version is used?
There was a problem hiding this comment.
the long test doesn't fail only the decimal does because the sql tests run in both modes. But, i looked closer to better answer your question which has caused me to discover another bug, which is that the column inspector that expressions use to determine things like if they can vectorize and their output type does not actually have access to other virtual columns, only physical segments.
This causes the 'can vectorize' check to compute the output type of the nested field virtual column as 'null' instead of 'STRING', which is what it would have seen if the virtual column was available, causing it to fail to vectorize (which is not awesome, but at least doesn't result in an exception). Since the type is inferred as 'null', the division expression thinks its inputs are null and double, which is vectorizable as basically all nulls (or zeros in default value mode), however when it gets to actually running the query a sanity check to make sure that the query is vectorizable with an inspector which does have access to the virtual columns sees the input types as STRING and DOUBLE, which is not vectorizable, causing the exception.
I'll work on fixing this, thanks for asking 👍
| ); | ||
| } else if (SqlTypeName.APPROX_TYPES.contains(sqlType.getSqlTypeName())) { | ||
| } else if (SqlTypeName.APPROX_TYPES.contains(sqlType.getSqlTypeName()) || | ||
| SqlTypeName.DECIMAL.equals(sqlType.getSqlTypeName())) { |
There was a problem hiding this comment.
perhaps invert these? Equality is faster than .contains() so will be better to check the equality first and then do the .contains() instead of forcing the .contains() to fail first before doing the equality.
|
|
||
| abstract void processIndex(TInput input, int i); | ||
|
|
||
| abstract void processNull(int i); |
There was a problem hiding this comment.
Instead of another abstract method call, I wonder how far we could get with a .getValueForNull() method that return a value that we can set? With that setup, we would be able to call that once and then reuse the value, allowing for inlining and stuff. It should work well for the scope of changes in this PR...
There was a problem hiding this comment.
it would have to return an object since we don't know if the value is double or long. Maybe the better thing to do is abstract out long and double versions of value processors and just have the same-ish code between them, though that's a bit more invasive of a change so i'm hesitant to make it right now. Would you be ok if i look into this as a follow-up PR?
There was a problem hiding this comment.
actually its not that bad, will go ahead and split them up now, it makes some other things nicer
…ion math null value handling in default mode (apache#13214) * fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode changes: * json_value 'returning' decimal will now plan to native double typed query instead of ending up with default string typing, allowing decimal vector math expressions to work with this type * vector math expressions now zero out 'null' values even in 'default' mode (druid.generic.useDefaultValueForNull=false) to prevent downstream things that do not check the null vector from producing incorrect results * more better * test and why not vectorize * more test, more fix
* use object[] instead of string[] for vector expressions to be consistent with vector object selectors (#13209) * fix issue with nested column null value index incorrectly matching non-null values (#13211) * fix json_value sql planning with decimal type, fix vectorized expression math null value handling in default mode (#13214)
Description
This PR fixes two bugs, the first around the SQL planner for
JSON_VALUEqueries using the "returning" syntax with SQLDECIMALtypes incorrectly being inferred to have aSTRINGoutput, which prevents them from working correctly with vectorized math expressions, such asJSON_VALUE(x, '$.y' RETURNING DECIMAL) + 100They should now correctly be determined to be Druid native
DOUBLEtype which will allow vectorized math expressions (which require numeric inputs) to not violate a check.This was actually triggering another bug, where the
VirtualColumns.canVectorizecheck was using aColumnInspectorwhich did not have access to the column capabilities of the other virtual columns, whilegetColumnCapabilitiesdid, which would sometimes lead to them not agreeing. The inspector missing the virtual columns would result innullcapabilities, which is treated as a missing column for the purposes of vectorization check, whilegetColumnCapabilitiesused later during expression planning would have the correct virtual column identifier types.While writing the test for this issue, I stumbled into yet another bug, this one around the fact that vector math processors were not zeroing out 'null' values in the value vector, which would cause anything downstream that was not checking the null vector to produce incorrect results from stale values in the 'null' positions. Most typically, this would be when
druid.generic.useDefaultValueForNull=truewhich is also the default mode, but luckily is fairly unlikely to happen unless expressions were explicitly producing null values before being used in part of sum or some other thing that does not check the null vector in default value mode. Numeric nested literal columns also are impacted by this issue, since they always write out a null value bitmap even in "default" value mode, because they write out bitmap for all values, and rely on downstream selectors to just ignore the null vector in default mode.Finally, I questioned why the numeric math expressions required all the inputs to be numeric instead of scalar in order to vectorize. Since the vector processors that drive them have implicit casts to ensure the correct type is used, we can allow cases where a string is one of the arguments to provide vectorized versions of the non-vectorized behavior, where if one of the arguments is a
STRING, the output type isDOUBLE. There was a flaw in theVectorMathProcessors.makeMathProcessorwhich didn't match the non-vectorized behavior which has also been fixed (this code path was not able to be hit because of the previous checks to ensure all numeric args so no behavior has been changed).This PR has: