fix issues with filtering nulls on values coerced to numeric types#14139
Conversation
…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
There was a problem hiding this comment.
The stuff outside of the nested package looks good to me, other than the BoundFilter piece that doesn't sit right.
Within the nested package I don't really understand the changes. I think I mostly don't understand how the variant column is different from a COMPLEX<json> nested column. How is it meant to work now?
| if (columnType != null) { | ||
| if (ColumnType.LONG.equals(columnType)) { | ||
| return NullHandling.defaultLongValue(); | ||
| } else if (ColumnType.DOUBLE.equals(columnType)) { |
There was a problem hiding this comment.
no FLOAT? If there's a reason it's not needed here, a comment would be helpful.
There was a problem hiding this comment.
expressions cannot be FLOAT so i pretend it doesn't exist, will update javadocs
| return ColumnTypeFactory.getInstance().ofComplex(complexTypeName); | ||
| } | ||
|
|
||
| public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullable ColumnType other) |
There was a problem hiding this comment.
Comments on the method declaration:
- Method itself should be
@Nullable? Since both inputs could be null, and then this'll return null. - Please include javadoc about what happens if a least restrictive type can't be found. Do you get null? An exception? (From the impl, looks like you get IllegalArgumentException.)
| return type; | ||
| } | ||
| if (!Objects.equals(type, other)) { | ||
| throw new IAE("Cannot implicitly cast %s to %s", type, other); |
| } | ||
| } | ||
| catch (NumberFormatException ignored) { | ||
| // bounds are not numeric? |
There was a problem hiding this comment.
The catch here doesn't really sit right. Where does the NumberFormatException get thrown? Is it possible to deal with that case some other way?
My concern here is that there's a lot of code and method calls wrapped into that try block, and we have no way of telling where the NumberFormatException came from. Maybe scoping the try down to a smaller bit of code would help with that.
There was a problem hiding this comment.
They come from the Double.parseDouble in
final Number lower = boundDimFilter.hasLowerBound() ? Double.parseDouble(boundDimFilter.getLower()) : null;
final Number upper = boundDimFilter.hasUpperBound() ? Double.parseDouble(boundDimFilter.getUpper()) : null;
will see if i can shuffle some stuff around
| ); | ||
| catch (ISE ise) { | ||
| // ignore failures resulting from 'auto' | ||
| if (!(testName.contains("AutoTypes") && "Unsupported type[ARRAY<STRING>]".equals(ise.getMessage()))) { |
There was a problem hiding this comment.
nit: this testName.contains("AutoTypes") is used in a bunch of places; would be nice to create a field or method for this.
| import java.util.TreeMap; | ||
|
|
||
| public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>> implements NestedCommonFormatColumn | ||
| public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>> |
There was a problem hiding this comment.
VariantArray seems like an inaccurate name now that this can be any kind of variant stuff. Shall we rename it to VariantColumn?
Btw, can a variant column store an object? How's it different from a NestedDataColumn?
I guess I don't really understand variant columns.
Added a bunch of javadocs to hopefully clear up how everything relates to each other, still haven't got everything quite yet but will keep chipping away at it. |
| private final IndexSpec indexSpec; | ||
|
|
||
| public VariantColumnSupplierTest( | ||
| @SuppressWarnings("unused") String name, |
Check notice
Code scanning / CodeQL
Useless parameter
gianm
left a comment
There was a problem hiding this comment.
LGTM after the latest changes.
…pache#14139) * fix issues with filtering nulls on values coerced to numeric types * fix issues with 'auto' type numeric columns in default value mode * optimize variant typed columns without nested data * more tests for 'auto' type column ingestion
Description
This PR fixes some bugs in both SQL compatible and 'default' value modes when filtering
IS NULL/IS NOT NULLon values which are coerced to a numeric type, such asCAST(x AS BIGINT)orJSON_VALUE(nested, '$.x' RETURNING BIGINT). While these expressions had different causes, the underlying effect was more or less the same and involved the query engine using the null value index for the underlying column to build the cursor, which only counts the null string values, and misses out on values which might become null after coercion.For example given the following source data:

trying to filter out nulls in default value mode would produce the following results:
and the same data ingested in SQL compatible mode also incorrect:
However the "correct" results in "default" value mode should match all of the rows because there are no null numbers, and in SQL compatible mode, should only match a single row, since that is the only truly non-null number.
JSON_VALUEsuffered from the same problem when using theRETURNINGsyntax to cast to a numeric type for string or mixed type fields.To fix this, we avoid using the indexes when this is the case. It would be possible to partially use these indexes and a value matcher for the 'is not null' case, but currently not null is handled as a single index, rather a not filter on a selector filter, so its a little bit tricky to implement right now so I'll save this for a future change.
This PR also fixes issues with 'auto' typed numeric columns to ensure they behave consistently with classic numeric columns in 'default' value mode, which means picking up the 'null values dont exist' behavior that much of the query engine has. In default value mode, the numeric index suppliers will combine the null and 0 indexes when searching for 0 to ensure that indexes behave consistently with the value matchers. I have added the 'auto' typed columns to
BaseFilterTestto better cover the 'auto' indexers, only skipping tests related to implicit multi-value handling.Release note
Fixed issues involving filtering
NULLvalues when usingCASTexpressions andJSON_VALUEexpressions using theRETURNINGsyntax to coerce non-numeric types into numeric types, such asCAST(x AS BIGINT) IS NOT NULLorJSON_VALUE(nested, '$.x' RETURNING BIGINT) IS NULL. These expressions were incorrectly using the null value index of the underlying column, which produced results matching the values which wereNULLstring values, but not values which could not be converted into a numeric value and so were effectivelyNULL.This PR has: