Skip to content

fix issues with filtering nulls on values coerced to numeric types#14139

Merged
clintropolis merged 46 commits intoapache:masterfrom
clintropolis:i-got-99-problems-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
May 8, 2023
Merged

fix issues with filtering nulls on values coerced to numeric types#14139
clintropolis merged 46 commits intoapache:masterfrom
clintropolis:i-got-99-problems-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Apr 21, 2023

Description

This PR fixes some bugs in both SQL compatible and 'default' value modes when filtering IS NULL/IS NOT NULL on values which are coerced to a numeric type, such as CAST(x AS BIGINT) or JSON_VALUE(nested, '$.x' RETURNING BIGINT). While these expressions had different causes, the underlying effect was more or less the same and involved the query engine using the null value index for the underlying column to build the cursor, which only counts the null string values, and misses out on values which might become null after coercion.

For example given the following source data:
Screenshot 2023-04-18 at 4 43 30 PM

trying to filter out nulls in default value mode would produce the following results:

Screenshot 2023-04-18 at 4 43 18 PM

and the same data ingested in SQL compatible mode also incorrect:

Screenshot 2023-04-18 at 4 48 30 PM

However the "correct" results in "default" value mode should match all of the rows because there are no null numbers, and in SQL compatible mode, should only match a single row, since that is the only truly non-null number.

JSON_VALUE suffered from the same problem when using the RETURNING syntax to cast to a numeric type for string or mixed type fields.

To fix this, we avoid using the indexes when this is the case. It would be possible to partially use these indexes and a value matcher for the 'is not null' case, but currently not null is handled as a single index, rather a not filter on a selector filter, so its a little bit tricky to implement right now so I'll save this for a future change.

This PR also fixes issues with 'auto' typed numeric columns to ensure they behave consistently with classic numeric columns in 'default' value mode, which means picking up the 'null values dont exist' behavior that much of the query engine has. In default value mode, the numeric index suppliers will combine the null and 0 indexes when searching for 0 to ensure that indexes behave consistently with the value matchers. I have added the 'auto' typed columns to BaseFilterTest to better cover the 'auto' indexers, only skipping tests related to implicit multi-value handling.

Release note

Fixed issues involving filtering NULL values when using CAST expressions and JSON_VALUE expressions using the RETURNING syntax to coerce non-numeric types into numeric types, such as CAST(x AS BIGINT) IS NOT NULL or JSON_VALUE(nested, '$.x' RETURNING BIGINT) IS NULL. These expressions were incorrectly using the null value index of the underlying column, which produced results matching the values which were NULL string values, but not values which could not be converted into a numeric value and so were effectively NULL.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe
)
)
);
bestEffortBindings = InputBindings.forMap(builder.build());

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [InputBindings.forMap](1) should be avoided because it has been deprecated.
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stuff outside of the nested package looks good to me, other than the BoundFilter piece that doesn't sit right.

Within the nested package I don't really understand the changes. I think I mostly don't understand how the variant column is different from a COMPLEX<json> nested column. How is it meant to work now?

if (columnType != null) {
if (ColumnType.LONG.equals(columnType)) {
return NullHandling.defaultLongValue();
} else if (ColumnType.DOUBLE.equals(columnType)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no FLOAT? If there's a reason it's not needed here, a comment would be helpful.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expressions cannot be FLOAT so i pretend it doesn't exist, will update javadocs

return ColumnTypeFactory.getInstance().ofComplex(complexTypeName);
}

public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullable ColumnType other)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on the method declaration:

  • Method itself should be @Nullable? Since both inputs could be null, and then this'll return null.
  • Please include javadoc about what happens if a least restrictive type can't be found. Do you get null? An exception? (From the impl, looks like you get IllegalArgumentException.)

return type;
}
if (!Objects.equals(type, other)) {
throw new IAE("Cannot implicitly cast %s to %s", type, other);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brackets for interpolation?

}
}
catch (NumberFormatException ignored) {
// bounds are not numeric?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch here doesn't really sit right. Where does the NumberFormatException get thrown? Is it possible to deal with that case some other way?

My concern here is that there's a lot of code and method calls wrapped into that try block, and we have no way of telling where the NumberFormatException came from. Maybe scoping the try down to a smaller bit of code would help with that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They come from the Double.parseDouble in

final Number lower = boundDimFilter.hasLowerBound() ? Double.parseDouble(boundDimFilter.getLower()) : null;
final Number upper = boundDimFilter.hasUpperBound() ? Double.parseDouble(boundDimFilter.getUpper()) : null;

will see if i can shuffle some stuff around

);
catch (ISE ise) {
// ignore failures resulting from 'auto'
if (!(testName.contains("AutoTypes") && "Unsupported type[ARRAY<STRING>]".equals(ise.getMessage()))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this testName.contains("AutoTypes") is used in a bunch of places; would be nice to create a field or method for this.

import java.util.TreeMap;

public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>> implements NestedCommonFormatColumn
public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariantArray seems like an inaccurate name now that this can be any kind of variant stuff. Shall we rename it to VariantColumn?

Btw, can a variant column store an object? How's it different from a NestedDataColumn?

I guess I don't really understand variant columns.

@clintropolis
Copy link
Copy Markdown
Member Author

Within the nested package I don't really understand the changes. I think I mostly don't understand how the variant column is different from a COMPLEX nested column. How is it meant to work now?

Added a bunch of javadocs to hopefully clear up how everything relates to each other, still haven't got everything quite yet but will keep chipping away at it.

private final IndexSpec indexSpec;

public VariantColumnSupplierTest(
@SuppressWarnings("unused") String name,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'name' is never used.
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the latest changes.

@clintropolis clintropolis merged commit 8805d8d into apache:master May 8, 2023
@clintropolis clintropolis deleted the i-got-99-problems-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe branch May 8, 2023 20:19
clintropolis added a commit to clintropolis/druid that referenced this pull request May 8, 2023
…pache#14139)

* fix issues with filtering nulls on values coerced to numeric types
* fix issues with 'auto' type numeric columns in default value mode
* optimize variant typed columns without nested data
* more tests for 'auto' type column ingestion
clintropolis added a commit that referenced this pull request May 8, 2023
…14139) (#14226)

* fix issues with filtering nulls on values coerced to numeric types
* fix issues with 'auto' type numeric columns in default value mode
* optimize variant typed columns without nested data
* more tests for 'auto' type column ingestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants