fix issues with filtering nulls on values coerced to numeric types by clintropolis · Pull Request #14139 · apache/druid

clintropolis · 2023-04-21T09:41:45Z

Description

This PR fixes some bugs in both SQL compatible and 'default' value modes when filtering IS NULL/IS NOT NULL on values which are coerced to a numeric type, such as CAST(x AS BIGINT) or JSON_VALUE(nested, '$.x' RETURNING BIGINT). While these expressions had different causes, the underlying effect was more or less the same and involved the query engine using the null value index for the underlying column to build the cursor, which only counts the null string values, and misses out on values which might become null after coercion.

For example given the following source data:

trying to filter out nulls in default value mode would produce the following results:

and the same data ingested in SQL compatible mode also incorrect:

However the "correct" results in "default" value mode should match all of the rows because there are no null numbers, and in SQL compatible mode, should only match a single row, since that is the only truly non-null number.

JSON_VALUE suffered from the same problem when using the RETURNING syntax to cast to a numeric type for string or mixed type fields.

To fix this, we avoid using the indexes when this is the case. It would be possible to partially use these indexes and a value matcher for the 'is not null' case, but currently not null is handled as a single index, rather a not filter on a selector filter, so its a little bit tricky to implement right now so I'll save this for a future change.

This PR also fixes issues with 'auto' typed numeric columns to ensure they behave consistently with classic numeric columns in 'default' value mode, which means picking up the 'null values dont exist' behavior that much of the query engine has. In default value mode, the numeric index suppliers will combine the null and 0 indexes when searching for 0 to ensure that indexes behave consistently with the value matchers. I have added the 'auto' typed columns to BaseFilterTest to better cover the 'auto' indexers, only skipping tests related to implicit multi-value handling.

Release note

Fixed issues involving filtering NULL values when using CAST expressions and JSON_VALUE expressions using the RETURNING syntax to coerce non-numeric types into numeric types, such as CAST(x AS BIGINT) IS NOT NULL or JSON_VALUE(nested, '$.x' RETURNING BIGINT) IS NULL. These expressions were incorrectly using the null value index of the underlying column, which produced results matching the values which were NULL string values, but not values which could not be converted into a numeric value and so were effectively NULL.

This PR has:

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

+                   )
+               )
+           );
+    bestEffortBindings = InputBindings.forMap(builder.build());


…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

gianm

The stuff outside of the nested package looks good to me, other than the BoundFilter piece that doesn't sit right.

Within the nested package I don't really understand the changes. I think I mostly don't understand how the variant column is different from a COMPLEX<json> nested column. How is it meant to work now?

gianm · 2023-05-06T01:35:27Z

+      if (columnType != null) {
+        if (ColumnType.LONG.equals(columnType)) {
+          return NullHandling.defaultLongValue();
+        } else if (ColumnType.DOUBLE.equals(columnType)) {


no FLOAT? If there's a reason it's not needed here, a comment would be helpful.

expressions cannot be FLOAT so i pretend it doesn't exist, will update javadocs

gianm · 2023-05-06T02:15:37Z

    return ColumnTypeFactory.getInstance().ofComplex(complexTypeName);
  }
+
+  public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullable ColumnType other)


Comments on the method declaration:

Method itself should be @Nullable? Since both inputs could be null, and then this'll return null.

Please include javadoc about what happens if a least restrictive type can't be found. Do you get null? An exception? (From the impl, looks like you get IllegalArgumentException.)

gianm · 2023-05-06T02:17:02Z

+        return type;
+      }
+      if (!Objects.equals(type, other)) {
+        throw new IAE("Cannot implicitly cast %s to %s", type, other);


Brackets for interpolation?

gianm · 2023-05-06T02:21:17Z

          }
        }
+        catch (NumberFormatException ignored) {
+          // bounds are not numeric?


The catch here doesn't really sit right. Where does the NumberFormatException get thrown? Is it possible to deal with that case some other way?

My concern here is that there's a lot of code and method calls wrapped into that try block, and we have no way of telling where the NumberFormatException came from. Maybe scoping the try down to a smaller bit of code would help with that.

They come from the Double.parseDouble in

final Number lower = boundDimFilter.hasLowerBound() ? Double.parseDouble(boundDimFilter.getLower()) : null; final Number upper = boundDimFilter.hasUpperBound() ? Double.parseDouble(boundDimFilter.getUpper()) : null;

will see if i can shuffle some stuff around

gianm · 2023-05-06T02:54:44Z

-    );
+    catch (ISE ise) {
+      // ignore failures resulting from 'auto'
+      if (!(testName.contains("AutoTypes") && "Unsupported type[ARRAY<STRING>]".equals(ise.getMessage()))) {


nit: this testName.contains("AutoTypes") is used in a bunch of places; would be nice to create a field or method for this.

gianm · 2023-05-06T02:57:58Z

+import java.util.TreeMap;

-public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>> implements NestedCommonFormatColumn
+public class VariantArrayColumn<TStringDictionary extends Indexed<ByteBuffer>>


VariantArray seems like an inaccurate name now that this can be any kind of variant stuff. Shall we rename it to VariantColumn?

Btw, can a variant column store an object? How's it different from a NestedDataColumn?

I guess I don't really understand variant columns.

clintropolis · 2023-05-08T00:05:51Z

Within the nested package I don't really understand the changes. I think I mostly don't understand how the variant column is different from a COMPLEX nested column. How is it meant to work now?

Added a bunch of javadocs to hopefully clear up how everything relates to each other, still haven't got everything quite yet but will keep chipping away at it.

+  private final IndexSpec indexSpec;
+
+  public VariantColumnSupplierTest(
+      @SuppressWarnings("unused") String name,


gianm

LGTM after the latest changes.

…pache#14139) * fix issues with filtering nulls on values coerced to numeric types * fix issues with 'auto' type numeric columns in default value mode * optimize variant typed columns without nested data * more tests for 'auto' type column ingestion

…14139) (#14226) * fix issues with filtering nulls on values coerced to numeric types * fix issues with 'auto' type numeric columns in default value mode * optimize variant typed columns without nested data * more tests for 'auto' type column ingestion

fix issues with filtering nulls on values coerced to numeric types

522ea8c

clintropolis added Bug Area - Querying Area - Null Handling labels Apr 21, 2023

clintropolis added this to the 26.0 milestone Apr 21, 2023

oops remove unused file

0e69986

clintropolis added the WIP label Apr 21, 2023

clintropolis added 3 commits April 24, 2023 18:32

how deep does this rabbit hole go?

896fc72

Merge remote-tracking branch 'upstream/master' into i-got-99-problems…

b541d6e

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

style

9399b22

github-advanced-security AI found potential problems Apr 25, 2023

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java Fixed

fix results that were sad to be still sad but at least consistently sad

0f5a70c

abhishekagarwal87 removed the WIP label Apr 25, 2023

clintropolis added 7 commits April 24, 2023 21:51

fix default mode tests, allow array functions to work with json inputs

ffdb883

json still has nulls

42a85c6

whack-a-mole

991cdb1

unused

b5bb2c1

oops

25937e9

more test

08e30ff

more

d7f12e8

github-advanced-security AI found potential problems Apr 26, 2023

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java Fixed

clintropolis added 3 commits April 25, 2023 19:57

unused

f391028

more test

c869da9

more test, more better

4d54fcc

github-advanced-security AI found potential problems Apr 26, 2023

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java Fixed

clintropolis added 3 commits April 26, 2023 06:05

oops

637c667

heh oops

5ac07ab

missed a spot

a36b406

clintropolis mentioned this pull request Apr 26, 2023

Update TimeBoundaryQueryRunnerTest.java #14173

Closed

Merge remote-tracking branch 'upstream/master' into i-got-99-problems…

13336c9

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

clintropolis added 2 commits May 4, 2023 06:30

cooler variant column, more test, more good

e051d04

hmm, need to find a better way...

32c6467

github-advanced-security AI found potential problems May 4, 2023

View reviewed changes

clintropolis added 6 commits May 4, 2023 13:53

nil columns, fixes

73b79f3

nullable

97898bb

more testing

b1d0e5f

make directly

1aec153

more test more better

f7e3fa2

Merge remote-tracking branch 'upstream/master' into i-got-99-problems…

cb4d7da

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

github-advanced-security AI found potential problems May 5, 2023

View reviewed changes

clintropolis added 4 commits May 4, 2023 20:26

fixup

8fca218

so picky

cca0a34

Merge remote-tracking branch 'upstream/master' into i-got-99-problems…

c9ae3fb

…-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe

tests tests tests

6e796a7

gianm reviewed May 6, 2023

View reviewed changes

clintropolis added 4 commits May 7, 2023 16:40

javadoc and rearrange boundfilter

1c51ca8

more javadocs

0befe1e

redundant

e3b6c08

more

966960f

github-advanced-security AI found potential problems May 8, 2023

View reviewed changes

clintropolis added 2 commits May 7, 2023 19:00

oops

ea7e46b

add import for javadoc

f79cbdc

gianm approved these changes May 8, 2023

View reviewed changes

clintropolis merged commit 8805d8d into apache:master May 8, 2023

clintropolis deleted the i-got-99-problems-and-null-handling-bugs-are-idk-like-at-least-half-of-them-maybe branch May 8, 2023 20:19

clintropolis mentioned this pull request May 8, 2023

[Backport] fix issues with filtering nulls on values coerced to numeric types #14226

Merged

clintropolis mentioned this pull request May 10, 2023

fix npe regression in json_value when filtering non-existent paths #14250

Merged

3 tasks

Conversation

clintropolis commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

gianm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis May 6, 2023

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

gianm May 6, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis commented May 8, 2023

Uh oh!

Check notice

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clintropolis commented Apr 21, 2023 •

edited

Loading

gianm left a comment •

edited

Loading

clintropolis May 6, 2023 •

edited

Loading