various nested column (and other) fixes by clintropolis · Pull Request #13732 · apache/druid

clintropolis · 2023-02-01T05:17:06Z

Description

This PR fixes a handful of semi-related issues, mostly to do with with nested columns, and some of them a result (or easier to encounter) after the introduction of the type 'mimic' behavior introduced in #13653.

changes:

modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for COMPLEX<json>
added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convenience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
Handles missing cases in ExprEval.bestEffortOf (Integer[], int[], byte[] which is now converted to base64 string instead of 'unknown complex') and fixes a translation error (float[] was not casting values to double), and test coverage
modernize calcite sql tests to not use deprecated Parser to build indexes
more tests more better

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

changes: * modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json> * NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not * fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true * fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json> * added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time * more tests more better

imply-cheddar · 2023-02-03T04:56:07Z

+      return null;
+    }
+
+    return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, offset, dimensionSpec.getExtractionFn()));


True statement, should be ignored in this case.

imply-cheddar · 2023-02-03T04:55:57Z

+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null, dimensionSpec.getExtractionFn()));


True statement, should be ignored in this case.

imply-cheddar · 2023-02-03T04:46:34Z

+        return new NestedDataColumnSupplier(
+            version,
+            metadata,
+            fields,
+            fieldInfo,
+            compressedRawColumnSupplier,
+            nullValues,
+            stringDictionary,
+            frontCodedStringDictionarySupplier,
+            longDictionarySupplier,
+            doubleDictionarySupplier,
+            columnConfig,
+            mapper,
+            simpleType
+        );


When building this, it appears like we are loading various files to then use in the suppliers. Given the recent sensitivity around the amount of data held on JVM heap per segment, I'm curious if we are being as judicious as possible with our JVM memory. We have to draw a fine line between performance and not performing hard-work multiple times and JVM heap, 'cause we can't fit everything in it.

With the changes here, I would have to dig in further to really get a sense for how it is changing the JVM heap memory usage, so I'm just wondering what your thought process was around these changes and their impact to heap memory?

i didn't actually change anything here, i just moved the logic out of the constructor into a static method per a request of yours on a previous PR. the nested column segments are pretty light due to the use of suppliers, particularly when using FrontCodedIndexed for the string dictionary #12277 (comment), so they consist mostly of a reference to the underlying buffer from the mmap segment and the supplier object until materialized into an indexed for queries. I did this with heap usage specifically in mind.

GenericIndexed is the most expensive part of segment heap usage afaict, and the limited use of it here helps keep things light, because its both materialized into an Indexed and also still functions like a supplier with the 'singleThreaded' method, and usually is the driver of heap size in the dumps i've seen.

Yeah, I remembered asking about the move to static, I didn't realize that this was all the exact same logic as before. It looked like there was stuff added, but perhaps that was just me mis-reading the diff. If this is no change in terms of the things that are pre-cached, then all is good.

imply-cheddar · 2023-02-03T04:48:12Z

      );
    } else {
+      final boolean[] nulls = new boolean[vectorSizeInspector.getMaxVectorSize()];
+      Arrays.fill(nulls, NullHandling.sqlCompatible());


The default initialization for a native boolean is false, so we should really only do this if NullHandling.sqlCompatible() is true.

reasonable, can fix since this one happens every time one of these is requested

imply-cheddar · 2023-02-03T04:55:57Z

+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null, dimensionSpec.getExtractionFn()));


True statement, should be ignored in this case.

imply-cheddar · 2023-02-03T04:56:07Z

+      return null;
+    }
+
+    return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, offset, dimensionSpec.getExtractionFn()));


True statement, should be ignored in this case.

imply-cheddar · 2023-02-03T04:59:04Z

+    if (!parts.isEmpty()) {
+      // we are being asked for a path that will never exist, so we are null selector
+      return DimensionSelector.constant(null, extractionFn);
+    }


Is this because parts.isEmpty() means that we are looking for the root? And if we are not looking for the root then it's impossible to have something? If so, can we move the stuff that's after this if to be inside of the if (to eliminate the negation) and adjust the comments to explain the relationship between parts.isEmpty() and the "real world/query"?

Is this because parts.isEmpty() means that we are looking for the root? And if we are not looking for the root then it's impossible to have something?

yep, exactly, will try to clarify comments

If so, can we move the stuff that's after this if to be inside of the if (to eliminate the negation) and adjust the comments to explain the relationship between parts.isEmpty() and the "real world/query"?

hah, i originally had it flipped and decided last minute to do this way so less stuff inside the if which seemed slightly more pleasing, but will switch back because is all the same.

I understand the want for less lines inside of a block, but my rule is that the fewest negations is always easier to read.

imply-cheddar · 2023-02-03T05:00:40Z

+      return DimensionSelector.constant(null, extractionFn);
+    }
+
+    // the path was the 'root', we're in luck, spit out a selector that behaves the same way as a nested column


why does the instanceof DictionaryEncoded check tell us that the path was the root? Or is that part intended to be explaining the if before this?

ah this comment is maybe confusing after i re-arranged some stuff. the comment is in relation to making it past the last 'if' statement that would have bailed with a nil selector if we weren't looking for the root path

im checking for a dictionary encoded column here because if so, i want to wrap it in the auto casting gizmo because the string dimension selectors do not have that behavior by default. If it is not dictionary encoded, i can just use the column value selector because it is going to be a numeric column which do all implement all of the 'get number' methods

i'll admit that this is possibly a bit dependent on the way things currently are and that the checks might not be adequate in the future. I guess if i flip this logic around like your other comment suggested i could make these check explicitly for either numeric columns and use value selector or string dictionary column and use the wrapped dim selector, else return nil selector

imply-cheddar · 2023-02-03T05:02:50Z

+    if (processFromRaw || hasNegativeArrayIndex) {
+      // if the path has negative array elements, or has set the flag to process 'raw' values explicitly (JSON_QUERY),
+      // then we use the 'raw' processing of the RawFieldColumnSelector/RawFieldLiteralColumnValueSelector created
+      // with the column selector factory instead of using the optimized nested field column
+      return null;
+    }


I understand why processFromRaw takes us into this if, I don't understand why hasNegativeArrayIndex does?

ah maybe I could explain why a bit better here. the reason is that jsonpath expressions with a negative array element mean "from the end of the array", so $.x[-1] returns the last element of the array x, and so on. We don't currently have an easy way to know how many elements are in an array for each row to be able to use the nested literal columns for those elements, so hasNegativeArrayIndex forces us to fall back to the 'raw' data processing which can get the whole array and find the appropriate element that way instead.

imply-cheddar · 2023-02-03T05:28:25Z

 {"timestamp": "2021-01-01", "dim": "hello", "nester":{ "x": ["x", "y", "z"]}, "list":[{"x": 35, "y": 310}, {"x": 315, "y": 322}]}
 {"timestamp": "2021-01-01", "dim": "hello", "nest":{ "x": 300, "y":  800}, "nester": "hello"}
-{"timestamp": "2021-01-01", "dim": "hello", "nest":{ "y": 500}, "list":[{"x": 115, "y": 410}, {"x": 415, "y": 422}]}
+{"timestamp": "2021-01-01", "dim": "100", "nest":{ "y": 500}, "list":[{"x": 115, "y": 410}, {"x": 415, "y": 422}]}


I noticed that it's still a string. Was this change also wanting to test adding a new type and having it be a number? Or is this being used to validate that the parsing of numbers works?

test parsing numbers

imply-cheddar · 2023-02-03T05:29:42Z

+            columnTypes.compute(column, (c, existingType) -> {
+              if (existingType == null) {
+                return columnType;
+              }
+              if (columnType == null) {
+                return existingType;
+              }
+              // if any are json, are all json
+              if (NestedDataComplexTypeSerde.TYPE.equals(columnType) || NestedDataComplexTypeSerde.TYPE.equals(existingType)) {
+                return NestedDataComplexTypeSerde.TYPE;
+              }
+              // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+              return existingType;
+            });


This is a question not actually about your code, but the code that existed before it... Given that this is using a compute to put values into the map, how do these things change?

this is an area that could use improvement probably, but currently it will do a 'refresh' that rebuilds the schema for a datasource from the available set of segments, which calls this method, so its like ...iterating over all of them to recompute the schema. See segmentsNeedingRefresh and dataSourcesNeedingRebuild and such

…-schema-handling-and-a-bunch-of-other-stuff

…ual column

…-schema-handling-and-a-bunch-of-other-stuff

* fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString) * ExprEval best effort mode now handles byte[] as string * added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered

imply-cheddar · 2023-02-07T00:02:46Z

+    // in 'best effort' mode, we couldn't possibly use byte[] as a complex or anything else useful without type
+    // knowledge, so lets turn it into a base64 encoded string so at least something downstream can use it by decoding
+    // back into bytes
+    if (val instanceof byte[]) {
+      return new StringExprEval(StringUtils.encodeBase64String((byte[]) val));
+    }


Do we even need to base64 it? why not just keep it as a byte[]?

this is going into the expression system, if we don't handle it like this, it ends up as "unknown complex" and is basically useless in that form since we don't have a native byte[] typed expressions, and complex types cannot be cast to anything else (not that would do anything useful here).

With nested columns, leaving it as a byte[] is a problem, since we are using this method to process inputs to determine their type, we have a default case for anything that leaks through that isn't LONG or DOUBLE or STRING that effectively calls java toString on things that makes these end up not very useful byte[].toString.

I could change this to be a parse exception, but even if we did that, i still think it would still be useful to feed these into the expressions as a base64 encoded string rather than throwing it away. Maybe in the future it would be nice to have a byte blob type to not have to encode it, but until then, I think this is the most useful thing we can do, and its consistent with the behavior we use to handle byte[] in Rows.objectToStrings that normal string columns go through.

changes: * modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json> * NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not * fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true * fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json> * added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time * fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString) * ExprEval best effort mode now handles byte[] as string * added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered * more tests more better

clintropolis added Bug Area - Querying Area - Ingestion labels Feb 1, 2023

clintropolis changed the title ~~various nested column fixes~~ various nested column (and other) fixes Feb 1, 2023

github-advanced-security AI found potential problems Feb 1, 2023

View reviewed changes

clintropolis added 2 commits February 1, 2023 19:02

better

e5cb7d9

pointless

7a60699

github-advanced-security AI found potential problems Feb 2, 2023

View reviewed changes

constant fix

6ab287f

github-advanced-security AI found potential problems Feb 2, 2023

View reviewed changes

imply-cheddar reviewed Feb 3, 2023

View reviewed changes

clintropolis added 3 commits February 3, 2023 01:40

Merge remote-tracking branch 'upstream/master' into fix-nested-column…

7f88eb1

…-schema-handling-and-a-bunch-of-other-stuff

modernize calcite tests to not use parsers, tidy up nested field virt…

77e1485

…ual column

less parser

fb429ed

github-advanced-security AI found potential problems Feb 3, 2023

View reviewed changes

Comment thread ...src/test/java/org/apache/druid/query/aggregation/bloom/sql/BloomFilterSqlAggregatorTest.java Fixed

clintropolis added 6 commits February 3, 2023 15:51

remove unused

0bdd5a9

Merge remote-tracking branch 'upstream/master' into fix-nested-column…

b073dd7

…-schema-handling-and-a-bunch-of-other-stuff

typo and more test

3f63354

style

f888086

adjust

e0e6745

imply-cheddar approved these changes Feb 7, 2023

View reviewed changes

clintropolis merged commit 2d3bee8 into apache:master Feb 7, 2023

clintropolis deleted the fix-nested-column-schema-handling-and-a-bunch-of-other-stuff branch February 7, 2023 03:48

clintropolis mentioned this pull request Feb 9, 2023

fix filtering nested field virtual column when used with non nested column input #13779

Merged

4 tasks

clintropolis added this to the 26.0 milestone Apr 10, 2023

Conversation

clintropolis commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Choose a reason for hiding this comment

Uh oh!

Check notice

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clintropolis commented Feb 1, 2023 •

edited

Loading