Skip to content

various nested column (and other) fixes#13732

Merged
clintropolis merged 13 commits intoapache:masterfrom
clintropolis:fix-nested-column-schema-handling-and-a-bunch-of-other-stuff
Feb 7, 2023
Merged

various nested column (and other) fixes#13732
clintropolis merged 13 commits intoapache:masterfrom
clintropolis:fix-nested-column-schema-handling-and-a-bunch-of-other-stuff

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Feb 1, 2023

Description

This PR fixes a handful of semi-related issues, mostly to do with with nested columns, and some of them a result (or easier to encounter) after the introduction of the type 'mimic' behavior introduced in #13653.

changes:

  • modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
  • NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
  • fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
  • fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for COMPLEX<json>
  • added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convenience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
  • Handles missing cases in ExprEval.bestEffortOf (Integer[], int[], byte[] which is now converted to base64 string instead of 'unknown complex') and fixes a translation error (float[] was not casting values to double), and test coverage
  • modernize calcite sql tests to not use deprecated Parser to build indexes
  • more tests more better

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

changes:
* modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
* NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
* fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
* fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json>
* added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
* more tests more better
@clintropolis clintropolis changed the title various nested column fixes various nested column (and other) fixes Feb 1, 2023
return null;
}

return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, offset, dimensionSpec.getExtractionFn()));

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [DimensionSpec.getExtractionFn](1) should be avoided because it has been deprecated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True statement, should be ignored in this case.

ColumnHolder holder = columnSelector.getColumnHolder(columnName);
if (holder == null) {
// column doesn't exist
return dimensionSpec.decorate(DimensionSelector.constant(null, dimensionSpec.getExtractionFn()));

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [DimensionSpec.getExtractionFn](1) should be avoided because it has been deprecated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True statement, should be ignored in this case.

Comment on lines +186 to +200
return new NestedDataColumnSupplier(
version,
metadata,
fields,
fieldInfo,
compressedRawColumnSupplier,
nullValues,
stringDictionary,
frontCodedStringDictionarySupplier,
longDictionarySupplier,
doubleDictionarySupplier,
columnConfig,
mapper,
simpleType
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building this, it appears like we are loading various files to then use in the suppliers. Given the recent sensitivity around the amount of data held on JVM heap per segment, I'm curious if we are being as judicious as possible with our JVM memory. We have to draw a fine line between performance and not performing hard-work multiple times and JVM heap, 'cause we can't fit everything in it.

With the changes here, I would have to dig in further to really get a sense for how it is changing the JVM heap memory usage, so I'm just wondering what your thought process was around these changes and their impact to heap memory?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't actually change anything here, i just moved the logic out of the constructor into a static method per a request of yours on a previous PR. the nested column segments are pretty light due to the use of suppliers, particularly when using FrontCodedIndexed for the string dictionary #12277 (comment), so they consist mostly of a reference to the underlying buffer from the mmap segment and the supplier object until materialized into an indexed for queries. I did this with heap usage specifically in mind.

GenericIndexed is the most expensive part of segment heap usage afaict, and the limited use of it here helps keep things light, because its both materialized into an Indexed and also still functions like a supplier with the 'singleThreaded' method, and usually is the driver of heap size in the dumps i've seen.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I remembered asking about the move to static, I didn't realize that this was all the exact same logic as before. It looked like there was stuff added, but perhaps that was just me mis-reading the diff. If this is no change in terms of the things that are pre-cached, then all is good.

);
} else {
final boolean[] nulls = new boolean[vectorSizeInspector.getMaxVectorSize()];
Arrays.fill(nulls, NullHandling.sqlCompatible());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default initialization for a native boolean is false, so we should really only do this if NullHandling.sqlCompatible() is true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable, can fix since this one happens every time one of these is requested

ColumnHolder holder = columnSelector.getColumnHolder(columnName);
if (holder == null) {
// column doesn't exist
return dimensionSpec.decorate(DimensionSelector.constant(null, dimensionSpec.getExtractionFn()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True statement, should be ignored in this case.

return null;
}

return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, offset, dimensionSpec.getExtractionFn()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True statement, should be ignored in this case.

Comment on lines +276 to +279
if (!parts.isEmpty()) {
// we are being asked for a path that will never exist, so we are null selector
return DimensionSelector.constant(null, extractionFn);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because parts.isEmpty() means that we are looking for the root? And if we are not looking for the root then it's impossible to have something? If so, can we move the stuff that's after this if to be inside of the if (to eliminate the negation) and adjust the comments to explain the relationship between parts.isEmpty() and the "real world/query"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because parts.isEmpty() means that we are looking for the root? And if we are not looking for the root then it's impossible to have something?

yep, exactly, will try to clarify comments

If so, can we move the stuff that's after this if to be inside of the if (to eliminate the negation) and adjust the comments to explain the relationship between parts.isEmpty() and the "real world/query"?

hah, i originally had it flipped and decided last minute to do this way so less stuff inside the if which seemed slightly more pleasing, but will switch back because is all the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the want for less lines inside of a block, but my rule is that the fewest negations is always easier to read.

return DimensionSelector.constant(null, extractionFn);
}

// the path was the 'root', we're in luck, spit out a selector that behaves the same way as a nested column
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the instanceof DictionaryEncoded check tell us that the path was the root? Or is that part intended to be explaining the if before this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this comment is maybe confusing after i re-arranged some stuff. the comment is in relation to making it past the last 'if' statement that would have bailed with a nil selector if we weren't looking for the root path

im checking for a dictionary encoded column here because if so, i want to wrap it in the auto casting gizmo because the string dimension selectors do not have that behavior by default. If it is not dictionary encoded, i can just use the column value selector because it is going to be a numeric column which do all implement all of the 'get number' methods

i'll admit that this is possibly a bit dependent on the way things currently are and that the checks might not be adequate in the future. I guess if i flip this logic around like your other comment suggested i could make these check explicitly for either numeric columns and use value selector or string dictionary column and use the wrapped dim selector, else return nil selector

Comment on lines +308 to +313
if (processFromRaw || hasNegativeArrayIndex) {
// if the path has negative array elements, or has set the flag to process 'raw' values explicitly (JSON_QUERY),
// then we use the 'raw' processing of the RawFieldColumnSelector/RawFieldLiteralColumnValueSelector created
// with the column selector factory instead of using the optimized nested field column
return null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why processFromRaw takes us into this if, I don't understand why hasNegativeArrayIndex does?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah maybe I could explain why a bit better here. the reason is that jsonpath expressions with a negative array element mean "from the end of the array", so $.x[-1] returns the last element of the array x, and so on. We don't currently have an easy way to know how many elements are in an array for each row to be able to use the nested literal columns for those elements, so hasNegativeArrayIndex forces us to fall back to the 'raw' data processing which can get the whole array and find the appropriate element that way instead.

{"timestamp": "2021-01-01", "dim": "hello", "nester":{ "x": ["x", "y", "z"]}, "list":[{"x": 35, "y": 310}, {"x": 315, "y": 322}]}
{"timestamp": "2021-01-01", "dim": "hello", "nest":{ "x": 300, "y": 800}, "nester": "hello"}
{"timestamp": "2021-01-01", "dim": "hello", "nest":{ "y": 500}, "list":[{"x": 115, "y": 410}, {"x": 415, "y": 422}]}
{"timestamp": "2021-01-01", "dim": "100", "nest":{ "y": 500}, "list":[{"x": 115, "y": 410}, {"x": 415, "y": 422}]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that it's still a string. Was this change also wanting to test adding a new type and having it be a number? Or is this being used to validate that the parsing of numbers works?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test parsing numbers

Comment on lines +801 to +814
columnTypes.compute(column, (c, existingType) -> {
if (existingType == null) {
return columnType;
}
if (columnType == null) {
return existingType;
}
// if any are json, are all json
if (NestedDataComplexTypeSerde.TYPE.equals(columnType) || NestedDataComplexTypeSerde.TYPE.equals(existingType)) {
return NestedDataComplexTypeSerde.TYPE;
}
// "existing type" is the 'newest' type, since we iterate the segments list by newest start time
return existingType;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question not actually about your code, but the code that existed before it... Given that this is using a compute to put values into the map, how do these things change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an area that could use improvement probably, but currently it will do a 'refresh' that rebuilds the schema for a datasource from the available set of segments, which calls this method, so its like ...iterating over all of them to recompute the schema. See segmentsNeedingRefresh and dataSourcesNeedingRebuild and such

* fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString)
* ExprEval best effort mode now handles byte[] as string
* added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered
Comment on lines +482 to +487
// in 'best effort' mode, we couldn't possibly use byte[] as a complex or anything else useful without type
// knowledge, so lets turn it into a base64 encoded string so at least something downstream can use it by decoding
// back into bytes
if (val instanceof byte[]) {
return new StringExprEval(StringUtils.encodeBase64String((byte[]) val));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to base64 it? why not just keep it as a byte[]?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going into the expression system, if we don't handle it like this, it ends up as "unknown complex" and is basically useless in that form since we don't have a native byte[] typed expressions, and complex types cannot be cast to anything else (not that would do anything useful here).

With nested columns, leaving it as a byte[] is a problem, since we are using this method to process inputs to determine their type, we have a default case for anything that leaks through that isn't LONG or DOUBLE or STRING that effectively calls java toString on things that makes these end up not very useful byte[].toString.

I could change this to be a parse exception, but even if we did that, i still think it would still be useful to feed these into the expressions as a base64 encoded string rather than throwing it away. Maybe in the future it would be nice to have a byte blob type to not have to encode it, but until then, I think this is the most useful thing we can do, and its consistent with the behavior we use to handle byte[] in Rows.objectToStrings that normal string columns go through.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@clintropolis clintropolis merged commit 2d3bee8 into apache:master Feb 7, 2023
@clintropolis clintropolis deleted the fix-nested-column-schema-handling-and-a-bunch-of-other-stuff branch February 7, 2023 03:48
abhagraw pushed a commit to abhagraw/druid that referenced this pull request Feb 8, 2023
changes:
* modified druid schema column type compution to special case COMPLEX<json> handling to choose COMPLEX<json> if any column in any segment is COMPLEX<json>
* NestedFieldVirtualColumn can now work correctly on any type of column, returning either a column selector if a root path, or nil selector if not
* fixed a random bug with NilVectorSelector when using a vector size larger than the default and druid.generic.useDefaultValueForNull=false would have the nulls vector set to all false instead of true
* fixed an overly aggressive check in ExprEval.ofType when handling complex types which would try to treat any string as base64 without gracefully falling back if it was not in fact base64 encoded, along with special handling for complex<json>
* added ExpressionVectorSelectors.castValueSelectorToObject and ExpressionVectorSelectors.castObjectSelectorToNumeric as convience methods to cast vector selectors using cast expressions without the trouble of constructing an expression. the polymorphic nature of the non-vectorized engine (and significantly larger overhead of non-vectorized expression processing) made adding similar methods for non-vectorized selectors less attractive and so have not been added at this time
* fix inconsistency between nested column indexer and serializer in handling values (coerce non primitive and non arrays of primitives using asString)
* ExprEval best effort mode now handles byte[] as string
* added test for ExprEval.bestEffortOf, and add missing conversion cases that tests uncovered
* more tests more better
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants