Skip to content

fix expression column capabilities to not report dictionary encoded unless input is string#16577

Merged
clintropolis merged 3 commits intoapache:masterfrom
clintropolis:fix-fallback-vectorize-dictionary-encoded
Jun 8, 2024
Merged

fix expression column capabilities to not report dictionary encoded unless input is string#16577
clintropolis merged 3 commits intoapache:masterfrom
clintropolis:fix-fallback-vectorize-dictionary-encoded

Conversation

@clintropolis
Copy link
Copy Markdown
Member

Description

Fixes an issue that occurs after #16366 where expressions with single input dictionary encoded columns that are not strings but have a string output incorrectly use the string dictionary encoded vector selector.

The fix for now is to make setting the capabilities of the virtual column to be dictionary encoded only if the input type is string. In the future if we make dictionary encoded selectors that are not coupled with handling strings then we could remove this constraint, but until then, this is the safest option to avoid incorrectly using string dimension selectors.

While here also noticed that COMPLEX<json> was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions like TO_JSON_STRING could also run into this problem.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

.setType(ColumnType.STRING)
// this is sad, but currently dictionary encodedness is tied to string
// selectors and sad stuff happens if the input type isn't string
.setDictionaryEncoded(underlyingCapabilities.is(ValueType.STRING))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?

oops yes.

also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.

Ah yea that is reasonable i suppose, i think originally more were preserved than not so i went with the copy. I guess hasBitmapIndexes isn't widely used on query side anymore since stuff can just ask ColumnIndexSupplier for an index and it will return if it has it, though i suppose it should still be preserved just in case.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we were explicitly marking hasBitmapIndexes as false, though #15585 allows expressions to use them in many cases, so it probably should be allowed to pass through. hasBitmapIndexes and hasSpatialIndexes should probably be removed from ColumnCapabilities, and the write side stuff that currently uses it be moved to ColumnFormat, though not going to do that on this PR.

bitmapSerdeFactory,
byteOrder
);
ColumnCapabilitiesImpl capabilitiesBuilder = builder.getCapabilitiesBuilder();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just removing unused code?

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i had put in the description:

While here also noticed that COMPLEX was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions like TO_JSON_STRING could also run into this problem.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I missed that the capabilitiesBuilder is from builder itself. OK, this change makes sense then.

@clintropolis clintropolis merged commit 3fb6ba2 into apache:master Jun 8, 2024
@clintropolis clintropolis deleted the fix-fallback-vectorize-dictionary-encoded branch June 8, 2024 20:05
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Jan 24, 2025

@gianm @clintropolis

I think this caused a regression. Specifically: https://github.com/apache/druid/pull/16577/files#diff-02a6d78c905b352f5be16cac7cecdfb707b9b1475493999a75f7e97de830e005R284

I have a search query that use to run successfully but now failed with

Cannot invoke "org.apache.druid.segment.index.semantic.DictionaryEncodedStringValueIndex.getCardinality()" because "bitmapIndex" is null
java.lang.NullPointerException

This is the query:

{"queryType":"search","dataSource":"wikipedia","granularity":"all","intervals":["2000-11-22/2024-12-21"],"virtualColumns":[{"name":"spark_cluster_group","type":"expression","expression":"case_searched(like(cityName,'C%'), 'foo', like(cityName, 'B%'), 'bar', 'other')","outputType":"STRING"}],"searchDimensions":["spark_cluster_group"],"query":{"type":"insensitive_contains","value":""},"sort":{"type":"lexicographic"},"limit":100}

This query can be reproduced with the wikipedia datasource

More details:
Before this change, the above query's search dimension would get assign to nonBitmapDims in UseIndexesStrategy#partitionDimensionList. However, after this change the search dimension would get assign to bitmapDims instead. As a result, it would then cause NPE at UseIndexesStrategy#execute:

          final DictionaryEncodedStringValueIndex bitmapIndex =
              indexSupplier.as(DictionaryEncodedStringValueIndex.class);
          for (int i = 0; i < bitmapIndex.getCardinality(); ++i) {

Note that a workaround is setting query context searchStrategy to cursorOnly

Screenshot 2025-01-24 at 12 27 12 AM Screenshot 2025-01-23 at 11 53 47 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants