fix expression column capabilities to not report dictionary encoded unless input is string#16577
Conversation
…nless input is string
| .setType(ColumnType.STRING) | ||
| // this is sad, but currently dictionary encodedness is tied to string | ||
| // selectors and sad stuff happens if the input type isn't string | ||
| .setDictionaryEncoded(underlyingCapabilities.is(ValueType.STRING)) |
There was a problem hiding this comment.
shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?
There was a problem hiding this comment.
also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.
There was a problem hiding this comment.
shouldn't this be underlyingCapabilities.isTrue() && underlyingCapabilities.is(ValueType.STRING)?
oops yes.
also, imo copyOf is not the right thing to use here, logically speaking. I don't think we actually want to carry through all caps from the underlying selector, just a few specific ones.
Ah yea that is reasonable i suppose, i think originally more were preserved than not so i went with the copy. I guess hasBitmapIndexes isn't widely used on query side anymore since stuff can just ask ColumnIndexSupplier for an index and it will return if it has it, though i suppose it should still be preserved just in case.
There was a problem hiding this comment.
actually, we were explicitly marking hasBitmapIndexes as false, though #15585 allows expressions to use them in many cases, so it probably should be allowed to pass through. hasBitmapIndexes and hasSpatialIndexes should probably be removed from ColumnCapabilities, and the write side stuff that currently uses it be moved to ColumnFormat, though not going to do that on this PR.
| bitmapSerdeFactory, | ||
| byteOrder | ||
| ); | ||
| ColumnCapabilitiesImpl capabilitiesBuilder = builder.getCapabilitiesBuilder(); |
There was a problem hiding this comment.
this is just removing unused code?
There was a problem hiding this comment.
ah, i had put in the description:
While here also noticed that COMPLEX was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions like TO_JSON_STRING could also run into this problem.
There was a problem hiding this comment.
ahh, I missed that the capabilitiesBuilder is from builder itself. OK, this change makes sense then.
|
I think this caused a regression. Specifically: https://github.com/apache/druid/pull/16577/files#diff-02a6d78c905b352f5be16cac7cecdfb707b9b1475493999a75f7e97de830e005R284 I have a search query that use to run successfully but now failed with This is the query: This query can be reproduced with the wikipedia datasource More details: Note that a workaround is setting query context searchStrategy to cursorOnly
|


Description
Fixes an issue that occurs after #16366 where expressions with single input dictionary encoded columns that are not strings but have a string output incorrectly use the string dictionary encoded vector selector.
The fix for now is to make setting the capabilities of the virtual column to be dictionary encoded only if the input type is string. In the future if we make dictionary encoded selectors that are not coupled with handling strings then we could remove this constraint, but until then, this is the safest option to avoid incorrectly using string dimension selectors.
While here also noticed that
COMPLEX<json>was needlessly reporting itself as dictionary encoded, which isn't quite true. While the nested field columns are dictionary encoded, the json values themselves are not, so functions likeTO_JSON_STRINGcould also run into this problem.This PR has: