Fix four bugs with numeric dimension output types.#6220
Conversation
This patch includes the following bug fixes: - TopNColumnSelectorStrategyFactory: Cast dimension values to the output type during dimExtractionScanAndAggregate instead of updateDimExtractionResults. This fixes a bug where, for example, grouping on doubles-cast-to-longs would fail to merge two doubles that should have been combined into the same long value. - TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long would fail to merge two strings that should have been combined. - GroupByQuery: Cast numeric types to the expected output type before comparing them in compareDimsForLimitPushDown. This fixes apache#6123. - GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into the proper output type. This fixes an inconsistency between results that came from cache vs. not-cache: for example, Jackson sometimes deserializes integers as Integers and sometimes as Longs. And the following code-cleanup changes, related to the fixes above: - DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType, and converterFromTypeToType to make it easier to handle casting operations. - TopN in general: Rename various "dimName" variables to "dimValue" where they actually represent dimension values. The old names were confusing.
| } | ||
|
|
||
|
|
||
| // Output type must be STRING in order for PooledTopNAlgorithm to make sense; so no need to convert value. |
There was a problem hiding this comment.
Maybe add an assert about the output type here.
| { | ||
| @Override | ||
| public int getCardinality(DimensionSelector selector) | ||
| private final Function<Object, Comparable<?>> dimensionValueConverter; |
| DimensionSelector selector, | ||
| Aggregator[][] rowSelector, | ||
| Map<String, Aggregator[]> aggregatesStore | ||
| Map<Comparable, Aggregator[]> aggregatesStore |
There was a problem hiding this comment.
Comparable and Comparable<?> are used inconsistently in this class
| return new NumericTopNColumnSelectorStrategy.OfDouble(); | ||
| if (ValueType.isNumeric(dimensionType)) { | ||
| // Return strategy that aggregates using the _output_ type, because this allows us to collapse values | ||
| // properly (numeric types cannot represent all values of other numeric types). |
There was a problem hiding this comment.
FWIW double represents all values of float
| // properly (numeric types cannot represent all values of other numeric types). | ||
| return NumericTopNColumnSelectorStrategy.ofType(dimensionType, dimensionType); | ||
| } else { | ||
| // Return strategy that aggregates using the _input_ type. Here we are assuming that the output type can |
There was a problem hiding this comment.
Despite those comments, it's still hard to understand what is going on here. Maybe you could try to clarify it more?
|
|
||
| @Nullable | ||
| public static Comparable<?> convertObjectToType( | ||
| @Nullable final Object obj, |
* Fix four bugs with numeric dimension output types. This patch includes the following bug fixes: - TopNColumnSelectorStrategyFactory: Cast dimension values to the output type during dimExtractionScanAndAggregate instead of updateDimExtractionResults. This fixes a bug where, for example, grouping on doubles-cast-to-longs would fail to merge two doubles that should have been combined into the same long value. - TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long would fail to merge two strings that should have been combined. - GroupByQuery: Cast numeric types to the expected output type before comparing them in compareDimsForLimitPushDown. This fixes apache#6123. - GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into the proper output type. This fixes an inconsistency between results that came from cache vs. not-cache: for example, Jackson sometimes deserializes integers as Integers and sometimes as Longs. And the following code-cleanup changes, related to the fixes above: - DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType, and converterFromTypeToType to make it easier to handle casting operations. - TopN in general: Rename various "dimName" variables to "dimValue" where they actually represent dimension values. The old names were confusing. * Remove unused imports.
|
@fjy please don't merge PRs with unanswered comments (despite the approval), otherwise why they are left? |
|
Fwiw, I had already started addressing some of those comments, so I plan to raise a follow up PR for that. Btw, @leventov what did you intend by doing an approval + also leaving comments? I'd interpret that as "please consider these comments, but if you don't want to do them, I am ok with that." I thought they were all reasonable comments so that's why I'm doing a follow up. |
Adjustments to comments and usage of generics.
|
Follow-up is in #6231. |
|
Thanks. Yes it means non-blocking comments, however answering them before merging is appreciated. |
* Fix four bugs with numeric dimension output types. This patch includes the following bug fixes: - TopNColumnSelectorStrategyFactory: Cast dimension values to the output type during dimExtractionScanAndAggregate instead of updateDimExtractionResults. This fixes a bug where, for example, grouping on doubles-cast-to-longs would fail to merge two doubles that should have been combined into the same long value. - TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long would fail to merge two strings that should have been combined. - GroupByQuery: Cast numeric types to the expected output type before comparing them in compareDimsForLimitPushDown. This fixes apache#6123. - GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into the proper output type. This fixes an inconsistency between results that came from cache vs. not-cache: for example, Jackson sometimes deserializes integers as Integers and sometimes as Longs. And the following code-cleanup changes, related to the fixes above: - DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType, and converterFromTypeToType to make it easier to handle casting operations. - TopN in general: Rename various "dimName" variables to "dimValue" where they actually represent dimension values. The old names were confusing. * Remove unused imports.
* Fix four bugs with numeric dimension output types. This patch includes the following bug fixes: - TopNColumnSelectorStrategyFactory: Cast dimension values to the output type during dimExtractionScanAndAggregate instead of updateDimExtractionResults. This fixes a bug where, for example, grouping on doubles-cast-to-longs would fail to merge two doubles that should have been combined into the same long value. - TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long would fail to merge two strings that should have been combined. - GroupByQuery: Cast numeric types to the expected output type before comparing them in compareDimsForLimitPushDown. This fixes #6123. - GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into the proper output type. This fixes an inconsistency between results that came from cache vs. not-cache: for example, Jackson sometimes deserializes integers as Integers and sometimes as Longs. And the following code-cleanup changes, related to the fixes above: - DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType, and converterFromTypeToType to make it easier to handle casting operations. - TopN in general: Rename various "dimName" variables to "dimValue" where they actually represent dimension values. The old names were confusing. * Remove unused imports.
…che#6230) * Fix four bugs with numeric dimension output types. This patch includes the following bug fixes: - TopNColumnSelectorStrategyFactory: Cast dimension values to the output type during dimExtractionScanAndAggregate instead of updateDimExtractionResults. This fixes a bug where, for example, grouping on doubles-cast-to-longs would fail to merge two doubles that should have been combined into the same long value. - TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long would fail to merge two strings that should have been combined. - GroupByQuery: Cast numeric types to the expected output type before comparing them in compareDimsForLimitPushDown. This fixes apache#6123. - GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into the proper output type. This fixes an inconsistency between results that came from cache vs. not-cache: for example, Jackson sometimes deserializes integers as Integers and sometimes as Longs. And the following code-cleanup changes, related to the fixes above: - DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType, and converterFromTypeToType to make it easier to handle casting operations. - TopN in general: Rename various "dimName" variables to "dimValue" where they actually represent dimension values. The old names were confusing. * Remove unused imports.
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
Similar to other bugs fixed in #6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
Similar to other bugs fixed in #6220, but this one was missed. This bug would cause "extraction" dimensionSpecs on the "__time" column with non-STRING outputTypes to potentially be output as STRING sometimes instead of LONG, causing incompletely merged results.
This patch includes the following bug fixes:
during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
This fixes a bug where, for example, grouping on doubles-cast-to-longs would
fail to merge two doubles that should have been combined into the same long value.
as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
would fail to merge two strings that should have been combined.
in compareDimsForLimitPushDown. This fixes ClassCastException in groupBy when sorting on numeric columns containing nulls #6123.
the proper output type. This fixes an inconsistency between results that came
from cache vs. not-cache: for example, Jackson sometimes deserializes integers
as Integers and sometimes as Longs.
And the following code-cleanup changes, related to the fixes above:
and converterFromTypeToType to make it easier to handle casting operations.
actually represent dimension values. The old names were confusing.