Skip to content

Fix four bugs with numeric dimension output types.#6220

Merged
fjy merged 2 commits intoapache:masterfrom
gianm:fix-some-numeric-dim-stuff
Aug 25, 2018
Merged

Fix four bugs with numeric dimension output types.#6220
fjy merged 2 commits intoapache:masterfrom
gianm:fix-some-numeric-dim-stuff

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Aug 22, 2018

This patch includes the following bug fixes:

  • TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
    during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
    This fixes a bug where, for example, grouping on doubles-cast-to-longs would
    fail to merge two doubles that should have been combined into the same long value.
  • TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
    as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
    would fail to merge two strings that should have been combined.
  • GroupByQuery: Cast numeric types to the expected output type before comparing them
    in compareDimsForLimitPushDown. This fixes ClassCastException in groupBy when sorting on numeric columns containing nulls #6123.
  • GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
    the proper output type. This fixes an inconsistency between results that came
    from cache vs. not-cache: for example, Jackson sometimes deserializes integers
    as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

  • DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
    and converterFromTypeToType to make it easier to handle casting operations.
  • TopN in general: Rename various "dimName" variables to "dimValue" where they
    actually represent dimension values. The old names were confusing.

This patch includes the following bug fixes:

- TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
  during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
  This fixes a bug where, for example, grouping on doubles-cast-to-longs would
  fail to merge two doubles that should have been combined into the same long value.
- TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
  as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
  would fail to merge two strings that should have been combined.
- GroupByQuery: Cast numeric types to the expected output type before comparing them
  in compareDimsForLimitPushDown. This fixes apache#6123.
- GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
  the proper output type. This fixes an inconsistency between results that came
  from cache vs. not-cache: for example, Jackson sometimes deserializes integers
  as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

- DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
  and converterFromTypeToType to make it easier to handle casting operations.
- TopN in general: Rename various "dimName" variables to "dimValue" where they
  actually represent dimension values. The old names were confusing.
@gianm gianm added this to the 0.12.3 milestone Aug 22, 2018
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🤘

}


// Output type must be STRING in order for PooledTopNAlgorithm to make sense; so no need to convert value.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an assert about the output type here.

{
@Override
public int getCardinality(DimensionSelector selector)
private final Function<Object, Comparable<?>> dimensionValueConverter;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function from String?

DimensionSelector selector,
Aggregator[][] rowSelector,
Map<String, Aggregator[]> aggregatesStore
Map<Comparable, Aggregator[]> aggregatesStore
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparable and Comparable<?> are used inconsistently in this class

return new NumericTopNColumnSelectorStrategy.OfDouble();
if (ValueType.isNumeric(dimensionType)) {
// Return strategy that aggregates using the _output_ type, because this allows us to collapse values
// properly (numeric types cannot represent all values of other numeric types).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW double represents all values of float

// properly (numeric types cannot represent all values of other numeric types).
return NumericTopNColumnSelectorStrategy.ofType(dimensionType, dimensionType);
} else {
// Return strategy that aggregates using the _input_ type. Here we are assuming that the output type can
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite those comments, it's still hard to understand what is going on here. Maybe you could try to clarify it more?


@Nullable
public static Comparable<?> convertObjectToType(
@Nullable final Object obj,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary breakdown

@fjy fjy merged commit 23ba6f7 into apache:master Aug 25, 2018
gianm added a commit to gianm/druid that referenced this pull request Aug 25, 2018
* Fix four bugs with numeric dimension output types.

This patch includes the following bug fixes:

- TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
  during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
  This fixes a bug where, for example, grouping on doubles-cast-to-longs would
  fail to merge two doubles that should have been combined into the same long value.
- TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
  as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
  would fail to merge two strings that should have been combined.
- GroupByQuery: Cast numeric types to the expected output type before comparing them
  in compareDimsForLimitPushDown. This fixes apache#6123.
- GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
  the proper output type. This fixes an inconsistency between results that came
  from cache vs. not-cache: for example, Jackson sometimes deserializes integers
  as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

- DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
  and converterFromTypeToType to make it easier to handle casting operations.
- TopN in general: Rename various "dimName" variables to "dimValue" where they
  actually represent dimension values. The old names were confusing.

* Remove unused imports.
@gianm gianm deleted the fix-some-numeric-dim-stuff branch August 25, 2018 23:31
@leventov
Copy link
Copy Markdown
Member

@fjy please don't merge PRs with unanswered comments (despite the approval), otherwise why they are left?

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Aug 25, 2018

Fwiw, I had already started addressing some of those comments, so I plan to raise a follow up PR for that.

Btw, @leventov what did you intend by doing an approval + also leaving comments? I'd interpret that as "please consider these comments, but if you don't want to do them, I am ok with that." I thought they were all reasonable comments so that's why I'm doing a follow up.

gianm added a commit to gianm/druid that referenced this pull request Aug 25, 2018
Adjustments to comments and usage of generics.
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Aug 25, 2018

Follow-up is in #6231.

@leventov
Copy link
Copy Markdown
Member

Thanks.

Yes it means non-blocking comments, however answering them before merging is appreciated.

gianm added a commit to implydata/druid-public that referenced this pull request Aug 26, 2018
* Fix four bugs with numeric dimension output types.

This patch includes the following bug fixes:

- TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
  during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
  This fixes a bug where, for example, grouping on doubles-cast-to-longs would
  fail to merge two doubles that should have been combined into the same long value.
- TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
  as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
  would fail to merge two strings that should have been combined.
- GroupByQuery: Cast numeric types to the expected output type before comparing them
  in compareDimsForLimitPushDown. This fixes apache#6123.
- GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
  the proper output type. This fixes an inconsistency between results that came
  from cache vs. not-cache: for example, Jackson sometimes deserializes integers
  as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

- DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
  and converterFromTypeToType to make it easier to handle casting operations.
- TopN in general: Rename various "dimName" variables to "dimValue" where they
  actually represent dimension values. The old names were confusing.

* Remove unused imports.
fjy pushed a commit that referenced this pull request Aug 26, 2018
* Fix four bugs with numeric dimension output types.

This patch includes the following bug fixes:

- TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
  during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
  This fixes a bug where, for example, grouping on doubles-cast-to-longs would
  fail to merge two doubles that should have been combined into the same long value.
- TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
  as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
  would fail to merge two strings that should have been combined.
- GroupByQuery: Cast numeric types to the expected output type before comparing them
  in compareDimsForLimitPushDown. This fixes #6123.
- GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
  the proper output type. This fixes an inconsistency between results that came
  from cache vs. not-cache: for example, Jackson sometimes deserializes integers
  as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

- DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
  and converterFromTypeToType to make it easier to handle casting operations.
- TopN in general: Rename various "dimName" variables to "dimValue" where they
  actually represent dimension values. The old names were confusing.

* Remove unused imports.
leventov pushed a commit that referenced this pull request Aug 27, 2018
Adjustments to comments and usage of generics.
leventov pushed a commit to metamx/druid that referenced this pull request Aug 31, 2018
…che#6230)

* Fix four bugs with numeric dimension output types.

This patch includes the following bug fixes:

- TopNColumnSelectorStrategyFactory: Cast dimension values to the output type
  during dimExtractionScanAndAggregate instead of updateDimExtractionResults.
  This fixes a bug where, for example, grouping on doubles-cast-to-longs would
  fail to merge two doubles that should have been combined into the same long value.
- TopNQueryEngine: Use DimExtractionTopNAlgorithm when treating string columns
  as numeric dimensions. This fixes a similar bug: grouping on string-cast-to-long
  would fail to merge two strings that should have been combined.
- GroupByQuery: Cast numeric types to the expected output type before comparing them
  in compareDimsForLimitPushDown. This fixes apache#6123.
- GroupByQueryQueryToolChest: Convert Jackson-deserialized dimension values into
  the proper output type. This fixes an inconsistency between results that came
  from cache vs. not-cache: for example, Jackson sometimes deserializes integers
  as Integers and sometimes as Longs.

And the following code-cleanup changes, related to the fixes above:

- DimensionHandlerUtils: Introduce convertObjectToType, compareObjectsAsType,
  and converterFromTypeToType to make it easier to handle casting operations.
- TopN in general: Rename various "dimName" variables to "dimValue" where they
  actually represent dimension values. The old names were confusing.

* Remove unused imports.
gianm added a commit to gianm/druid that referenced this pull request Mar 13, 2019
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
fjy pushed a commit that referenced this pull request Mar 13, 2019
Similar to other bugs fixed in #6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
gianm added a commit to implydata/druid-public that referenced this pull request Mar 14, 2019
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
gianm added a commit to implydata/druid-public that referenced this pull request Apr 8, 2019
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
gianm added a commit to implydata/druid-public that referenced this pull request Apr 14, 2019
Similar to other bugs fixed in apache#6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
clintropolis pushed a commit that referenced this pull request Apr 24, 2019
Similar to other bugs fixed in #6220, but this one was missed. This bug would
cause "extraction" dimensionSpecs on the "__time" column with non-STRING
outputTypes to potentially be output as STRING sometimes instead of LONG,
causing incompletely merged results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ClassCastException in groupBy when sorting on numeric columns containing nulls

4 participants