fix topn aggregation on numeric columns with null values by clintropolis · Pull Request #9183 · apache/druid

clintropolis · 2020-01-14T10:15:56Z

Description

This PR refactors handling of TopN queries performing aggregations on numeric columns to handle null valued rows, instead of treating them as zero.

This is done by modifying TopNColumnSelectorStrategy to be both aggregate processor and aggregate store for DimExtractionTopNAlgorithm, and for numeric selectors will keep an extra Aggregator[] around for aggregating for the null value in addition to the type appropriate fastutils map that is currently in place.

To reflect the behavior changes, TopNColumnSelectorStrategy has been renamed TopNColumnAggregatesProcessor . It still does provide value cardinality to the other algorithms, and the changes should have no footprint change for things not using it.

Additionally, i renamed DimExtractionTopNAlgorithm to the more generic sounding HeapBasedTopNAlgorithm since it is also used for numeric selectors. I don't love this name, so if anyone has any better ideas I'll happily change it.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

…null-columns

jon-wei · 2020-01-17T02:56:05Z

+      Aggregator[][] rowSelector
+  )
+  {
+    initAggregateStore();


I think the initAggregateStore call could be moved into HeapBasedTopNAlgorithm.scanAndAggregate since both impls call it as the first step

jon-wei · 2020-01-17T02:57:49Z

+    this.converter = converter;
+  }
+
+  abstract Aggregator[] getValueAggregators(TopNQuery query, Selector selector, Cursor cursor);


Can you add javadocs for the abstract methods?

jon-wei

LGTM after CI

jihoonson · 2020-01-17T19:39:51Z

  )
  {
-    if (params.getCardinality() != TopNColumnSelectorStrategy.CARDINALITY_UNKNOWN) {
+    if (params.getCardinality() != TopNParams.CARDINALITY_UNKNOWN) {


nit: I wish someday we could remove duplicate static values for this..

I agree but do not want to settle this up in this PR, part of the problem is that different parts of the code use different values to indicate unknown cardinality...

jihoonson · 2020-01-17T19:48:38Z

-public class TopNColumnSelectorStrategyFactory implements ColumnSelectorStrategyFactory<TopNColumnSelectorStrategy>
+import java.util.function.Function;
+
+public class TopNColumnSelectorStrategyFactory


nit: TopNColumnAggregatesProcessorFactory?

jihoonson · 2020-01-17T19:54:19Z

+  Aggregator[] getValueAggregators(TopNQuery query, BaseLongColumnValueSelector selector, Cursor cursor)
+  {
+    long key = selector.getLong();
+    Aggregator[] aggs = aggregatesStore.get(key);


Probably better to use computeIfAbsent() since it computes hash less time.

Same for double and float types.

Yeah, i thought about that but was too lazy and just copied the old codes, will change.

jihoonson · 2020-01-18T02:11:26Z

TC failure doesn't seem legit.

* fix topn issue with aggregating on numeric columns with null values * adjustments * rename * add more tests * fix comments * more javadocs * computeIfAbsent

* add middle manager and indexer worker category to tier column of services view (apache#9158) (apache#9167) * Graduation update for ASF release process guide and download links (apache#9126) (apache#9160) * Graduation update for ASF release process guide and download links * Fix release vote thread typo * Fix pom.xml * Add numeric nulls to sample data, fix some numeric null handling issues (apache#9154) (apache#9175) * Fix LongSumAggregator comparator null handling * Remove unneeded GroupBy test change * Checkstyle * Update other processing tests for new sample data * Remove unused code * Fix SearchQueryRunner column selectors * Fix DimensionIndexer null handling and ScanQueryRunnerTest * Fix TeamCity errors * Add jackson-mapper-asl for hdfs-storage extension (apache#9178) (apache#9185) Previously jackson-mapper-asl was excluded to remove a security vulnerability; however, it is required for functionality (e.g., org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator). * Suppress CVE-2019-20330 for htrace-core-4.0.1 (apache#9189) (apache#9191) CVE-2019-20330 was updated on 14 Jan 2020, which now gets flagged by the security vulnerability scan. Since the CVE is for jackson-databind, via htrace-core-4.0.1, it can be added to the existing list of security vulnerability suppressions for that dependency. * Fix deserialization of maxBytesInMemory (apache#9092) (apache#9170) * Fix deserialization of maxBytesInMemory * Add maxBytes check Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> * Update Kinesis resharding information about task failures (apache#9104) (apache#9201) * fix refresh button (apache#9195) (apache#9203) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * allow empty values to be set in the auto form (apache#9198) (apache#9206) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * fix null handling for arithmetic post aggregator comparator (apache#9159) (apache#9202) * fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext * fix * Link javaOpts to middlemanager runtime.properties docs (apache#9101) (apache#9204) * Link javaOpts to middlemanager runtime.properties docs * fix broken link * reword config links * Tutorials use new ingestion spec where possible (apache#9155) (apache#9205) * Tutorials use new ingestion spec where possible There are 2 main changes * Use task type index_parallel instead of index * Remove the use of parser + firehose in favor of inputFormat + inputSource index_parallel is the preferred method starting in 0.17. Setting the job to index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent of an index task Instead of using a parserSpec, dimensionSpec and timestampSpec have been promoted to the dataSchema. The format is described in the ioConfig as the inputFormat. There are a few cases where the new format is not supported * Hadoop must use firehoses instead of the inputSource and inputFormat * There is no equivalent of a combining firehose as an inputSource * A Combining firehose does not support index_parallel * fix typo * Fix TSV bugs (apache#9199) (apache#9213) * working * - support multi-char delimiter for tsv - respect "delimiter" property for tsv * default value check for findColumnsFromHeader * remove CSVParser to have a true and only CSVParser * fix tests * fix another test * Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (apache#9197) (apache#9210) * fix buff limit bug * add tests * add test * add tests * fix checkstyle * Doc update for the new input source and the new input format (apache#9171) (apache#9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs * [0.17.0] Speed up String first/last aggregators when folding isn't needed. (apache#9181) (apache#9215) * Speed up String first/last aggregators when folding isn't needed. (apache#9181) * Speed up String first/last aggregators when folding isn't needed. Examines the value column, and disables fold checking via a needsFoldCheck flag if that column can't possibly contain SerializableLongStringPairs. This is helpful because it avoids calling getObject on the value selector when unnecessary; say, because the time selector didn't yield an earlier or later value. * PR comments. * Move fastLooseChop to StringUtils. * actually fix conflict correctly * remove unused import Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix topn aggregation on numeric columns with null values (apache#9183) (apache#9219) * fix topn issue with aggregating on numeric columns with null values * adjustments * rename * add more tests * fix comments * more javadocs * computeIfAbsent * first/last aggregators and nulls (apache#9161) (apache#9233) * null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs * initially null or not based on config * review stuff, make string first/last consistent with null handling of numeric columns, more tests * docs * handle nil selectors, revert to primitive first/last types so groupby v1 works... * Minor doc updates (apache#9217) (apache#9230) * update string first last aggs * update kafka ingestion specs in docs * remove unnecessary parser spec * [Backport] Update docs for extensions (apache#9218) (apache#9228) Backport of apache#9218 to 0.17.0. * More tests for range partition parallel indexing (apache#9232) (apache#9236) Add more unit tests for range partition native batch parallel indexing. Also, fix a bug where ParallelIndexPhaseRunner incorrectly thinks that identical collected DimensionDistributionReports are not equal due to not overriding equals() in DimensionDistributionReport. * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task (apache#9222) (apache#9237) * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task * tuningConfig module * fix tests Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Chi Cao Minh <chi.caominh@gmail.com> Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com> Co-authored-by: Jihoon Son <jihoonson@apache.org> Co-authored-by: Maytas Monsereenusorn <52679095+maytasm3@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

fix topn issue with aggregating on numeric columns with null values

ae47598

clintropolis added Bug Area - Querying WIP Area - Null Handling labels Jan 14, 2020

adjustments

c692751

jon-wei added this to the 0.17.0 milestone Jan 16, 2020

clintropolis added 2 commits January 16, 2020 14:44

Merge remote-tracking branch 'upstream/master' into topn-agg-numeric-…

7f91915

…null-columns

rename

515af98

clintropolis removed the WIP label Jan 17, 2020

clintropolis added 2 commits January 16, 2020 17:47

add more tests

777bf78

fix comments

ac2d66e

jon-wei reviewed Jan 17, 2020

View reviewed changes

more javadocs

6130e6a

jon-wei approved these changes Jan 17, 2020

View reviewed changes

jihoonson reviewed Jan 17, 2020

View reviewed changes

computeIfAbsent

3c9778f

jihoonson approved these changes Jan 18, 2020

View reviewed changes

jihoonson merged commit f0dddaa into apache:master Jan 18, 2020

clintropolis mentioned this pull request Jan 18, 2020

[Backport] fix topn aggregation on numeric columns with null values (#9183) #9219

Merged

clintropolis deleted the topn-agg-numeric-null-columns branch January 18, 2020 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix topn aggregation on numeric columns with null values#9183

fix topn aggregation on numeric columns with null values#9183
jihoonson merged 8 commits intoapache:masterfrom
clintropolis:topn-agg-numeric-null-columns

clintropolis commented Jan 14, 2020 •

edited

Loading

Uh oh!

jon-wei Jan 17, 2020

Uh oh!

jon-wei Jan 17, 2020

Uh oh!

jon-wei left a comment

Uh oh!

jihoonson Jan 17, 2020

Uh oh!

clintropolis Jan 17, 2020

Uh oh!

jihoonson Jan 17, 2020

Uh oh!

jihoonson Jan 17, 2020

Uh oh!

jihoonson Jan 17, 2020

Uh oh!

clintropolis Jan 17, 2020

Uh oh!

jihoonson commented Jan 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

clintropolis commented Jan 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Jan 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clintropolis commented Jan 14, 2020 •

edited

Loading