first/last aggregators and nulls by clintropolis · Pull Request #9161 · apache/druid

clintropolis · 2020-01-10T11:28:37Z

Description

This PR changes the behavior of first/last aggregators to respect null values to be more consistent with SQL. Existing behavior can be obtained by filtering numeric columns to not include null values.

I did some additional digging related to findings of #9154 and #9159, and managed to hit an npe while sorting a top-n query by a 'long first' aggregator. Checking out the comparators, I assumed it had similar issues to #9159 and would be a quick/simple change, but how ever wrong I was.

It turns out, the 'first'/'last' family of aggregators was building on top of NullableNumericAggregatorFactory, following #8834, though the lack of correct SQL compatible null handling definitely predates that change. Anyway, the problem with using NullableNumericAggregatorFactory is that the 'first'/'last' aggregators aren't aggregating numbers, but rather a complex type of a SerializablePair containing the timestamp and the number column value. This would make the aggregator result for a row itself be null rather than a pair containing the timestamp and the null right hand side value, which caused all sorts of funny business down the line when sorting or combining values.

For a solution, I did some heavy refactoring, pulling out NumericFirstAggregator, NumericFirstBufferAggregator, NumericLastAggregator, and NumericLastBufferAggregator. Further refactoring could likely pull out an additional base type between the first and last aggs and buffer aggs, because the only differences is the time comparison and the initial time values, but it gets kind of messy because they aren't in the same package...

Anyway, common code for tracking if the current value is null for on heap and buffer aggs is in these types, and so on and so forth, comparators fixed up, tests added, and so on, so I think these aggs should now handle null numbers correctly.

Note: group by v1 queries will incorrectly report 0 for null numeric values in sql compatible mode since I haven't implemented a ComplexMetricSerde for long/long, long/float, and long/double pairs.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

…end nullable numeric agg since they are complex typed aggs

jon-wei · 2020-01-11T02:44:03Z

+    this.valueSelector = valueSelector;
+  }
+
+  abstract void initValue(ByteBuffer buf, int position);


Can you add javadocs for these buffer methods?

jon-wei · 2020-01-11T02:45:06Z

+  void updateTimeWithValue(ByteBuffer buf, int position, long time)
+  {
+    buf.putLong(position, time);
+    putValue(buf, position);


nit: to follow the ordering in the buffer, maybe move putValue call after the null marker setting

…ll-handling

jihoonson

Thanks, +1. Left trivial comments.

jihoonson · 2020-01-17T21:34:22Z

+
+  boolean isValueNull(ByteBuffer buf, int position)
+  {
+    return buf.get(position + NULL_OFFSET) == 1;


== RHS_NULL?

switching these to use NullHandling.IS_NULL_BYTE and NullHandling.IS_NOT_NULL_BYTE

jihoonson · 2020-01-17T21:50:51Z

  public int getMaxIntermediateSize()
  {
-    return Long.BYTES + Double.BYTES;
+    return Long.BYTES + Double.BYTES + 1;


Byte.BYTES. Or perhaps adding a new variable NULL_SIZE?

it seems unlikely to me that the number of bytes in a byte is going to change, but ok :p

… numeric columns, more tests

…ll-handling

… v1 works...

suneet-s

Very nice! Mostly comments to help with my understanding.

suneet-s · 2020-01-20T20:01:41Z

    return rhs;
  }
+
+  public static <T1, T2> Comparator<SerializablePair<T1, T2>> createNullHandlingComparator(


looks like this is missing unit tests? Also javadocs since this is a utility that would be used by many other classes

This is covered by calcite query tests that order by the first/last aggregators; afaik SerializedPair is only used by first/last despite it's generic name, to store a timestamp and value.

suneet-s · 2020-01-20T20:22:42Z

 #### `doubleFirst` aggregator

-`doubleFirst` computes the metric value with the minimum timestamp or 0 if no row exist
+`doubleFirst` computes the metric value with the minimum timestamp or 0 in default mode or `null` in SQL compatible mode if no row exist


Are you referring to this property - https://github.com/apache/druid/blob/master/docs/configuration/index.md#sql-compatible-null-handling ? Would be nice to link to the configuration here.

nit: I'd re-phrase slightly

computes the metric value with the minimum timestamp. If no row exists, it will return 0 or `null` if [SQL compatible mode](../configuration/index.md#sql-compatible-null-handling) is enabled

suneet-s · 2020-01-20T20:34:04Z

 import org.apache.druid.segment.BaseLongColumnValueSelector;

-public class DoubleFirstAggregator implements Aggregator
+public class DoubleFirstAggregator extends NumericFirstAggregator<BaseDoubleColumnValueSelector>


Nice abstraction! 🎉

note to self: can the get call be abstracted into the base class?

Not without boxing the primitive

Well, I guess that is going to happen anyway in making the pair.. so I guess maybe the on heap version of get could be shared, but not really possible for the buffer aggregator.

suneet-s · 2020-01-20T20:48:15Z

+
  public static final Comparator<SerializablePair<Long, Double>> VALUE_COMPARATOR =
-      Comparator.comparingDouble(o -> o.rhs);
+      SerializablePair.createNullHandlingComparator(Double::compare, true);


It took me a long time to try and figure out what the comparator was used for. I got wrapped up in the fact that the aggregator was meant compare timestamps, that I didn't realize this was for ordering. I think a javadoc on #AggregatorFactory#getComparator would have cleared up my confusion pretty quickly

suneet-s · 2020-01-20T20:55:37Z

-      public Aggregator factorize(ColumnSelectorFactory metricFactory, ColumnValueSelector selector)
+      public Aggregator factorize(ColumnSelectorFactory metricFactory)
      {
+        final ColumnValueSelector<SerializablePair<Long, Double>> selector =


Based on javadocs in makeColumnValueSelector this selector can be NilColumnValueSelector

in which case selector.getObject() on line 170 would return null and line 171 would throw an NPE?
similar comment for the factorizeBuffered method

Since this is for combining, the selector will be of the serialized pair complex objects, which will not be null.

suneet-s · 2020-01-20T21:14:01Z

+import java.util.Comparator;

-public class DoubleFirstAggregationTest
+public class DoubleFirstAggregationTest extends InitializedNullHandlingTest


it looks like these tests will only check useDefaultValuesForNull = true or is there some config I'm not seeing that sets it to false as well in another run?

travis runs tests in both modes, this is to let tests run in intellij which doesn't get initialized correctly

* null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs * initially null or not based on config * review stuff, make string first/last consistent with null handling of numeric columns, more tests * docs * handle nil selectors, revert to primitive first/last types so groupby v1 works...

* add middle manager and indexer worker category to tier column of services view (apache#9158) (apache#9167) * Graduation update for ASF release process guide and download links (apache#9126) (apache#9160) * Graduation update for ASF release process guide and download links * Fix release vote thread typo * Fix pom.xml * Add numeric nulls to sample data, fix some numeric null handling issues (apache#9154) (apache#9175) * Fix LongSumAggregator comparator null handling * Remove unneeded GroupBy test change * Checkstyle * Update other processing tests for new sample data * Remove unused code * Fix SearchQueryRunner column selectors * Fix DimensionIndexer null handling and ScanQueryRunnerTest * Fix TeamCity errors * Add jackson-mapper-asl for hdfs-storage extension (apache#9178) (apache#9185) Previously jackson-mapper-asl was excluded to remove a security vulnerability; however, it is required for functionality (e.g., org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator). * Suppress CVE-2019-20330 for htrace-core-4.0.1 (apache#9189) (apache#9191) CVE-2019-20330 was updated on 14 Jan 2020, which now gets flagged by the security vulnerability scan. Since the CVE is for jackson-databind, via htrace-core-4.0.1, it can be added to the existing list of security vulnerability suppressions for that dependency. * Fix deserialization of maxBytesInMemory (apache#9092) (apache#9170) * Fix deserialization of maxBytesInMemory * Add maxBytes check Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> * Update Kinesis resharding information about task failures (apache#9104) (apache#9201) * fix refresh button (apache#9195) (apache#9203) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * allow empty values to be set in the auto form (apache#9198) (apache#9206) Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> * fix null handling for arithmetic post aggregator comparator (apache#9159) (apache#9202) * fix null handling for arithmetic postagg comparator, add test for comparator for min/max/quantile postaggs in histogram ext * fix * Link javaOpts to middlemanager runtime.properties docs (apache#9101) (apache#9204) * Link javaOpts to middlemanager runtime.properties docs * fix broken link * reword config links * Tutorials use new ingestion spec where possible (apache#9155) (apache#9205) * Tutorials use new ingestion spec where possible There are 2 main changes * Use task type index_parallel instead of index * Remove the use of parser + firehose in favor of inputFormat + inputSource index_parallel is the preferred method starting in 0.17. Setting the job to index_parallel with the default maxNumConcurrentSubTasks(1) is the equivalent of an index task Instead of using a parserSpec, dimensionSpec and timestampSpec have been promoted to the dataSchema. The format is described in the ioConfig as the inputFormat. There are a few cases where the new format is not supported * Hadoop must use firehoses instead of the inputSource and inputFormat * There is no equivalent of a combining firehose as an inputSource * A Combining firehose does not support index_parallel * fix typo * Fix TSV bugs (apache#9199) (apache#9213) * working * - support multi-char delimiter for tsv - respect "delimiter" property for tsv * default value check for findColumnsFromHeader * remove CSVParser to have a true and only CSVParser * fix tests * fix another test * Fix LATEST / EARLIEST Buffer Aggregator does not work on String column (apache#9197) (apache#9210) * fix buff limit bug * add tests * add test * add tests * fix checkstyle * Doc update for the new input source and the new input format (apache#9171) (apache#9214) * Doc update for new input source and input format. - The input source and input format are promoted in all docs under docs/ingestion - All input sources including core extension ones are located in docs/ingestion/native-batch.md - All input formats and parsers including core extension ones are localted in docs/ingestion/data-formats.md - New behavior of the parallel task with different partitionsSpecs are documented in docs/ingestion/native-batch.md * parquet * add warning for range partitioning with sequential mode * hdfs + s3, gs * add fs impl for gs * address comments * address comments * gcs * [0.17.0] Speed up String first/last aggregators when folding isn't needed. (apache#9181) (apache#9215) * Speed up String first/last aggregators when folding isn't needed. (apache#9181) * Speed up String first/last aggregators when folding isn't needed. Examines the value column, and disables fold checking via a needsFoldCheck flag if that column can't possibly contain SerializableLongStringPairs. This is helpful because it avoids calling getObject on the value selector when unnecessary; say, because the time selector didn't yield an earlier or later value. * PR comments. * Move fastLooseChop to StringUtils. * actually fix conflict correctly * remove unused import Co-authored-by: Gian Merlino <gianmerlino@gmail.com> * fix topn aggregation on numeric columns with null values (apache#9183) (apache#9219) * fix topn issue with aggregating on numeric columns with null values * adjustments * rename * add more tests * fix comments * more javadocs * computeIfAbsent * first/last aggregators and nulls (apache#9161) (apache#9233) * null handling for numeric first/last aggregators, refactor to not extend nullable numeric agg since they are complex typed aggs * initially null or not based on config * review stuff, make string first/last consistent with null handling of numeric columns, more tests * docs * handle nil selectors, revert to primitive first/last types so groupby v1 works... * Minor doc updates (apache#9217) (apache#9230) * update string first last aggs * update kafka ingestion specs in docs * remove unnecessary parser spec * [Backport] Update docs for extensions (apache#9218) (apache#9228) Backport of apache#9218 to 0.17.0. * More tests for range partition parallel indexing (apache#9232) (apache#9236) Add more unit tests for range partition native batch parallel indexing. Also, fix a bug where ParallelIndexPhaseRunner incorrectly thinks that identical collected DimensionDistributionReports are not equal due to not overriding equals() in DimensionDistributionReport. * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task (apache#9222) (apache#9237) * Support both IndexTuningConfig and ParallelIndexTuningConfig for compaction task * tuningConfig module * fix tests Co-authored-by: Clint Wylie <cjwylie@gmail.com> Co-authored-by: Chi Cao Minh <chi.caominh@gmail.com> Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com> Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com> Co-authored-by: Suneet Saldanha <44787917+suneet-s@users.noreply.github.com> Co-authored-by: Jihoon Son <jihoonson@apache.org> Co-authored-by: Maytas Monsereenusorn <52679095+maytasm3@users.noreply.github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

null handling for numeric first/last aggregators, refactor to not ext…

fd12b66

…end nullable numeric agg since they are complex typed aggs

clintropolis added Bug Area - Querying Refactoring Area - Null Handling WIP labels Jan 10, 2020

initially null or not based on config

1f0cd18

clintropolis removed the WIP label Jan 10, 2020

jon-wei reviewed Jan 11, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into first-last-agg-nu…

9209421

…ll-handling

jihoonson reviewed Jan 17, 2020

View reviewed changes

jihoonson approved these changes Jan 17, 2020

View reviewed changes

clintropolis added 2 commits January 17, 2020 18:12

review stuff, make string first/last consistent with null handling of…

defc3a7

… numeric columns, more tests

docs

ab2c73b

jon-wei added this to the 0.17.0 milestone Jan 18, 2020

clintropolis added this to the 0.17.0 milestone Jan 18, 2020

clintropolis changed the title ~~refactor numeric first/last aggregators to fix null handling bugs~~ first/last aggregators and nulls Jan 18, 2020

clintropolis added 2 commits January 17, 2020 19:32

Merge remote-tracking branch 'upstream/master' into first-last-agg-nu…

7960b3c

…ll-handling

handle nil selectors, revert to primitive first/last types so groupby…

94c92b5

… v1 works...

clintropolis added the Release Notes label Jan 18, 2020

jon-wei approved these changes Jan 20, 2020

View reviewed changes

jon-wei merged commit 8011211 into apache:master Jan 20, 2020

suneet-s reviewed Jan 20, 2020

View reviewed changes

clintropolis mentioned this pull request Jan 21, 2020

[Backport] first/last aggregators and nulls #9233

Merged

clintropolis deleted the first-last-agg-null-handling branch January 21, 2020 02:38

jon-wei mentioned this pull request Jan 21, 2020

0.17.0 release notes #9066

Closed

clintropolis mentioned this pull request Jun 21, 2023

Vectorize earliest aggregator for both numeric and string types #14408

Merged

10 tasks

Conversation

clintropolis commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suneet-s left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clintropolis commented Jan 10, 2020 •

edited

Loading

clintropolis Jan 17, 2020 •

edited

Loading