Skip to content

refactor numeric primitive aggregators in sql compatible mode#10666

Closed
clintropolis wants to merge 4 commits intoapache:masterfrom
clintropolis:null-numeric-aggregators-refactor
Closed

refactor numeric primitive aggregators in sql compatible mode#10666
clintropolis wants to merge 4 commits intoapache:masterfrom
clintropolis:null-numeric-aggregators-refactor

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Dec 10, 2020

Description

#10219 added hasNulls to ColumnCapabilities, allowing things to know if a column has any null values in SQL compatible mode (druid.generic.useDefaultValueForNull=false). With this information available, NullableNumericAggregatorFactory can be modified to use a version of the null-aware wrapper aggregators which do not need to check isNull on each aggregate call, which should reduce the overhead of enabling this mode for columns which do not have null values (I haven't actually measured this yet so I'm unsure of the difference).

To achieve this, I have pulled out most of the logic of NullableNumericAggregator, NullableNumericBufferAggregator, and NullableNumericVectorAggregator into a new set of abstract classes, NullAwareNumericAggregator, NullAwareNumericVectorAggregator, and NullAwareNumericVectorAggregator respectively which the former now extend. For processing columns which do not have null values, a new set of 'non-null' aggregator wrappers have been introduced, NonnullNumericAggregator, NonnullNumericBufferAggregator, and NonnullNumericVectorAggregator, which also extend the 'null-aware' base classes, so that they initialize to a null value and are compatible with the expectations of aggregator behavior with filtering (which reasonably expect an aggregator 'get' to produce the correct result, even if no values were aggregated), but can skip the 'is null check'.

NullableNumericAggregatorFactory (which should probably more correctly be renamed NullAwareNumericAggregatorFactory but its marked @ExtensionPoint so I tried not to mess with it too much), has also been expanded to include a new method to check if the aggregator input has null values, with a default implementation of:

  /**
   * Returns true if the aggregator will actually produce null values given its input selectors, e.g. if
   * the inputs to the aggregator have any nulls.
   */
  protected boolean hasNulls(ColumnInspector inspector)
  {
    return sqlCompatible;
  }

with implementations that override in the 'simple' aggregator factories. Since there was a lot of duplicated code between the 'simple' aggregator factories (SimpleLongAggregatorFactory, SimpleFloatAggregatorFactory, SimpleDoubleAggregatorFactory), I have introduced yet another base class, SimpleNumericAggregatorFactory to consolidate all of the 'fieldName'/'expression' handling stuff.

public abstract class SimpleNumericAggregatorFactory<TValueSelector extends BaseNullableColumnValueSelector>
    extends NullableNumericAggregatorFactory<ColumnValueSelector>

which also handles the 'hasNulls' check override of NullableNumericAggregatorFactory. This class could probably be consolidated with NullableNumericAggregatorFactory, but since that class is @ExtensionPoint, I avoided making this change at this time.

I'll try to find some time to measure the difference, but naively it seems obvious that using the 'non-null' family of aggregators should be better due to having fewer method calls and branching opportunities.

There is a potential further optimization that callers which "know" that there are values to aggregate (e.g. no filter) could avoid the extra byte of overhead used for null tracking for columns which don't have any null values by modifying the 'factorize' methods to let callers communicate this information (aggregators don't know such things), but I haven't made this modification at this time since this PR was already starting to get a bit big, so can be done as a follow-up.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • NullableNumericAggregatorFactory
  • NullAwareNumericAggregator
  • NullAwareNumericBufferAggregator
  • NullAwareNumericVectorAggregator
  • NullableNumericAggregator
  • NullableNumericBufferAggregator
  • NullableNumericVectorAggregator
  • NonnullNumericAggregator
  • NonnullNumericBufferAggregator
  • NonnullNumericVectorAggregator
  • SimpleNumericAggregatorFactory
  • SimpleLongAggregatorFactory
  • SimpleFloatAggregatorFactory
  • SimpleDoubleAggregatorFactory

@clintropolis
Copy link
Copy Markdown
Member Author

This PR is going to fail the coverage check in CI on the 'processing' unit tests, because most of the code here is only used in SQL compatible mode.

@stale
Copy link
Copy Markdown

stale Bot commented Apr 28, 2022

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Oct 4, 2023

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 4, 2023

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions Bot added the stale label Dec 4, 2023
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 1, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 12, 2024
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 9, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant