Skip to content

Count distinct returned incorrect results without useApproximateCountDistinct#14748

Merged
clintropolis merged 101 commits intoapache:masterfrom
kgyrtkirk:count-agg-missing0
Sep 12, 2023
Merged

Count distinct returned incorrect results without useApproximateCountDistinct#14748
clintropolis merged 101 commits intoapache:masterfrom
kgyrtkirk:count-agg-missing0

Conversation

@kgyrtkirk
Copy link
Copy Markdown
Member

@kgyrtkirk kgyrtkirk commented Aug 3, 2023

With useApproximateCountDistinct=false queries like:

select count(distinct m1) from druid.foo where m1 < -1.0

may have returned incorrected results.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java Fixed
for (int i = 0; i < aggSpec.size(); i++) {
values[i] = aggSpec.get(i).factorize(new AllNullColumnSelectorFactory()).get();
}
return Collections.singleton(ResultRow.of(values)).iterator();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Collections.singletonIterator that you can use instead. It's a nit, but will save on an object allocation.

Comment on lines +178 to +184
Sequence<ResultRow> process;
if (isNestedQueryPushDown(query)) {
return mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
process = mergeResultsWithNestedQueryPushDown(query, resource, runner, context);
} else {
process = mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
}
return mergeGroupByResultsWithoutPushDown(query, resource, runner, context);
return GroupByQueryRunnerFactory.wrapSummaryRowIfNeeded(query, process);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this was required, which test caused you to need this change? I say this because the only way you should be able to get a completely empty sequence here is if the "leaf nodes" are producing completely empty sequences. The change in the other place should ensure that no leaf node ever produces a completely empty sequence, meaning that this change shouldn't be necessary...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a look!
unfortunately its needed - I've linked the test(s) checking this.

The leaf nodes are not necessarily aggregating (in case of distinct) so an empty sequence may be produced - the merger supposed to aggregate them - that's why this is needed.

For nested query stuff the merge runner becomes this lambda (note: I don't know why I didn't placed this call there - just moved it)

example tests

@kgyrtkirk
Copy link
Copy Markdown
Member Author

The last test results have uncovered that HAVING clauses were not able to filter the summary row - because it was added after those were processed.

To avoid that issue I've moved the insertion of the optional summary row to be right before postprocessing is applied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants