Skip to content

Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type#16682

Merged
LakshSingla merged 20 commits intoapache:masterfrom
Akshat-Jain:complex-column-approx-distinct-count
Jul 22, 2024
Merged

Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type#16682
LakshSingla merged 20 commits intoapache:masterfrom
Akshat-Jain:complex-column-approx-distinct-count

Conversation

@Akshat-Jain
Copy link
Copy Markdown
Contributor

@Akshat-Jain Akshat-Jain commented Jul 2, 2024

Description

Currently, we don't support count (distinct complexColumn) + approximation for unsupported types, and the query execution fails with a ClassCastException:

Caused by: java.lang.ClassCastException: class org.apache.druid.segment.nested.StructuredData cannot be cast to class org.apache.druid.hll.HyperLogLogCollector (org.apache.druid.segment.nested.StructuredData and org.apache.druid.hll.HyperLogLogCollector are in unnamed module of loader 'app')

This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.

The validation is added in 3 layers:

  1. In the SQL operand type checker - This helps with the case of using APPROX_COUNT_DISTINCT() and other functions.
  2. In the SQL layer - This helps with the case of doing select count (distinct column) type queries.
  3. In the native layer - This helps with the case of doing native queries.

Test plan

  1. Manually verified the functionality of APPROX_COUNT_DISTINCT, APPROX_COUNT_DISTINCT_DS_HLL, APPROX_COUNT_DISTINCT_DS_HLL_UTF8, APPROX_COUNT_DISTINCT_DS_THETA
  2. Manually verified the functionality of select count (distinct column)
  3. Manually verified the functionality of native queries.

Sample error message after this PR's changes in sql layer:
image

Sample error message after this PR's changes in native layer:
image

Release Note

Queries like SELECT COUNT(DISTINCT columnName) FROM tableName WHERE columnValue = 'non-existing-value' will also stop working with these changes.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Akshat-Jain Akshat-Jain marked this pull request as ready for review July 4, 2024 13:56
@LakshSingla
Copy link
Copy Markdown
Contributor

Thanks for the PR @Akshat-Jain 💯 I have made a few suggestions - a few of which are applicable across the aggregators.

We should also add in the release notes that, according to your investigation, the queries like the following would fail if they were passing earlier:

SELECT COUNT(DISTINCT x) FROM table WHERE x = 'non-existing'

@Akshat-Jain Akshat-Jain requested a review from LakshSingla July 8, 2024 15:28
Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that a HllSketchMergeAggregatorFactory doesn't have a check for arrays but Theta ones do?

@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

Why is it that a HllSketchMergeAggregatorFactory doesn't have a check for arrays but Theta ones do?

@LakshSingla I'm not sure about this, lacking context here. @clintropolis Can you chime in? Thanks!

@Akshat-Jain Akshat-Jain requested a review from LakshSingla July 10, 2024 17:40
Comment on lines +183 to +184
return Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.TYPE.getComplexTypeName()) ||
Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.PRECOMPUTED_TYPE.getComplexTypeName());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't allow UNKNOWN_COMPLEX types. Should it support them as well? It looks inconsistent given that other aggregators allow UNKNOWN_COMPLEX

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on this. Will check with Clint / Zoltan on all the pending review comments where I'm lacking context to make a decision.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of only allowing what's absolutely necessary.

{
if (capabilities != null) {
final ColumnType type = capabilities.toColumnType();
if (!(ColumnType.UNKNOWN_COMPLEX.equals(type) || TYPE.equals(type) || PRECOMPUTED_TYPE.equals(type))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you push the negation into this conditional...or return early if they are ok...

but why allow everything in case capabilities == null ? does that cause any trouble? if its not I think its better to throw an error in that case as well....

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why allow everything in case capabilities == null ?

What should be the expectation when capabilities is null?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if capabilities is null an exception should be raised

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk if (capabilities != null) {} seems to be used in a bunch of other layers as well, so seems like should be tackled separate from this PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

@Akshat-Jain Akshat-Jain requested a review from kgyrtkirk July 15, 2024 09:08
@LakshSingla
Copy link
Copy Markdown
Contributor

Had taken a few passes over the changes. Need some validation on the Calcite and SQL side modifications that @kgyrtkirk has given.

Comment on lines +183 to +184
return Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.TYPE.getComplexTypeName()) ||
Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.PRECOMPUTED_TYPE.getComplexTypeName());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of only allowing what's absolutely necessary.

@Akshat-Jain Akshat-Jain requested a review from kgyrtkirk July 17, 2024 12:23
{
if (capabilities != null) {
final ColumnType type = capabilities.toColumnType();
if (!(ColumnType.UNKNOWN_COMPLEX.equals(type) || TYPE.equals(type) || PRECOMPUTED_TYPE.equals(type))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

@LakshSingla LakshSingla merged commit 6a2348b into apache:master Jul 22, 2024
@LakshSingla
Copy link
Copy Markdown
Contributor

Thanks for the patch @Akshat-Jain

@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

Thanks for the exhaustive review cycles @LakshSingla and @kgyrtkirk! 🙌

sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
… complex columns of unsupported type (apache#16682)

This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants