Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type#16682
Conversation
… complex columns of unsupported type
…tory as per offline discussion
…ctSqlAggregator: SQL validation layer + Native layer changes
…chApproxCountDistinctSqlAggregator
|
Thanks for the PR @Akshat-Jain 💯 I have made a few suggestions - a few of which are applicable across the aggregators. We should also add in the release notes that, according to your investigation, the queries like the following would fail if they were passing earlier: SELECT COUNT(DISTINCT x) FROM table WHERE x = 'non-existing' |
LakshSingla
left a comment
There was a problem hiding this comment.
Why is it that a HllSketchMergeAggregatorFactory doesn't have a check for arrays but Theta ones do?
@LakshSingla I'm not sure about this, lacking context here. @clintropolis Can you chime in? Thanks! |
| return Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.TYPE.getComplexTypeName()) || | ||
| Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.PRECOMPUTED_TYPE.getComplexTypeName()); |
There was a problem hiding this comment.
This doesn't allow UNKNOWN_COMPLEX types. Should it support them as well? It looks inconsistent given that other aggregators allow UNKNOWN_COMPLEX
There was a problem hiding this comment.
I'm not sure on this. Will check with Clint / Zoltan on all the pending review comments where I'm lacking context to make a decision.
There was a problem hiding this comment.
I'm a big fan of only allowing what's absolutely necessary.
| { | ||
| if (capabilities != null) { | ||
| final ColumnType type = capabilities.toColumnType(); | ||
| if (!(ColumnType.UNKNOWN_COMPLEX.equals(type) || TYPE.equals(type) || PRECOMPUTED_TYPE.equals(type))) { |
There was a problem hiding this comment.
you push the negation into this conditional...or return early if they are ok...
but why allow everything in case capabilities == null ? does that cause any trouble? if its not I think its better to throw an error in that case as well....
There was a problem hiding this comment.
but why allow everything in case capabilities == null ?
What should be the expectation when capabilities is null?
There was a problem hiding this comment.
I think if capabilities is null an exception should be raised
There was a problem hiding this comment.
@kgyrtkirk if (capabilities != null) {} seems to be used in a bunch of other layers as well, so seems like should be tackled separate from this PR?
|
Had taken a few passes over the changes. Need some validation on the Calcite and SQL side modifications that @kgyrtkirk has given. |
| return Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.TYPE.getComplexTypeName()) || | ||
| Objects.equals(columnType.getComplexTypeName(), HyperUniquesAggregatorFactory.PRECOMPUTED_TYPE.getComplexTypeName()); |
There was a problem hiding this comment.
I'm a big fan of only allowing what's absolutely necessary.
| { | ||
| if (capabilities != null) { | ||
| final ColumnType type = capabilities.toColumnType(); | ||
| if (!(ColumnType.UNKNOWN_COMPLEX.equals(type) || TYPE.equals(type) || PRECOMPUTED_TYPE.equals(type))) { |
|
Thanks for the patch @Akshat-Jain |
|
Thanks for the exhaustive review cycles @LakshSingla and @kgyrtkirk! 🙌 |
… complex columns of unsupported type (apache#16682) This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
Description
Currently, we don't support
count (distinct complexColumn) + approximationfor unsupported types, and the query execution fails with a ClassCastException:This PR aims to check if the complex column being queried aligns with the supported types in the aggregator and aggregator factories, and throws a user-friendly error message if they don't.
The validation is added in 3 layers:
select count (distinct column)type queries.Test plan
APPROX_COUNT_DISTINCT,APPROX_COUNT_DISTINCT_DS_HLL,APPROX_COUNT_DISTINCT_DS_HLL_UTF8,APPROX_COUNT_DISTINCT_DS_THETAselect count (distinct column)Sample error message after this PR's changes in sql layer:

Sample error message after this PR's changes in native layer:

Release Note
Queries like
SELECT COUNT(DISTINCT columnName) FROM tableName WHERE columnValue = 'non-existing-value'will also stop working with these changes.This PR has: