fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340
fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340sitelight wants to merge 2 commits into
Conversation
6fbf764 to
ec0ff1c
Compare
ec0ff1c to
4808b03
Compare
Code Review Agent Run #1bccc9Actionable Suggestions - 0Additional Suggestions - 1
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set `MULTI_ARG_DISTINCT = False` on their sqlglot generator, which rewrites `FUNC(DISTINCT a, b)` into a row-expression null guard of the form `FUNC(DISTINCT CASE WHEN a IS NULL ... THEN NULL ELSE (a, b) END)`. That emulation is intended for the unsupported `COUNT(DISTINCT a, b)` idiom, but it silently corrupts user-defined aggregates that natively accept multiple arguments — at query time Postgres reports `function distinct_avg(record) does not exist` because the generated tuple has no matching function signature. Two code paths normalize user SQL via a sqlglot round-trip today: - `sanitize_clause` (introduced in apache#35419), used by adhoc metric expressions and cache-key generation. - `SQLStatement.format`, used by the SQL Lab executor, celery workers, and query rendering — reproducible via `SQLScript("SELECT DISTINCT_AVG(DISTINCT a, b) FROM t", "postgresql").format()`. Both paths must preserve user SQL verbatim for the DISTINCT rewrite — Superset is not transpiling here, it is normalizing whitespace / stripping comments for stability. Extract a shared `_normalized_generator` helper at module level in `superset/sql/parse.py` that builds a dialect generator with `MULTI_ARG_DISTINCT = True` set, and use it from both call sites. Comment stripping, whitespace normalization and every other dialect-specific behavior (JSON operators, regex, array literals, casts, single-arg DISTINCT) are untouched. Side effect: `COUNT(DISTINCT a, b)` on these dialects used to be silently rewritten by Superset into a working `CASE WHEN ... ELSE (a, b) END` emulation. It now round-trips verbatim and will fail at query time on Postgres (`function count(record) does not exist`). The emulation also silently corrupted every user-defined multi-arg aggregate, and sanitize / format are normalization passes, not transpilation — so preserving the user's SQL is the right tradeoff. Users relying on the emulation should switch to the engine-native idiom (e.g. `COUNT(DISTINCT (a, b))` on Postgres). Thanks to @jorickdefraine for the detailed bug report. cc @betodealmeida who introduced the sanitize_clause round-trip in apache#35419 — this keeps the cache-key normalization intent and only opts out of the multi-arg DISTINCT rewrite. Fixes apache#39223
4808b03 to
2a3321b
Compare
Code Review Agent Run #355dafActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
|
🎪 Showtime deployed environment on GHA for c951de0 • Environment: http://52.89.166.223:8080 (admin/admin) |
|
Tested locally on a custom Superset instance, the fix works perfectly. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes a sqlglot normalization regression where multi-argument DISTINCT aggregates (eg FUNC(DISTINCT a, b)) were being rewritten into a CASE WHEN ... THEN NULL ELSE (a, b) END row-expression guard for certain dialects, corrupting user-defined aggregates. It introduces a shared normalization generator that forces MULTI_ARG_DISTINCT=True and adds regression tests for both sanitize_clause and SQLStatement.format() paths.
Changes:
- Add
_normalized_generatorhelper to preserve multi-argDISTINCTduring sqlglot generation. - Update
SQLStatement.format()andsanitize_clause()to use the shared generator. - Add unit tests covering preservation of multi-arg
DISTINCTacross affected dialects.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
superset/sql/parse.py |
Centralizes sqlglot generator configuration to prevent multi-arg DISTINCT rewrites in sanitize/format normalization paths. |
tests/unit_tests/sql/parse_tests.py |
Adds regression tests for multi-arg DISTINCT preservation in both clause sanitization and SQL formatting flows. |
| def _normalized_generator( | ||
| dialect_name: DialectType, | ||
| *, | ||
| pretty: bool, | ||
| comments: bool, | ||
| ) -> Generator: |
| assert "DISTINCT_AVG(DISTINCT a, b)" in formatted | ||
| assert "CASE WHEN" not in formatted | ||
|
|
||
|
|
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (99.81%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #39340 +/- ##
=======================================
Coverage 64.45% 64.45%
=======================================
Files 2555 2555
Lines 132721 132724 +3
Branches 30802 30802
=======================================
+ Hits 85539 85542 +3
Misses 45696 45696
Partials 1486 1486
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
SUMMARY
Fixes #39223
Thanks to @jorickdefraine for the very detailed bug report, including the v5-vs-v6 SQL diff and root-cause analysis that made this trivial to pin down.
cc @betodealmeida who introduced the
sanitize_clausesqlglot round-trip in #35419 — this PR keeps the cache-key normalization intent you landed there and only opts out of the multi-arg DISTINCT rewrite.Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set
MULTI_ARG_DISTINCT = Falseon their sqlglot generator, which rewritesFUNC(DISTINCT a, b)into a row-expression null guard of the form:That emulation is intended for the unsupported
COUNT(DISTINCT a, b)idiom, but it silently corrupts user-defined aggregates that natively accept multiple arguments. At query time Postgres reportsfunction distinct_avg(record) does not existbecause the generated tuple has no matching signature.Scope
Two normalization paths round-trip user SQL through the target dialect generator today and both exhibit the bug:
sanitize_clause— introduced for cache-key stability in fix(cache): ensure SQL is sanitized before cache key generation #35419; used by adhoc metric expressions on the chart/explore path. This is the pathDISTINCT_AVG/DISTINCT_SUMbroken in v6 — wrong argument type generated #39223 was reported via.SQLStatement.format— used by the SQL Lab executor, celery workers and query rendering. Reproducible viaSQLScript("SELECT DISTINCT_AVG(DISTINCT a, b) FROM t", "postgresql").format().This PR fixes both via a shared
_normalized_generatorhelper at module level insuperset/sql/parse.pythat builds a generator withMULTI_ARG_DISTINCT = Trueset. Comment stripping, whitespace normalization, and all other dialect-specific behavior (JSON operators, regex, array literals, casts, single-arg DISTINCT) are untouched.BEFORE / AFTER
Input:
Before (sqlglot 28.10, postgres dialect):
After:
Side effects
On Postgres, Presto, Trino, DuckDB, and Dremio, expressions of the form
COUNT(DISTINCT a, b)were previously rewritten by Superset into a workingCASE WHEN a IS NULL OR b IS NULL THEN NULL ELSE (a, b) ENDemulation. After this PR they round-trip verbatim and will now fail at query time (function count(record) does not existon Postgres).This is a deliberate tradeoff: the emulation also silently corrupted every user-defined multi-argument aggregate, and Superset's sanitize / format paths are for normalization, not transpilation. Users who were relying on the emulation should switch to the engine-native idiom (e.g.
COUNT(DISTINCT (a, b))on Postgres).TESTING INSTRUCTIONS
New parametrized cases in
tests/unit_tests/sql/parse_tests.py:test_sanitize_clause— covers the metric/cache path:DISTINCT_AVG(DISTINCT report_id, time_to_accept / 86400)on postgresqlDISTINCT_SUM(DISTINCT report_id, total_bounty_reward_amount)on postgresqlDISTINCT_AVG(DISTINCT k, v)on presto, trino, and duckdbCOUNT(DISTINCT x)on postgresql (single-arg regression guard)test_sqlstatement_format_preserves_multi_arg_distinct— new test, covers the SQL Lab / executor path:SELECT DISTINCT_AVG(DISTINCT a, b) FROM ton postgresql, presto, trino, duckdbpytest tests/unit_tests/sql/parse_tests.py -k "sanitize_clause or sqlstatement_format_preserves_multi_arg_distinct" -vAll 23 new/affected variants pass locally. Broader suites (
parse_tests.py,transpile_to_dialect_test.py,models/helpers_test.py,common/test_query_context_processor.py) remain green — 655 tests total.ADDITIONAL INFORMATION
DISTINCT_AVG/DISTINCT_SUMbroken in v6 — wrong argument type generated #39223