Skip to content

fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340

Open
sitelight wants to merge 2 commits into
apache:masterfrom
sitelight:fix/sanitize-clause-preserve-user-sql
Open

fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340
sitelight wants to merge 2 commits into
apache:masterfrom
sitelight:fix/sanitize-clause-preserve-user-sql

Conversation

@sitelight
Copy link
Copy Markdown

@sitelight sitelight commented Apr 14, 2026

SUMMARY

Fixes #39223

Thanks to @jorickdefraine for the very detailed bug report, including the v5-vs-v6 SQL diff and root-cause analysis that made this trivial to pin down.

cc @betodealmeida who introduced the sanitize_clause sqlglot round-trip in #35419 — this PR keeps the cache-key normalization intent you landed there and only opts out of the multi-arg DISTINCT rewrite.

Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set MULTI_ARG_DISTINCT = False on their sqlglot generator, which rewrites FUNC(DISTINCT a, b) into a row-expression null guard of the form:

FUNC(DISTINCT CASE WHEN a IS NULL OR b IS NULL THEN NULL ELSE (a, b) END)

That emulation is intended for the unsupported COUNT(DISTINCT a, b) idiom, but it silently corrupts user-defined aggregates that natively accept multiple arguments. At query time Postgres reports function distinct_avg(record) does not exist because the generated tuple has no matching signature.

Scope

Two normalization paths round-trip user SQL through the target dialect generator today and both exhibit the bug:

This PR fixes both via a shared _normalized_generator helper at module level in superset/sql/parse.py that builds a generator with MULTI_ARG_DISTINCT = True set. Comment stripping, whitespace normalization, and all other dialect-specific behavior (JSON operators, regex, array literals, casts, single-arg DISTINCT) are untouched.

BEFORE / AFTER

Input:

DISTINCT_AVG(DISTINCT report_id, time_to_accept/86400)

Before (sqlglot 28.10, postgres dialect):

DISTINCT_AVG(DISTINCT CASE WHEN report_id IS NULL THEN NULL WHEN (time_to_accept / 86400) IS NULL THEN NULL ELSE (report_id, CAST(time_to_accept AS DOUBLE PRECISION) / 86400) END)
-- runtime error: function distinct_avg(record) does not exist

After:

DISTINCT_AVG(DISTINCT report_id, time_to_accept / 86400)
-- executes as the user intended

Side effects

On Postgres, Presto, Trino, DuckDB, and Dremio, expressions of the form COUNT(DISTINCT a, b) were previously rewritten by Superset into a working CASE WHEN a IS NULL OR b IS NULL THEN NULL ELSE (a, b) END emulation. After this PR they round-trip verbatim and will now fail at query time (function count(record) does not exist on Postgres).

This is a deliberate tradeoff: the emulation also silently corrupted every user-defined multi-argument aggregate, and Superset's sanitize / format paths are for normalization, not transpilation. Users who were relying on the emulation should switch to the engine-native idiom (e.g. COUNT(DISTINCT (a, b)) on Postgres).

TESTING INSTRUCTIONS

New parametrized cases in tests/unit_tests/sql/parse_tests.py:

test_sanitize_clause — covers the metric/cache path:

  • DISTINCT_AVG(DISTINCT report_id, time_to_accept / 86400) on postgresql
  • DISTINCT_SUM(DISTINCT report_id, total_bounty_reward_amount) on postgresql
  • DISTINCT_AVG(DISTINCT k, v) on presto, trino, and duckdb
  • COUNT(DISTINCT x) on postgresql (single-arg regression guard)

test_sqlstatement_format_preserves_multi_arg_distinct — new test, covers the SQL Lab / executor path:

  • SELECT DISTINCT_AVG(DISTINCT a, b) FROM t on postgresql, presto, trino, duckdb
pytest tests/unit_tests/sql/parse_tests.py -k "sanitize_clause or sqlstatement_format_preserves_multi_arg_distinct" -v

All 23 new/affected variants pass locally. Broader suites (parse_tests.py, transpile_to_dialect_test.py, models/helpers_test.py, common/test_query_context_processor.py) remain green — 655 tests total.

ADDITIONAL INFORMATION

@sitelight sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch 2 times, most recently from 6fbf764 to ec0ff1c Compare April 14, 2026 11:37
@pull-request-size pull-request-size Bot added size/L and removed size/M labels Apr 14, 2026
@sitelight sitelight changed the title fix(sql): preserve multi-arg DISTINCT in sanitize_clause fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format Apr 14, 2026
@sitelight sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch from ec0ff1c to 4808b03 Compare April 14, 2026 11:47
@sitelight sitelight marked this pull request as ready for review April 14, 2026 11:48
@dosubot dosubot Bot added change:backend Requires changing the backend sqllab Namespace | Anything related to the SQL Lab labels Apr 14, 2026
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented Apr 14, 2026

Code Review Agent Run #1bccc9

Actionable Suggestions - 0
Additional Suggestions - 1
  • tests/unit_tests/sql/parse_tests.py - 1
    • Missing test coverage for Dremio · Line 2708-2709
      The comment lists Dremio as a dialect affected by the MULTI_ARG_DISTINCT=False issue, but the test cases and parametrize lists do not include Dremio. This creates an inconsistency and potential gap in regression test coverage for Dremio.
Review Details
  • Files reviewed - 2 · Commit Range: 4808b03..4808b03
    • superset/sql/parse.py
    • tests/unit_tests/sql/parse_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set
`MULTI_ARG_DISTINCT = False` on their sqlglot generator, which rewrites
`FUNC(DISTINCT a, b)` into a row-expression null guard of the form
`FUNC(DISTINCT CASE WHEN a IS NULL ... THEN NULL ELSE (a, b) END)`.
That emulation is intended for the unsupported `COUNT(DISTINCT a, b)`
idiom, but it silently corrupts user-defined aggregates that natively
accept multiple arguments — at query time Postgres reports
`function distinct_avg(record) does not exist` because the generated
tuple has no matching function signature.

Two code paths normalize user SQL via a sqlglot round-trip today:

- `sanitize_clause` (introduced in apache#35419), used by adhoc metric
  expressions and cache-key generation.
- `SQLStatement.format`, used by the SQL Lab executor, celery workers,
  and query rendering — reproducible via
  `SQLScript("SELECT DISTINCT_AVG(DISTINCT a, b) FROM t", "postgresql").format()`.

Both paths must preserve user SQL verbatim for the DISTINCT rewrite —
Superset is not transpiling here, it is normalizing whitespace /
stripping comments for stability. Extract a shared
`_normalized_generator` helper at module level in `superset/sql/parse.py`
that builds a dialect generator with `MULTI_ARG_DISTINCT = True` set,
and use it from both call sites. Comment stripping, whitespace
normalization and every other dialect-specific behavior (JSON operators,
regex, array literals, casts, single-arg DISTINCT) are untouched.

Side effect: `COUNT(DISTINCT a, b)` on these dialects used to be
silently rewritten by Superset into a working `CASE WHEN ... ELSE (a,
b) END` emulation. It now round-trips verbatim and will fail at query
time on Postgres (`function count(record) does not exist`). The
emulation also silently corrupted every user-defined multi-arg
aggregate, and sanitize / format are normalization passes, not
transpilation — so preserving the user's SQL is the right tradeoff.
Users relying on the emulation should switch to the engine-native
idiom (e.g. `COUNT(DISTINCT (a, b))` on Postgres).

Thanks to @jorickdefraine for the detailed bug report.
cc @betodealmeida who introduced the sanitize_clause round-trip in
apache#35419 — this keeps the cache-key normalization intent and only
opts out of the multi-arg DISTINCT rewrite.

Fixes apache#39223
@sitelight sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch from 4808b03 to 2a3321b Compare April 14, 2026 13:02
@bito-code-review
Copy link
Copy Markdown
Contributor

bito-code-review Bot commented Apr 14, 2026

Code Review Agent Run #355daf

Actionable Suggestions - 0
Review Details
  • Files reviewed - 2 · Commit Range: 2a3321b..c951de0
    • superset/sql/parse.py
    • tests/unit_tests/sql/parse_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by Bito Logo

@sadpandajoe sadpandajoe added the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Apr 15, 2026
@github-actions github-actions Bot added 🎪 c951de0 🚦 building Environment c951de0 status: building 🎪 c951de0 📅 2026-04-15T09-52 Environment c951de0 created at 2026-04-15T09-52 🎪 c951de0 🤡 sadpandajoe Environment c951de0 requested by sadpandajoe 🎪 ⌛ 48h Environment expires after 48 hours (default) and removed 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR labels Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🎪 Showtime is building environment on GHA for c951de0

@github-actions github-actions Bot added 🎪 c951de0 🚦 deploying Environment c951de0 status: deploying 🎪 c951de0 🚦 running Environment c951de0 status: running 🎪 🎯 c951de0 Active environment pointer - c951de0 is receiving traffic 🎪 c951de0 🌐 52.89.166.223:8080 Environment c951de0 URL: http://52.89.166.223:8080 (click to visit) and removed 🎪 c951de0 🚦 building Environment c951de0 status: building 🎪 c951de0 🚦 deploying Environment c951de0 status: deploying 🎪 c951de0 🚦 running Environment c951de0 status: running labels Apr 15, 2026
@github-actions github-actions Bot removed the 🎪 🎯 c951de0 Active environment pointer - c951de0 is receiving traffic label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🎪 Showtime deployed environment on GHA for c951de0

Environment: http://52.89.166.223:8080 (admin/admin)
Lifetime: 48h auto-cleanup
Updates: New commits create fresh environments automatically

@github-actions github-actions Bot removed 🎪 ⚡ showtime-trigger-stop 🎪 ⌛ 48h Environment expires after 48 hours (default) 🎪 c951de0 🤡 sadpandajoe Environment c951de0 requested by sadpandajoe 🎪 c951de0 📅 2026-04-15T09-52 Environment c951de0 created at 2026-04-15T09-52 🎪 c951de0 🚦 running Environment c951de0 status: running 🎪 c951de0 🌐 52.89.166.223:8080 Environment c951de0 URL: http://52.89.166.223:8080 (click to visit) labels Apr 15, 2026
@jorickdefraine
Copy link
Copy Markdown

Tested locally on a custom Superset instance, the fix works perfectly.
DISTINCT_AVG(DISTINCT report_id, time_to_accept/86400) now generates the correct SQL and executes without error on PostgreSQL. Thanks for the quick turnaround!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a sqlglot normalization regression where multi-argument DISTINCT aggregates (eg FUNC(DISTINCT a, b)) were being rewritten into a CASE WHEN ... THEN NULL ELSE (a, b) END row-expression guard for certain dialects, corrupting user-defined aggregates. It introduces a shared normalization generator that forces MULTI_ARG_DISTINCT=True and adds regression tests for both sanitize_clause and SQLStatement.format() paths.

Changes:

  • Add _normalized_generator helper to preserve multi-arg DISTINCT during sqlglot generation.
  • Update SQLStatement.format() and sanitize_clause() to use the shared generator.
  • Add unit tests covering preservation of multi-arg DISTINCT across affected dialects.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
superset/sql/parse.py Centralizes sqlglot generator configuration to prevent multi-arg DISTINCT rewrites in sanitize/format normalization paths.
tests/unit_tests/sql/parse_tests.py Adds regression tests for multi-arg DISTINCT preservation in both clause sanitization and SQL formatting flows.

Comment thread superset/sql/parse.py
Comment on lines +140 to +145
def _normalized_generator(
dialect_name: DialectType,
*,
pretty: bool,
comments: bool,
) -> Generator:
Comment on lines +2780 to +2783
assert "DISTINCT_AVG(DISTINCT a, b)" in formatted
assert "CASE WHEN" not in formatted


@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.45%. Comparing base (499e27e) to head (c951de0).
⚠️ Report is 484 commits behind head on master.

❌ Your project check has failed because the head coverage (99.81%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #39340   +/-   ##
=======================================
  Coverage   64.45%   64.45%           
=======================================
  Files        2555     2555           
  Lines      132721   132724    +3     
  Branches    30802    30802           
=======================================
+ Hits        85539    85542    +3     
  Misses      45696    45696           
  Partials     1486     1486           
Flag Coverage Δ
hive 39.96% <85.71%> (+<0.01%) ⬆️
mysql 60.60% <100.00%> (+<0.01%) ⬆️
postgres 60.68% <100.00%> (+<0.01%) ⬆️
presto 41.76% <85.71%> (+<0.01%) ⬆️
python 62.27% <100.00%> (+<0.01%) ⬆️
sqlite 60.31% <100.00%> (+<0.01%) ⬆️
unit 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change:backend Requires changing the backend size/L sqllab Namespace | Anything related to the SQL Lab

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISTINCT_AVG / DISTINCT_SUM broken in v6 — wrong argument type generated

4 participants