fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format by sitelight · Pull Request #39340 · apache/superset

sitelight · 2026-04-14T11:23:09Z

SUMMARY

Thanks to @jorickdefraine for the very detailed bug report, including the v5-vs-v6 SQL diff and root-cause analysis that made this trivial to pin down.

cc @betodealmeida who introduced the sanitize_clause sqlglot round-trip in #35419 — this PR keeps the cache-key normalization intent you landed there and only opts out of the multi-arg DISTINCT rewrite.

Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set MULTI_ARG_DISTINCT = False on their sqlglot generator, which rewrites FUNC(DISTINCT a, b) into a row-expression null guard of the form:

FUNC(DISTINCT CASE WHEN a IS NULL OR b IS NULL THEN NULL ELSE (a, b) END)

That emulation is intended for the unsupported COUNT(DISTINCT a, b) idiom, but it silently corrupts user-defined aggregates that natively accept multiple arguments. At query time Postgres reports function distinct_avg(record) does not exist because the generated tuple has no matching signature.

Scope

Two normalization paths round-trip user SQL through the target dialect generator today and both exhibit the bug:

sanitize_clause — introduced for cache-key stability in fix(cache): ensure SQL is sanitized before cache key generation #35419; used by adhoc metric expressions on the chart/explore path. This is the path DISTINCT_AVG / DISTINCT_SUM broken in v6 — wrong argument type generated #39223 was reported via.
SQLStatement.format — used by the SQL Lab executor, celery workers and query rendering. Reproducible via SQLScript("SELECT DISTINCT_AVG(DISTINCT a, b) FROM t", "postgresql").format().

This PR fixes both via a shared _normalized_generator helper at module level in superset/sql/parse.py that builds a generator with MULTI_ARG_DISTINCT = True set. Comment stripping, whitespace normalization, and all other dialect-specific behavior (JSON operators, regex, array literals, casts, single-arg DISTINCT) are untouched.

BEFORE / AFTER

Input:

DISTINCT_AVG(DISTINCT report_id, time_to_accept/86400)

Before (sqlglot 28.10, postgres dialect):

DISTINCT_AVG(DISTINCT CASE WHEN report_id IS NULL THEN NULL WHEN (time_to_accept / 86400) IS NULL THEN NULL ELSE (report_id, CAST(time_to_accept AS DOUBLE PRECISION) / 86400) END)
-- runtime error: function distinct_avg(record) does not exist

After:

DISTINCT_AVG(DISTINCT report_id, time_to_accept / 86400)
-- executes as the user intended

Side effects

On Postgres, Presto, Trino, DuckDB, and Dremio, expressions of the form COUNT(DISTINCT a, b) were previously rewritten by Superset into a working CASE WHEN a IS NULL OR b IS NULL THEN NULL ELSE (a, b) END emulation. After this PR they round-trip verbatim and will now fail at query time (function count(record) does not exist on Postgres).

This is a deliberate tradeoff: the emulation also silently corrupted every user-defined multi-argument aggregate, and Superset's sanitize / format paths are for normalization, not transpilation. Users who were relying on the emulation should switch to the engine-native idiom (e.g. COUNT(DISTINCT (a, b)) on Postgres).

TESTING INSTRUCTIONS

New parametrized cases in tests/unit_tests/sql/parse_tests.py:

test_sanitize_clause — covers the metric/cache path:

DISTINCT_AVG(DISTINCT report_id, time_to_accept / 86400) on postgresql
DISTINCT_SUM(DISTINCT report_id, total_bounty_reward_amount) on postgresql
DISTINCT_AVG(DISTINCT k, v) on presto, trino, and duckdb
COUNT(DISTINCT x) on postgresql (single-arg regression guard)

test_sqlstatement_format_preserves_multi_arg_distinct — new test, covers the SQL Lab / executor path:

SELECT DISTINCT_AVG(DISTINCT a, b) FROM t on postgresql, presto, trino, duckdb

pytest tests/unit_tests/sql/parse_tests.py -k "sanitize_clause or sqlstatement_format_preserves_multi_arg_distinct" -v

All 23 new/affected variants pass locally. Broader suites (parse_tests.py, transpile_to_dialect_test.py, models/helpers_test.py, common/test_query_context_processor.py) remain green — 655 tests total.

ADDITIONAL INFORMATION

Has associated issue: DISTINCT_AVG / DISTINCT_SUM broken in v6 — wrong argument type generated #39223
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

bito-code-review · 2026-04-14T12:26:42Z

Code Review Agent Run #1bccc9

Actionable Suggestions - 0

Additional Suggestions - 1

tests/unit_tests/sql/parse_tests.py - 1
- Missing test coverage for Dremio · Line 2708-2709
  
  The comment lists Dremio as a dialect affected by the MULTI_ARG_DISTINCT=False issue, but the test cases and parametrize lists do not include Dremio. This creates an inconsistency and potential gap in regression test coverage for Dremio.

Review Details

Files reviewed - 2 · Commit Range: 4808b03..4808b03
- superset/sql/parse.py
- tests/unit_tests/sql/parse_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

@jorickdefraine

Dialects such as Postgres, Presto, Trino, DuckDB and Dremio set `MULTI_ARG_DISTINCT = False` on their sqlglot generator, which rewrites `FUNC(DISTINCT a, b)` into a row-expression null guard of the form `FUNC(DISTINCT CASE WHEN a IS NULL ... THEN NULL ELSE (a, b) END)`. That emulation is intended for the unsupported `COUNT(DISTINCT a, b)` idiom, but it silently corrupts user-defined aggregates that natively accept multiple arguments — at query time Postgres reports `function distinct_avg(record) does not exist` because the generated tuple has no matching function signature. Two code paths normalize user SQL via a sqlglot round-trip today: - `sanitize_clause` (introduced in apache#35419), used by adhoc metric expressions and cache-key generation. - `SQLStatement.format`, used by the SQL Lab executor, celery workers, and query rendering — reproducible via `SQLScript("SELECT DISTINCT_AVG(DISTINCT a, b) FROM t", "postgresql").format()`. Both paths must preserve user SQL verbatim for the DISTINCT rewrite — Superset is not transpiling here, it is normalizing whitespace / stripping comments for stability. Extract a shared `_normalized_generator` helper at module level in `superset/sql/parse.py` that builds a dialect generator with `MULTI_ARG_DISTINCT = True` set, and use it from both call sites. Comment stripping, whitespace normalization and every other dialect-specific behavior (JSON operators, regex, array literals, casts, single-arg DISTINCT) are untouched. Side effect: `COUNT(DISTINCT a, b)` on these dialects used to be silently rewritten by Superset into a working `CASE WHEN ... ELSE (a, b) END` emulation. It now round-trips verbatim and will fail at query time on Postgres (`function count(record) does not exist`). The emulation also silently corrupted every user-defined multi-arg aggregate, and sanitize / format are normalization passes, not transpilation — so preserving the user's SQL is the right tradeoff. Users relying on the emulation should switch to the engine-native idiom (e.g. `COUNT(DISTINCT (a, b))` on Postgres). Thanks to @jorickdefraine for the detailed bug report. cc @betodealmeida who introduced the sanitize_clause round-trip in apache#35419 — this keeps the cache-key normalization intent and only opts out of the multi-arg DISTINCT rewrite. Fixes apache#39223

bito-code-review · 2026-04-14T14:04:37Z

Code Review Agent Run #355daf

Actionable Suggestions - 0

Review Details

Files reviewed - 2 · Commit Range: 2a3321b..c951de0
- superset/sql/parse.py
- tests/unit_tests/sql/parse_tests.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
- MyPy (Static Code Analysis) - ✔︎ Successful
- Astral Ruff (Static Code Analysis) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Superset You can customize the agent settings here or contact your Bito workspace admin at evan@preset.io.

Documentation & Help

AI Code Review powered by

github-actions · 2026-04-15T09:52:07Z

🎪 Showtime is building environment on GHA for c951de0

github-actions · 2026-04-15T10:12:34Z

🎪 Showtime deployed environment on GHA for c951de0

• Environment: http://52.89.166.223:8080 (admin/admin)
• Lifetime: 48h auto-cleanup
• Updates: New commits create fresh environments automatically

jorickdefraine · 2026-04-23T08:17:15Z

Tested locally on a custom Superset instance, the fix works perfectly.
DISTINCT_AVG(DISTINCT report_id, time_to_accept/86400) now generates the correct SQL and executes without error on PostgreSQL. Thanks for the quick turnaround!

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes a sqlglot normalization regression where multi-argument DISTINCT aggregates (eg FUNC(DISTINCT a, b)) were being rewritten into a CASE WHEN ... THEN NULL ELSE (a, b) END row-expression guard for certain dialects, corrupting user-defined aggregates. It introduces a shared normalization generator that forces MULTI_ARG_DISTINCT=True and adds regression tests for both sanitize_clause and SQLStatement.format() paths.

Changes:

Add _normalized_generator helper to preserve multi-arg DISTINCT during sqlglot generation.
Update SQLStatement.format() and sanitize_clause() to use the shared generator.
Add unit tests covering preservation of multi-arg DISTINCT across affected dialects.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`superset/sql/parse.py`	Centralizes sqlglot generator configuration to prevent multi-arg `DISTINCT` rewrites in sanitize/format normalization paths.
`tests/unit_tests/sql/parse_tests.py`	Adds regression tests for multi-arg `DISTINCT` preservation in both clause sanitization and SQL formatting flows.

+def _normalized_generator(
+    dialect_name: DialectType,
+    *,
+    pretty: bool,
+    comments: bool,
+) -> Generator:


+    assert "DISTINCT_AVG(DISTINCT a, b)" in formatted
+    assert "CASE WHEN" not in formatted
+
+


codecov · 2026-05-05T18:01:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.45%. Comparing base (499e27e) to head (c951de0).
⚠️ Report is 484 commits behind head on master.

❌ Your project check has failed because the head coverage (99.81%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #39340   +/-   ##
=======================================
  Coverage   64.45%   64.45%           
=======================================
  Files        2555     2555           
  Lines      132721   132724    +3     
  Branches    30802    30802           
=======================================
+ Hits        85539    85542    +3     
  Misses      45696    45696           
  Partials     1486     1486

Flag	Coverage Δ
hive	`39.96% <85.71%> (+<0.01%)`	⬆️
mysql	`60.60% <100.00%> (+<0.01%)`	⬆️
postgres	`60.68% <100.00%> (+<0.01%)`	⬆️
presto	`41.76% <85.71%> (+<0.01%)`	⬆️
python	`62.27% <100.00%> (+<0.01%)`	⬆️
sqlite	`60.31% <100.00%> (+<0.01%)`	⬆️
unit	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pull-request-size Bot added the size/M label Apr 14, 2026

sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch 2 times, most recently from 6fbf764 to ec0ff1c Compare April 14, 2026 11:37

pull-request-size Bot added size/L and removed size/M labels Apr 14, 2026

sitelight changed the title ~~fix(sql): preserve multi-arg DISTINCT in sanitize_clause~~ fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format Apr 14, 2026

sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch from ec0ff1c to 4808b03 Compare April 14, 2026 11:47

sitelight marked this pull request as ready for review April 14, 2026 11:48

dosubot Bot added change:backend Requires changing the backend sqllab Namespace | Anything related to the SQL Lab labels Apr 14, 2026

sitelight force-pushed the fix/sanitize-clause-preserve-user-sql branch from 4808b03 to 2a3321b Compare April 14, 2026 13:02

Merge branch 'master' into fix/sanitize-clause-preserve-user-sql

c951de0

sadpandajoe added the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Apr 15, 2026

github-actions Bot removed the 🎪 🎯 c951de0 Active environment pointer - c951de0 is receiving traffic label Apr 15, 2026

sadpandajoe added the 🎪 ⚡ showtime-trigger-stop label Apr 15, 2026

sadpandajoe requested review from betodealmeida, Copilot and msyavuz May 5, 2026 17:29

Copilot AI reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340

fix(sql): preserve multi-arg DISTINCT in sanitize_clause and format#39340
sitelight wants to merge 2 commits into
apache:masterfrom
sitelight:fix/sanitize-clause-preserve-user-sql

sitelight commented Apr 14, 2026 •

edited

Loading

Uh oh!

bito-code-review Bot commented Apr 14, 2026 •

edited

Loading

Code Review Agent Run #1bccc9

Uh oh!

bito-code-review Bot commented Apr 14, 2026 •

edited

Loading

Code Review Agent Run #355daf

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

jorickdefraine commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		assert "DISTINCT_AVG(DISTINCT a, b)" in formatted
		assert "CASE WHEN" not in formatted

Conversation

sitelight commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

Scope

BEFORE / AFTER

Side effects

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

bito-code-review Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #1bccc9

Uh oh!

bito-code-review Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Agent Run #355daf

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

jorickdefraine commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sitelight commented Apr 14, 2026 •

edited

Loading

bito-code-review Bot commented Apr 14, 2026 •

edited

Loading

bito-code-review Bot commented Apr 14, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading