Skip to content

SELECT ... FROM (tbl1 UNION tbl2) wrongly works like SELECT DISTINCT ... FROM (tbl1 UNION tbl2) #4667

@albel727

Description

@albel727

The title says it all. But in other words:
if a table constructed with UNION [DISTINCT] is used as a subtable for a SELECT statement,
and not all columns of the table are mentioned in the select, the output is incorrectly
deduplicated while considering only the mentioned columns, instead of all columns of the subtable.

To Reproduce

Consider a simple table, constructed with a UNION statement:

+----+------+
| id | data |
+----+------+
| 1  | 10   |
| 1  | 20   |
+----+------+

Let's select the id column from it:

let ctx = SessionContext::new();
let df = ctx.sql("SELECT id FROM (SELECT 1 as id, 10 as data UNION SELECT 1, 20)").await?;
df.show().await?;

Expected output

+----+
| id |
+----+
| 1  |
| 1  |
+----+

Actual output

+----+
| id |
+----+
| 1  |
+----+

Additional context

Of course, in this particular example this wrong behavior could be worked around by using UNION ALL, but this isn't always possible.
E.g. the following query, which detects rows that have same ids but differing data in other columns, can't be (easily) rewritten in terms of UNION ALL:

let df = ctx.sql("SELECT id FROM (SELECT * FROM tbl1 UNION SELECT * FROM tbl2) GROUP BY id HAVING count(*) = 2").await?;

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions