fix duplicated schema name error from count wildcard by jayzhan211 · Pull Request #14824 · apache/datafusion

jayzhan211 · 2025-02-22T11:53:30Z

Which issue does this PR close?

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Rationale for this change

Instead of converting count() and count(*) to count(1). We makes count() possible as a replacement of count wildcard. In this case, count(1) can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case for count(1) such as count(cast(1 as i32)). The schema name is much more consistent with DuckDB too.

What changes are included in this PR?

Implement count with zero arg in aggregate function level.

count() -> count()
count(*) -> count()
count(1) -> count(1)
count(2) -> count(2)

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2025-02-22T13:50:11Z

datafusion/physical-plan/src/aggregates/mod.rs

+    // handle count() case
+    if expr.is_empty() {
+        return Ok(vec![
+            Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef


This is equivalent to count(1) case

It seems that this function is not only used by count. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.

jayzhan211 · 2025-02-22T13:50:19Z

datafusion/physical-plan/src/aggregates/no_grouping.rs

-                .collect::<Result<Vec<_>>>()?;
+            // Handle count(*) case
+            let values = if expr.is_empty() {
+                vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef]


This is equivalent to count(1) case

jayzhan211 · 2025-02-24T01:21:01Z

fix the extended test in main branch

alamb · 2025-02-25T22:41:58Z

I think the issue is that the runner in https://github.com/Omega359/sqllogictest-rs is based on an older version of the sqllogictests than we use in datafusion.
I have an idea for a workaround, but longer term we probably need to make the update eaiser to maintaine

That is exactly what I was thinking and hopefully will fix tonight. I think a decent short-term fix is to 'lock' the sqllogictest-rs dependency version and add a comment that any update to it will require a full run of the regenerate script before committing.

Long term ideally would be to improve my changes to my fork of sqllogictest-rs such that they would be suitable to submit a PR to that project. That is not an insignificant amount of work to be honest and I'm a bit thin on time for the next month or two.

Makes sesne -- thank you

BTW I have another interim workaround here:

Fix regenerate_sqlite_files.sh due to changes in sqllogictests #14881

I think we can use that to regenerate the output for this PR

…-name

jayzhan211 · 2025-02-26T11:48:35Z

After merging apache/datafusion-testing#7 and update commit, I guess is good to go

alamb · 2025-02-26T12:01:32Z

I just merged apache/datafusion-testing#7

alamb

Thank you @jayzhan211

alamb · 2025-02-26T12:06:33Z

datafusion/core/tests/dataframe/mod.rs


    let sql_results = ctx
-        .sql("select b,count(*) from t1 group by b order by count(*)")
+        .sql("select b,count(1) from t1 group by b order by count(1)")


I had to double check -- the reason this needs to change is that the test is comparing again a dataframe built with count_all() which now uses count(1)

Though maybe we could change count_all() to return count(1) as "count(*)" so it would be consistent with older versions?

alamb · 2025-02-26T12:06:46Z

datafusion/expr/src/expr_rewriter/mod.rs


 /// If the qualified name of an expression is remembered, it will be preserved
 /// when rewriting the expression
+#[derive(Debug)]


alamb · 2025-02-26T12:08:15Z

I think we need to update the datafusion-testing pin -- closing/reopening this PR to rerun the tests to make sure

alamb · 2025-02-26T12:31:53Z

NM I think things are clean now

alamb

Thanks @jayzhan211 -- since this PR fixes a bunch of tests and gets the main branch back to green, I am going to merge it. We can then address the count_all() function name as a follow on PR

alamb · 2025-02-26T13:39:10Z

datafusion/core/tests/dataframe/mod.rs


    let sql_results = ctx
-        .sql("select b,count(*) from t1 group by b order by count(*)")
+        .sql("select b,count(1) from t1 group by b order by count(1)")


I found I could avoid the double alias by adding a check in Expr::alias:

diff --git a/datafusion/expr/src/expr.rs b/datafusion/expr/src/expr.rs index f8baf9c94..2f3c2c575 100644 --- a/datafusion/expr/src/expr.rs +++ b/datafusion/expr/src/expr.rs @@ -1276,7 +1276,14 @@ impl Expr { /// Return `self AS name` alias expression pub fn alias(self, name: impl Into<String>) -> Expr { - Expr::Alias(Alias::new(self, None::<&str>, name.into())) + let name = name.into(); + // don't realias the same thing + if matches!(&self, Expr::Alias(Alias {name: existing_name, ..} ) if existing_name == &name) + { + self + } else { + Expr::Alias(Alias::new(self, None::<&str>, name)) + } } /// Return `self AS name` alias expression with a specific qualifier @@ -1285,7 +1292,15 @@ impl Expr { relation: Option<impl Into<TableReference>>, name: impl Into<String>, ) -> Expr { - Expr::Alias(Alias::new(self, relation, name.into())) + let relation = relation.map(|r| r.into()); + let name = name.into(); + // don't realias the same thing + if matches!(&self, Expr::Alias(Alias {name: existing_name, relation: existing_relation, ..} ) if existing_name == &name && relation.as_ref()==existing_relation.as_ref() ) + { + self + } else { + Expr::Alias(Alias::new(self, relation, name)) + } } /// Remove an alias from an expression if one exists. diff --git a/datafusion/functions-aggregate/src/count.rs b/datafusion/functions-aggregate/src/count.rs index a3339f0fc..1faf1968b 100644 --- a/datafusion/functions-aggregate/src/count.rs +++ b/datafusion/functions-aggregate/src/count.rs @@ -81,7 +81,7 @@ pub fn count_distinct(expr: Expr) -> Expr { /// Creates aggregation to count all rows, equivalent to `COUNT(*)`, `COUNT()`, `COUNT(1)` pub fn count_all() -> Expr { - count(Expr::Literal(COUNT_STAR_EXPANSION)) + count(Expr::Literal(COUNT_STAR_EXPANSION)).alias("count(*)") } #[user_doc(

alamb · 2025-02-26T13:40:12Z

datafusion/optimizer/tests/optimizer_integration.rs

    let plan = test_sql(sql)?;
    let expected =
-        "Aggregate: groupBy=[[]], aggr=[[count(*)]]\
+        "Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]\


this certainly seems an improvement

alamb · 2025-02-26T13:40:56Z

Let's get the tests clean

jayzhan211 · 2025-02-26T13:56:11Z

Thanks @alamb. I will file related issue as follow-up

alamb · 2025-02-26T14:20:35Z

Change in 46: count_all() expr_fn function now displayed as count(1) rather than count(*) #14894

Thansk! Note I did file

Change in 46: count_all() expr_fn function now displayed as count(1) rather than count(*) #14894

alamb · 2025-02-26T15:44:55Z

The tests are green again on main!
https://github.com/apache/datafusion/actions/runs/13545248421/job/37855153112

fix name

0af4ab9

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 22, 2025

jayzhan211 changed the title ~~Fix duplicated schema name of count wildcard issue~~ Fix duplicated schema name error from count wildcard Feb 22, 2025

upd doc

7f18e05

jayzhan211 mentioned this pull request Feb 22, 2025

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Merged

jayzhan211 requested a review from jonahgao February 22, 2025 12:08

jayzhan211 marked this pull request as draft February 22, 2025 12:20

jayzhan211 added 3 commits February 22, 2025 20:24

drop table

3ef7ddd

real count()

40385aa

clippy

a456792

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Feb 22, 2025

jayzhan211 commented Feb 22, 2025

View reviewed changes

jayzhan211 marked this pull request as ready for review February 22, 2025 13:51

jayzhan211 changed the title ~~Fix duplicated schema name error from count wildcard~~ Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard Feb 22, 2025

jayzhan211 marked this pull request as draft February 22, 2025 14:14

jayzhan211 added 3 commits February 22, 2025 22:22

fix tests

3497965

fix test

d956307

fix other tests

e24cf29

github-actions bot added sql SQL Planner optimizer Optimizer rules substrait Changes to the substrait crate labels Feb 22, 2025

jayzhan211 added 3 commits February 23, 2025 07:57

fix proto test

e54d4b8

fix substrait test

6ee5a35

fnt

2a2d0d3

jayzhan211 marked this pull request as ready for review February 23, 2025 03:15

jayzhan211 requested a review from alamb February 24, 2025 01:20

jayzhan211 added 3 commits February 26, 2025 09:05

fix

610c9a3

fix tests

cb6c975

Merge branch 'main' of github.com:apache/datafusion into count-schema…

90a7b0a

…-name

jayzhan211 mentioned this pull request Feb 26, 2025

Update test for datafusion #14824 apache/datafusion-testing#7

Merged

jayzhan211 added 3 commits February 26, 2025 10:21

avro

6550841

upd testing

5f55161

tpch

2a8f4f4

alamb mentioned this pull request Feb 26, 2025

Release DataFusion 46.0.0 #14123

Closed

26 tasks

alamb marked this pull request as ready for review February 26, 2025 12:01

alamb reviewed Feb 26, 2025

View reviewed changes

alamb approved these changes Feb 26, 2025

View reviewed changes

upd test

70280b6

alamb closed this Feb 26, 2025

alamb reopened this Feb 26, 2025

alamb approved these changes Feb 26, 2025

View reviewed changes

alamb merged commit 9278233 into apache:main Feb 26, 2025
24 of 47 checks passed

alamb mentioned this pull request Feb 26, 2025

Change in 46: count_all() expr_fn function now displayed as count(1) rather than count(*) #14894

Closed

jayzhan211 deleted the count-schema-name branch February 26, 2025 13:56

Omega359 mentioned this pull request Feb 26, 2025

Update regenerate sql dep, revert runner changes. #14901

Merged

alamb mentioned this pull request Feb 26, 2025

Improve regeneration of sqlite expected test suite #14906

Open

gabotechs mentioned this pull request Mar 21, 2025

Fix empty aggregation function count() in Substrait #15345

Merged

Conversation

jayzhan211 commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jayzhan211 Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

jonahgao Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

jayzhan211 Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

jayzhan211 commented Feb 24, 2025

Uh oh!

alamb commented Feb 25, 2025

Uh oh!

jayzhan211 commented Feb 26, 2025

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

jayzhan211 commented Feb 26, 2025

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

alamb commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jayzhan211 commented Feb 22, 2025 •

edited

Loading