fix duplicated schema name error from count wildcard#14824
fix duplicated schema name error from count wildcard#14824alamb merged 28 commits intoapache:mainfrom
Conversation
| // handle count() case | ||
| if expr.is_empty() { | ||
| return Ok(vec![ | ||
| Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef |
There was a problem hiding this comment.
This is equivalent to count(1) case
There was a problem hiding this comment.
It seems that this function is not only used by count. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.
| .collect::<Result<Vec<_>>>()?; | ||
| // Handle count(*) case | ||
| let values = if expr.is_empty() { | ||
| vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef] |
There was a problem hiding this comment.
This is equivalent to count(1) case
|
fix the extended test in main branch |
Makes sesne -- thank you BTW I have another interim workaround here: I think we can use that to regenerate the output for this PR |
|
After merging apache/datafusion-testing#7 and update commit, I guess is good to go |
|
I just merged apache/datafusion-testing#7 |
|
|
||
| let sql_results = ctx | ||
| .sql("select b,count(*) from t1 group by b order by count(*)") | ||
| .sql("select b,count(1) from t1 group by b order by count(1)") |
There was a problem hiding this comment.
I had to double check -- the reason this needs to change is that the test is comparing again a dataframe built with count_all() which now uses count(1)
Though maybe we could change count_all() to return count(1) as "count(*)" so it would be consistent with older versions?
|
|
||
| /// If the qualified name of an expression is remembered, it will be preserved | ||
| /// when rewriting the expression | ||
| #[derive(Debug)] |
|
I think we need to update the datafusion-testing pin -- closing/reopening this PR to rerun the tests to make sure |
|
NM I think things are clean now |
alamb
left a comment
There was a problem hiding this comment.
Thanks @jayzhan211 -- since this PR fixes a bunch of tests and gets the main branch back to green, I am going to merge it. We can then address the count_all() function name as a follow on PR
|
|
||
| let sql_results = ctx | ||
| .sql("select b,count(*) from t1 group by b order by count(*)") | ||
| .sql("select b,count(1) from t1 group by b order by count(1)") |
There was a problem hiding this comment.
I found I could avoid the double alias by adding a check in Expr::alias:
diff --git a/datafusion/expr/src/expr.rs b/datafusion/expr/src/expr.rs
index f8baf9c94..2f3c2c575 100644
--- a/datafusion/expr/src/expr.rs
+++ b/datafusion/expr/src/expr.rs
@@ -1276,7 +1276,14 @@ impl Expr {
/// Return `self AS name` alias expression
pub fn alias(self, name: impl Into<String>) -> Expr {
- Expr::Alias(Alias::new(self, None::<&str>, name.into()))
+ let name = name.into();
+ // don't realias the same thing
+ if matches!(&self, Expr::Alias(Alias {name: existing_name, ..} ) if existing_name == &name)
+ {
+ self
+ } else {
+ Expr::Alias(Alias::new(self, None::<&str>, name))
+ }
}
/// Return `self AS name` alias expression with a specific qualifier
@@ -1285,7 +1292,15 @@ impl Expr {
relation: Option<impl Into<TableReference>>,
name: impl Into<String>,
) -> Expr {
- Expr::Alias(Alias::new(self, relation, name.into()))
+ let relation = relation.map(|r| r.into());
+ let name = name.into();
+ // don't realias the same thing
+ if matches!(&self, Expr::Alias(Alias {name: existing_name, relation: existing_relation, ..} ) if existing_name == &name && relation.as_ref()==existing_relation.as_ref() )
+ {
+ self
+ } else {
+ Expr::Alias(Alias::new(self, relation, name))
+ }
}
/// Remove an alias from an expression if one exists.
diff --git a/datafusion/functions-aggregate/src/count.rs b/datafusion/functions-aggregate/src/count.rs
index a3339f0fc..1faf1968b 100644
--- a/datafusion/functions-aggregate/src/count.rs
+++ b/datafusion/functions-aggregate/src/count.rs
@@ -81,7 +81,7 @@ pub fn count_distinct(expr: Expr) -> Expr {
/// Creates aggregation to count all rows, equivalent to `COUNT(*)`, `COUNT()`, `COUNT(1)`
pub fn count_all() -> Expr {
- count(Expr::Literal(COUNT_STAR_EXPANSION))
+ count(Expr::Literal(COUNT_STAR_EXPANSION)).alias("count(*)")
}
#[user_doc(| let plan = test_sql(sql)?; | ||
| let expected = | ||
| "Aggregate: groupBy=[[]], aggr=[[count(*)]]\ | ||
| "Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]\ |
There was a problem hiding this comment.
this certainly seems an improvement
|
Let's get the tests clean |
|
Thanks @alamb. I will file related issue as follow-up |
Thansk! Note I did file |
|
The tests are green again on main! |
Which issue does this PR close?
We convert count(constant) i.e. count(2) to count(*) in previous PR
so
select count(1) * count(2)produces duplicated schema name error given both arecount(*)in schema name.select count(), count(*)does not work #14855Rationale for this change
Instead of converting
count()andcount(*)tocount(1). We makescount()possible as a replacement of count wildcard. In this case,count(1)can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case forcount(1)such ascount(cast(1 as i32)). The schema name is much more consistent with DuckDB too.What changes are included in this PR?
Implement count with zero arg in aggregate function level.
count()-> count()count(*)-> count()count(1)-> count(1)count(2)-> count(2)Are these changes tested?
Are there any user-facing changes?