Skip to content

Conversation

@Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Aug 20, 2025

Which issue does this PR close?

Relates to #2408

  • Decimal support not included so can't close issue yet

Rationale for this change

Building on old PR #15413, to get it over the line. Opened as new PR as other PR was too old and wasn't sure if should push to the original author branch; for now pushed to my own branch but preserved the original commits.

What changes are included in this PR?

From original PR:

  1. mv DistinctSumAccumulator to common so that it can be used in Float64DistinctAvgAccumulator
  2. implement Float64DistinctAvgAccumulator using DistinctSumAccumulator
  3. tested in aggregate.slt

Additional changes made by me:

  • Fixed error query error DataFusion error: Arrow error: Invalid argument error: number of columns\(1\) must match number of fields\(2\) in schema (Support Avg distinct for float64 type #15413 (comment)) which was caused by wrong state fields (and also but not disabling group accumulator support if distinct)
  • Update the SLT test cases (removed the decimal test cases as those aren't relevant for this float PR)

Are these changes tested?

Added SLT tests, and also regenerated the extended tests: apache/datafusion-testing#11

Are there any user-facing changes?

Not sure if doc changes are required, as this is mainly for SQL and it seems we don't explicitly say avg(distinct) is disallowed so don't need to update anything saying it works now?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Aug 20, 2025
Comment on lines 168 to 199
Ok(vec![
Field::new(
format_state_name(args.name, "count"),
DataType::UInt64,
true,
),
Field::new(
format_state_name(args.name, "sum"),
args.input_fields[0].data_type().clone(),
true,
),
]
.into_iter()
.map(Arc::new)
.collect())
if args.is_distinct {
// Copied from datafusion_functions_aggregate::sum::Sum::state_fields
// since the accumulator uses DistinctSumAccumulator internally.
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
Field::new_list_field(args.return_type().clone(), true),
false,
)
.into()])
} else {
Ok(vec![
Field::new(
format_state_name(args.name, "count"),
DataType::UInt64,
true,
),
Field::new(
format_state_name(args.name, "sum"),
args.input_fields[0].data_type().clone(),
true,
),
]
.into_iter()
.map(Arc::new)
.collect())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my main contribution in addition to the changes from the original PR, to fix that error with the differing field counts. I wonder if there's a better way to architect this, since wasn't clearly obvious that this is related to the state of the accumulator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this question -- the PR's code looks good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I meant was that in accumulators we can return the state() like so:

fn state(&mut self) -> Result<Vec<ScalarValue>> {
// 1. Stores aggregate state in `ScalarValue::List`
// 2. Constructs `ScalarValue::List` state from distinct numeric stored in hash set
let state_out = {
let distinct_values = self
.values
.iter()
.map(|value| {
ScalarValue::new_primitive::<T>(Some(value.0), &self.data_type)
})
.collect::<Result<Vec<_>>>()?;
vec![ScalarValue::List(ScalarValue::new_list_nullable(
&distinct_values,
&self.data_type,
))]
};
Ok(state_out)
}

However this must align with state_fields() of the parent aggregate UDF:

fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<FieldRef>> {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new_list_field(args.return_type().clone(), true),
false,
)
.into()])
} else {
Ok(vec![Field::new(
format_state_name(args.name, "sum"),
args.return_type().clone(),
true,
)
.into()])
}
}

But this isn't clearly obvious at compile time, and during runtime we only hit this issue for certain test cases (for this distinct avg PR). So I was wondering if there was a better way to enforce this at compile time. Hope that clears it up.

@Jefffrey Jefffrey marked this pull request as ready for review August 20, 2025 06:41
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Aug 20, 2025

Looks like still getting same error for some of the extended tests:

# Datafusion - Datafusion expected results:
query error DataFusion error: Arrow error: Invalid argument error: number of columns\(5\) must match number of fields\(4\) in schema
SELECT ALL - AVG ( ALL + col0 ) AS col1 FROM tab0 GROUP BY col0 HAVING + AVG ( DISTINCT - col2 ) IS NULL

Will look into this

Edit: seems to be related to group accumulator support

@Jefffrey Jefffrey marked this pull request as draft August 20, 2025 08:44
args.return_field.data_type(),
DataType::Float64 | DataType::Decimal128(_, _) | DataType::Duration(_)
)
) && !args.is_distinct
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how sum handles it:

fn groups_accumulator_supported(&self, args: AccumulatorArgs) -> bool {
!args.is_distinct
}

@Jefffrey
Copy link
Contributor Author

Looks like still getting same error for some of the extended tests:

# Datafusion - Datafusion expected results:
query error DataFusion error: Arrow error: Invalid argument error: number of columns\(5\) must match number of fields\(4\) in schema
SELECT ALL - AVG ( ALL + col0 ) AS col1 FROM tab0 GROUP BY col0 HAVING + AVG ( DISTINCT - col2 ) IS NULL

Will look into this

Edit: seems to be related to group accumulator support

Fixed by bc121fb

@Jefffrey Jefffrey marked this pull request as ready for review August 20, 2025 09:44
@Omega359
Copy link
Contributor

This will I assume require regenerating the extended slt files in datafusion-testing?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- I think the Pr looks quite good. I think it should just have a few more tests, but otherwise 👌

(1, 1),
(2, 2),
(3, 3),
(4, 4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this test so:

  1. The input isn't in order
  2. Add a test for floating point values
  3. Test for an input that includes at least one null value
  4. the values in b are different than the values in b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on adding these cases 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed those points in this commit: 3abb4b7

Comment on lines 168 to 199
Ok(vec![
Field::new(
format_state_name(args.name, "count"),
DataType::UInt64,
true,
),
Field::new(
format_state_name(args.name, "sum"),
args.input_fields[0].data_type().clone(),
true,
),
]
.into_iter()
.map(Arc::new)
.collect())
if args.is_distinct {
// Copied from datafusion_functions_aggregate::sum::Sum::state_fields
// since the accumulator uses DistinctSumAccumulator internally.
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
Field::new_list_field(args.return_type().clone(), true),
false,
)
.into()])
} else {
Ok(vec![
Field::new(
format_state_name(args.name, "count"),
DataType::UInt64,
true,
),
Field::new(
format_state_name(args.name, "sum"),
args.input_fields[0].data_type().clone(),
true,
),
]
.into_iter()
.map(Arc::new)
.collect())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this question -- the PR's code looks good to me

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Jefffrey
Copy link
Contributor Author

This will I assume require regenerating the extended slt files in datafusion-testing?

Yep, refer to apache/datafusion-testing#11

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

I think all we need now is to merge apache/datafusion-testing#11 and then update the datafusion-testing pin on this PR.

I am testing locally with the changes from apache/datafusion-testing#11 using:

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

SELECT array_agg(a_varchar order by a_varchar) WITHIN GROUP (ORDER BY a_varchar)
FROM (VALUES ('a'), ('d'), ('c'), ('a')) t(a_varchar);

# distinct average
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

   Finished `release-nonlto` profile [optimized] target(s) in 2m 09s
     Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-78c77c9b80ab3916)
Completed 942 test files in 4 minutes
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$

👍

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

I updated the pin and merged up from main

@Jefffrey
Copy link
Contributor Author

I updated the pin and merged up from main

Cheers 👍

@Jefffrey Jefffrey merged commit 241e47d into apache:main Aug 24, 2025
27 checks passed
@Jefffrey Jefffrey deleted the pr_15413 branch August 24, 2025 01:03
@alamb
Copy link
Contributor

alamb commented Aug 24, 2025

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants