Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Mar 20, 2021

ARROW-9054.

This adds ScalarAggregateOptions to control null behavior of mean and sum kernels.

@github-actions
Copy link

@rok
Copy link
Member Author

rok commented Mar 21, 2021

As per comments in Jiras this should introduce the following:

  • sum - don't promote type (e.g. int32 to int64) if possible
  • sum / mean - should have min_count and return NA if the count of non-nulls is lower than min_count.
  • sum / mean - should short circuit when it becomes impossible for non-null count to be below min_count.
  • mean - should have skipna and keepna options.

@rok rok force-pushed the ARROW-9054 branch 2 times, most recently from d210140 to 97c78d6 Compare March 25, 2021 16:17
@rok rok force-pushed the ARROW-9054 branch 2 times, most recently from c71b1f8 to 6e89d74 Compare April 12, 2021 16:02
@nealrichardson
Copy link
Member

Should MinMax be updated to use ScalarAggregateOptions?

You should also be able to remove https://github.com/apache/arrow/blob/master/r/R/compute.R#L113-L117 and search for any other references to that jira number in the R code/tests.

@rok
Copy link
Member Author

rok commented Apr 13, 2021

Should MinMax be updated to use ScalarAggregateOptions?

I'll look into it. Count would be a candidate as well.

You should also be able to remove https://github.com/apache/arrow/blob/master/r/R/compute.R#L113-L117 and search for any other references to that jira number in the R code/tests.

Will do.

@rok
Copy link
Member Author

rok commented Apr 15, 2021

This is slowly coming together. Remaining todo:

  • min/max must cover R edge cases
  • sum - don't promote type (e.g. int32 to int64) if possible
  • sum / mean - should short circuit when it becomes impossible for non-null count to be below min_count.

@jorisvandenbossche
Copy link
Member

sum - don't promote type (e.g. int32 to int64) if possible

Is this desired behaviour? I think in general the resulting data type should be known based on the input types and operation (and not depend on the values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I am reading this one correctly, is it basically testing mean([], min_count=0) == 0 ?

I am not sure that should give 0 as result? I would have expected either NaN or Null.
Basically mean of empty is dividing 0/0, which is NaN.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think I've put it there to have it documented while I was changing other things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you changed it to return null now? Should it rather return NaN? (I assume it could be consistent with a manual mean from sum/count with min_count=0)

@jorisvandenbossche
Copy link
Member

Looking forward to this! Added a few comments on naming and exact expected semantics.

@rok rok force-pushed the ARROW-9054 branch 2 times, most recently from 85e8bf0 to 1478720 Compare April 16, 2021 02:07
@rok rok marked this pull request as ready for review April 16, 2021 02:20
@rok
Copy link
Member Author

rok commented Apr 16, 2021

This requires some more work on C Glib but is ready for review otherwise.

@rok
Copy link
Member Author

rok commented Apr 16, 2021

@jorisvandenbossche & @nealrichardson could you take a look at this proposal?
I'm not very confident about the c_glib and Ruby changes, who could review that?

@rok
Copy link
Member Author

rok commented Apr 16, 2021

Should we introduce ScalarAggregateOptions in All and Any as well?

@rok rok force-pushed the ARROW-9054 branch 5 times, most recently from 40f5e49 to ce22f1a Compare April 17, 2021 18:43
@rok
Copy link
Member Author

rok commented May 20, 2021

Some checks timed out. Rebased to restart the tasks.

@kou
Copy link
Member

kou commented May 21, 2021

The "R / AMD64 Windows RTools 35" failure was occurred on master too: https://github.com/apache/arrow/runs/2634288057

It's unrelated to this. I'll merge this.

@kou kou closed this in 0a82432 May 21, 2021
@rok
Copy link
Member Author

rok commented May 21, 2021

Thanks for the help everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants