-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method #11624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| self._set_options(q, delta, buffer_size, skip_nulls, min_count) | ||
|
|
||
|
|
||
| def _group_by(args, keys, aggregations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also make this a public function in the compute module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, should we? I made it internal because we plan to replace this with the exec engine on long term, so I guess that the Table.group_by implementation will switch to use something different in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it internal because we plan to replace this with the exec engine on long term, so I guess that the
Table.group_byimplementation will switch to use something different in the future.
The same could be done for a pyarrow.compute function? (it doesn't map 1:1 to a C++ kernel anyway)
For me one reason to put it in the compute functions as a pc.group_by(table, keys, ...) is to sidestep the 1-step vs 2-step API discussion for the method a bit. For a function in compute, I think it's totally fine to be a one step function
|
Bikeshedding on the method name: In other packages, the |
|
"grouped_aggregate" perhaps? |
|
Another possibility is to have a two_step API, e.g. replace: table.group_by("keys", ["values"], "sum")with: table.group_by("keys", ["values"]).aggregate("sum")or perhaps even some shortcuts: table.group_by("keys", ["values"]).sum()
|
|
+1 on the two-step approach if it is feasible and doesn't add too much complexity to the implementation. Ideally the values would be passed to the aggregate function, not to the grouping function. That's how it works in Ibis, dplyr, and pandas (at least since named aggregation in pandas 0.25.0+) |
I think that in such case the aggregated values shouldn't go into I think having a single |
|
I was proposing shortcut methods for the simple cases where you compute only one aggregate. But perhaps that's not useful. (and, yes, you're right, the value columns should go into the aggregate call, not the group_by call. My bad) |
Given the small number of aggregate functions and the popularity of that style in pandas, I think that is practical and useful |
|
I am a bit hesitant to add such a two-step interface to pyarrow. It's indeed the way how it is done in other packages, but the ones that @ianmcook mentions (ibis, pandas, dplyr) also all have slightly different APIs on how to specify this. And then pyarrow would add yet another slightly different interface. (but I also agree that groupby is not a great name as method on the table for this reason) Playing a bit with this branch, some other observations:
|
I don't have a strong opinion about the single step or multi step API. I personally rarely ever had the need to do a grouping without an associated aggregation, so I feel that the value of the multistep approach isn't huge, even thought it might be easier to evolve in the future.
I implemented support for the first two points in dfecba1 that might be convenient for other use cases too (for example when willing to rename only a subset of columns) and would expose the ability to do |
|
@jorisvandenbossche @pitrou I was working toward the multistep API refactoring and I was wondering about the grouping alone case ( At the moment it seems that the |
|
IMHO we should just leave out the plain grouping for the moment. |
|
@pitrou moved to multistep api or at the moment it only has |
chungg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what other functionality we plan on supporting but syntax makes sense to me. it's similar to other dataframe libraries.
|
A slightly related silly question: how does the performance compare to pandas at this stage? |
Not a real benchmark, but in the current very rough form it seems to be mostly comparable. VS |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit. @jorisvandenbossche Can you give this a final review?
| function_registry, | ||
| get_function, | ||
| list_functions, | ||
| _group_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reaons for exposing this publicly? Is this just a leftover from previous attempts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to make it available when _pc() is used to access compute functions from other modules/files (in this case from table.pxi). I adhered to that practice instead of injecting a import pyarrow._compute.
Given that there are many more internal functions in the pyarrow.compute module I thought it wasn't concerning.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a few more (mostly docstring) nits.
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
|
@jorisvandenbossche I should have addressed your most recent comments, anything else you feel is pending? |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
|
Benchmark runs are scheduled for baseline = 4cfd1d9 and contender = 999d97a. 999d97a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.