ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method #11624

amol- · 2021-11-05T14:27:23Z

No description provided.

github-actions · 2021-11-05T14:27:43Z

https://issues.apache.org/jira/browse/ARROW-14608

python/pyarrow/tests/test_table.py

jorisvandenbossche · 2021-11-08T18:47:17Z

python/pyarrow/_compute.pyx

        self._set_options(q, delta, buffer_size, skip_nulls, min_count)
+
+
+def _group_by(args, keys, aggregations):


We can also make this a public function in the compute module?

Not sure, should we? I made it internal because we plan to replace this with the exec engine on long term, so I guess that the Table.group_by implementation will switch to use something different in the future.

I made it internal because we plan to replace this with the exec engine on long term, so I guess that the Table.group_by implementation will switch to use something different in the future.

The same could be done for a pyarrow.compute function? (it doesn't map 1:1 to a C++ kernel anyway)

For me one reason to put it in the compute functions as a pc.group_by(table, keys, ...) is to sidestep the 1-step vs 2-step API discussion for the method a bit. For a function in compute, I think it's totally fine to be a one step function

python/pyarrow/tests/test_table.py

ianmcook · 2021-11-12T16:26:30Z

Bikeshedding on the method name: In other packages, the group_by method/function does not actually do any aggregation. Instead it serves as a helper function that tells a separate aggregate method/function what groups to aggregate over. Examples of this include Ibis (group_by --> aggregate), pandas (groupby --> agg), and dplyr (group_by --> summarise). Because of this I think we should pick a different name than group_by for this function, since it both groups and aggregates.

pitrou · 2021-11-12T16:31:36Z

"grouped_aggregate" perhaps?

pitrou · 2021-11-12T16:44:21Z

Another possibility is to have a two_step API, e.g. replace:

table.group_by("keys", ["values"], "sum")

with:

table.group_by("keys", ["values"]).aggregate("sum")

or perhaps even some shortcuts:

table.group_by("keys", ["values"]).sum()

Table.group_by would return an intermediate object with several methods, including one for doing the actual grouping ("collect"?) and other(s) to compute aggregates.

ianmcook · 2021-11-12T16:58:04Z

+1 on the two-step approach if it is feasible and doesn't add too much complexity to the implementation.

Ideally the values would be passed to the aggregate function, not to the grouping function. That's how it works in Ibis, dplyr, and pandas (at least since named aggregation in pandas 0.25.0+)

amol- · 2021-11-12T17:03:09Z

Another possibility is to have a two_step API, e.g. replace:
table.group_by("keys", ["values"], "sum")
with:
table.group_by("keys", ["values"]).aggregate("sum")
or perhaps even some shortcuts:
table.group_by("keys", ["values"]).sum()

I think that in such case the aggregated values shouldn't go into group_by, you probably would want something like table.group_by("keys").sum("values").max("othervalues", HashMaxOptions())
I'm not too fond of that solution by the way as it would require an explicit point where you collect results to allow chaining multiple aggregations.

I think having a single aggregate method where you can provide multiple aggregations would be more usable

t.group_by("key").aggregate([
   ("sum", "values"),
   ("max", "othervalues", HashMaxOptions())
])

pitrou · 2021-11-12T17:08:53Z

I was proposing shortcut methods for the simple cases where you compute only one aggregate. But perhaps that's not useful.

(and, yes, you're right, the value columns should go into the aggregate call, not the group_by call. My bad)

ianmcook · 2021-11-12T17:16:35Z

I was proposing shortcut methods for the simple cases where you compute only one aggregate. But perhaps that's not useful.

Given the small number of aggregate functions and the popularity of that style in pandas, I think that is practical and useful

jorisvandenbossche · 2021-11-15T16:36:34Z

I am a bit hesitant to add such a two-step interface to pyarrow. It's indeed the way how it is done in other packages, but the ones that @ianmcook mentions (ibis, pandas, dplyr) also all have slightly different APIs on how to specify this. And then pyarrow would add yet another slightly different interface.

(but I also agree that groupby is not a great name as method on the table for this reason)

Playing a bit with this branch, some other observations:

I find it unexpected that the resulting table always has "key" column instead of reusing the original name that was specified as the key column
Is it possible to group by multiple columns? Not in the current bindings in this PR, but I suppose in c++ / R this is already possible?
I think users will very quickly request the ability to specify the resulting column name .. (to not have things like "column_count_distinct")

amol- · 2021-11-16T14:12:57Z

pyarrow would add yet another slightly different interface.
(but I also agree that groupby is not a great name as method on the table for this reason)

I don't have a strong opinion about the single step or multi step API. I personally rarely ever had the need to do a grouping without an associated aggregation, so I feel that the value of the multistep approach isn't huge, even thought it might be easier to evolve in the future.

Playing a bit with this branch, some other observations:

I find it unexpected that the resulting table always has "key" column instead of reusing the original name that was specified as the key column

Is it possible to group by multiple columns? Not in the current bindings in this PR, but I suppose in c++ / R this is already possible?

I think users will very quickly request the ability to specify the resulting column name .. (to not have things like "column_count_distinct")

I implemented support for the first two points in dfecba1
Regarding the third one, I wonder if that would be best satisfied by extending the Table.rename_columns API to support a mapping of column names
IE:

t.rename_column({"oldcolname": "newcolname"})

that might be convenient for other use cases too (for example when willing to rename only a subset of columns) and would expose the ability to do

t.group_by("keycol", ["value1"], ["sum"]).rename_column({"value1_sum": "total"})

amol- · 2021-11-16T16:57:28Z

@jorisvandenbossche @pitrou I was working toward the multistep API refactoring and I was wondering about the grouping alone case (group_by(["keys"]).collect()?).

At the moment it seems that the GroupBy C++ function doesn't support grouping without any provided aggregation. Do you think it would be a reasonable work-around to run a count aggregation just to drop it or should we just leave out the plain grouping for the moment?

pitrou · 2021-11-16T17:46:41Z

IMHO we should just leave out the plain grouping for the moment.

python/pyarrow/table.pxi

amol- · 2021-11-17T14:26:16Z

@pitrou moved to multistep api

    table.group_by("keys").aggregate([
        ("sum", "values"),
        ("count", "values")
    ])

or

    table.group_by("keys").aggregate([
        ("sum", "values", FunctionOptions),
        ("count", "values", FunctionOptions)
    ])

at the moment it only has aggregate method, but we can grow more helpers in the future

chungg

not sure what other functionality we plan on supporting but syntax makes sense to me. it's similar to other dataframe libraries.

python/pyarrow/table.pxi

alippai · 2021-11-17T22:54:49Z

A slightly related silly question: how does the performance compare to pandas at this stage?

amol- · 2021-11-18T11:14:21Z

A slightly related silly question: how does the performance compare to pandas at this stage?

Not a real benchmark, but in the current very rough form it seems to be mostly comparable.

>>> table = pyarrow.csv.read_csv("yellowtaxi.csv")
>>> timeit.timeit(lambda: table.group_by("VendorID").aggregate([("sum", "trip_distance")]), number=1)
2.626802896999993

VS

>>> df = pandas.read_csv("yellowtaxi.csv")
>>> timeit.timeit(lambda: df.groupby("VendorID").aggregate({"trip_distance": ["sum"]}), number=1)
2.3642018030000003

pitrou

Just a nit. @jorisvandenbossche Can you give this a final review?

pitrou · 2021-11-18T15:12:30Z

python/pyarrow/compute.py

    function_registry,
    get_function,
    list_functions,
+    _group_by


Is there a reaons for exposing this publicly? Is this just a leftover from previous attempts?

It's to make it available when _pc() is used to access compute functions from other modules/files (in this case from table.pxi). I adhered to that practice instead of injecting a import pyarrow._compute.

Given that there are many more internal functions in the pyarrow.compute module I thought it wasn't concerning.

python/pyarrow/table.pxi

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche

Sorry, a few more (mostly docstring) nits.

python/pyarrow/table.pxi

python/pyarrow/tests/test_table.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

amol- · 2021-11-25T09:23:31Z

@jorisvandenbossche I should have addressed your most recent comments, anything else you feel is pending?

jorisvandenbossche

Thanks for the updates!

ursabot · 2021-11-25T16:44:25Z

Benchmark runs are scheduled for baseline = 4cfd1d9 and contender = 999d97a. 999d97a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

amol- added 3 commits November 5, 2021 11:15

Initial Experiment

67f287c

Fix

039b8de

Refine

2cfc926

github-actions bot added Component: C++ Component: Python labels Nov 5, 2021

amol- added 3 commits November 5, 2021 15:29

Remove stray prints

9329bcf

Further tweak

d04716e

Docstring

c70cbc6

amol- marked this pull request as ready for review November 5, 2021 16:26

pitrou reviewed Nov 8, 2021

View reviewed changes

python/pyarrow/tests/test_table.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Nov 8, 2021

View reviewed changes

amol- added 4 commits November 10, 2021 11:59

Support omitting hash_ prefix

a32a314

Lint

d19f61d

Add helper for sorting

9e2c16f

Generate a Table as the output of Table.group_by

05aaafb

pitrou reviewed Nov 10, 2021

View reviewed changes

python/pyarrow/tests/test_table.py Outdated Show resolved Hide resolved

pitrou reviewed Nov 10, 2021

View reviewed changes

python/pyarrow/tests/test_table.py Show resolved Hide resolved

amol- added 2 commits November 10, 2021 17:43

Improve column names

30b64fe

missing assertions

8ec3153

pitrou reviewed Nov 12, 2021

View reviewed changes

python/pyarrow/tests/test_table.py Show resolved Hide resolved

amol- added 2 commits November 12, 2021 16:52

Add test with different order

25e6276

lint

94c64e4

Sort aggregation keys in tests

76fff1b

Support multiple keys in grouping

dfecba1

chungg reviewed Nov 16, 2021

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

amol- added 3 commits November 17, 2021 12:31

Multistep aggregation API

111f258

Tweak docs

1bc4fd4

Tweak

712dc94

lint

c9bd87d

chungg approved these changes Nov 17, 2021

View reviewed changes

pitrou reviewed Nov 17, 2021

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

stray if

73a60d4

pitrou approved these changes Nov 18, 2021

View reviewed changes

jorisvandenbossche reviewed Nov 18, 2021

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

amol- and others added 3 commits November 19, 2021 10:25

Update python/pyarrow/table.pxi

0126d5c

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Reverse arguments order

990e674

Add example

3e06ddf

jorisvandenbossche reviewed Nov 22, 2021

View reviewed changes

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

python/pyarrow/table.pxi Show resolved Hide resolved

python/pyarrow/table.pxi Outdated Show resolved Hide resolved

python/pyarrow/tests/test_table.py Show resolved Hide resolved

amol- and others added 2 commits November 23, 2021 12:39

Apply suggestions from code review

45af478

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Add test with options

be82014

jorisvandenbossche approved these changes Nov 25, 2021

View reviewed changes

jorisvandenbossche closed this in 999d97a Nov 25, 2021

		self._set_options(q, delta, buffer_size, skip_nulls, min_count)


		def _group_by(args, keys, aggregations):

ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method #11624

ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method #11624

Uh oh!

Conversation

amol- commented Nov 5, 2021

Uh oh!

github-actions bot commented Nov 5, 2021

Uh oh!

Uh oh!

jorisvandenbossche Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

amol- Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianmcook commented Nov 12, 2021

Uh oh!

pitrou commented Nov 12, 2021

Uh oh!

pitrou commented Nov 12, 2021

Uh oh!

ianmcook commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amol- commented Nov 12, 2021

Uh oh!

pitrou commented Nov 12, 2021

Uh oh!

ianmcook commented Nov 12, 2021

Uh oh!

jorisvandenbossche commented Nov 15, 2021

Uh oh!

amol- commented Nov 16, 2021

Uh oh!

amol- commented Nov 16, 2021

Uh oh!

pitrou commented Nov 16, 2021

Uh oh!

Uh oh!

amol- commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chungg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alippai commented Nov 17, 2021

Uh oh!

amol- commented Nov 18, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

amol- Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amol- commented Nov 25, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ianmcook commented Nov 12, 2021 •

edited

Loading

amol- commented Nov 17, 2021 •

edited

Loading

amol- Nov 18, 2021 •

edited

Loading

ursabot commented Nov 25, 2021 •

edited

Loading