Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented Nov 5, 2021

No description provided.

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

@amol- amol- marked this pull request as ready for review November 5, 2021 16:26
self._set_options(q, delta, buffer_size, skip_nulls, min_count)


def _group_by(args, keys, aggregations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also make this a public function in the compute module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, should we? I made it internal because we plan to replace this with the exec engine on long term, so I guess that the Table.group_by implementation will switch to use something different in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it internal because we plan to replace this with the exec engine on long term, so I guess that the Table.group_by implementation will switch to use something different in the future.

The same could be done for a pyarrow.compute function? (it doesn't map 1:1 to a C++ kernel anyway)

For me one reason to put it in the compute functions as a pc.group_by(table, keys, ...) is to sidestep the 1-step vs 2-step API discussion for the method a bit. For a function in compute, I think it's totally fine to be a one step function

@ianmcook
Copy link
Member

Bikeshedding on the method name: In other packages, the group_by method/function does not actually do any aggregation. Instead it serves as a helper function that tells a separate aggregate method/function what groups to aggregate over. Examples of this include Ibis (group_by --> aggregate), pandas (groupby --> agg), and dplyr (group_by --> summarise). Because of this I think we should pick a different name than group_by for this function, since it both groups and aggregates.

@pitrou
Copy link
Member

pitrou commented Nov 12, 2021

"grouped_aggregate" perhaps?

@pitrou
Copy link
Member

pitrou commented Nov 12, 2021

Another possibility is to have a two_step API, e.g. replace:

table.group_by("keys", ["values"], "sum")

with:

table.group_by("keys", ["values"]).aggregate("sum")

or perhaps even some shortcuts:

table.group_by("keys", ["values"]).sum()

Table.group_by would return an intermediate object with several methods, including one for doing the actual grouping ("collect"?) and other(s) to compute aggregates.

@ianmcook
Copy link
Member

ianmcook commented Nov 12, 2021

+1 on the two-step approach if it is feasible and doesn't add too much complexity to the implementation.

Ideally the values would be passed to the aggregate function, not to the grouping function. That's how it works in Ibis, dplyr, and pandas (at least since named aggregation in pandas 0.25.0+)

@amol-
Copy link
Member Author

amol- commented Nov 12, 2021

Another possibility is to have a two_step API, e.g. replace:

table.group_by("keys", ["values"], "sum")

with:

table.group_by("keys", ["values"]).aggregate("sum")

or perhaps even some shortcuts:

table.group_by("keys", ["values"]).sum()

I think that in such case the aggregated values shouldn't go into group_by, you probably would want something like table.group_by("keys").sum("values").max("othervalues", HashMaxOptions())
I'm not too fond of that solution by the way as it would require an explicit point where you collect results to allow chaining multiple aggregations.

I think having a single aggregate method where you can provide multiple aggregations would be more usable

t.group_by("key").aggregate([
   ("sum", "values"),
   ("max", "othervalues", HashMaxOptions())
])

@pitrou
Copy link
Member

pitrou commented Nov 12, 2021

I was proposing shortcut methods for the simple cases where you compute only one aggregate. But perhaps that's not useful.

(and, yes, you're right, the value columns should go into the aggregate call, not the group_by call. My bad)

@ianmcook
Copy link
Member

I was proposing shortcut methods for the simple cases where you compute only one aggregate. But perhaps that's not useful.

Given the small number of aggregate functions and the popularity of that style in pandas, I think that is practical and useful

@jorisvandenbossche
Copy link
Member

I am a bit hesitant to add such a two-step interface to pyarrow. It's indeed the way how it is done in other packages, but the ones that @ianmcook mentions (ibis, pandas, dplyr) also all have slightly different APIs on how to specify this. And then pyarrow would add yet another slightly different interface.

(but I also agree that groupby is not a great name as method on the table for this reason)


Playing a bit with this branch, some other observations:

  • I find it unexpected that the resulting table always has "key" column instead of reusing the original name that was specified as the key column
  • Is it possible to group by multiple columns? Not in the current bindings in this PR, but I suppose in c++ / R this is already possible?
  • I think users will very quickly request the ability to specify the resulting column name .. (to not have things like "column_count_distinct")

@amol-
Copy link
Member Author

amol- commented Nov 16, 2021

pyarrow would add yet another slightly different interface.
(but I also agree that groupby is not a great name as method on the table for this reason)

I don't have a strong opinion about the single step or multi step API. I personally rarely ever had the need to do a grouping without an associated aggregation, so I feel that the value of the multistep approach isn't huge, even thought it might be easier to evolve in the future.

Playing a bit with this branch, some other observations:

  • I find it unexpected that the resulting table always has "key" column instead of reusing the original name that was specified as the key column
  • Is it possible to group by multiple columns? Not in the current bindings in this PR, but I suppose in c++ / R this is already possible?
  • I think users will very quickly request the ability to specify the resulting column name .. (to not have things like "column_count_distinct")

I implemented support for the first two points in dfecba1
Regarding the third one, I wonder if that would be best satisfied by extending the Table.rename_columns API to support a mapping of column names
IE:

t.rename_column({"oldcolname": "newcolname"})

that might be convenient for other use cases too (for example when willing to rename only a subset of columns) and would expose the ability to do

t.group_by("keycol", ["value1"], ["sum"]).rename_column({"value1_sum": "total"})

@amol-
Copy link
Member Author

amol- commented Nov 16, 2021

@jorisvandenbossche @pitrou I was working toward the multistep API refactoring and I was wondering about the grouping alone case (group_by(["keys"]).collect()?).

At the moment it seems that the GroupBy C++ function doesn't support grouping without any provided aggregation. Do you think it would be a reasonable work-around to run a count aggregation just to drop it or should we just leave out the plain grouping for the moment?

@pitrou
Copy link
Member

pitrou commented Nov 16, 2021

IMHO we should just leave out the plain grouping for the moment.

@amol-
Copy link
Member Author

amol- commented Nov 17, 2021

@pitrou moved to multistep api

    table.group_by("keys").aggregate([
        ("sum", "values"),
        ("count", "values")
    ])

or

    table.group_by("keys").aggregate([
        ("sum", "values", FunctionOptions),
        ("count", "values", FunctionOptions)
    ])

at the moment it only has aggregate method, but we can grow more helpers in the future

Copy link

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what other functionality we plan on supporting but syntax makes sense to me. it's similar to other dataframe libraries.

@alippai
Copy link
Contributor

alippai commented Nov 17, 2021

A slightly related silly question: how does the performance compare to pandas at this stage?

@amol-
Copy link
Member Author

amol- commented Nov 18, 2021

A slightly related silly question: how does the performance compare to pandas at this stage?

Not a real benchmark, but in the current very rough form it seems to be mostly comparable.

>>> table = pyarrow.csv.read_csv("yellowtaxi.csv")
>>> timeit.timeit(lambda: table.group_by("VendorID").aggregate([("sum", "trip_distance")]), number=1)
2.626802896999993

VS

>>> df = pandas.read_csv("yellowtaxi.csv")
>>> timeit.timeit(lambda: df.groupby("VendorID").aggregate({"trip_distance": ["sum"]}), number=1)
2.3642018030000003

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit. @jorisvandenbossche Can you give this a final review?

function_registry,
get_function,
list_functions,
_group_by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reaons for exposing this publicly? Is this just a leftover from previous attempts?

Copy link
Member Author

@amol- amol- Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to make it available when _pc() is used to access compute functions from other modules/files (in this case from table.pxi). I adhered to that practice instead of injecting a import pyarrow._compute.

Given that there are many more internal functions in the pyarrow.compute module I thought it wasn't concerning.

amol- and others added 3 commits November 19, 2021 10:25
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few more (mostly docstring) nits.

amol- and others added 2 commits November 23, 2021 12:39
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@amol-
Copy link
Member Author

amol- commented Nov 25, 2021

@jorisvandenbossche I should have addressed your most recent comments, anything else you feel is pending?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@ursabot
Copy link

ursabot commented Nov 25, 2021

Benchmark runs are scheduled for baseline = 4cfd1d9 and contender = 999d97a. 999d97a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants