-
Notifications
You must be signed in to change notification settings - Fork 29k
[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… change group.agg to support df.groupby(name).agg(max,min) like pandas
|
Can one of the admins verify this patch? |
|
Isn't it as simple as |
|
@petermaxlee |
|
I understand the reasons why you want to add this -- but I feel this is too esoteric and if we add this one, there are also a lot of other cases that can be added and I don't know where we would stop. |
|
OK, because pandas dataframe support the added approach to agg, so I suppose maybe spark dataframe should support, but it not. So I have tried to add this patch. If you think this patch is not necessary , I will close this request later. @rxin . |
|
Pandas doesn't support this, does it? |
|
with pandas, the param for agg is the function not a str (function names). @rxin 。 Will you consider to add this patch, if not maybe I should close it. |
|
@davies, what do you think about this patch? Can you give me some advice? Thanks |
Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404
Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404 Author: Sean Owen <sowen@cloudera.com> Closes apache#15451 from srowen/CloseStalePRs.
What changes were proposed in this pull request?
With pyspark dataframe, the agg method just support two ways, one is to give the column and agg method maps and another one is to use agg functions in package functions to apply on specific columns names. The two approach both ask us to asign a method on specific columns names. But if
I want to apply the agg method on all other numeric columns, I should list all method-column combinations. such as, suppose the df has to columns province,age,income, I want to groupby the province and calculate the min, max and average values on age and income, before change I have to approach: df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})
df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}), which are both redundant.
with this change, we can simply replace the code with df.groupby('province').agg('max','min','avg')
How was this patch tested?
manual tests