[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135

citoubest · 2016-09-18T06:01:12Z

What changes were proposed in this pull request?

With pyspark dataframe, the agg method just support two ways, one is to give the column and agg method maps and another one is to use agg functions in package functions to apply on specific columns names. The two approach both ask us to asign a method on specific columns names. But if
I want to apply the agg method on all other numeric columns, I should list all method-column combinations. such as, suppose the df has to columns province,age,income, I want to groupby the province and calculate the min, max and average values on age and income, before change I have to approach: df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})
df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}), which are both redundant.

with this change, we can simply replace the code with df.groupby('province').agg('max','min','avg')

How was this patch tested?

manual tests

… change group.agg to support df.groupby(name).agg(max,min) like pandas

AmplabJenkins · 2016-09-18T06:02:15Z

Can one of the admins verify this patch?

petermaxlee · 2016-09-18T06:07:31Z

Isn't it as simple as

cols = [x for x in df.columns if x != "key]
df.groupby("key").agg([F.min(x) for x in cols] + [F.max(x) for x in cols])

citoubest · 2016-09-18T09:39:25Z

@petermaxlee
In my opinion, list comprehension can reduce code length to some extent. It's better if the agg method can support the easy way in api level.

citoubest · 2016-09-20T05:51:57Z

@rxin @davies @srowen

rxin · 2016-09-20T05:57:09Z

I understand the reasons why you want to add this -- but I feel this is too esoteric and if we add this one, there are also a lot of other cases that can be added and I don't know where we would stop.

citoubest · 2016-09-20T06:07:38Z

OK, because pandas dataframe support the added approach to agg, so I suppose maybe spark dataframe should support, but it not. So I have tried to add this patch. If you think this patch is not necessary , I will close this request later. @rxin .

rxin · 2016-09-20T06:15:29Z

Pandas doesn't support this, does it?

>>> pd.read_csv('test.csv').groupby('a').agg('sum', 'avg')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3114, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 428, in _aggregate
    return getattr(self, arg)(*args, **kwargs), None
TypeError: f() takes exactly 1 argument (2 given)
>>> pd.read_csv('test.csv').groupby('a').agg(['sum', 'avg'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3114, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 564, in _aggregate
    return self._aggregate_multiple_funcs(arg, _level=_level), None
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 609, in _aggregate_multiple_funcs
    results.append(colg.aggregate(arg))
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2574, in aggregate
    (_level or 0) + 1)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2636, in _aggregate_multiple_funcs
    results[name] = obj.aggregate(func)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2570, in aggregate
    return getattr(self, func_or_funcs)(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 498, in __getattr__
    (type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'avg'

citoubest · 2016-09-20T06:23:21Z

with pandas, the param for agg is the function not a str (function names). @rxin 。

Will you consider to add this patch, if not maybe I should close it.

In [13]: df
Out[13]: 
          a         b         c  d
0  0.068300  0.263883  0.237335  1
1  0.226992  0.573966  0.954791  2
2  0.907550  0.930591  0.886454  1
3  0.178581  0.440734  0.414763  2

In [14]: df.groupby('d').agg([max,min])
Out[14]: 
          a                   b                   c          
        max       min       max       min       max       min
d                                                            
1  0.907550  0.068300  0.930591  0.263883  0.886454  0.237335
2  0.226992  0.178581  0.573966  0.440734  0.954791  0.414763

citoubest · 2016-09-25T08:06:21Z

@davies, what do you think about this patch? Can you give me some advice? Thanks

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404 Author: Sean Owen <sowen@cloudera.com> Closes apache#15451 from srowen/CloseStalePRs.

citoubest added 2 commits September 18, 2016 13:37

pyspark dataframe agg not support multiple functions for all columns,…

67e75a2

… change group.agg to support df.groupby(name).agg(max,min) like pandas

add comment for last change

7407bc8

srowen added a commit to srowen/spark that referenced this pull request Oct 12, 2016

Closing stale PRs.

4d40636

Closes apache#15303 Closes apache#15078 Closes apache#15080 Closes apache#15135 Closes apache#14565 Closes apache#12355 Closes apache#15404

srowen mentioned this pull request Oct 12, 2016

[BUILD] Closing stale PRs #15451

Closed

asfgit closed this in eb69335 Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135

[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135

Uh oh!

citoubest commented Sep 18, 2016

Uh oh!

AmplabJenkins commented Sep 18, 2016

Uh oh!

petermaxlee commented Sep 18, 2016

Uh oh!

citoubest commented Sep 18, 2016

Uh oh!

citoubest commented Sep 20, 2016

Uh oh!

rxin commented Sep 20, 2016

Uh oh!

citoubest commented Sep 20, 2016

Uh oh!

rxin commented Sep 20, 2016

Uh oh!

citoubest commented Sep 20, 2016 •

edited

Loading

Uh oh!

citoubest commented Sep 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135

[pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. #15135

Uh oh!

Conversation

citoubest commented Sep 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Sep 18, 2016

Uh oh!

petermaxlee commented Sep 18, 2016

Uh oh!

citoubest commented Sep 18, 2016

Uh oh!

citoubest commented Sep 20, 2016

Uh oh!

rxin commented Sep 20, 2016

Uh oh!

citoubest commented Sep 20, 2016

Uh oh!

rxin commented Sep 20, 2016

Uh oh!

citoubest commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

citoubest commented Sep 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

citoubest commented Sep 20, 2016 •

edited

Loading