Skip to content

Conversation

@citoubest
Copy link

What changes were proposed in this pull request?

With pyspark dataframe, the agg method just support two ways, one is to give the column and agg method maps and another one is to use agg functions in package functions to apply on specific columns names. The two approach both ask us to asign a method on specific columns names. But if
I want to apply the agg method on all other numeric columns, I should list all method-column combinations. such as, suppose the df has to columns province,age,income, I want to groupby the province and calculate the min, max and average values on age and income, before change I have to approach: df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})
df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}), which are both redundant.

with this change, we can simply replace the code with df.groupby('province').agg('max','min','avg')

How was this patch tested?

manual tests

… change group.agg to support df.groupby(name).agg(max,min) like pandas
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@petermaxlee
Copy link
Contributor

Isn't it as simple as

cols = [x for x in df.columns if x != "key]
df.groupby("key").agg([F.min(x) for x in cols] + [F.max(x) for x in cols])

@citoubest
Copy link
Author

@petermaxlee
In my opinion, list comprehension can reduce code length to some extent. It's better if the agg method can support the easy way in api level.

@citoubest
Copy link
Author

@rxin @davies @srowen

@rxin
Copy link
Contributor

rxin commented Sep 20, 2016

I understand the reasons why you want to add this -- but I feel this is too esoteric and if we add this one, there are also a lot of other cases that can be added and I don't know where we would stop.

@citoubest
Copy link
Author

OK, because pandas dataframe support the added approach to agg, so I suppose maybe spark dataframe should support, but it not. So I have tried to add this patch. If you think this patch is not necessary , I will close this request later. @rxin .

@rxin
Copy link
Contributor

rxin commented Sep 20, 2016

Pandas doesn't support this, does it?

>>> pd.read_csv('test.csv').groupby('a').agg('sum', 'avg')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3114, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 428, in _aggregate
    return getattr(self, arg)(*args, **kwargs), None
TypeError: f() takes exactly 1 argument (2 given)
>>> pd.read_csv('test.csv').groupby('a').agg(['sum', 'avg'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 3114, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 564, in _aggregate
    return self._aggregate_multiple_funcs(arg, _level=_level), None
  File "/Library/Python/2.7/site-packages/pandas/core/base.py", line 609, in _aggregate_multiple_funcs
    results.append(colg.aggregate(arg))
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2574, in aggregate
    (_level or 0) + 1)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2636, in _aggregate_multiple_funcs
    results[name] = obj.aggregate(func)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2570, in aggregate
    return getattr(self, func_or_funcs)(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 498, in __getattr__
    (type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'avg'

@citoubest
Copy link
Author

citoubest commented Sep 20, 2016

with pandas, the param for agg is the function not a str (function names). @rxin

Will you consider to add this patch, if not maybe I should close it.

In [13]: df
Out[13]: 
          a         b         c  d
0  0.068300  0.263883  0.237335  1
1  0.226992  0.573966  0.954791  2
2  0.907550  0.930591  0.886454  1
3  0.178581  0.440734  0.414763  2

In [14]: df.groupby('d').agg([max,min])
Out[14]: 
          a                   b                   c          
        max       min       max       min       max       min
d                                                            
1  0.907550  0.068300  0.930591  0.263883  0.886454  0.237335
2  0.226992  0.178581  0.573966  0.440734  0.954791  0.414763

@citoubest
Copy link
Author

@davies, what do you think about this patch? Can you give me some advice? Thanks

srowen added a commit to srowen/spark that referenced this pull request Oct 12, 2016
@asfgit asfgit closed this in eb69335 Oct 12, 2016
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#15303
Closes apache#15078
Closes apache#15080
Closes apache#15135
Closes apache#14565
Closes apache#12355
Closes apache#15404

Author: Sean Owen <sowen@cloudera.com>

Closes apache#15451 from srowen/CloseStalePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants