-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffCompatpandas objects compatability with Numpy or Python functionspandas objects compatability with Numpy or Python functions
Milestone
Description
Using the built-in sum function gives correct result in df.agg(sum, axis=0), but wrong result in df.agg(sum, axis=1).
>>> df = pd.DataFrame([[np.nan, 2], [3, 4]])
>>> df.agg(sum, axis=0)
0 3.0
1 6.0
dtype: float64
>>> df.agg(sum, axis=1)
0 NaN
1 7.0
dtype: float64The NaN in the last example should be 2.0.
Also, operation using the builtin sum in agg with axis=1 are very slow:
>>> n = 1_000
>>> df = pd.DataFrame({'a': range(n), 'b': range(n)})
>>> %timeit df.agg(sum, axis=1)
16.5 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df.T.agg(sum) # correct result *and* faster
312 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)Problem description
Currently, df.agg(func, axis=1) defers to df.apply(func, axis=1). This is not done for axis=0, and the operation may therefore give unexpected results and slow the operation down (because df.apply can be very slow).
Expected Output
The expected output is:
>>> df.agg(sum, axis=1)
0 2.0
1 7.0
dtype: float64Solution proposal
I'm thinking about putting in df.T.agg(func, axis=0) rather than df.apply(func, axis=1) in a few strategic places. This should ensure both getting correct results and faster operations. will report back if this succeeds.
Metadata
Metadata
Assignees
Labels
AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffCompatpandas objects compatability with Numpy or Python functionspandas objects compatability with Numpy or Python functions