-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
CategoricalCategorical Data TypeCategorical Data TypeGroupbyNeeds TestsUnit test(s) needed to prevent regressionsUnit test(s) needed to prevent regressionsgood first issue
Milestone
Description
- Series groupby excluding NaN groups with Categorical (DataFrame DOES include)
- sorting via a returned Interval-like-Index (string based)
Hello,
When grouping a DataFrame over more than one column including a categorical, the empty groups are kept in the aggregation result. A test for this behaviour was introduced in #8138.
However, when performing aggregation on only one column of the DataFrame, the empty groups are dropped. This seems inconsistent to me and I guess that it's an edge case that wasn't thought of at the time.
d = {'foo': [10, 8, 4, 1], 'bar': [10, 20, 30, 40],
'baz': ['d', 'c', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 20, 5))
df['range'] = cat
groups = df.groupby(['range', 'baz'], as_index=True, sort=True)
# Expected result, fixed as part of #8138
fixed = groups.agg('mean')
# Inconsistent behaviour with series
inconsistent = groups['foo'].agg('mean')
# Expected result
expected = fixed['foo']fixed| bar | foo | ||
|---|---|---|---|
| range | baz | ||
| (0, 5] | c | 1 | 40 |
| d | 4 | 30 | |
| (10, 15] | c | NaN | NaN |
| d | NaN | NaN | |
| (15, 20] | c | NaN | NaN |
| d | NaN | NaN | |
| (5, 10] | c | 8 | 20 |
| d | 10 | 10 |
inconsistent| range | baz | |
|---|---|---|
| (0, 5] | c | 1 |
| d | 4 | |
| (5, 10] | c | 8 |
| d | 10 |
expected| range | baz | |
| (0, 5] | c | 1 |
| d | 4 | |
| (10, 15] | c | NaN |
| d | NaN | |
| (15, 20] | c | NaN |
| d | NaN | |
| (5, 10] | c | 8 |
| d | 10 |
Note the strange ordering of the categorical index. I would expect sorted = True to sort by categorical level and not by lexical order?
Also note that using as_index=False fails due to #8869
Metadata
Metadata
Assignees
Labels
CategoricalCategorical Data TypeCategorical Data TypeGroupbyNeeds TestsUnit test(s) needed to prevent regressionsUnit test(s) needed to prevent regressionsgood first issue