-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Closed
Labels
Description
Currently, Categorical.unique and CategoricalIndex.unique drop unused categories:
>>> categories = ['very good', 'good', 'neutral', 'bad', 'very bad']
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], categories=categories, ordered=True)
>>> cat
[good, good, bad, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]
>>> cat.unique()
[good, bad]
Categories (2, object): [good < bad] # unused categories droppedSo, .unique() both uniquefies and drops unused categories (does two things in one operation)
Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So Categorical/CategoricalIndex.unique should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:
>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad] # unused not droppedIf you want to drop unused categories, you should do it explicitly like so: cat.unique().remove_unused_categories().
The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.