-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Problem description
For long series and many categories 'Series.isin()' is slower for categorical data rather than for int64. If categories are built from strings, then the degradation of the performance is even larger.
import pandas as pd
import numpy as np
N = 3000000
Ncats = 100
cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])
df = pd.DataFrame({'A': np.random.randn(N),
'B': np.random.randn(N),
'C': np.random.randint(0, Ncats, N),
})
df['D'] = cats.loc[df['C'].values].values
df['E'] = df['C'].astype('category')
df['F'] = df['D'].astype('category')
sel_codes = [1,2]
sel_cats = cats.loc[sel_codes].values
%timeit inds = df.C.isin(sel_codes) # int64
%timeit inds = df.E.isin(sel_codes) # category based on int64
%timeit inds = df.D.isin(sel_cats) # object / string
%timeit inds = df.F.isin(sel_cats) # category based on stringOn my machine:
6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for
sel_codes = range(90)
sel_cats = cats.loc[sel_codes].values
%timeit inds = df.C.isin(sel_codes) # int64
%timeit inds = df.E.isin(sel_codes) # category based on int64
%timeit inds = df.D.isin(sel_cats) # object / string
%timeit inds = df.F.isin(sel_cats) # category based on stringthe timings are:
441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
p.s. I'm not sure if such performance issues are worth filing.
Output of pd.show_versions()
Details
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None