-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
d = {'l': ['left', 'right', 'left', 'right', 'left', 'right'],
'r': ['right', 'left', 'right', 'left', 'right', 'left'],
'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)
Problem description
When a grouped dataframe contains a value of np.NaN the expected output is not aligned with numpy.sum or pandas.Series.sum
NaN as is given by the skipna=False flag for pd.Series.sum and also pd.DataFrame.sum
In [235]: df.v.sum(skipna=False)
Out[235]: nan
However, this behavior is not reflected in the pandas.DataFrame.groupby object
In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0
and cannot be forced by applying the np.sum method directly
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0
see this StackOverflow post for a workaround
Expected Output
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan
and
In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan
Output of pd.show_versions()
Details
INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.Nonepandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None