-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: DataFrame.append preserves columns dtype if possible #19021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: DataFrame.append preserves columns dtype if possible #19021
Conversation
72214ca to
6bb6860
Compare
|
Can't get Travis-ci to complete because it uses too much time, somehow. I assume this has nothing to do with this PR, but is about travis settings? |
|
rebase and push again, fixed some hanging by Travis CI |
6bb6860 to
9cfa3d3
Compare
Codecov Report
@@ Coverage Diff @@
## master #19021 +/- ##
==========================================
- Coverage 91.84% 91.82% -0.03%
==========================================
Files 153 153
Lines 49295 49299 +4
==========================================
- Hits 45276 45268 -8
- Misses 4019 4031 +12
Continue to review full report at Codecov.
|
|
Ok, rebased. |
| idx_diff = other.index.difference(self.columns) | ||
| try: | ||
| combined_columns = self.columns.append(idx_diff) | ||
| except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this should not be done here, rather if a TypeError happens, then Index.append handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand you here, could you expand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: pd.Index([1, 2, 3]).append(pd.Index(list('abc')))
Out[1]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')
In [2]: pd.Index([1, 2, 3]).append(pd.CategoricalIndex(list('abc')))
Out[2]: Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')
In [3]: pd.Index([1, 2, 3]).append(pd.MultiIndex.from_product([[1], ['a']]))
Out[3]: Index([1, 2, 3, (1, 'a')], dtype='object')
In [4]: pd.Index([1, 2, 3]).append(pd.IntervalIndex.from_breaks([1, 2, 3]))
Out[4]: Index([1, 2, 3, (1, 2], (2, 3]], dtype='object')
.append`` should never raise
so you are catching something that does not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The below fails, though:
>>> pd.CategoricalIndex('A B C'.split()).append(pd.Index([1])
TypeError: cannot append a non-category item to a CategoricalIndexThis is what is tested in the if not isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)): part of the tests. In addition, the reason why pd.IntervalIndex and pd.MultiIndex are excluded for testing there is because of reindexing issues:
>>> ii = pd.IntervalIndex.from_breaks([0,1,2])
>>> df = pd.DataFrame([[1, 2], [4, 5]], columns=ii)
>>> ser = pd.Series([7], index=['a'])
>>> combined = pd.Index(ii.tolist() + ser.index.tolist())
>>> df.reindex(columns=combined)
TypeError: unorderable types: Interval() > int()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the MI case we did on purpose. However you cannot convert to lists and expect things to work. for the TypeError, do .astype(object) on each left and right then .append again. we never should coerce to lists.
pandas/tests/reshape/test_concat.py
Outdated
| def test_append_same_columns_type(self, df_columns): | ||
| # GH18359 | ||
|
|
||
| # ser.index is a normal pd.Index, result from df.append(ser) should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this as a separate test (the if part), though I suspect you are actually hiding a bug (see above).
pandas/tests/reshape/test_concat.py
Outdated
| # GH 4993 | ||
| # appending with datetime will incorrectly convert datetime64 | ||
| import datetime as dt | ||
| from pandas import NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can move other imports to top as well
9cfa3d3 to
0d55da6
Compare
|
Changed and green (except for travis errors that seem unrelated). A different subject, but shouldn't |
pandas/tests/reshape/test_concat.py
Outdated
|
|
||
| # MultiIndex and IntervalIndex will raise if appended or appended to | ||
| # a different type | ||
| if (isinstance(df_columns, (pd.IntervalIndex, pd.MultiIndex)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this in a test. I want a separate test for MI/II (or 2 is ok).
8f8b140 to
bd400df
Compare
|
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on April 17, 2018 at 17:28 Hours UTC |
730ac3a to
26e7419
Compare
464fdc2 to
48f406b
Compare
|
All is green again, now that #19112 has been pulled in. |
|
@topper-123 added some ids to the parametrization. also I added a test case in the combo test or an ordered and unordered category which is failing. (its commented out). can you have a look? |
|
also pls rebase |
|
@jreback , wrt. the ordered vs unordered combo in I think the current implementation is reasonable behaviour, as |
pandas/tests/reshape/test_concat.py
Outdated
|
|
||
| df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=mi) | ||
| ser = pd.Series([7, 8, 9], index=other_type, name=2) | ||
| if isinstance(other_type, pd.IntervalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is wrong, append should raise a TypeError (like all the others). can you fix this in II?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the ValueError is coming from here:
pandas/pandas/core/indexes/interval.py
Lines 1155 to 1161 in 1245f06
| def _as_like_interval_index(self, other, error_msg): | |
| self._assert_can_do_setop(other) | |
| other = _ensure_index(other) | |
| if (not isinstance(other, IntervalIndex) or | |
| self.closed != other.closed): | |
| raise ValueError(error_msg) | |
| return other |
Seems reasonable to split the if condition, and raise a different exception in each case, along the lines of:
if not isinstance(other, IntervalIndex):
raise TypeError(error_msg)
elif self.closed != other.closed:
raise ValueError(error_msg)I think this change will only break the test below, at least within the II tests, but all that needs to be changed is ValueError --> TypeError:
pandas/pandas/tests/indexes/interval/test_interval.py
Lines 936 to 940 in 1245f06
| # non-IntervalIndex | |
| msg = ('can only do set operations between two IntervalIndex objects ' | |
| 'that are closed on the same side') | |
| with tm.assert_raises_regex(ValueError, msg): | |
| set_op(Index([1, 2, 3])) |
pandas/tests/reshape/test_concat.py
Outdated
| dt.datetime(2013, 1, 3, 7, 12)]), | ||
| pd.Index([4, 5, 6]), | ||
| ], ids=lambda x: str(x.dtype)) | ||
| def test_append_interval_index_raises(self, other_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above.
cc @jschendel
|
cc @jschendel |
pandas/tests/reshape/test_concat.py
Outdated
|
|
||
| df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=other_type) | ||
| ser = pd.Series([7, 8, 9], index=ii, name=2) | ||
| with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks to be the same as above
a5deefb to
860a5d2
Compare
doc/source/whatsnew/v0.23.0.txt
Outdated
|
|
||
|
|
||
| - ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`) | ||
| - :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated the whatsnew, don't remove the whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
pandas/tests/reshape/test_concat.py
Outdated
| # different index type | ||
|
|
||
| ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests now looks superfluous, yes? and you just need to add it to the above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appending IntervalIndex with other index types still fails, only with a TypeError rather than a ValueError, so not sure I understand.
Oh, you mean combine test_append_multi_index_raises and test_append_interval_index_raises? I think you're right. I'll get on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. None of these should fail, you are catching the TypeError already (and then coercing to object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, combining IntervalIndex and other index types will stilll fail, because Interval and other dtypes are not orderable:
>>> ii = pd.IntervalIndex.from_breaks([0,1,2,3])
>>> i = pd.RangeIndex(3)
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=ii)
>>> ser = pd.Series([7, 8, 9], index=i, name=2)
>>> df.append(ser)
TypeError: unorderable types: Interval() > int()test_append_multi_index_raises and test_append_interval_index_raises are very similar and I've made a changes to combine them into a single test, see test_append_different_columns_types_raises.
EDIT: I'm wondering if trying to order here is a bug, as you can append to the index itself:
>>> ii.append(i)
Index([(0, 1], (1, 2], (2, 3], 0, 1, 2], dtype='object')If the above is possible, surely you can use DataFrame.append as well. I'll take a look tonight.
|
I've addressed the comments, except one where I've commented specifically. If #19353 (exact matches for CategoricalIndex) or a version thereof could be accepted, it would be possible to append frames/series with CategoricalIndex as well, i.e. move IntervalIndex checks from |
pandas/tests/reshape/test_concat.py
Outdated
| columns=combined_columns) | ||
| assert_frame_equal(result, expected) | ||
|
|
||
| @pytest.mark.parametrize("this_type", [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexes = [pd.RangeIndex(3),
pd.Index([4, 5, 6]),
pd.Index(list("abc")),
pd.CategoricalIndex('A B C'.split()),
pd.CategoricalIndex('A B C'.split(), ordered=True),
pd.IntervalIndex.from_breaks([0, 1, 2, 3]),
pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]),
pd.DatetimeIndex([dt.datetime(2013, 1, 3, 0, 0),
dt.datetime(2013, 1, 3, 6, 10),
dt.datetime(2013, 1, 3, 7, 12)])]
@pytest.fixture(params=indexes)
def this_type(request):
return request.param
@pytest.fixture(params=indexes)
def that_type(params=indexes):
return request.param
.....
def append_different_coumns_types_raises(self, this_type, that_type):
....
944cf96 to
f2fbdf4
Compare
|
I've uploaded a version with fixture, but honestly I consider this less readable than the previous version... |
pandas/tests/reshape/test_concat.py
Outdated
| pd.MultiIndex.from_arrays(['A B C'.split(), 'D E F'.split()]), | ||
| ] | ||
|
|
||
| @pytest.fixture(params=indexes_II) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, rather than use the fixtures, let's leave the lists where they are (though you don't need to have indexes_I at all, and pls name them something else more descriptive. and you can use parametrize directrly.
| columns=ser_index) | ||
| assert_frame_equal(result, expected) | ||
|
|
||
| @pytest.mark.parametrize("df_columns, series_index", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use 2 parameterize here, one for df_columns and one for series_index; you don't need combinations, pytest already does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this approach to work, AFAIKS, parametrize doesn't do this for you, and I've googled it without results. Can you give an example?
|
can you update |
|
can you rebase |
|
@topper-123 can you rebase |
f2fbdf4 to
7e57ff6
Compare
|
Rebased. I've also simplified the tests a bit. I dropped the fixture-based approach, as I don't think it makes things clearer in this case. |
7e57ff6 to
fe4bc5a
Compare
|
do you disagree wit the approach, @jreback? |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can you just make the whatsnew a bit more
doc/source/whatsnew/v0.23.0.txt
Outdated
| - :class:`IntervalIndex` now supports time zone aware ``Interval`` objects (:issue:`18537`, :issue:`18538`) | ||
| - :func:`Series` / :func:`DataFrame` tab completion also returns identifiers in the first level of a :func:`MultiIndex`. (:issue:`16326`) | ||
| - :func:`read_excel()` has gained the ``nrows`` parameter (:issue:`16645`) | ||
| - :meth:`DataFrame.append` now preserves the type of the calling dataframe's columns, when possible (:issue:`18359`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the case this is actually addressing (e.g. categorical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
fe4bc5a to
13537ae
Compare
|
@jreback, could you pull this in? |
|
thanks @topper-123 |
|
A related question. I was expecting making a column an index , or level in a MultiIndex, to not change the dtype. df = pd.DataFrame(dict(x=np.array([1,2],dtype = 'int32'),
y =['a', 'b']))
df.set_index('x').reset_index().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
x 2 non-null int64
y 2 non-null object
dtypes: int64(1), object(1)
memory usage: 104.0+ bytesIs this exepected ,and will this enhancement change this? Thanks |
|
The index will in this case be a Int64Index: >>> df.set_index('x').index
Int64Index([1, 2], dtype='int64', name='x')Pandas doesn't have a Int32Index, so this behaviour is as expected and you shouldn't expect an int32 when resetting (or you'll have to reset manually). |
| self, index_can_append, index_cannot_append_with_other): | ||
| # GH18359 | ||
| # Dataframe.append will raise if IntervalIndex/MultiIndex appends | ||
| # or is appended to a different index type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@topper-123 can you help me understand why raising is the correct thing to do here? i have a branch that fixes a Series construction bug:
def test_constructor_dict_multiindex_reindex_flat(self):
# construction involves reindexing with a MultiIndex corner case
data = {('i', 'i'): 0, ('i', 'j'): 1, ('j', 'i'): 2, "j": np.nan}
expected = Series(data)
result = Series(expected[:-1].to_dict(), index=expected.index)
tm.assert_series_equal(result, expected)
but the fix for that (in MultiIndex.reindex) causes this test to fail
git diff upstream/master -u -- "*.py" | flake8 --diffThis PR makes
DataFrame.appendpreserve columns dtype, e.g.:Previously, the above returned
Index(['a', 'b'], dtype='object'), i.e. the index type information was lost when usingappend.