ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #768

cpcloud · 2017-06-22T18:34:24Z

No description provided.

wesm · 2017-06-22T21:34:05Z

python/pyarrow/tests/test_parquet.py

Please revert this. See 8f2b44b#diff-3a10a971f558573678baea521d62790a

The problem is if the extension is built but pyarrow.parquet fails to import due to a dynamic linking error, the whole module gets ignored

cpcloud · 2017-06-23T15:03:06Z

cpp/src/arrow/python/pandas_convert.cc

I'll remove this after #774 is merged.

…ntaining duplicate values to parquet

cpcloud · 2017-06-23T17:30:30Z

Added another test to confirm that writing works. The parquet writing ended up actually being a red herring for a deeper problem which was that we couldn't handle duplicate multiindex levels at all. The problem is fixed in this patch.

cpcloud · 2017-06-23T19:01:00Z

@wesm passing on appveyor: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.184

wesm

+1. Thank you!

bmabey · 2017-06-23T19:47:05Z

Thanks @cpcloud for fixing this so quickly!

When would you expect this fix to make it into a release/conda build?

wesm · 2017-06-23T19:53:10Z

The conda-forge packages can be updated whenever; help would be appreciated with that. We have a bunch of patches in flight so it might be worth waiting a couple days before starting the update process on conda-forge (for arrow-cpp/parquet-cpp/pyarrow)

bmabey · 2017-06-23T20:20:07Z

The conda-forge packages can be updated whenever; help would be appreciated with that.

Are you suggesting a 0.5-SNAPSHOT or similar build for conda-forge? (I would vote for something along the lines of 0.5-gitref)

Is the release procedure simply opening up a PR for the feedstock after updating the recipe? If so I'm happy to do that. I'm okay waiting till next week to catch a few more patches.

wesm · 2017-06-23T20:33:50Z

Yeah, the compliant version number would be 0.5.0.pre (with a conda package build number that we can increment after that). We'll have to be careful with version pinning so that people can still install 0.4.1 without getting the newer packages (the upgrade from 0.3.0 to 0.4.0 was not as graceful as planned)

bmabey · 2017-06-27T20:34:48Z

@wesm is the most recent commit a good option to make a release?

This is how I've edited the feedstock so far, does it look good? bmabey/pyarrow-feedstock@53b5e20

wesm · 2017-06-27T20:39:31Z

It's a bit tricky because pyarrow depends on arrow-cpp and parquet-cpp, both of which have just undergone some changes to their build systems, so those recipes also need to be updated.

Once apache/parquet-cpp#364 is merged I can start updating the arrow-cpp and parquet-cpp feedstocks, I will keep you posted

wesm · 2017-07-16T23:41:11Z

@bmabey sorry I didn't finish the updates -- we are close to cutting a 0.5.0 release candidate so I will try to make a 0.5.0.pre release on conda-forge before the final release goes out, also to make sure nothing broke in our packaging since 0.4.1 (since we moved a bunch of code around in the package toolchain)

wesm · 2017-07-16T23:41:26Z

Should do this conda-forge update in the next few days if all goes well

wesm reviewed Jun 22, 2017

View reviewed changes

cpcloud force-pushed the ARROW-1132 branch 2 times, most recently from 48ccbfd to 95aa36e Compare June 23, 2017 04:06

cpcloud commented Jun 23, 2017

View reviewed changes

cpp/src/arrow/python/pandas_convert.cc Outdated

Copy link

Contributor Author

cpcloud Jun 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this after #774 is merged.

cpcloud force-pushed the ARROW-1132 branch from 95aa36e to 819b1d8 Compare June 23, 2017 15:31

ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex co…

49684fd

…ntaining duplicate values to parquet

cpcloud force-pushed the ARROW-1132 branch from 819b1d8 to 49684fd Compare June 23, 2017 17:22

Add test for parquet roundtripping with dups

4b42f64

wesm approved these changes Jun 23, 2017

View reviewed changes

asfgit closed this in b7befeb Jun 23, 2017

cpcloud deleted the ARROW-1132 branch June 23, 2017 19:30

ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #768

ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #768

Uh oh!

Conversation

cpcloud commented Jun 22, 2017

Uh oh!

wesm Jun 22, 2017

Choose a reason for hiding this comment

Uh oh!

wesm Jun 22, 2017

Choose a reason for hiding this comment

Uh oh!

cpcloud Jun 22, 2017

Choose a reason for hiding this comment

Uh oh!

cpcloud Jun 23, 2017

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Jun 23, 2017

Uh oh!

cpcloud commented Jun 23, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

bmabey commented Jun 23, 2017

Uh oh!

wesm commented Jun 23, 2017

Uh oh!

bmabey commented Jun 23, 2017

Uh oh!

wesm commented Jun 23, 2017

Uh oh!

bmabey commented Jun 27, 2017

Uh oh!

wesm commented Jun 27, 2017

Uh oh!

wesm commented Jul 16, 2017

Uh oh!

wesm commented Jul 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants