-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this. See 8f2b44b#diff-3a10a971f558573678baea521d62790a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is if the extension is built but pyarrow.parquet fails to import due to a dynamic linking error, the whole module gets ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
48ccbfd to
95aa36e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this after #774 is merged.
…ntaining duplicate values to parquet
|
Added another test to confirm that writing works. The parquet writing ended up actually being a red herring for a deeper problem which was that we couldn't handle duplicate multiindex levels at all. The problem is fixed in this patch. |
|
@wesm passing on appveyor: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.184 |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thank you!
|
Thanks @cpcloud for fixing this so quickly! When would you expect this fix to make it into a release/conda build? |
|
The conda-forge packages can be updated whenever; help would be appreciated with that. We have a bunch of patches in flight so it might be worth waiting a couple days before starting the update process on conda-forge (for arrow-cpp/parquet-cpp/pyarrow) |
Are you suggesting a Is the release procedure simply opening up a PR for the feedstock after updating the recipe? If so I'm happy to do that. I'm okay waiting till next week to catch a few more patches. |
|
Yeah, the compliant version number would be 0.5.0.pre (with a conda package build number that we can increment after that). We'll have to be careful with version pinning so that people can still install 0.4.1 without getting the newer packages (the upgrade from 0.3.0 to 0.4.0 was not as graceful as planned) |
|
@wesm is the most recent commit a good option to make a release? This is how I've edited the feedstock so far, does it look good? bmabey/pyarrow-feedstock@53b5e20 |
|
It's a bit tricky because pyarrow depends on arrow-cpp and parquet-cpp, both of which have just undergone some changes to their build systems, so those recipes also need to be updated. Once apache/parquet-cpp#364 is merged I can start updating the arrow-cpp and parquet-cpp feedstocks, I will keep you posted |
|
@bmabey sorry I didn't finish the updates -- we are close to cutting a 0.5.0 release candidate so I will try to make a 0.5.0.pre release on conda-forge before the final release goes out, also to make sure nothing broke in our packaging since 0.4.1 (since we moved a bunch of code around in the package toolchain) |
|
Should do this conda-forge update in the next few days if all goes well |
No description provided.