-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9718: [Python] ParquetWriter to work with new FileSystem API #7991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9718: [Python] ParquetWriter to work with new FileSystem API #7991
Conversation
089d84e to
fb9a773
Compare
37ac79f to
12017a5
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jorisvandenbossche . A couple of comments below.
python/pyarrow/fs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"where"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's the name of an argument, then put backquotes around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was copy pasted from the implementation in pyarrow.filesystem, but agree it can be improved. Will update.
The keyword name itself may vary depending on where this helper function is called, so will keep it on a general "the specified path" or so.
python/pyarrow/parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are legacy filesystem imports? Do we still need them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps import pyarrow.filesystem as legacyfs would make the code easier to read below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we still need them because the full ParquetDataset/ParquetManifest (python) implementation here is based on the legacy filesystems.
But switched to use legacyfs. for the old ones, and plain imports for the new ones
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from pyarrow import filesystem as legacyfs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am going to leave it as is for now, because the old ones are still used a lot (would make the diff much larger, will keep that for a next PR, eg when actually deprecating)
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be lifted out of the with block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it may be simpler to use pytest.raises(ValueError, match="...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, was copied from another test, but updated to use match
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test ParquetWriter(path=uri)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one, but it is segfaulting locally .. (maybe similar as ARROW-9814)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass -s to pytest, you should be able to see the C++ crash message (if any).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to merge it with the commented out test for now? (opened issue for it at https://issues.apache.org/jira/browse/ARROW-9906)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Pushed some updates
python/pyarrow/fs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was copy pasted from the implementation in pyarrow.filesystem, but agree it can be improved. Will update.
The keyword name itself may vary depending on where this helper function is called, so will keep it on a general "the specified path" or so.
python/pyarrow/parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we still need them because the full ParquetDataset/ParquetManifest (python) implementation here is based on the legacy filesystems.
But switched to use legacyfs. for the old ones, and plain imports for the new ones
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am going to leave it as is for now, because the old ones are still used a lot (would make the diff much larger, will keep that for a next PR, eg when actually deprecating)
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one, but it is segfaulting locally .. (maybe similar as ARROW-9814)
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, was copied from another test, but updated to use match
b772535 to
5bafd1d
Compare
|
@jorisvandenbossche Do you want to merge this? |
|
Yep |
No description provided.