-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5494: [Python] Create FileSystem bindings #5258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/python/datetime.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou apply ARROW_PYTHON_EXPORT for all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the functions which were inlined should be kept inline IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I need to move all of the implementation to the header, including the static functions (practically the whole implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it was already, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep everything as internal than that could work, although I prefer readable header files.
Actually we should benchmark whether it has a performance impact or not.
Could you recommend me specific benchmarks if we have any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to have any datetime-heavy benchmarks in the ASV benchmarks directory.
However, you can probably devise a timeit-based microbenchmark easily. No need to be sophisticated :-)
|
@wesm how can I properly set the visibility for https://github.com/apache/arrow/pull/5258/files#diff-6de8e16eea663cbb2694f9e66e71bd75R55 ? |
|
@kszucs it's a typedef... visibility only has to do with exported compiled symbols |
|
Ahh, indeed! Thanks! |
|
@pitrou I have a "not a subpath error" which is probably windows related? https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/27224120/job/dx9qmxdj5tnenu94#L2057 |
Right now the filesystem classes expect forward slashes, and it seems you're passing backslashes. You should try |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. It's gonna be very useful. I posted some comments below.
cpp/src/arrow/python/datetime.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to have any datetime-heavy benchmarks in the ASV benchmarks directory.
However, you can probably devise a timeit-based microbenchmark easily. No need to be sophisticated :-)
|
Looks like the AppVeyor failure needs fixing, can you take a look @kszucs ? |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly tried this out, and added a few comments. But looks good!
|
For the pathlib comment, the other alternative would also be to add it as a dependency (for python 2 only, just like also enum34 is a py2 only dependency for enum backport). |
|
After transitioning to py3 only it would be reasonable to use pathlib.Path objects as return types indeed (at least that's my preference). I suggest to create a JIRA where we can discuss further (possible involves other parts of pyarrow), meanwhile I've set the return types to string. |
You mean from |
Actually, no. pathlib paths are only for the local filesystem, not for arbitrary resources (e.g. S3). Instead you should return a plain string. |
Ah, that solves the python 2 / should we depend on pathlib question then. |
|
We could have a pathlib compatible Path implementation as well. Whatever, the API we're providing is not stable yet. |
|
For now I'm not doing a full review again, feel free to ping me when you think you're ready (and CI is fixed :-)). |
|
@pitrou this should be green now. I'd like to move forward with bindings for S3 once it is merged. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two more comments, sorry ;-)
python/pyarrow/_fs.pyx
Outdated
| since the C++ side then decodes from utf-8. On Unix, os.fsencode may be | ||
| better. | ||
| """ | ||
| return tobytes(_stringify_path(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this shouldn't accept Path objects, only str and bytes. Path objects are only valid for local paths, but here we are dealing with paths in arbitrary systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou done, actually we don't accept bytes neither, only strings (we're discussing it previously)
|
@pitrou done, builds are green, the travis error is about codecov upload failure. |
|
This is a very nice start, thank you :-) |
arrow/python/util/datetime.hone level upper, enabling to be used by the bindings