-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles #7156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bkietz Cool, I am testing this out. Something like currently does not yet work. This is because you are checking for I also noticed that it is easy to segfault Fragment, because we don't forbid the |
|
I am testing the other parquet tests that are also skipped, and that turned up already one issue: https://issues.apache.org/jira/browse/ARROW-8799 With a few small edits, all other tests pass now. Do I push that here? (can also keep for follow-up PR) |
|
@jorisvandenbossche please push that here, thanks! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions below.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, it doesn't use a local filesystem as default? (or doesn't accept a URI?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm following the convention that _dataset.pyx APIs require filesystem while dataset.py APIs can default to a LocalFileSystem. I provide FileSource.from_uri to produce a FileSource from a uri but we could certainly remove that and accept a URI in the constructor instead. This feeds back into the "what's optimal python API" discussion
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the MockFileSystem fallback here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dirty hack to get the ParquetDataset tests passing; it's ignored later except that fs must be a non-None FileSystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, did you report a bug to Cython?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I can try to assemble a minimal reproducer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without a minimal reproducer, if you just post the relevant snippets in a bug report (declarations then point of use), it may be enough for the Cython team to understand and fix the issue.
ac2a2df to
c9fcc99
Compare
41f6a03 to
e26c89a
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass here!
One thing I am noticing is a segfault in ParquetFileFragment when accessing the filesytem:
In [1]: import pyarrow.dataset as ds
...:
...: with open("test.parquet", 'rb') as f:
...: dataset = ds.dataset(f)
In [2]: fragments = list(dataset.get_fragments())
In [3]: fr = fragments[0]
In [5]: fr.buffer
In [6]: fr.filesystem
Segmentation fault (core dumped)
The FileFragment.filesystem property needs to check for null pointers I think, similarly as the FileFragment.buffer already does (can't comment inline there)
The code in dataset.py is a bit in a messy state .. (already before this PR to be clear), we should think about how we can improve this, but after this PR, I would say
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to expose FileSource publicly, so it also shouldn't matter too much
(we still need to make a choice for internal usage of course)
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this here, I don't think there is a need right now for the user to create this manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see it is used below ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could import it privately at the point of use.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for our own usage, for me it is fine to move this into the main constructor (since that is handling a lot already)
python/pyarrow/dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?
e26c89a to
bd998b6
Compare
|
Regarding my comments on |
cpp/src/arrow/util/checked_cast.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we stick with the signature from the standard?
https://en.cppreference.com/w/cpp/memory/shared_ptr/pointer_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add an rvalue overload of checked_pointer_cast as the standard does, but my understanding is that since we're always taking ownership of r then the copy may as well take place in argument initialization, as with modernize-pass-by-value
cpp/src/arrow/dataset/file_base.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not satisfied how this class is transforming into a Franken-class. The need of static fake properties and the semi-broken default constructor.
I'd say make FileSource an interface and use inheritance, make Open() virtual, the path() and filesystem() will be specific to one implementation (maybe name them Source, FileSource, BufferSource, ...). We can make an accept visitor for classes who wants to touch properties like the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that FileSource would benefit from refactoring but I think that doing so further in this PR (including resorbing WritableFileSource) is not necessary. I'll make a follow up JIRA
81cdf6f to
87d68c0
Compare
|
I feel 0.5 on this PR in general, the functionality it adds is initially for testing and it introduces debt. I'm not keen on the change on FileSystemFactory since this class is used for 3 purposes, and only one of them is used by Native handles:
This patch retrofits FileSystemFactory for no other purpose than making buffer sources accessible in python while ignoring the actual discovery. In reality, this class deals heavily and almost exclusively with paths. |
|
Taking a step back: wouldn't it be possible to eg "just" allow to create a Fragment from a buffer instead from a file? In practice, I think we only need to support dealing with buffers when there is a single buffer (so not like paths, where you can have multiple paths or a directory etc). And then do we need discovery at all? If we can construct a Fragment backed by a buffer instead of a file path, then you can create a Dataset from that, either with the physical schema of the fragment (no unification is needed if there is only one) or either with a user-specified schema. |
|
Note that it is not only for testing. We for sure use it for testing in pyarrow, but in pandas 1.0.4, we accidentally broke reading parquet files from file-like objects, and we directly got a some bug reports about it. So actual users also do that, to a certain extent. |
|
Yes, we can already create FileFragment from any FileSource. You make a valid point that, usually, this is meant for a single buffer. If this is only scoped to FileSource and the python bindings, not touching any Factory, then this is fine as is. |
|
Okay, I'll start trimming |
Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>
87d68c0 to
93bd9f4
Compare
| shared_ptr[CRandomAccessFile] c_file | ||
| shared_ptr[CBuffer] c_buffer | ||
|
|
||
| if isinstance(file, FileSource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd. Should you have _ensure_file_source like other methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we've decided not to export FileSource I think we can probably delete this class in a follow up (in favor of cdef CFileSource _make_file_source(...) or so)
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
| if not isinstance(path_or_paths, list): | ||
| if not _is_path_like(path_or_paths): | ||
| self._fragment = parquet_format.make_fragment(path_or_paths) | ||
| self._dataset = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be cleaner to make a Dataset from this single fragment. That's not yet possible right now from python (have a PR for it), so can do as a follow-up
|
The travis failure is an unrelated Flight failure |
|
@bkietz did you already open some follow-up JIRAs? (eg for #7156 (comment)) I will handle my comment at #7156 (comment) in #7468 |
Adds
ds.FileSource, which represents an openable file and may be initialized from apath, filesystem, aBuffer, or any python object which can be wrapped byNativeFile.test_parquet.pynow usesBytesIOas the roundtrip medium for non legacyParquetDatasetinstead of resorting to a mock filesystem. Other than that the integration with Python is somewhat haphazard; I'm thinking we need to rewrite some of the APIs to be less magical about figuring out what is a selector, path, list(paths), etc since we will be adding buffers andNativeFiles to the mix.