ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles #7156

bkietz · 2020-05-12T14:46:11Z

Adds ds.FileSource, which represents an openable file and may be initialized from a path, filesystem, a Buffer, or any python object which can be wrapped by NativeFile.

test_parquet.py now uses BytesIO as the roundtrip medium for non legacy ParquetDataset instead of resorting to a mock filesystem. Other than that the integration with Python is somewhat haphazard; I'm thinking we need to rewrite some of the APIs to be less magical about figuring out what is a selector, path, list(paths), etc since we will be adding buffers and NativeFiles to the mix.

github-actions · 2020-05-12T15:25:33Z

https://issues.apache.org/jira/browse/ARROW-8074

jorisvandenbossche · 2020-05-14T12:37:05Z

@bkietz Cool, I am testing this out.

Something like

import pyarrow.dataset as ds

with open("test.parquet", 'rb') as f: 
    dataset = ds.dataset(f)

currently does not yet work. This is because you are checking for io.BytesIO in the dataset() constructor, while the above open(..) gives a io.BufferedReader, which is apparently not a subclass from BytestIO.
Now, I am not fully familiar with the class hierarchy of the Python io module, so will need to look into that a bit. Their common base class might be BufferedIOBase (https://docs.python.org/3/library/io.html#binary-i-o).

I also noticed that it is easy to segfault Fragment, because we don't forbid the __init__ constructor, but that's not related to the changes in this PR.

python/pyarrow/dataset.py

jorisvandenbossche · 2020-05-14T14:02:35Z

I am testing the other parquet tests that are also skipped, and that turned up already one issue: https://issues.apache.org/jira/browse/ARROW-8799

With a few small edits, all other tests pass now. Do I push that here? (can also keep for follow-up PR)

bkietz · 2020-05-14T14:16:59Z

@jorisvandenbossche please push that here, thanks!

python/pyarrow/_dataset.pyx

pitrou

Some questions below.

cpp/src/arrow/python/common.cc

cpp/src/arrow/python/common.h

python/pyarrow/_dataset.pyx

pitrou · 2020-05-18T14:52:30Z

python/pyarrow/_dataset.pyx

Interesting, it doesn't use a local filesystem as default? (or doesn't accept a URI?)

I'm following the convention that _dataset.pyx APIs require filesystem while dataset.py APIs can default to a LocalFileSystem. I provide FileSource.from_uri to produce a FileSource from a uri but we could certainly remove that and accept a URI in the constructor instead. This feeds back into the "what's optimal python API" discussion

python/pyarrow/_dataset.pyx

pitrou · 2020-05-18T15:01:34Z

python/pyarrow/dataset.py

Why the MockFileSystem fallback here?

This is a dirty hack to get the ParquetDataset tests passing; it's ignored later except that fs must be a non-None FileSystem.

We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?

pitrou · 2020-05-18T15:03:10Z

python/pyarrow/includes/libarrow_dataset.pxd

I'm curious, did you report a bug to Cython?

No. I can try to assemble a minimal reproducer

Even without a minimal reproducer, if you just post the relevant snippets in a bug report (declarations then point of use), it may be enough for the Cython team to understand and fix the issue.

python/pyarrow/io.pxi

python/pyarrow/tests/test_dataset.py

jorisvandenbossche

Did another pass here!

One thing I am noticing is a segfault in ParquetFileFragment when accessing the filesytem:

In [1]: import pyarrow.dataset as ds 
   ...:  
   ...: with open("test.parquet", 'rb') as f:  
   ...:     dataset = ds.dataset(f) 

In [2]: fragments = list(dataset.get_fragments()) 

In [3]: fr = fragments[0]   

In [5]: fr.buffer 

In [6]: fr.filesystem 
Segmentation fault (core dumped)

The FileFragment.filesystem property needs to check for null pointers I think, similarly as the FileFragment.buffer already does (can't comment inline there)

The code in dataset.py is a bit in a messy state .. (already before this PR to be clear), we should think about how we can improve this, but after this PR, I would say

jorisvandenbossche · 2020-06-02T14:17:28Z

python/pyarrow/_dataset.pyx

I don't think we need to expose FileSource publicly, so it also shouldn't matter too much
(we still need to make a choice for internal usage of course)

jorisvandenbossche · 2020-06-02T14:18:06Z

python/pyarrow/dataset.py

I would remove this here, I don't think there is a need right now for the user to create this manually?

Ah, I see it is used below ..

I could import it privately at the point of use.

cpp/src/arrow/dataset/discovery.cc

jorisvandenbossche · 2020-06-02T14:44:57Z

python/pyarrow/_dataset.pyx

But for our own usage, for me it is fine to move this into the main constructor (since that is handling a lot already)

jorisvandenbossche · 2020-06-02T14:53:48Z

python/pyarrow/dataset.py

We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?

jorisvandenbossche · 2020-06-02T18:56:37Z

Regarding my comments on dataset.py, since that file is in a need of a general clean-up regarding "input" handling (the handling of single file path / directory path / list of paths / ...), it's certainly fine for me to only deal with those comment then

cpp/src/arrow/dataset/test_util.h

python/pyarrow/_dataset.pyx

cpp/src/arrow/result.h

cpp/src/arrow/status.h

fsaintjacques · 2020-05-19T01:22:37Z

cpp/src/arrow/util/checked_cast.h

Shouldn't we stick with the signature from the standard?

https://en.cppreference.com/w/cpp/memory/shared_ptr/pointer_cast

I can add an rvalue overload of checked_pointer_cast as the standard does, but my understanding is that since we're always taking ownership of r then the copy may as well take place in argument initialization, as with modernize-pass-by-value

cpp/src/arrow/dataset/file_base.h

fsaintjacques · 2020-06-05T13:29:35Z

cpp/src/arrow/dataset/file_base.h

I'm not satisfied how this class is transforming into a Franken-class. The need of static fake properties and the semi-broken default constructor.

I'd say make FileSource an interface and use inheritance, make Open() virtual, the path() and filesystem() will be specific to one implementation (maybe name them Source, FileSource, BufferSource, ...). We can make an accept visitor for classes who wants to touch properties like the path.

I agree that FileSource would benefit from refactoring but I think that doing so further in this PR (including resorbing WritableFileSource) is not necessary. I'll make a follow up JIRA

cpp/src/arrow/flight/client.cc

fsaintjacques · 2020-06-16T18:30:43Z

I feel 0.5 on this PR in general, the functionality it adds is initially for testing and it introduces debt. I'm not keen on the change on FileSystemFactory since this class is used for 3 purposes, and only one of them is used by Native handles:

Unify schema
Crawl a file systems to explore (requires a filesystem instance)
Discover partition information (requires a path)

This patch retrofits FileSystemFactory for no other purpose than making buffer sources accessible in python while ignoring the actual discovery. In reality, this class deals heavily and almost exclusively with paths.

jorisvandenbossche · 2020-06-16T19:24:35Z

Taking a step back: wouldn't it be possible to eg "just" allow to create a Fragment from a buffer instead from a file?

In practice, I think we only need to support dealing with buffers when there is a single buffer (so not like paths, where you can have multiple paths or a directory etc). And then do we need discovery at all? If we can construct a Fragment backed by a buffer instead of a file path, then you can create a Dataset from that, either with the physical schema of the fragment (no unification is needed if there is only one) or either with a user-specified schema.
And in such a case, the factory can focus on file paths only.

jorisvandenbossche · 2020-06-16T19:26:01Z

Note that it is not only for testing. We for sure use it for testing in pyarrow, but in pandas 1.0.4, we accidentally broke reading parquet files from file-like objects, and we directly got a some bug reports about it. So actual users also do that, to a certain extent.

fsaintjacques · 2020-06-16T19:31:54Z

Yes, we can already create FileFragment from any FileSource. You make a valid point that, usually, this is meant for a single buffer.

If this is only scoped to FileSource and the python bindings, not touching any Factory, then this is fine as is.

bkietz · 2020-06-16T19:41:19Z

Okay, I'll start trimming

…ragments

Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>

fsaintjacques · 2020-06-17T18:20:13Z

python/pyarrow/_dataset.pyx

+            shared_ptr[CRandomAccessFile] c_file
+            shared_ptr[CBuffer] c_buffer
+
+        if isinstance(file, FileSource):


This is odd. Should you have _ensure_file_source like other methods?

Since we've decided not to export FileSource I think we can probably delete this class in a follow up (in favor of cdef CFileSource _make_file_source(...) or so)

jorisvandenbossche

Thanks for the updates!

jorisvandenbossche · 2020-06-18T12:00:05Z

python/pyarrow/parquet.py

+        if not isinstance(path_or_paths, list):
+            if not _is_path_like(path_or_paths):
+                self._fragment = parquet_format.make_fragment(path_or_paths)
+                self._dataset = None


I think it will be cleaner to make a Dataset from this single fragment. That's not yet possible right now from python (have a PR for it), so can do as a follow-up

jorisvandenbossche · 2020-06-18T12:01:42Z

The travis failure is an unrelated Flight failure

jorisvandenbossche · 2020-06-18T12:11:01Z

@bkietz did you already open some follow-up JIRAs? (eg for #7156 (comment))

I will handle my comment at #7156 (comment) in #7468

bkietz requested a review from jorisvandenbossche May 12, 2020 14:46

bkietz requested a review from fsaintjacques May 13, 2020 15:19

jorisvandenbossche reviewed May 14, 2020

View reviewed changes

python/pyarrow/dataset.py Outdated Show resolved Hide resolved

jorisvandenbossche changed the title ~~ARROW-8047: [C++][Dataset][Python] FileFragments from buffers and NativeFiles~~ ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles May 14, 2020

bkietz commented May 14, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

pitrou reviewed May 18, 2020

View reviewed changes

bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from ac2a2df to c9fcc99 Compare May 18, 2020 18:19

bkietz force-pushed the 8047-FileFragments-from-NativeFile branch 3 times, most recently from 41f6a03 to e26c89a Compare June 1, 2020 13:46

jorisvandenbossche reviewed Jun 2, 2020

View reviewed changes

bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from e26c89a to bd998b6 Compare June 2, 2020 18:14

fsaintjacques reviewed Jun 5, 2020

View reviewed changes

fsaintjacques mentioned this pull request Jun 12, 2020

ARROW-8510: [C++][Datasets] Do not use variant in WritePlan to fix compiler error with VS 2017 #7419

Closed

bkietz force-pushed the 8047-FileFragments-from-NativeFile branch 2 times, most recently from 81cdf6f to 87d68c0 Compare June 12, 2020 16:42

bkietz added 4 commits June 16, 2020 15:42

ARROW-8047: [C++][Dataset] Support creation of Datasets with buffer f…

96d3c40

…ragments

add FileSource to python

68072f0

add support for arbitrary CustomOpen functors

b7e1115

refactor refcount handling

fc4f94e

bkietz and others added 20 commits June 16, 2020 15:42

add explicit default constructor for OwnedRefNoGIL

8d0e970

use std::forward when calling unbound method

7fedc05

try manual parameter packing for GCC 4.8

65f6d20

nullary parameter pack case

b7e864a

enable more python tests

1e662c5

correct wrapping of Fragment

5b2aaff

allow construction of Dataset from list(FileSource)

83f21e0

use __cinit__ for FileSource

1fc87db

review comments, add PyError RAII helper

5334256

iwyu: <utility>

d987c3c

revert rvalue accessors for result/status

6434656

revert CustomOpen

8a4447c

lint fixes

96814da

MSVC fix

12ef385

add absent FileFragment.filesystem -> None

6f4dbab

Update cpp/src/arrow/status.h

ad369db

Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com>

fix broken suggestion

50fb210

FileSource::FromPaths

e8bce6d

rebase fix: FileSource

9c4d95a

revert modifications to Factory

93bd9f4

bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from 87d68c0 to 93bd9f4 Compare June 16, 2020 20:40

fsaintjacques approved these changes Jun 17, 2020

View reviewed changes

fsaintjacques reviewed Jun 17, 2020

View reviewed changes

jorisvandenbossche approved these changes Jun 18, 2020

View reviewed changes

jorisvandenbossche closed this in 9d7dca6 Jun 18, 2020

This was referenced Jun 18, 2020

[C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset? #24286

Closed

[Python][Dataset] Clean-up internal FileSource class #17238

Closed

ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles #7156

ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles #7156

Uh oh!

Conversation

bkietz commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 12, 2020 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented May 14, 2020

Uh oh!

Uh oh!

jorisvandenbossche commented May 14, 2020

Uh oh!

bkietz commented May 14, 2020

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jun 2, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fsaintjacques commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

bkietz commented May 12, 2020 •

edited

Loading

github-actions bot commented May 12, 2020 •

edited by jorisvandenbossche

Loading

fsaintjacques commented Jun 16, 2020 •

edited

Loading

jorisvandenbossche commented Jun 16, 2020 •

edited

Loading