Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented May 12, 2020

Adds ds.FileSource, which represents an openable file and may be initialized from a path, filesystem, a Buffer, or any python object which can be wrapped by NativeFile.

test_parquet.py now uses BytesIO as the roundtrip medium for non legacy ParquetDataset instead of resorting to a mock filesystem. Other than that the integration with Python is somewhat haphazard; I'm thinking we need to rewrite some of the APIs to be less magical about figuring out what is a selector, path, list(paths), etc since we will be adding buffers and NativeFiles to the mix.

@bkietz bkietz requested a review from jorisvandenbossche May 12, 2020 14:46
@github-actions
Copy link

github-actions bot commented May 12, 2020

@bkietz bkietz requested a review from fsaintjacques May 13, 2020 15:19
@jorisvandenbossche
Copy link
Member

@bkietz Cool, I am testing this out.

Something like

import pyarrow.dataset as ds

with open("test.parquet", 'rb') as f: 
    dataset = ds.dataset(f) 

currently does not yet work. This is because you are checking for io.BytesIO in the dataset() constructor, while the above open(..) gives a io.BufferedReader, which is apparently not a subclass from BytestIO.
Now, I am not fully familiar with the class hierarchy of the Python io module, so will need to look into that a bit. Their common base class might be BufferedIOBase (https://docs.python.org/3/library/io.html#binary-i-o).

I also noticed that it is easy to segfault Fragment, because we don't forbid the __init__ constructor, but that's not related to the changes in this PR.

@jorisvandenbossche jorisvandenbossche changed the title ARROW-8047: [C++][Dataset][Python] FileFragments from buffers and NativeFiles ARROW-8074: [C++][Dataset][Python] FileFragments from buffers and NativeFiles May 14, 2020
@jorisvandenbossche
Copy link
Member

I am testing the other parquet tests that are also skipped, and that turned up already one issue: https://issues.apache.org/jira/browse/ARROW-8799

With a few small edits, all other tests pass now. Do I push that here? (can also keep for follow-up PR)

@bkietz
Copy link
Member Author

bkietz commented May 14, 2020

@jorisvandenbossche please push that here, thanks!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it doesn't use a local filesystem as default? (or doesn't accept a URI?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the convention that _dataset.pyx APIs require filesystem while dataset.py APIs can default to a LocalFileSystem. I provide FileSource.from_uri to produce a FileSource from a uri but we could certainly remove that and accept a URI in the constructor instead. This feeds back into the "what's optimal python API" discussion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the MockFileSystem fallback here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dirty hack to get the ParquetDataset tests passing; it's ignored later except that fs must be a non-None FileSystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, did you report a bug to Cython?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I can try to assemble a minimal reproducer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without a minimal reproducer, if you just post the relevant snippets in a bug report (declarations then point of use), it may be enough for the Cython team to understand and fix the issue.

@bkietz bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from ac2a2df to c9fcc99 Compare May 18, 2020 18:19
@bkietz bkietz force-pushed the 8047-FileFragments-from-NativeFile branch 3 times, most recently from 41f6a03 to e26c89a Compare June 1, 2020 13:46
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass here!

One thing I am noticing is a segfault in ParquetFileFragment when accessing the filesytem:

In [1]: import pyarrow.dataset as ds 
   ...:  
   ...: with open("test.parquet", 'rb') as f:  
   ...:     dataset = ds.dataset(f) 

In [2]: fragments = list(dataset.get_fragments()) 

In [3]: fr = fragments[0]   

In [5]: fr.buffer 

In [6]: fr.filesystem 
Segmentation fault (core dumped)

The FileFragment.filesystem property needs to check for null pointers I think, similarly as the FileFragment.buffer already does (can't comment inline there)


The code in dataset.py is a bit in a messy state .. (already before this PR to be clear), we should think about how we can improve this, but after this PR, I would say

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to expose FileSource publicly, so it also shouldn't matter too much
(we still need to make a choice for internal usage of course)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this here, I don't think there is a need right now for the user to create this manually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see it is used below ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could import it privately at the point of use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for our own usage, for me it is fine to move this into the main constructor (since that is handling a lot already)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also make the filesystem keyword in FileSystemDatasetFactory init optional, and manually raise an error when it is required (eg when passing a selector), so we can use None here, instead of the MockFilesystem hack ?

@bkietz bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from e26c89a to bd998b6 Compare June 2, 2020 18:14
@jorisvandenbossche
Copy link
Member

Regarding my comments on dataset.py, since that file is in a need of a general clean-up regarding "input" handling (the handling of single file path / directory path / list of paths / ...), it's certainly fine for me to only deal with those comment then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we stick with the signature from the standard?

https://en.cppreference.com/w/cpp/memory/shared_ptr/pointer_cast

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add an rvalue overload of checked_pointer_cast as the standard does, but my understanding is that since we're always taking ownership of r then the copy may as well take place in argument initialization, as with modernize-pass-by-value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not satisfied how this class is transforming into a Franken-class. The need of static fake properties and the semi-broken default constructor.

I'd say make FileSource an interface and use inheritance, make Open() virtual, the path() and filesystem() will be specific to one implementation (maybe name them Source, FileSource, BufferSource, ...). We can make an accept visitor for classes who wants to touch properties like the path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that FileSource would benefit from refactoring but I think that doing so further in this PR (including resorbing WritableFileSource) is not necessary. I'll make a follow up JIRA

@fsaintjacques
Copy link
Contributor

fsaintjacques commented Jun 16, 2020

I feel 0.5 on this PR in general, the functionality it adds is initially for testing and it introduces debt. I'm not keen on the change on FileSystemFactory since this class is used for 3 purposes, and only one of them is used by Native handles:

  • Unify schema
  • Crawl a file systems to explore (requires a filesystem instance)
  • Discover partition information (requires a path)

This patch retrofits FileSystemFactory for no other purpose than making buffer sources accessible in python while ignoring the actual discovery. In reality, this class deals heavily and almost exclusively with paths.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 16, 2020

Taking a step back: wouldn't it be possible to eg "just" allow to create a Fragment from a buffer instead from a file?

In practice, I think we only need to support dealing with buffers when there is a single buffer (so not like paths, where you can have multiple paths or a directory etc). And then do we need discovery at all? If we can construct a Fragment backed by a buffer instead of a file path, then you can create a Dataset from that, either with the physical schema of the fragment (no unification is needed if there is only one) or either with a user-specified schema.
And in such a case, the factory can focus on file paths only.

@jorisvandenbossche
Copy link
Member

Note that it is not only for testing. We for sure use it for testing in pyarrow, but in pandas 1.0.4, we accidentally broke reading parquet files from file-like objects, and we directly got a some bug reports about it. So actual users also do that, to a certain extent.

@fsaintjacques
Copy link
Contributor

Yes, we can already create FileFragment from any FileSource. You make a valid point that, usually, this is meant for a single buffer.

If this is only scoped to FileSource and the python bindings, not touching any Factory, then this is fine as is.

@bkietz
Copy link
Member Author

bkietz commented Jun 16, 2020

Okay, I'll start trimming

@bkietz bkietz force-pushed the 8047-FileFragments-from-NativeFile branch from 87d68c0 to 93bd9f4 Compare June 16, 2020 20:40
shared_ptr[CRandomAccessFile] c_file
shared_ptr[CBuffer] c_buffer

if isinstance(file, FileSource):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd. Should you have _ensure_file_source like other methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've decided not to export FileSource I think we can probably delete this class in a follow up (in favor of cdef CFileSource _make_file_source(...) or so)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

if not isinstance(path_or_paths, list):
if not _is_path_like(path_or_paths):
self._fragment = parquet_format.make_fragment(path_or_paths)
self._dataset = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be cleaner to make a Dataset from this single fragment. That's not yet possible right now from python (have a PR for it), so can do as a follow-up

@jorisvandenbossche
Copy link
Member

The travis failure is an unrelated Flight failure

@jorisvandenbossche
Copy link
Member

@bkietz did you already open some follow-up JIRAs? (eg for #7156 (comment))

I will handle my comment at #7156 (comment) in #7468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants