Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Mar 17, 2020

No description provided.

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

Cool, thanks for working on this!

WIll take a closer look tomorrow, but just one comment. From the test I see the usage is like:

fragment = ds.FileFragment(ds.ParquetFileFormat(), parquetbuffer)
dataset = ds.InMemoryDataset(fragment=fragment)

which allows one to make a fragment and then a dataset from it. This gives all what is needed to eg allow ds.dataset(parquetbuffer) in the higher level API.

But, naively, I would not have thought this to be a InMemoryDataset (as for me, in-memory dataset was about in-memory record batches, while here it is in-memory but still parquet files).
A more practical concern is that, assuming at some point we want to make a ParquetDataset(FileSystemDataset) subclass that exposes some more parquet-specific properties, this makes it more difficult to have such a ParquetDataset that supports both file paths and buffer (although I assume this can be solved by not using inheritance, but composition).

@fsaintjacques
Copy link
Contributor

fsaintjacques commented Apr 27, 2020

I'll close this for now, ARROW-8318 will remove this limitation and FileSystemDataset will be created from a list of FileFragment, which themselves can be created from Buffer-backed FileSource. You'll be able to create Dataset from a mix of files on disk and buffers in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants