ARROW-8074: [Dataset][Python] FileFragments from Buffers #6645

bkietz · 2020-03-17T20:19:15Z

No description provided.

github-actions · 2020-03-17T20:31:39Z

https://issues.apache.org/jira/browse/ARROW-8074

jorisvandenbossche · 2020-03-17T20:50:36Z

Cool, thanks for working on this!

WIll take a closer look tomorrow, but just one comment. From the test I see the usage is like:

fragment = ds.FileFragment(ds.ParquetFileFormat(), parquetbuffer)
dataset = ds.InMemoryDataset(fragment=fragment)

which allows one to make a fragment and then a dataset from it. This gives all what is needed to eg allow ds.dataset(parquetbuffer) in the higher level API.

But, naively, I would not have thought this to be a InMemoryDataset (as for me, in-memory dataset was about in-memory record batches, while here it is in-memory but still parquet files).
A more practical concern is that, assuming at some point we want to make a ParquetDataset(FileSystemDataset) subclass that exposes some more parquet-specific properties, this makes it more difficult to have such a ParquetDataset that supports both file paths and buffer (although I assume this can be solved by not using inheritance, but composition).

…m in memory buffers

fsaintjacques · 2020-04-27T12:33:51Z

I'll close this for now, ARROW-8318 will remove this limitation and FileSystemDataset will be created from a list of FileFragment, which themselves can be created from Buffer-backed FileSource. You'll be able to create Dataset from a mix of files on disk and buffers in memory.

bkietz requested a review from jorisvandenbossche March 17, 2020 20:19

bkietz added 2 commits March 18, 2020 10:21

ARROW-8074: [Dataset][Python] Allow construction of FileFragments fro…

a7e22be

…m in memory buffers

lint fixes

453d74d

bkietz force-pushed the 8074-Support-for-file-like-obj branch from 3bfb3f9 to 453d74d Compare March 18, 2020 14:40

wesm force-pushed the master branch from 5fe5b88 to aa55967 Compare April 19, 2020 22:46

kszucs force-pushed the master branch from 1b71ca7 to 5093b80 Compare April 20, 2020 19:21

fsaintjacques closed this Apr 27, 2020

bkietz deleted the 8074-Support-for-file-like-obj branch February 25, 2021 16:34

asfimport mentioned this pull request Jun 18, 2020

[C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset? #24286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8074: [Dataset][Python] FileFragments from Buffers #6645

ARROW-8074: [Dataset][Python] FileFragments from Buffers #6645

Uh oh!

bkietz commented Mar 17, 2020

Uh oh!

github-actions bot commented Mar 17, 2020

Uh oh!

jorisvandenbossche commented Mar 17, 2020

Uh oh!

fsaintjacques commented Apr 27, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-8074: [Dataset][Python] FileFragments from Buffers #6645

ARROW-8074: [Dataset][Python] FileFragments from Buffers #6645

Uh oh!

Conversation

bkietz commented Mar 17, 2020

Uh oh!

github-actions bot commented Mar 17, 2020

Uh oh!

jorisvandenbossche commented Mar 17, 2020

Uh oh!

fsaintjacques commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fsaintjacques commented Apr 27, 2020 •

edited

Loading