ARROW-12231: [C++][Python][Dataset] Isolate one-shot data to scanner #10070

lidavidm · 2021-04-16T15:37:19Z

This isolates the one-shot portion of InMemoryDataset to Scanner, so that it more clearly is used only for writing data from a source that cannot be re-read.

lidavidm · 2021-04-16T15:37:26Z

CC @westonpace

github-actions · 2021-04-16T15:38:10Z

https://issues.apache.org/jira/browse/ARROW-12231

westonpace · 2021-04-16T18:14:25Z

It's a pity, if we were not crossing languages I think we could require as input an iterable (something that can be invoked to generate a (potentially one-shot) iterator) instead of an iterator and then wouldn't have to have this distinction. This looks good but do we want to open a second JIRA for the post-12289 follow-up or just wait and do that work in this JIRA later?

lidavidm · 2021-04-16T18:16:10Z

If the other side is something like a Flight stream, then I think we'd still have the distinction, unfortunately.

I am happy to wait until we have AsyncScanner merged and then I can update this.

westonpace · 2021-04-16T18:19:51Z

How would a flight stream even work with the datasets API? What would fragments be? How would it know when the file is ended? I think I need to fit this into my mental model.

lidavidm · 2021-04-16T18:22:30Z

Sorry, so this was originally added to support writing data from a generator - which could be something like a Flight stream (=record batch reader). But writing data in Datasets consumes a scanner, so you end up having to support one-shot datasets. I agree supporting reading data from Flight is an entirely different manner and would be modeled differently (presumably, as an iterable, as you suggest, corresponding to an RPC with a fixed set of parameters).

westonpace · 2021-04-16T18:24:55Z

Ah, that's right. And for the writing from memory case we want to free up the memory after we write it so an iterable would be out of the question.

westonpace · 2021-04-16T18:27:39Z

Hmm...maybe streaming isn't the most intuitive name then. Technically all the file-based datasets are "streaming". If a user was copying a dataset for example we would stream the data a few batches at a time. Should we just use SingleShotDataset?

lidavidm · 2021-04-16T18:28:29Z

Sounds good to me.

lidavidm · 2021-04-21T13:27:24Z

Rebased to pick up ARROW-12289; now OneShotDataset uses its own Fragment implementation so that ScanBatchesAsync uses a background thread, to avoid blocking in the async scanner.

lidavidm · 2021-04-27T18:17:13Z

@westonpace do you want to look over the ScanBatchesAsync implementation here?

bkietz · 2021-05-04T18:24:14Z

To me, this seems less like a subclass of dataset and more like a subclass of Scanner: IMHO it's not intuitive that a dataset would ever be single-shot. Instead, I think it'd make more sense to add Scanner::MakeFromRecordBatchReader or so, and (probably) add single-shot-ness to the contract of Scanner.

westonpace · 2021-05-04T19:07:07Z

To me, this seems less like a subclass of dataset and more like a subclass of Scanner: IMHO it's not intuitive that a dataset would ever be single-shot. Instead, I think it'd make more sense to add Scanner::MakeFromRecordBatchReader or so, and (probably) add single-shot-ness to the contract of Scanner.

I'm not sure I agree. I agree with "it's not intuitive that a dataset would ever be single-shot". I don't agree that it makes any more sense for Scanner to be single-shot. I think the core non-intuitive piece is the concept of a "one-shot iterable".

In my mental model:

Dataset -> Iterable<Fragment>
Fragment -> Map<Fragment, Iterable<RecordBatch>>
Scanner -> Map<Dataset, Iterable<RecordBatch>>

So Scanner is just a "map" function which is generally (Python being the exception) reusable.

Perhaps I will revisit my original suggestion of having the input to dataset be an iterable (InMemoryDataset::RecordBatchGenerator is already sort of an "iterable" interface) and the in-memory variants are one-shot iterables. The user facing python API could remain as-is. list of batches, tables, or iterable of batches, tables would be converted into a RecordBatchReader and a one-shot implementation of InMemoryDataset::RecordBatchGenerator would consume the reader and then return an invalid status the next time InMemoryDataset::RecordBatchGenerator::Get is called.

Although that takes us back pretty close to where we started 😬

lidavidm · 2021-05-05T12:13:17Z

I think you could argue that Fragment is just Iterable<RecordBatch> while Scanner is Iterator<RecordBatch>. While usually Scanner is a rewindable (but not random-access) iterator, it only guarantees ForwardIterator. Furthermore it's pretty simple to implement a one-shot scanner by having it wrap a (non-public-API) one-shot fragment (like the one implemented here).

Or put another way, if we limit the one-shotness to the Scanner, then we can hide the odd nonconforming Dataset/Fragment from the public API.

westonpace · 2021-05-05T17:43:26Z

So this would be a scanner that doesn't use fragments or datasets at all? Then would the python API change? Right now they pass batches/tables to the dataset function and then later get a scanner with scan.

So this change would be creating a scanner directly from a table/batches and bypassing the creation of a "dataset" entirely?

I think that makes a lot of sense.

lidavidm · 2021-05-05T17:46:25Z

Right (though the implementation would just be a SyncScanner wrapping a OneShotFragment).

In fact, Joris already refactored the Python side to have write take a Scanner. So in all cases, a Scanner gets passed to C++ (batches/etc. get turned into an InMemoryDataset and scanned; iterators get turned into a scanner directly). I'll rebase again to make sure everything still fits together.

bkietz

LGTM

cpp/src/arrow/dataset/dataset.cc

cpp/src/arrow/dataset/dataset.h

bkietz · 2021-05-06T00:52:53Z

cpp/src/arrow/dataset/scanner.cc

  DCHECK_OK(Filter(scan_options_->filter));
 }

+class ARROW_DS_EXPORT OneShotScanTask : public ScanTask {


Instead of exporting these, I think we can keep them in an anonymous namespace

westonpace

Ok, I like this latest approach. Right now there is a fallback in AsyncScanner::Finish to always use the sync scanner if creating a scanner from a fragment because I wasn't sure if wanted to keep the API. This uses that API so I created ARROW-12664 which I'll add soon.

jorisvandenbossche · 2021-05-06T08:12:13Z

LGTM, the Scanner.from_batches instead of a single-shot dataset seems like a nice solution

lidavidm added Component: C++ Component: Python labels Apr 16, 2021

jorisvandenbossche approved these changes Apr 19, 2021

View reviewed changes

lidavidm force-pushed the arrow-12231 branch from bd78113 to 6be5de2 Compare April 21, 2021 13:26

lidavidm force-pushed the arrow-12231 branch from 6be5de2 to b637947 Compare April 26, 2021 18:06

lidavidm force-pushed the arrow-12231 branch 2 times, most recently from b0a97ec to f1106c2 Compare May 4, 2021 18:04

lidavidm added 2 commits May 5, 2021 14:40

ARROW-12231: [C++][Python][Dataset] Differentiate one-shot datasets

2f8c651

ARROW-12231: [C++][Python][Dataset] Hide one-shotness in the scanner

c62ff07

lidavidm force-pushed the arrow-12231 branch from f1106c2 to c62ff07 Compare May 5, 2021 19:14

bkietz approved these changes May 6, 2021

View reviewed changes

lidavidm added 2 commits May 5, 2021 21:01

ARROW-12231: [C++][Dataset] Address review feedback

8804db7

ARROW-12231: [C++][Dataset] Make MSVC happy

f0162bc

lidavidm changed the title ~~ARROW-12231: [C++][Python][Dataset] Differentiate one-shot datasets~~ ARROW-12231: [C++][Python][Dataset] Isolate one-shot data to scanner May 6, 2021

westonpace approved these changes May 6, 2021

View reviewed changes

bkietz approved these changes May 6, 2021

View reviewed changes

bkietz closed this in b0e1284 May 6, 2021

wjones127 mentioned this pull request Mar 18, 2022

Dataset not supporting RecordBatchReader #12667

Closed

asfimport mentioned this pull request May 6, 2021

[C++][Dataset] Separate datasets backed by readers from InMemoryDataset #28047

Closed

kou mentioned this pull request Oct 4, 2023

[Python] pyarrow.dataset.dataset does not accept RecordBatchReader as source #38012

Open

ARROW-12231: [C++][Python][Dataset] Isolate one-shot data to scanner #10070

ARROW-12231: [C++][Python][Dataset] Isolate one-shot data to scanner #10070

Uh oh!

Conversation

lidavidm commented Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Apr 16, 2021

Uh oh!

github-actions bot commented Apr 16, 2021

Uh oh!

westonpace commented Apr 16, 2021

Uh oh!

lidavidm commented Apr 16, 2021

Uh oh!

westonpace commented Apr 16, 2021

Uh oh!

lidavidm commented Apr 16, 2021

Uh oh!

westonpace commented Apr 16, 2021

Uh oh!

westonpace commented Apr 16, 2021

Uh oh!

lidavidm commented Apr 16, 2021

Uh oh!

lidavidm commented Apr 21, 2021

Uh oh!

lidavidm commented Apr 27, 2021

Uh oh!

bkietz commented May 4, 2021

Uh oh!

westonpace commented May 4, 2021

Uh oh!

lidavidm commented May 5, 2021

Uh oh!

westonpace commented May 5, 2021

Uh oh!

lidavidm commented May 5, 2021

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bkietz May 6, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented May 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lidavidm commented Apr 16, 2021 •

edited

Loading