-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12231: [C++][Python][Dataset] Isolate one-shot data to scanner #10070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @westonpace |
|
It's a pity, if we were not crossing languages I think we could require as input an iterable (something that can be invoked to generate a (potentially one-shot) iterator) instead of an iterator and then wouldn't have to have this distinction. This looks good but do we want to open a second JIRA for the post-12289 follow-up or just wait and do that work in this JIRA later? |
|
If the other side is something like a Flight stream, then I think we'd still have the distinction, unfortunately. I am happy to wait until we have AsyncScanner merged and then I can update this. |
|
How would a flight stream even work with the datasets API? What would fragments be? How would it know when the file is ended? I think I need to fit this into my mental model. |
|
Sorry, so this was originally added to support writing data from a generator - which could be something like a Flight stream (=record batch reader). But writing data in Datasets consumes a scanner, so you end up having to support one-shot datasets. I agree supporting reading data from Flight is an entirely different manner and would be modeled differently (presumably, as an iterable, as you suggest, corresponding to an RPC with a fixed set of parameters). |
|
Ah, that's right. And for the writing from memory case we want to free up the memory after we write it so an iterable would be out of the question. |
|
Hmm...maybe streaming isn't the most intuitive name then. Technically all the file-based datasets are "streaming". If a user was copying a dataset for example we would stream the data a few batches at a time. Should we just use SingleShotDataset? |
|
Sounds good to me. |
|
Rebased to pick up ARROW-12289; now OneShotDataset uses its own Fragment implementation so that ScanBatchesAsync uses a background thread, to avoid blocking in the async scanner. |
|
@westonpace do you want to look over the ScanBatchesAsync implementation here? |
b0a97ec to
f1106c2
Compare
|
To me, this seems less like a subclass of dataset and more like a subclass of Scanner: IMHO it's not intuitive that a dataset would ever be single-shot. Instead, I think it'd make more sense to add |
I'm not sure I agree. I agree with "it's not intuitive that a dataset would ever be single-shot". I don't agree that it makes any more sense for Scanner to be single-shot. I think the core non-intuitive piece is the concept of a "one-shot iterable". In my mental model: So Scanner is just a "map" function which is generally (Python being the exception) reusable. Perhaps I will revisit my original suggestion of having the input to dataset be an iterable ( Although that takes us back pretty close to where we started 😬 |
|
I think you could argue that Fragment is just Or put another way, if we limit the one-shotness to the Scanner, then we can hide the odd nonconforming Dataset/Fragment from the public API. |
|
So this would be a scanner that doesn't use fragments or datasets at all? Then would the python API change? Right now they pass batches/tables to the So this change would be creating a scanner directly from a table/batches and bypassing the creation of a "dataset" entirely? I think that makes a lot of sense. |
|
Right (though the implementation would just be a SyncScanner wrapping a OneShotFragment). In fact, Joris already refactored the Python side to have |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cpp/src/arrow/dataset/scanner.cc
Outdated
| DCHECK_OK(Filter(scan_options_->filter)); | ||
| } | ||
|
|
||
| class ARROW_DS_EXPORT OneShotScanTask : public ScanTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exporting these, I think we can keep them in an anonymous namespace
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I like this latest approach. Right now there is a fallback in AsyncScanner::Finish to always use the sync scanner if creating a scanner from a fragment because I wasn't sure if wanted to keep the API. This uses that API so I created ARROW-12664 which I'll add soon.
|
LGTM, the |
This isolates the one-shot portion of InMemoryDataset to Scanner, so that it more clearly is used only for writing data from a source that cannot be re-read.