-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11797: [C++][Dataset] Provide batch stream Scanner methods #9589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a quick look. I think my main concern is the potential race condition in Pop. The other comments are more of "I'm about to replace this with something else so lets not worry"
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be replacing this with something better so I don't know how much we care to worry but this is not ideal. For example, with parquet, this would fetch metadata for every file in the scan before starting to read any individual file. It introduces more latency than necessary.
Also, I'm not sure how this will interact with parquet preloading.
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, what I'm working on will work around this so maybe not stress at the moment but there's no back-pressure here. If the batch consumer is not fast enough and the dataset is larger than RAM the system will run out of RAM.
It's a bit odd because you are fixing ARROW-11800 here (though without pressure) and then I'll be breaking it again with my implementation (the first pass of my impl will have back pressure and parse loaded buffers a bit more serially)
cpp/src/arrow/dataset/scanner.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why both Scan(visitor) and ToBatches? Couldn't you just do ToBatches().Visit(visitor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is ToBatches().Visit(visitor) would invoke visitor exclusively on the thread which called Visit() whereas Scan(visitor) invokes visitor in the scan's thread pool. This method is speculative; I'm not sure we'd want to provide that but I included it as an example
522aca4 to
08aec2f
Compare
90818b2 to
ba8c952
Compare
|
I've updated this in case we change how to proceed with ARROW-7001:
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool. Now I just need to provide the V2 :)
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: member variables should be at the bottom of a struct:
https://google.github.io/styleguide/cppguide.html#Declaration_Order
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain it is safe to modify iteration_error outside the mutex. What happens if Pop is accessing it at the same time?
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not safe to do outside the mutex.
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: best to relinquish the mutex before calling notify
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm just not seeing it but what causes the above loop to exit if there is an error scanning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we'll run out of scan tasks/batches, since an error in getting the next scan task won't stop the current scan tasks from eventually completing. So really it's "all scan tasks drained (and maybe we didn't start all of them due to a failure)"
cpp/src/arrow/dataset/scanner.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If I see ToBatches I expect RecordBatchVector. I had named it ScanBatches but I don't feel too strongly on this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point. I didn't want to take ScanBatches since your method has a different signature. But maybe that could be ScanIndexedBatches or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the issue I ran into was that ScanBatches was used by FileSystemDataset::Write and it needed the fragment info in order to have access to the fragment's partition expression. So at a minimum I needed to return "record batch & partition it came from".
I think there was some discussion (either on the ML or some JIRA/PR) about the benefit of keeping the fragment available as the user might want to know where the batch came from.
Can you modify the ScanBatches here to return a RecordBatch/Fragment pair? I could align my ScanBatches with that (PositionedRecordBatch is overkill).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making a parameterized test instead so it is easier to trace failures if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just testing a scan of one scan task & one batch? It seems we would want to test more than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MakeScanner generates a union dataset of 2 InMemoryDatasets each of which repeats the batch 16 times so we should have 32 scan tasks total.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least we should create a JIRA to migrate these over to the new scan. Although I'm wondering if we want to just do that now because that would give us a lot more coverage of the non-deprecated path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it now.
r/R/dataset-scan.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Does this comment still make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! :)
r/src/dataset.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a naive question but why are there two versions of TakeRows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, Ben split it into two functions (a pure-Arrow implementation, and the R binding) presumably for convenience. But maybe we can just port the implementation into the C++ library and expose it to Python as well?
r/R/dataset-scan.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice but probably not necessary, this was a (de facto) internal method. If it were me I'd just delete it, but since you've already done this, might as well keep it. Can you please just make a JIRA to delete this so we don't forget to clean this up after the release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been using ARROW-11782 so I'll note that in this PR and update the issue.
8e494c9 to
52f146e
Compare
|
@westonpace @bkietz this should be ready now, if either of you wants to take another look. |
70c7bb8 to
5e885cd
Compare
|
I restored the |
2f09670 to
67ed5a8
Compare
|
Rebased (unfortunately, I had to squash all commits or else the rebase would've been a pain). |
67ed5a8 to
dc97bb1
Compare
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a JIRA for pushing down the index predicate into the scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed ARROW-12369.
cpp/src/arrow/dataset/scanner.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you not want to skip empty arrays (where length == 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that also exposed a bug (or well, poor error message) if all indices were out-of-bounds.
65f72a2 to
ad14ea5
Compare
|
I am not sure why the JNI test is having so much trouble but it passes locally under Docker. |
|
FYI, there are probably some things we could do to improve the JNI build (ARROW-11633) |
No description provided.