Skip to content

Conversation

@westonpace
Copy link
Member

@westonpace westonpace commented Apr 8, 2021

To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match.

What is removed:

  • Scanner::GetFragments was only used in FileSystemDataset::Write. The correct source of truth for fragments is the Dataset. Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation.
  • Scanner::schema is redundant and ambiguous. There are two schemas at the scan level. The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression). Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there. This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options.
  • FileFormat::splittable and Fragment::splittable. These were intended to advertise that batch readahead was available on the given fragment/format. However, there is no need to advertise this. They are not used by the SyncScanner and the AsyncScanner will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in ScanOptions)
  • Direct instantiation of Scanner. All Scanner creation should go through ScannerBuilder now. This allows the ScannerBuilder to determine what implementation to use. This was mostly the way things were implemented already. Only a few tests instantiated a Scanner directly.

What is deprecated

  • Scanner::Scan is going to be deprecated (ARROW-11797). It will not be implemented by AsyncScanner. I do not actually deprecate it in this PR as I reserve that for ARROW-11797. Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal.

What is new

  • Scanner::ScanBatches and Scanner::ScanBatchesUnordered have been added. These functions will be the new preferred "scan" method going forward. This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life.
  • ScanOptions::batch_readahead and ScanOptions::fragment_readahead options allow more fine grained control over how to perform readahead. One technicality is that these options will not be respected well by the SyncScanner (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement AsyncScanner.
  • ScanOptions::cpu_executor and ScanOptions::io_context are added and should be fairly self explanatory.
  • ScanOptions::use_async will toggle which scanner to use.

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

@westonpace
Copy link
Member Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = 95ca4f5 and contender = bc1b8f0a7b6d36273ce815c288f59d96000a14cb. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/e547252b-adca-4ac3-a547-6a480839d7a0...d3ba5dd8-1399-4556-93ef-19bbb0c7f76f/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/9494d4da-1a74-42ce-87f0-b05cc5bd0771...c1c2c894-19df-46fa-ba5b-6d667523e5a5/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/84ee2b27-9ce2-4872-8b69-6203ae82a032...0139f16e-8064-4e3e-87a1-d1602a138ba9/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/251a600f-87d7-41e0-8954-9adbdbed9875...32e1a644-f27c-4c3c-9955-0ea4a497a02b/

@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = 95ca4f5 and contender = 09e132a0c6378d80a1fd33cb82266449420e2397. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/e547252b-adca-4ac3-a547-6a480839d7a0...c1afca64-00d3-4fb5-bbbc-a865a3e71057/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/9494d4da-1a74-42ce-87f0-b05cc5bd0771...ef9eea2d-ce34-43fc-964a-bcf993f5b8c4/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/84ee2b27-9ce2-4872-8b69-6203ae82a032...5cf39b2f-8724-43b6-8245-ab1b0bed3246/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/251a600f-87d7-41e0-8954-9adbdbed9875...bcf09612-e9c9-4e48-94ce-bd184192fdec/

@westonpace westonpace force-pushed the feature/arrow-12288 branch from 09e132a to 90f86d2 Compare April 9, 2021 10:38
@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 90f86d284b0640257f18910a26a95bbab9f18cf1. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...3e3e1b94-ae11-4e46-b63d-33aaa7eb224c/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...c77a16eb-6fea-4782-99df-2cd08777ed0e/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...bfdc976c-9348-4ea3-bc97-e42ea1bfa2f9/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...32960d64-1910-49c0-8558-97379cabc8a2/

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll backport this into #9589 since they need to be synced up at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. If you want to land #9589 first I can rebase this on top of that. I'm pretty sure the two are fairly compatible although there is some work to adjust to the new structs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already backported it so whichever one gets reviewed first should get merged first :) (though I'm about to go back and remove operator== as per your point below.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this state pattern is common enough for iterators/async generators that I feel like we should encapsulate it instead of defining it ad-hoc every time. (Not for this PR, but as a future task.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done for generators (#9945). It's a bit trickier here with the iterator here since it's enumerating both fragments and record batches at the same time. In the async version we enumerate them separately so it made more sense to have a general utility. Although this comment made me realize this should be named EnumeratingIterator and not TaggingIterator (I'm using "tagging" to refer to attaching the fragment to the record batch which is not what I'm doing here).

@lidavidm
Copy link
Member

lidavidm commented Apr 9, 2021

Should we actually merge this before ARROW-11797? Since that PR adds some more methods to the interface. Plus then we can kick things off in parallel.

@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 451e238e4f05cbbae1f11cbb5a8dd2e49ff68267. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...dd3d7ce9-7a31-4e68-a7f6-ae032c65724e/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...97db6b5f-bc27-40c3-a376-aa3bc7642b31/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...ac1839e9-ccf2-4c38-b48e-19b59c2f17ff/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...c89daefd-8a87-4c08-824f-c62d9f6862d2/

@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 4500bc1afbcd06312c65375b5b16ae7a1f4bc879. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...e83384a6-81b1-4445-996b-0747f3b30e6d/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...30aa6a27-38f9-481a-9faf-4a6a4c8aa4cd/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...b940b64d-b5b5-41f5-b42f-6bf1fd0eb49c/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...7a43a5b3-a087-4386-a9ec-b22e221164c7/

@westonpace westonpace force-pushed the feature/arrow-12288 branch from 4500bc1 to f88a917 Compare April 9, 2021 19:30
@ursabot
Copy link

ursabot commented Apr 9, 2021

Benchmark runs are scheduled for baseline = c0ce2b1 and contender = f88a9172bb7ace47efc83a2eb53c895d7fa5d958. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/ae62d394-72d3-4748-97e1-803b871167ec...40e7e6bc-9177-422c-baa4-9e45a96509ea/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/dd805af5-4077-47ef-acc1-4b49532ae41e...dc7601de-9ac2-44a3-8d68-766fa45f9d98/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/0b51231e-fc1d-4888-834f-c3bccad1e026...83d518fe-63c5-437a-bf47-4141b9e6ae8e/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/c4009461-4c7d-4f82-b535-9b5dd71e101c...c9451681-a295-4306-bd59-83eb135c8567/

@lidavidm
Copy link
Member

lidavidm commented Apr 9, 2021

This needs rebasing & I'll take a final look but otherwise I think we can get this at least into 4.0? And then we can hopefully also get the deprecation of Scan() in, which should set us up for 5.0.0.

@jorisvandenbossche
Copy link
Member

[about Scanner::GetFragments being removed] Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation.

Agreed, I don't think we need to worry about deprecating this one.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, thanks for doing this:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
void AssertBatchEquals(RecordBatchReader* expected, RecordBatch* batch) {
void AssertBatchEquals(RecordBatchReader* expected, const RecordBatch& batch) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Although why isn't RecordBatchReader passed by reference? Is it simply the fact that RecordBatch is const?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally favor pointers over non-const reference parameters, IIRC to make it clear what is used mutably, but I can't find the reference for this (I had thought it was in the Google C++ Style Guide).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to remove parallel execution of batch partitioning. If that's intended then please call it out and add a comment with the follow up JIRA noted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had taken out the equivalent from #9589/ARROW-11797 but I can restore it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I rebased I restored the old version which relies on Scan (@lidavidm maybe the easiest thing to do is just add the pragma avoiding warning for deprecation. I'll add some kind of switch for the async version to use ScanBatches).

However, the old version created one scan task per fragment which relied on Scanner::GetFragments (which is actually the only reason I modified this function in this PR at all).

So when restoring the older version it just calls Scan and collects the iterator into a vector. My guess is the older version was a hangover from the older version of dataset write which didn't use the scanner? @bkietz do you mind thinking this through?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this wasn't purely to put the parallel partitioning back in. The ScanBatches implementation introduced the ever-annoying nested parallelism back in. I had avoided it in #9892 by converting synchronous scan task execution into a future (using TaskGroup::FinishAsync) and then combining that future with any asynchronous scan task executions. I bring this up because @lidavidm is probably going to run into the same problem with ToTable when rebasing ARROW-12208.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ARROW-11797 I renamed Scan to ScanInternal and made it private for this purpose so I'll fix that once merged. (I think this can/should merge first since it's gone through review.)

@westonpace westonpace force-pushed the feature/arrow-12288 branch from f88a917 to 9fbeb3c Compare April 12, 2021 20:11
@ursabot
Copy link

ursabot commented Apr 12, 2021

@ursabot
Copy link

ursabot commented Apr 12, 2021

@ursabot
Copy link

ursabot commented Apr 12, 2021

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two test failures look unrelated (MinIO failing to initialize on Windows and Maven encountering a network hiccup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants