ARROW-12288: [C++] Create Scanner interface #9947

westonpace · 2021-04-08T10:06:53Z

To prepare for the AsyncScanner this PR creates a Scanner interface and, along the way, simplifies the current Scanner API so that the new scanner won't need to match.

What is removed:

Scanner::GetFragments was only used in FileSystemDataset::Write. The correct source of truth for fragments is the Dataset. Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation.
Scanner::schema is redundant and ambiguous. There are two schemas at the scan level. The dataset schema (the unified master schema that we expect all fragment schemas to be a subset of) and the projection schema (a combination of the dataset schema and the projection expression). Both of these are available on the scan options object and there is an accessor for these options so the caller might as well get them from there. This schema function was exposed via R and used internally there but I think any uses can be easily changed to using the options.
FileFormat::splittable and Fragment::splittable. These were intended to advertise that batch readahead was available on the given fragment/format. However, there is no need to advertise this. They are not used by the SyncScanner and the AsyncScanner will just assume that the format/fragment's will utilize readahead if they can (respecting the readahead options in ScanOptions)
Direct instantiation of Scanner. All Scanner creation should go through ScannerBuilder now. This allows the ScannerBuilder to determine what implementation to use. This was mostly the way things were implemented already. Only a few tests instantiated a Scanner directly.

What is deprecated

Scanner::Scan is going to be deprecated (ARROW-11797). It will not be implemented by AsyncScanner. I do not actually deprecate it in this PR as I reserve that for ARROW-11797. Unfortunately, this method was exposed via python & R and likely was used so deprecation is recommended over outright removal.

What is new

Scanner::ScanBatches and Scanner::ScanBatchesUnordered have been added. These functions will be the new preferred "scan" method going forward. This allows the parallelization (batch readahead, file readahead, etc.) to be handled by C++ and simplifies the user's life.
ScanOptions::batch_readahead and ScanOptions::fragment_readahead options allow more fine grained control over how to perform readahead. One technicality is that these options will not be respected well by the SyncScanner (although I think the current ARROW-11797 utilizes batch readahead) so they are more placeholders for when we implement AsyncScanner.
ScanOptions::cpu_executor and ScanOptions::io_context are added and should be fairly self explanatory.
ScanOptions::use_async will toggle which scanner to use.

github-actions · 2021-04-08T10:50:42Z

https://issues.apache.org/jira/browse/ARROW-12288

westonpace · 2021-04-09T08:43:42Z

@ursabot please benchmark

ursabot · 2021-04-09T08:43:46Z

Benchmark runs are scheduled for baseline = 95ca4f5 and contender = bc1b8f0a7b6d36273ce815c288f59d96000a14cb. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/e547252b-adca-4ac3-a547-6a480839d7a0...d3ba5dd8-1399-4556-93ef-19bbb0c7f76f/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/9494d4da-1a74-42ce-87f0-b05cc5bd0771...c1c2c894-19df-46fa-ba5b-6d667523e5a5/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/84ee2b27-9ce2-4872-8b69-6203ae82a032...0139f16e-8064-4e3e-87a1-d1602a138ba9/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/251a600f-87d7-41e0-8954-9adbdbed9875...32e1a644-f27c-4c3c-9955-0ea4a497a02b/

ursabot · 2021-04-09T10:30:15Z

Benchmark runs are scheduled for baseline = 95ca4f5 and contender = 09e132a0c6378d80a1fd33cb82266449420e2397. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/e547252b-adca-4ac3-a547-6a480839d7a0...c1afca64-00d3-4fb5-bbbc-a865a3e71057/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/9494d4da-1a74-42ce-87f0-b05cc5bd0771...ef9eea2d-ce34-43fc-964a-bcf993f5b8c4/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/84ee2b27-9ce2-4872-8b69-6203ae82a032...5cf39b2f-8724-43b6-8245-ab1b0bed3246/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/251a600f-87d7-41e0-8954-9adbdbed9875...bcf09612-e9c9-4e48-94ce-bd184192fdec/

ursabot · 2021-04-09T10:45:09Z

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 90f86d284b0640257f18910a26a95bbab9f18cf1. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...3e3e1b94-ae11-4e46-b63d-33aaa7eb224c/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...c77a16eb-6fea-4782-99df-2cd08777ed0e/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...bfdc976c-9348-4ea3-bc97-e42ea1bfa2f9/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...32960d64-1910-49c0-8558-97379cabc8a2/

fsaintjacques

Very nice cleanup.

fsaintjacques · 2021-04-09T12:28:43Z

cpp/src/arrow/dataset/scanner.h

I'll backport this into #9589 since they need to be synced up at some point.

Sorry about that. If you want to land #9589 first I can rebase this on top of that. I'm pretty sure the two are fairly compatible although there is some work to adjust to the new structs.

I've already backported it so whichever one gets reviewed first should get merged first :) (though I'm about to go back and remove operator== as per your point below.)

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/scanner.cc

lidavidm · 2021-04-09T14:05:51Z

cpp/src/arrow/dataset/scanner.cc

Overall this state pattern is common enough for iterators/async generators that I feel like we should encapsulate it instead of defining it ad-hoc every time. (Not for this PR, but as a future task.)

Already done for generators (#9945). It's a bit trickier here with the iterator here since it's enumerating both fragments and record batches at the same time. In the async version we enumerate them separately so it made more sense to have a general utility. Although this comment made me realize this should be named EnumeratingIterator and not TaggingIterator (I'm using "tagging" to refer to attaching the fragment to the record batch which is not what I'm doing here).

lidavidm · 2021-04-09T14:22:19Z

Should we actually merge this before ARROW-11797? Since that PR adds some more methods to the interface. Plus then we can kick things off in parallel.

ursabot · 2021-04-09T18:15:13Z

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 451e238e4f05cbbae1f11cbb5a8dd2e49ff68267. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...dd3d7ce9-7a31-4e68-a7f6-ae032c65724e/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...97db6b5f-bc27-40c3-a376-aa3bc7642b31/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...ac1839e9-ccf2-4c38-b48e-19b59c2f17ff/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...c89daefd-8a87-4c08-824f-c62d9f6862d2/

ursabot · 2021-04-09T18:45:08Z

Benchmark runs are scheduled for baseline = 6ddaaa8 and contender = 4500bc1afbcd06312c65375b5b16ae7a1f4bc879. Results will be available as each benchmark for each run completes:
[Failed] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/d9b7107b-62f2-4dfe-b5c3-a27d18fd1f75...e83384a6-81b1-4445-996b-0747f3b30e6d/
[Failed] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/cb88b1a5-ef5d-468c-9c7f-60c0c2d79a8c...30aa6a27-38f9-481a-9faf-4a6a4c8aa4cd/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/bb28b466-d28e-4a8a-bb3c-dfa2c1725145...b940b64d-b5b5-41f5-b42f-6bf1fd0eb49c/
[Failed] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/b4b45f4c-4bc4-4ea7-b7ba-5eb52540da4e...7a43a5b3-a087-4386-a9ec-b22e221164c7/

ursabot · 2021-04-09T19:45:16Z

Benchmark runs are scheduled for baseline = c0ce2b1 and contender = f88a9172bb7ace47efc83a2eb53c895d7fa5d958. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/ae62d394-72d3-4748-97e1-803b871167ec...40e7e6bc-9177-422c-baa4-9e45a96509ea/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/dd805af5-4077-47ef-acc1-4b49532ae41e...dc7601de-9ac2-44a3-8d68-766fa45f9d98/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/0b51231e-fc1d-4888-834f-c3bccad1e026...83d518fe-63c5-437a-bf47-4141b9e6ae8e/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/c4009461-4c7d-4f82-b535-9b5dd71e101c...c9451681-a295-4306-bd59-83eb135c8567/

lidavidm · 2021-04-09T22:00:11Z

This needs rebasing & I'll take a final look but otherwise I think we can get this at least into 4.0? And then we can hopefully also get the deprecation of Scan() in, which should set us up for 5.0.0.

jorisvandenbossche · 2021-04-12T17:20:34Z

[about Scanner::GetFragments being removed] Note: The python implementation exposed this method but it was not documented or used in any unit test. I think it can be safely removed and we need not worry about deprecation.

Agreed, I don't think we need to worry about deprecating this one.

bkietz

Some minor comments, thanks for doing this:

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/test_util.h

bkietz · 2021-04-12T15:20:00Z

cpp/src/arrow/dataset/test_util.h

Nit:

Suggested change

void AssertBatchEquals(RecordBatchReader* expected, RecordBatch* batch) {

void AssertBatchEquals(RecordBatchReader* expected, const RecordBatch& batch) {

Done. Although why isn't RecordBatchReader passed by reference? Is it simply the fact that RecordBatch is const?

We generally favor pointers over non-const reference parameters, IIRC to make it clear what is used mutably, but I can't find the reference for this (I had thought it was in the Google C++ Style Guide).

cpp/src/arrow/dataset/test_util.h

bkietz · 2021-04-12T15:49:12Z

cpp/src/arrow/dataset/file_base.cc

This seems to remove parallel execution of batch partitioning. If that's intended then please call it out and add a comment with the follow up JIRA noted

I had taken out the equivalent from #9589/ARROW-11797 but I can restore it.

When I rebased I restored the old version which relies on Scan (@lidavidm maybe the easiest thing to do is just add the pragma avoiding warning for deprecation. I'll add some kind of switch for the async version to use ScanBatches).

However, the old version created one scan task per fragment which relied on Scanner::GetFragments (which is actually the only reason I modified this function in this PR at all).

So when restoring the older version it just calls Scan and collects the iterator into a vector. My guess is the older version was a hangover from the older version of dataset write which didn't use the scanner? @bkietz do you mind thinking this through?

Also, this wasn't purely to put the parallel partitioning back in. The ScanBatches implementation introduced the ever-annoying nested parallelism back in. I had avoided it in #9892 by converting synchronous scan task execution into a future (using TaskGroup::FinishAsync) and then combining that future with any asynchronous scan task executions. I bring this up because @lidavidm is probably going to run into the same problem with ToTable when rebasing ARROW-12208.

In ARROW-11797 I renamed Scan to ScanInternal and made it private for this purpose so I'll fix that once merged. (I think this can/should merge first since it's gone through review.)

cpp/src/arrow/dataset/scanner.h

…f warnings and ARROW-11797 is already dealing with that

…s and not directly from scanner

…s more accurate

…unit test

ursabot · 2021-04-12T20:15:15Z

Benchmark runs are scheduled for baseline = 745cdb6 and contender = 9fbeb3c. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/be153764-4c1d-4b73-aa0c-67923009ee2d...360f8d90-b736-4081-bd56-84c3e083ce5e/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/d385cf33-21d1-44eb-9f8b-0a83309b8ec6...aec1cc95-0909-48a9-8d7c-885b612874bd/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/3deed7a8-9484-44a0-a2da-25338cc3ee8b...d83385bb-ea0b-4d12-a68f-c601f29745a2/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/8af1e9ef-4582-4447-ba90-008ccd53af21...ddbc0863-d056-4554-8bca-b1cf4d8b0d5d/

ursabot · 2021-04-12T20:30:15Z

Benchmark runs are scheduled for baseline = 745cdb6 and contender = 7c42b63. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/be153764-4c1d-4b73-aa0c-67923009ee2d...cefe43fe-6e25-4859-91c3-166f5e599107/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/d385cf33-21d1-44eb-9f8b-0a83309b8ec6...d6dba6ec-f4f9-46dd-87f4-05460f66f73c/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/3deed7a8-9484-44a0-a2da-25338cc3ee8b...40d9765a-83d7-484e-ae63-7fd792fe5224/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/8af1e9ef-4582-4447-ba90-008ccd53af21...1e46679c-5f04-477b-ac9b-710be1d304f3/

ursabot · 2021-04-12T20:45:11Z

Benchmark runs are scheduled for baseline = 745cdb6 and contender = 3690029. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/be153764-4c1d-4b73-aa0c-67923009ee2d...0536ade9-a6ba-48ab-bf34-8c820886bd6a/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/d385cf33-21d1-44eb-9f8b-0a83309b8ec6...4397974b-0a0d-4db7-ab20-d18037a5ed46/
[Finished] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/3deed7a8-9484-44a0-a2da-25338cc3ee8b...36c94aa3-de4d-42fc-a819-bdcbcf9e3e92/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/8af1e9ef-4582-4447-ba90-008ccd53af21...30b62c3b-94e5-4b28-ae3d-cef38ec3e9ea/

lidavidm

The two test failures look unrelated (MinIO failing to initialize on Windows and Maven encountering a network hiccup).

github-actions bot added Component: C++ Component: Python Component: R labels Apr 8, 2021

westonpace marked this pull request as ready for review April 9, 2021 08:33

westonpace force-pushed the feature/arrow-12288 branch from 09e132a to 90f86d2 Compare April 9, 2021 10:38

fsaintjacques reviewed Apr 9, 2021

View reviewed changes

lidavidm reviewed Apr 9, 2021

View reviewed changes

westonpace force-pushed the feature/arrow-12288 branch from 4500bc1 to f88a917 Compare April 9, 2021 19:30

bkietz requested changes Apr 12, 2021

View reviewed changes

westonpace added 10 commits April 12, 2021 10:02

ARROW-12288: First pass at defining Scanner interface

704f753

ARROW-12288: A little refinement and got python to compile

9182a78

ARROW-12288: Further refinement

7ef8c9e

ARROW-12288: Removing the actual deprecate since it generates a lot o…

e093426

…f warnings and ARROW-11797 is already dealing with that

ARROW-12288: Patching in support for R

d2aea7e

ARROW-12288: Lint

15b5e1a

ARROW-12288: Typos

d66a04a

ARROW-12288: More lint

aa9dc43

ARROW-12288: Tweaked Java to get scanner projected schema from option…

55b8f92

…s and not directly from scanner

ARROW-12288: Renamed TaggingIterator to EnumeratingIterator as that i…

f2aca73

…s more accurate

ARROW-12288: Fixed up Scanner::AddPositioningToInOrderScan and added …

9fbeb3c

…unit test

westonpace force-pushed the feature/arrow-12288 branch from f88a917 to 9fbeb3c Compare April 12, 2021 20:11

ARROW-12288: Cleanup after rebase

7c42b63

ARROW-12288: Addressing PR comments

3690029

lidavidm approved these changes Apr 13, 2021

View reviewed changes

lidavidm closed this in a102ba2 Apr 13, 2021

westonpace mentioned this pull request Apr 13, 2021

ARROW-11797: [C++][Dataset] Provide batch stream Scanner methods #9589

Closed

westonpace deleted the feature/arrow-12288 branch January 6, 2022 08:17

asfimport mentioned this pull request Apr 13, 2021

[C++] Create Scanner interface #28094

Closed

	void AssertBatchEquals(RecordBatchReader* expected, RecordBatch* batch) {
	void AssertBatchEquals(RecordBatchReader* expected, const RecordBatch& batch) {

ARROW-12288: [C++] Create Scanner interface #9947

ARROW-12288: [C++] Create Scanner interface #9947

Uh oh!

Conversation

westonpace commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is removed:

What is deprecated

What is new

Uh oh!

github-actions bot commented Apr 8, 2021

Uh oh!

westonpace commented Apr 9, 2021

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 9, 2021

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Apr 9, 2021

Uh oh!

jorisvandenbossche commented Apr 12, 2021

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace commented Apr 8, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 9, 2021 •

edited

Loading

ursabot commented Apr 12, 2021 •

edited

Loading

ursabot commented Apr 12, 2021 •

edited

Loading

ursabot commented Apr 12, 2021 •

edited

Loading