ARROW-12289: [C++] Create basic AsyncScanner implementation #10008

westonpace · 2021-04-13T03:41:05Z

Adds a naive implementation of AsyncScanner which is different from SyncScanner in a few ways:

It does not use ScanTask and instead relies on Fragment::ScanBatchesAsync which returns RecordBatchGenerator.
It does an unordered scan by default (i.e. batches from file N may arrive before all batches from file N-1 have arrived) and can order it if asked for
It uses the unordered scan for ToTable.

It is "naive" because this PR does not add a complete implementation for FileFragment::ScanBatchesAsync. This method relies on FileFormat::ScanBatchesAsync (in the same way that FileFragment::Scan relies on FileFormat::ScanFile). This method (FileFormat::ScanBatchesAsync) should be overridden in each of the formats (to rely on an async reader) but it is not (yet).

As a result, the performance for AsyncScanner is poor since it does not do any "per-file" parallelism nor does it do any "per-batch" parallelism. Follow-up tasks are ARROW-12355 (CSV), ARROW-11772 (IPC), ARROW-11843 (Parquet)

In addition, this PR is built on top of ARROW-12287 so that will need to be merged first. It will also need to rebase changes from ARROW-12161 and ARROW-11797.

github-actions · 2021-04-13T09:42:20Z

https://issues.apache.org/jira/browse/ARROW-12289

lidavidm · 2021-04-13T13:46:06Z

cpp/src/arrow/dataset/scanner_test.cc

ARROW-11797 uses the param to toggle UseThreads, so this will have to become a std::pair<bool, bool> (or really, just a custom struct) in the end.

Ah, yes. I was planning on adding whether to scan with Scan (to ensure we still test the legacy), ScanBatches, or ScanBatchesUnordered as a parameter as well.

I'll tackle that when I rebase ARROW-11797

I've absorbed UseThreads into the matrix.

cpp/src/arrow/dataset/dataset.cc

lidavidm · 2021-04-14T14:30:06Z

cpp/src/arrow/dataset/file_base.cc

Don't we need to check again if NextSync/NextAsync return the end marker? Otherwise, operator() will return a Future that resolves to the end marker and the consumer will stop early.

There is a silent precondition here that every fragment scan should return scan tasks that return at least 1 record batch (unless the entire fragment is empty in which case either 0 scan tasks or 1 scan task with 0 batches should both be ok).

I'm know this precondition holds for IPC and CSV (by virtue of there being only one scan task) but wasn't sure about parquet (i.e. can a push down filter cause a batch-less scan task to be emitted in the middle of a set of scan tasks?)

I may be reading this wrong, but when we finish one scan task and move on to the next, since we just fall through here, we'll return a completed Future which contains a nullptr, which gets returned as the result of operator(). So the generator's consumer will think that the generator has ended, even though we still have more scan tasks.

Oh. I think you are right. I should probably add a scanner unit test that generates more than one scan task. I'll work on that.

I see that there's now a parameter to generate multiple scan tasks per fragment in InMemoryDataset - however, is that necessary? For one, it doesn't affect this code path, since this only affects file fragments. For another, it doesn't affect the scanner, which doesn't use scan tasks (directly); it'll use ScanBatchesAsync on the Fragment, which flattens all the scan tasks itself anyways.

So I think the issue pointed out here doesn't show up in test purely because only Parquet fragments expose multiple scan tasks per fragment right now.

Fair point. I think it's still necessary but could be renamed as it is a bit vague. In the async case it is "batches per fragment" and in the sync case it is "scan tasks per fragment". It was enough to break the async scanner (it currently fails these tests). I also agree it doesn't expose this issue so I'll some more tests.

Well, technically speaking the # of batches per fragment depends both on this and max batch size. So I suppose we could have gotten sufficient testing by setting the max batch size small enough. At the moment these tests help exercise the SyncScanner if nothing else.

Ah good point. It is mostly just a nit as it's really a testing parameter that's unfortunately getting exposed in the public API.

This isn't a very strong precedent, but TableBatchReader handles batch_size by letting you set it after construction and that feels like an analogue of this.

Once SyncScanner goes away we could probably change InMemoryFragment::record_batches_ to InMemoryFragment::record_batch_. This reflects the spirit of "getting rid of scan tasks" better anyways.

I capitulated and removed the argument. Your comment about it just being a testing parameter is accurate. I created a test in arrow-dataset-file-test that does not rely on InMemoryDataset to test this logic here. I might in the future add some tests to ScannerTest that set a limit on scan options batch size to get coverage of the multiple batches per fragment case.

lidavidm · 2021-04-14T14:53:43Z

cpp/src/arrow/dataset/scanner.cc

Could this be MakeMergedGenerator?

It will need to be. The problem is that MakeMergedGenerator is immediately consuming EnumeratingGenerator which is not async-reentrant. MakeMergedGenerator (erroneously) pulls from the outer (the gen_gen) generator in an async-reentrant fashion. I'll make a follow-up JIRA just to keep this one simple.

ARROW-12386

cpp/src/arrow/dataset/file_base.cc

cpp/src/arrow/dataset/scanner.cc

cpp/src/arrow/dataset/scanner.h

lidavidm · 2021-04-16T13:20:39Z

After this lands I can rebase and implement ScanBatchesAsync for IPC/Parquet and give that another test.

…. This ends up creating multiple scan tasks per fragment for the sync case.

…sync

westonpace · 2021-04-16T20:51:42Z

I rebased @lidavidm 's latest changes. At this point I think I don't think there are anymore outstanding dataset PRs to rebase so I think this one is probably ready to merge if it passes review.

lidavidm

I think this is ready, just a couple more minor things.

cpp/src/arrow/util/iterator_test.cc

lidavidm · 2021-04-16T20:49:21Z

cpp/src/arrow/dataset/scanner.h

 constexpr int32_t kDefaultBatchReadahead = 32;
 constexpr int32_t kDefaultFragmentReadahead = 8;

+using FragmentGenerator = std::function<Future<std::shared_ptr<Fragment>>()>;


I think moving the subclass definitions means you can also move this alias and get rid of the async_generator.h include.

I was able to move FragmentGenerator but async_generator.h was still needed for Enumerated which is a pity since Enumerated is small and self-contained. Should I place it in its own file?

Ah, that's alright then.

…an executor I don't think it's really something we want the user specifying after all

lidavidm · 2021-04-19T16:49:10Z

I'll merge on green unless somebody else wants to take a look.

lidavidm · 2021-04-19T19:42:03Z

Integration build: usual issues
All MacOS builds: ARROW-12467 (need to account for LLVM 12)
Travis: I think this is a flake
AppVeyor: this is a flake, albeit a common one

github-actions bot added Component: C++ Component: Python Component: R labels Apr 13, 2021

lidavidm reviewed Apr 13, 2021

View reviewed changes

westonpace force-pushed the feature/arrow-12289 branch 2 times, most recently from dd525b3 to a040abf Compare April 14, 2021 09:43

westonpace marked this pull request as ready for review April 14, 2021 09:43

lidavidm reviewed Apr 14, 2021

View reviewed changes

westonpace force-pushed the feature/arrow-12289 branch 2 times, most recently from 7c93e1e to d372f45 Compare April 15, 2021 23:51

lidavidm reviewed Apr 16, 2021

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

westonpace added 7 commits April 16, 2021 08:59

ARROW-12289: Initial "naive" implementation of AsyncScanner

3c1e3bc

ARROW-12289: Added tests for scans with multiple batches per fragment…

eef3753

…. This ends up creating multiple scan tasks per fragment for the sync case.

ARROW-12289: Added a basic test of the naive FileFormat::ScanBatchesA…

aaae924

…sync

ARROW-12289: Rebased and fixed up tests back to passing

f0acd36

ARROW-12289: Lint

cca69c0

ARROW-12289: Rebase cleanup

c513b68

ARROW-12289: Addressing PR comments

42ba11f

westonpace force-pushed the feature/arrow-12289 branch from d4669c1 to 42ba11f Compare April 16, 2021 20:28

lidavidm approved these changes Apr 16, 2021

View reviewed changes

westonpace added 2 commits April 16, 2021 16:44

ARROW-12289: Given the amount of subtle nuance involved in selecting …

796b1e6

…an executor I don't think it's really something we want the user specifying after all

ARROW-12289: Minor cleanup from PR comments

c750a4d

lidavidm closed this in 2b2eeeb Apr 19, 2021

westonpace deleted the feature/arrow-12289 branch January 6, 2022 08:17

asfimport mentioned this pull request Apr 21, 2021

[C++] Create basic AsyncScanner implementation #28095

Closed

ARROW-12289: [C++] Create basic AsyncScanner implementation #10008

ARROW-12289: [C++] Create basic AsyncScanner implementation #10008

Uh oh!

Conversation

westonpace commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm commented Apr 16, 2021

Uh oh!

westonpace commented Apr 16, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 19, 2021

Uh oh!

lidavidm commented Apr 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

westonpace commented Apr 13, 2021 •

edited

Loading