ARROW-7001: [C++] Develop threading APIs to accommodate nested parallelism #9607

westonpace · 2021-03-01T16:30:18Z

This PR ports the dataset/scanner logic to async. It does not actually make any readers (e.g. parquet reader, ipc reader) async. Once the readers are switched to async then they can make use of nested parallelism.

This PR should improve performance on CSV in high latency situations (and the other readers once they are converted).

The scanning order and the scan tasks delivered by Scanner::Scan will change as part of this PR (just to be clear, the order in which the files are scanned will change, not the order in which scan tasks are delivered. However, the scan tasks themselves will change, see next note).

This PR makes Scanner::Scan a legacy operation. In order to keep backwards compatibility we still return scan tasks but they are in-memory scan tasks around record batches. This means that parquet will generate more scan tasks than it did before (one per record batch instead of one per row group). CSV and IPC will generate many more scan tasks than before. However, the Execute method should be immediate for these new scan tasks. The new, preferred approach, is to use Scanner::ScanBatches.

This PR also adds Scanner::ScanAsync, Scanner::ScanUnorderedAsync, and Scanner::ToTableAsync. However, there is no need to externally expose those yet.

github-actions · 2021-03-01T17:31:25Z

https://issues.apache.org/jira/browse/ARROW-7001

cpp/src/arrow/dataset/file_parquet.cc

lidavidm

Frankly I probably need to give this a second pass but overall this looks reasonable. I left mostly minor comments.

cpp/src/arrow/dataset/dataset.h

lidavidm · 2021-03-31T14:52:28Z

cpp/src/arrow/dataset/dataset.h


 protected:
-  Result<FragmentIterator> GetFragmentsImpl(Expression predicate) override;
+  Future<FragmentVector> GetFragmentsImpl(Expression predicate) override;


This may be suboptimal for the case of writing datasets. ARROW-10882/#9802 backs an InMemoryDataset with a RecordBatchReader to support passing in Python generators; if we have to materialize all fragments up front that would defeat the purpose.

Though, instead, we could implement a lazy InMemoryFragment instead of the current approach which materializes an InMemoryFragment for each batch in the source, so this doesn't block this PR per se (we just need to rework the other PR a bit once this lands).

I was going to let this be addressed by ARROW-8163.

However, if we're getting ARROW-10882 done first then rather than go with an InMemoryFragment approach I'd recommend switching to AsyncGenerator at that point since we're going to do so sooner or later anyways.

Also, it hopefully won't be too much work. The scanner already consumes the FragmentVector as a generator so the only change needed would be to push the MakeVectorGenerator down into the file formats.

cpp/src/arrow/dataset/dataset.cc

cpp/src/arrow/dataset/dataset_internal.h

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/scanner.cc

ARROW-7001: First stab at converting datasets logic to async ARROW-7001: Fixed a bunch of .result()'s in unit tests that weren't really valid (returning a reference to something then deleted) ARROW-7001: Missed on change during rebase ARROW-7001: Renamed ScanSync to Scan and ExecuteSync to Execute to preserve the old mirror APIs until the public bindings can be removed ARROW-7001: Added a few more mirror APIs to get python build working ARROW-7001: WIP ARROW-7001: Various WIP ARROW-7001: WIP ARROW-7001: First stab at converting datasets logic to async ARROW-7001: Fixed a bunch of .result()'s in unit tests that weren't really valid (returning a reference to something then deleted) ARROW-7001: Renamed ScanSync to Scan and ExecuteSync to Execute to preserve the old mirror APIs until the public bindings can be removed ARROW-7001: Added a few more mirror APIs to get python build working ARROW-7001: WIP ARROW-7001: Various WIP ARROW-7001: Minor fixes to get semantics right ARROW-7001: Cleanup ARROW-7001: Fixing some compile errors after rebase ARROW-7001: Fixing errors from rebase ARROW-7001: Added a test for reordering datasets. Removed old concept of splittable. Fixed bug where file errors may not pass through ARROW-7001: Somewhere in the rebasing I lost the 1-arg ScannerBuilder constructor. Added it back in and created a unit test for it for good measure. ARROW-7001: Removing a ... to see if it removes illegal instruction on mac ARROW-7001: Fixed a potential memory issue in the preserve ordering test ARROW-7001: lint ARROW-7001: Changed from using optional<bool> which isn't allowed to just returning the scan task in Scanner::ToTableAsync::table_building_task ARROW-7001: Removed the forced transfer as it was not truly doing anything ARROW-7001: The CSV scan task was doing a read on the CPU thread pool and it was preventing the async chain from getting setup immediately slowing things down. In addition, the later readahead buffers need to be larger to prevent the CPU thread from idling when things arrive out of order. ARROW-7001: Need to put the impl for Scanner::ToTable in the cc file so it ends up in the so ARROW-7001: Added a reordering test ARROW-7001: Added ordering to scanner ARROW-7001: Converted Future<Generator> to Generator ARROW-7001: File readahead was not working correctly and to fix it required quite an overhaul of the scanner but, on the bright side, performance is better on I/O bound tasks ARROW-7001: Fix failing unit test ARROW-7001: Cleaned up lint. Deprecated the old Scan method. Reworked existing logic to adapt ARROW-7001: Removing unused code detected by build ARROW-7001: Moved some code around between header/impl to make MSVC happy. Fixed up a memory leak in a unit test caused by a circular shared_ptr reference

…Batches

…R code. To address in ARROW-11782 ARROW-7001: Removed incorrect comment from MakeMappedGenerator ARROW-7001: Fixed a regression present when reading IPC fully buffered in memory ARROW-7001: Made the InMemoryDataset creation methods consistent. ARROW-7001: Adding back in (hopefully legacy) constructor for InMemoryScanTask needed by cglib

westonpace · 2021-04-07T18:44:03Z

@ursabot please benchmark

ursabot · 2021-04-07T18:44:10Z

Benchmark runs are scheduled for baseline = 5554c54 and contender = 5d48227. Results will be available as each benchmark for each run completes:
[Finished] ursa-i9-9960x: https://conbench.ursa.dev/compare/runs/c9a4e9a6-9290-4f4e-b382-fe43557215ee...6e3be913-c973-407e-a0de-5e3f0a6a6b10/
[Finished] ursa-thinkcentre-m75q: https://conbench.ursa.dev/compare/runs/1899a09d-e07c-423f-bdc5-13ae3fa1cd1c...7eb02524-600d-4af9-914a-621c4990bdef/
[Failed] ec2-t3-large-us-east-2: https://conbench.ursa.dev/compare/runs/ee98cea1-b3ef-4664-bb4b-8038fe8b9aed...5bc86590-a57a-4549-b267-5905f08a8414/
[Finished] ec2-t3-xlarge-us-east-2: https://conbench.ursa.dev/compare/runs/0e49b970-a3ee-478f-b42a-f1d15d2c8096...f00a81b5-2883-42eb-8559-95536fe2f432/

westonpace · 2021-05-04T19:21:21Z

The work reflected in this PR has been captured in ARROW-12289 (and related JIRAs). This PR is no longer needed.

github-actions bot added the Component: C++ label Mar 1, 2021

westonpace force-pushed the feature/arrow-7001 branch from a875631 to 9044db2 Compare March 2, 2021 00:11

lidavidm reviewed Mar 2, 2021

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

lidavidm mentioned this pull request Mar 2, 2021

ARROW-11843: [C++] Provide async Parquet reader #9620

Closed

westonpace force-pushed the feature/arrow-7001 branch 4 times, most recently from ce01a15 to 7f525e8 Compare March 9, 2021 23:18

westonpace force-pushed the feature/arrow-7001 branch from 7f525e8 to df34e3a Compare March 15, 2021 15:22

westonpace marked this pull request as ready for review March 15, 2021 15:23

westonpace force-pushed the feature/arrow-7001 branch 5 times, most recently from e399e7d to 8a30438 Compare March 20, 2021 01:45

westonpace force-pushed the feature/arrow-7001 branch 6 times, most recently from 4b51568 to b1cfa66 Compare March 30, 2021 14:13

lidavidm mentioned this pull request Mar 30, 2021

ARROW-9731: [C++][Python][R][Dataset] WIP: Port "head" into C++ #9854

Closed

westonpace force-pushed the feature/arrow-7001 branch 4 times, most recently from 5082cab to 4581b6a Compare March 31, 2021 13:00

github-actions bot added Component: Python Component: R labels Mar 31, 2021

lidavidm reviewed Mar 31, 2021

View reviewed changes

westonpace force-pushed the feature/arrow-7001 branch from 35903de to 857f349 Compare April 6, 2021 21:16

westonpace and others added 4 commits April 6, 2021 19:31

ARROW-9731: [R][Dataset] Add warning for Scanner$Scan, bind ScanBatches

4d71415

ARROW-9731: [Python][Dataset] Add warning for Scanner.scan, bind Scan…

c44c19d

…Batches

westonpace force-pushed the feature/arrow-7001 branch from 4d69e88 to 5d48227 Compare April 7, 2021 05:33

westonpace closed this May 4, 2021

westonpace deleted the feature/arrow-7001 branch January 6, 2022 08:17

asfimport mentioned this pull request Jun 22, 2021

[C++] Develop threading APIs to accommodate nested parallelism #23315

Closed

6 tasks

ARROW-7001: [C++] Develop threading APIs to accommodate nested parallelism #9607

ARROW-7001: [C++] Develop threading APIs to accommodate nested parallelism #9607

Uh oh!

Conversation

westonpace commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 1, 2021

Uh oh!

Uh oh!

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lidavidm Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Apr 7, 2021

Uh oh!

ursabot commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented May 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

westonpace commented Mar 1, 2021 •

edited

Loading

ursabot commented Apr 7, 2021 •

edited

Loading