ARROW-11889: [C++] Add parallelism to streaming CSV reader #10568

westonpace · 2021-06-22T03:31:19Z

This converts the parser & decoder into map functions and then creates the streaming CSV reader as an async generator. Parallel readahead is then added on top of the parser/decoder to allow for parallel reads.

One thing that is lost at the moment is the ability to encounter a parsing error and then continue. There was a python test that read in the first block, failed to convert the second block, and then successfully read in a third block. I'm not sure if that restart behavior is important but if it is I can look into adding it.

Another thing that could be investigated in the future is combining the file readers and table readers more. They already share some components but the parsing and decoding logic, while basically the same, is handled very differently. The only real difference is that the table reader saves all the parsed blocks for re-parsing and the streaming reader does not.

github-actions · 2021-06-22T03:31:37Z

https://issues.apache.org/jira/browse/ARROW-11889

westonpace · 2021-06-24T18:50:05Z

@pitrou @n3world This should probably wait until after ARROW-12996 is merged in. I think it'd be easier to rebase ARROW-12996 into this PR than the other way around.

westonpace · 2021-07-01T08:27:10Z

I've rebased in the changes from #10509. The behavior is only slightly different. Opening the streaming CSV reader reads in the first record batch so the bytes_read will reflect that before any batch is read. After that each time a batch is read in the next batch will be read in. This means the read will not increment bytes_read. If reading in parallel then bytes_read could potentially be even further ahead of the consumer since it will be doing decoding in readahead. It should still match the spirit of the feature which is to report how many bytes have been decoded.

@n3world @pitrou review is welcome. The CI failure is unrelated.

n3world · 2021-07-02T01:25:08Z

cpp/src/arrow/csv/reader_test.cc

I would say this changes the intent I had for bytes_read() when threads are used. The goal was to be able to report progress along with the batch. So that after a batch was retrieved with ReadNext() bytes_read() could be used to calculate the progress of this batch. In this example the second to last batch would be calculated as 100% complete and this can become more skewed with more read ahead a parallel processing. However with the futures you never know when the record batch is retrieved from the future making it impossible for bytes_read() to work that way.

My only thought on how to solve this would be to have ReadNextAsync() or a new similar method return a Future on a pair where one of the values was the bytes read so that anybody who actually wants to associate progress with a batch will just use that API.

I moved the increment of decoded_bytes_ to be after the readahead. So now...

bytes_decoded_ will not be incremented until the reader asks for the batch

The header bytes and skip before header are still marked read after Make (I think this is fair as they have been "consumed" by this point)

Bytes skipped after the header are marked consumed after the first batch is delivered

I think this is close enough to what you are after.

cpp/src/arrow/csv/reader.cc

n3world · 2021-07-02T02:41:54Z

python/pyarrow/tests/test_csv.py

Is this still to test SerialStreamingCSV? Should there be two classes so that all test get run for serial and non serial?

Ah, good point (and clumsy of me to leave those comments in there). I've changed up the test so now there is no base class and every test is parameterized on use_threads=True/False.

westonpace · 2021-07-08T03:03:54Z

Thanks for the feedback @n3world. I think I was able to update bytes_read to align with your use case a little better.

westonpace · 2021-07-08T03:05:09Z

@ursabot please benchmark

ursabot · 2021-07-08T03:06:10Z

Benchmark runs are scheduled for baseline = cf6a7ff and contender = cd899de2debcd7ccb0c8d1e3f7840a3cebf77742. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

bkietz

Just a few comments. Overall this is a nice clarification of the streaming reader

cpp/src/arrow/csv/column_decoder.cc

cpp/src/arrow/csv/reader.cc

…ed implementation. The parser and decoder are now operator functions and sequencing logic has been removed from them. Parallel readahead has been added to the streaming reader to allow for parallel streaming CSV reads.

…e commented out column decoder tests

…ad a reference to self which was causing a circular reference. Moved the reference to bytes_decoded itself.

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

bkietz

LGTM, thanks for doing this!

github-actions bot added Component: C++ Component: Python labels Jun 22, 2021

westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from acde4e5 to 7cd6aed Compare June 24, 2021 09:33

westonpace marked this pull request as ready for review June 24, 2021 18:45

westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 8b8c812 to 9d8dc07 Compare July 1, 2021 00:57

n3world reviewed Jul 2, 2021

View reviewed changes

westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 86b9649 to cd899de Compare July 8, 2021 02:48

bkietz self-requested a review July 12, 2021 20:32

westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from cd899de to 6d6505a Compare July 13, 2021 21:12

bkietz requested changes Jul 15, 2021

View reviewed changes

westonpace and others added 10 commits July 15, 2021 11:36

ARROW-11889: Removed debugging entry in CMakeLists, added back in som…

efe3292

…e commented out column decoder tests

ARROW-11889: Lint

27ac193

ARROW-11889: Compiler warnings

74cf9fe

ARROW-11889: More compiler warnings

19ad0fe

ARROW-11889: Addressing review feedback.

c57c6d7

ARROW-11889: Rebase

3808254

ARROW-11889: The final operator that was incrementing bytes_decoded h…

61d4bce

…ad a reference to self which was causing a circular reference. Moved the reference to bytes_decoded itself.

Apply suggestions from code review

a1364e1

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

ARROW-11889: Applying additional suggestions from code review

ab9b932

westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 6347b70 to ab9b932 Compare July 15, 2021 21:40

bkietz approved these changes Jul 16, 2021

View reviewed changes

westonpace closed this in 17e6f23 Jul 16, 2021

westonpace deleted the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch January 6, 2022 08:16

asfimport mentioned this pull request Jul 16, 2021

[C++] Add parallelism to streaming CSV reader #27731

Closed

ARROW-11889: [C++] Add parallelism to streaming CSV reader #10568

ARROW-11889: [C++] Add parallelism to streaming CSV reader #10568

Uh oh!

Conversation

westonpace commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 22, 2021

Uh oh!

westonpace commented Jun 24, 2021

Uh oh!

westonpace commented Jul 1, 2021

Uh oh!

n3world Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

n3world Jul 2, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace commented Jul 8, 2021

Uh oh!

westonpace commented Jul 8, 2021

Uh oh!

ursabot commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

westonpace commented Jun 22, 2021 •

edited

Loading

ursabot commented Jul 8, 2021 •

edited

Loading