Skip to content

Conversation

@westonpace
Copy link
Member

@westonpace westonpace commented Jun 22, 2021

This converts the parser & decoder into map functions and then creates the streaming CSV reader as an async generator. Parallel readahead is then added on top of the parser/decoder to allow for parallel reads.

One thing that is lost at the moment is the ability to encounter a parsing error and then continue. There was a python test that read in the first block, failed to convert the second block, and then successfully read in a third block. I'm not sure if that restart behavior is important but if it is I can look into adding it.

Another thing that could be investigated in the future is combining the file readers and table readers more. They already share some components but the parsing and decoding logic, while basically the same, is handled very differently. The only real difference is that the table reader saves all the parsed blocks for re-parsing and the streaming reader does not.

@github-actions
Copy link

@westonpace westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from acde4e5 to 7cd6aed Compare June 24, 2021 09:33
@westonpace westonpace marked this pull request as ready for review June 24, 2021 18:45
@westonpace
Copy link
Member Author

@pitrou @n3world This should probably wait until after ARROW-12996 is merged in. I think it'd be easier to rebase ARROW-12996 into this PR than the other way around.

@westonpace westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 8b8c812 to 9d8dc07 Compare July 1, 2021 00:57
@westonpace
Copy link
Member Author

I've rebased in the changes from #10509. The behavior is only slightly different. Opening the streaming CSV reader reads in the first record batch so the bytes_read will reflect that before any batch is read. After that each time a batch is read in the next batch will be read in. This means the read will not increment bytes_read. If reading in parallel then bytes_read could potentially be even further ahead of the consumer since it will be doing decoding in readahead. It should still match the spirit of the feature which is to report how many bytes have been decoded.

@n3world @pitrou review is welcome. The CI failure is unrelated.

Comment on lines 323 to 330
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this changes the intent I had for bytes_read() when threads are used. The goal was to be able to report progress along with the batch. So that after a batch was retrieved with ReadNext() bytes_read() could be used to calculate the progress of this batch. In this example the second to last batch would be calculated as 100% complete and this can become more skewed with more read ahead a parallel processing. However with the futures you never know when the record batch is retrieved from the future making it impossible for bytes_read() to work that way.

My only thought on how to solve this would be to have ReadNextAsync() or a new similar method return a Future on a pair where one of the values was the bytes read so that anybody who actually wants to associate progress with a batch will just use that API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the increment of decoded_bytes_ to be after the readahead. So now...

  • bytes_decoded_ will not be incremented until the reader asks for the batch
  • The header bytes and skip before header are still marked read after Make (I think this is fair as they have been "consumed" by this point)
  • Bytes skipped after the header are marked consumed after the first batch is delivered

I think this is close enough to what you are after.

Comment on lines 1488 to 1492
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still to test SerialStreamingCSV? Should there be two classes so that all test get run for serial and non serial?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point (and clumsy of me to leave those comments in there). I've changed up the test so now there is no base class and every test is parameterized on use_threads=True/False.

@westonpace westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 86b9649 to cd899de Compare July 8, 2021 02:48
@westonpace
Copy link
Member Author

Thanks for the feedback @n3world. I think I was able to update bytes_read to align with your use case a little better.

@westonpace
Copy link
Member Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jul 8, 2021

Benchmark runs are scheduled for baseline = cf6a7ff and contender = cd899de2debcd7ccb0c8d1e3f7840a3cebf77742. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2 (mimalloc)
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x (mimalloc)
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@bkietz bkietz self-requested a review July 12, 2021 20:32
@westonpace westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from cd899de to 6d6505a Compare July 13, 2021 21:12
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments. Overall this is a nice clarification of the streaming reader

westonpace and others added 10 commits July 15, 2021 11:36
…ed implementation. The parser and decoder are now operator functions and sequencing logic has been removed from them. Parallel readahead has been added to the streaming reader to allow for parallel streaming CSV reads.
…ad a reference to self which was causing a circular reference. Moved the reference to bytes_decoded itself.
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
@westonpace westonpace force-pushed the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch from 6347b70 to ab9b932 Compare July 15, 2021 21:40
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

@westonpace westonpace deleted the feature/ARROW-11889--c-add-parallelism-to-streaming-csv-reader branch January 6, 2022 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants