ARROW-12996: Add bytes_read() to StreamingReader #10509

n3world · 2021-06-10T20:46:36Z

Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.

github-actions · 2021-06-10T20:46:56Z

https://issues.apache.org/jira/browse/ARROW-12996

pitrou

This sounds useful on the principle, thank you @n3world .

pitrou · 2021-06-15T13:03:46Z

cpp/src/arrow/csv/reader.h

The docstring may be a bit imprecise here. If there is some readahead going on, is it included in the result? Or is it the number of bytes corresponding to the batches already consumed by the caller?

I think bytes_read means "bytes the CSV reader is completely finished with". So serial readahead (e.g. the readahead happening on the I/O context or "data read but not parsed or decoded") should not be included. Caller consumption should be irrelevant.

For parallel readahead (e.g. the CSV reader reading/parsing/decoding multiple batches of data at the same time) then my opinion is that bytes_read should be incremented as soon as a batch is ready to be delivered (even if there are other batches in front of it that aren't ready).

Perhaps bytes_processed or bytes_finished would remove the ambiguity? Or maybe just a clearer docstring.

A clearer docstring would be fine with me.

I updated it to

/// - bytes skipped by `ReadOptions.skip_rows` will be counted as being read before /// any records are returned. /// - bytes read while parsing the header will be counted as being read before any /// records are returned. /// - bytes skipped by `ReadOptions.skip_rows_after_names` will be counted after the /// first batch is returned. /// /// \return the number of bytes which have been read from the CSV stream and returned to /// caller

pitrou · 2021-06-15T13:06:04Z

cpp/src/arrow/csv/reader_test.cc

Can you also add a test where the skip_rows and/or skip_rows_after_names options are set? What should be the semantics there?

pitrou · 2021-06-15T13:06:57Z

@westonpace You might be curious about this.

westonpace

Looks good, one minor nit: If there is just one property then bytes_read makes sense. So the API is fine. Internally there are two properties bytes_read_ and bytes_parsed_ which is a little confusing to me because I immediately thought bytes_read_ meant "read but not parsed" since the order is "read->parse->decode". Maybe change bytes_read_ to bytes_decoded_ but leave it as bytes_read at the API level?

n3world · 2021-06-15T21:12:48Z

Looks good, one minor nit: If there is just one property then bytes_read makes sense. So the API is fine. Internally there are two properties bytes_read_ and bytes_parsed_ which is a little confusing to me because I immediately thought bytes_read_ meant "read but not parsed" since the order is "read->parse->decode". Maybe change bytes_read_ to bytes_decoded_ but leave it as bytes_read at the API level?

I only named the variable bytes_read_ to match the method name so if you are fine with the bytes_read() returning the value of bytes_decoded_, I'll make that change.

pitrou · 2021-06-16T09:11:33Z

Calling it bytes_read() is fine with me.

pitrou · 2021-06-16T09:25:19Z

This PR will make it a bit more complicated to add a parallel streaming reader. @westonpace Are you ok with this?

westonpace · 2021-06-16T17:40:47Z

Yes, I'll move the counting logic a little when I do it but I considered this when looking through the PR and I should be able to hook the counter update into the new logic pretty easily.

Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.

pitrou

+1, thank you for the updates @n3world !

github-actions bot added the Component: C++ label Jun 10, 2021

n3world force-pushed the ARROW-12996-stream_progress branch 2 times, most recently from ef3f092 to c33fc49 Compare June 10, 2021 21:25

pitrou reviewed Jun 15, 2021

View reviewed changes

westonpace reviewed Jun 15, 2021

View reviewed changes

n3world force-pushed the ARROW-12996-stream_progress branch from c33fc49 to 05f175f Compare June 15, 2021 20:52

n3world force-pushed the ARROW-12996-stream_progress branch 2 times, most recently from bfb2ef3 to 803759a Compare June 15, 2021 21:36

ARROW-12996: Add bytes_read() to StreamingReader

6e869ac

Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.

pitrou force-pushed the ARROW-12996-stream_progress branch from 803759a to 6e869ac Compare June 30, 2021 09:31

Update docstring

294f0b0

pitrou approved these changes Jun 30, 2021

View reviewed changes

pitrou closed this in a308f2c Jun 30, 2021

westonpace mentioned this pull request Jul 1, 2021

ARROW-11889: [C++] Add parallelism to streaming CSV reader #10568

Closed

n3world deleted the ARROW-12996-stream_progress branch July 21, 2021 23:04

asfimport mentioned this pull request Jun 30, 2021

[C++] CSV stream reader has no progress indication #28713

Closed

ARROW-12996: Add bytes_read() to StreamingReader #10509

ARROW-12996: Add bytes_read() to StreamingReader #10509

Uh oh!

Conversation

n3world commented Jun 10, 2021

Uh oh!

github-actions bot commented Jun 10, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

westonpace Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

n3world Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

n3world Jun 15, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 15, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

n3world commented Jun 15, 2021

Uh oh!

pitrou commented Jun 16, 2021

Uh oh!

pitrou commented Jun 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Jun 16, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pitrou commented Jun 16, 2021 •

edited

Loading