Skip to content

Conversation

@n3world
Copy link
Contributor

@n3world n3world commented Jun 10, 2021

Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.

@github-actions
Copy link

@n3world n3world force-pushed the ARROW-12996-stream_progress branch 2 times, most recently from ef3f092 to c33fc49 Compare June 10, 2021 21:25
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds useful on the principle, thank you @n3world .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring may be a bit imprecise here. If there is some readahead going on, is it included in the result? Or is it the number of bytes corresponding to the batches already consumed by the caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bytes_read means "bytes the CSV reader is completely finished with". So serial readahead (e.g. the readahead happening on the I/O context or "data read but not parsed or decoded") should not be included. Caller consumption should be irrelevant.

For parallel readahead (e.g. the CSV reader reading/parsing/decoding multiple batches of data at the same time) then my opinion is that bytes_read should be incremented as soon as a batch is ready to be delivered (even if there are other batches in front of it that aren't ready).

Perhaps bytes_processed or bytes_finished would remove the ambiguity? Or maybe just a clearer docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clearer docstring would be fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to

  /// - bytes skipped by `ReadOptions.skip_rows` will be counted as being read before
  /// any records are returned.
  /// - bytes read while parsing the header will be counted as being read before any
  /// records are returned.
  /// - bytes skipped by `ReadOptions.skip_rows_after_names` will be counted after the
  /// first batch is returned.
  ///
  /// \return the number of bytes which have been read from the CSV stream and returned to
  /// caller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test where the skip_rows and/or skip_rows_after_names options are set? What should be the semantics there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

@westonpace You might be curious about this.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one minor nit: If there is just one property then bytes_read makes sense. So the API is fine. Internally there are two properties bytes_read_ and bytes_parsed_ which is a little confusing to me because I immediately thought bytes_read_ meant "read but not parsed" since the order is "read->parse->decode". Maybe change bytes_read_ to bytes_decoded_ but leave it as bytes_read at the API level?

@n3world n3world force-pushed the ARROW-12996-stream_progress branch from c33fc49 to 05f175f Compare June 15, 2021 20:52
@n3world
Copy link
Contributor Author

n3world commented Jun 15, 2021

Looks good, one minor nit: If there is just one property then bytes_read makes sense. So the API is fine. Internally there are two properties bytes_read_ and bytes_parsed_ which is a little confusing to me because I immediately thought bytes_read_ meant "read but not parsed" since the order is "read->parse->decode". Maybe change bytes_read_ to bytes_decoded_ but leave it as bytes_read at the API level?

I only named the variable bytes_read_ to match the method name so if you are fine with the bytes_read() returning the value of bytes_decoded_, I'll make that change.

@n3world n3world force-pushed the ARROW-12996-stream_progress branch 2 times, most recently from bfb2ef3 to 803759a Compare June 15, 2021 21:36
@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Calling it bytes_read() is fine with me.

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

This PR will make it a bit more complicated to add a parallel streaming reader. @westonpace Are you ok with this?

@westonpace
Copy link
Member

Yes, I'll move the counting logic a little when I do it but I considered this when looking through the PR and I should be able to hook the counter update into the new logic pretty easily.

Add a bytes_read() to the StreamingReader interface so the progress of
the stream can be determined easily and accurately by a user.
@pitrou pitrou force-pushed the ARROW-12996-stream_progress branch from 803759a to 6e869ac Compare June 30, 2021 09:31
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you for the updates @n3world !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants