-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12996: Add bytes_read() to StreamingReader #10509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ef3f092 to
c33fc49
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds useful on the principle, thank you @n3world .
cpp/src/arrow/csv/reader.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring may be a bit imprecise here. If there is some readahead going on, is it included in the result? Or is it the number of bytes corresponding to the batches already consumed by the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bytes_read means "bytes the CSV reader is completely finished with". So serial readahead (e.g. the readahead happening on the I/O context or "data read but not parsed or decoded") should not be included. Caller consumption should be irrelevant.
For parallel readahead (e.g. the CSV reader reading/parsing/decoding multiple batches of data at the same time) then my opinion is that bytes_read should be incremented as soon as a batch is ready to be delivered (even if there are other batches in front of it that aren't ready).
Perhaps bytes_processed or bytes_finished would remove the ambiguity? Or maybe just a clearer docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A clearer docstring would be fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to
/// - bytes skipped by `ReadOptions.skip_rows` will be counted as being read before
/// any records are returned.
/// - bytes read while parsing the header will be counted as being read before any
/// records are returned.
/// - bytes skipped by `ReadOptions.skip_rows_after_names` will be counted after the
/// first batch is returned.
///
/// \return the number of bytes which have been read from the CSV stream and returned to
/// caller
cpp/src/arrow/csv/reader_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test where the skip_rows and/or skip_rows_after_names options are set? What should be the semantics there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@westonpace You might be curious about this. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one minor nit: If there is just one property then bytes_read makes sense. So the API is fine. Internally there are two properties bytes_read_ and bytes_parsed_ which is a little confusing to me because I immediately thought bytes_read_ meant "read but not parsed" since the order is "read->parse->decode". Maybe change bytes_read_ to bytes_decoded_ but leave it as bytes_read at the API level?
c33fc49 to
05f175f
Compare
I only named the variable bytes_read_ to match the method name so if you are fine with the bytes_read() returning the value of bytes_decoded_, I'll make that change. |
bfb2ef3 to
803759a
Compare
|
Calling it |
|
This PR will make it a bit more complicated to add a parallel streaming reader. @westonpace Are you ok with this? |
|
Yes, I'll move the counting logic a little when I do it but I considered this when looking through the PR and I should be able to hook the counter update into the new logic pretty easily. |
Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.
803759a to
6e869ac
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you for the updates @n3world !
Add a bytes_read() to the StreamingReader interface so the progress of the stream can be determined easily and accurately by a user.