-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-41116: [C++] IO: enhance boundary checking in CompressedInputStream #41117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41116: [C++] IO: enhance boundary checking in CompressedInputStream #41117
Conversation
|
|
cpp/src/arrow/io/compressed.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether previous check would be too strict. If compressed_->size() == compressedPos_, it would decompress from 0, would that be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh damn, this would trigger a zero-sized compression and set freshed = false... ( At least in brotli testing )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch also protect original:
if (!decompressed_ || decompressed_->size() == 0)
If decompress_ != nullptr, then, compressed_ is always not nullptr. So, this branch is always hit. And during DecompressData(), decompressed_ will be reset. So the original code never hit decompressed_->size() != 0 && decompressed_->size() == decompressed_pos_.
82b1a93 to
9f631d7
Compare
9f631d7 to
cd87ec5
Compare
cpp/src/arrow/io/compressed.cc
Outdated
| // First try to read data from the decompressor | ||
| // First try to read data from the decompressor. | ||
| // This doesn't use `CompressedBufferAvailable()` because when compressed_ | ||
| // exists and available == 0, it might trigger an empty decompress and set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily an empty decompress if the decompressor has its own internal buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivially, if the decompressed data is 1MB of zeros, and we decompress kDecompressSize bytes at a time, DecompressData will still yield data even after the compressed data is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily an empty decompress if the decompressor has its own internal buffer?
Aha, yes. I didn't know this before, let me change this description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, if compressed_->size() == 0 here, would it has internal buffer to be decompressed? Feel a little consumed here. If the compressed data length is exactly 64KB, the first read finished, and when calling EnsureCompressedData() again, compressed_->size() == 0. Would it require more decompressing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compressed data length is 64 kB, but the decompressed data length is 1 MB, then you'll get several chunks of decompressed data after the compressed data has ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After re-reading this part of code I finally understand this. Seems CompressedInputStream is far complex than I first thought that... Thanks!
|
I've updated the comment here, would you mind check again? |
cpp/src/arrow/io/compressed.cc
Outdated
| } | ||
|
|
||
| // Try to feed more data into the decompressed_ buffer. | ||
| Status RefillDecompressed(bool* has_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are making changes in this function, can you make it return a Result<bool>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand, seems that syntax of this function is not changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just take the opportunity to use the newer style of returning Result<bool> instead of taking a bool* output argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
8887644 to
a740df3
Compare
|
Will merge in monday if no negative comments |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 271c878. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…tStream (apache#41117) ### Rationale for this change Enhance the boundary checking code style in `io::CompressedInputStream`. ### What changes are included in this PR? * Add `compressed_buffer_available` and `decompressed_buffer_available` in the class, and uses them for checking the boundary * Change `Status(bool*)` to `Result<bool>` ### Are these changes tested? Already has testing. I don't know how to hacking into internal ### Are there any user-facing changes? No * GitHub Issue: apache#41116 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
Enhance the boundary checking code style in
io::CompressedInputStream.What changes are included in this PR?
compressed_buffer_availableanddecompressed_buffer_availablein the class, and uses them for checking the boundaryStatus(bool*)toResult<bool>Are these changes tested?
Already has testing. I don't know how to hacking into internal
Are there any user-facing changes?
No