Skip to content

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Apr 10, 2024

Rationale for this change

Enhance the boundary checking code style in io::CompressedInputStream.

What changes are included in this PR?

  • Add compressed_buffer_available and decompressed_buffer_available in the class, and uses them for checking the boundary
  • Change Status(bool*) to Result<bool>

Are these changes tested?

Already has testing. I don't know how to hacking into internal

Are there any user-facing changes?

No

@github-actions
Copy link

⚠️ GitHub issue #41116 has been automatically assigned in GitHub to PR creator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether previous check would be too strict. If compressed_->size() == compressedPos_, it would decompress from 0, would that be ok?

Copy link
Member Author

@mapleFU mapleFU Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn, this would trigger a zero-sized compression and set freshed = false... ( At least in brotli testing )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch also protect original:

if (!decompressed_ || decompressed_->size() == 0)

If decompress_ != nullptr, then, compressed_ is always not nullptr. So, this branch is always hit. And during DecompressData(), decompressed_ will be reset. So the original code never hit decompressed_->size() != 0 && decompressed_->size() == decompressed_pos_.

@mapleFU mapleFU requested review from felipecrv and pitrou April 10, 2024 05:51
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 10, 2024
@mapleFU mapleFU force-pushed the minor/enhance-style-for-compressed branch 2 times, most recently from 82b1a93 to 9f631d7 Compare April 10, 2024 06:11
@mapleFU mapleFU force-pushed the minor/enhance-style-for-compressed branch from 9f631d7 to cd87ec5 Compare April 10, 2024 06:11
// First try to read data from the decompressor
// First try to read data from the decompressor.
// This doesn't use `CompressedBufferAvailable()` because when compressed_
// exists and available == 0, it might trigger an empty decompress and set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily an empty decompress if the decompressor has its own internal buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivially, if the decompressed data is 1MB of zeros, and we decompress kDecompressSize bytes at a time, DecompressData will still yield data even after the compressed data is finished.

Copy link
Member Author

@mapleFU mapleFU Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily an empty decompress if the decompressor has its own internal buffer?

Aha, yes. I didn't know this before, let me change this description

Copy link
Member Author

@mapleFU mapleFU Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, if compressed_->size() == 0 here, would it has internal buffer to be decompressed? Feel a little consumed here. If the compressed data length is exactly 64KB, the first read finished, and when calling EnsureCompressedData() again, compressed_->size() == 0. Would it require more decompressing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compressed data length is 64 kB, but the decompressed data length is 1 MB, then you'll get several chunks of decompressed data after the compressed data has ended.

Copy link
Member Author

@mapleFU mapleFU Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading this part of code I finally understand this. Seems CompressedInputStream is far complex than I first thought that... Thanks!

@mapleFU
Copy link
Member Author

mapleFU commented Apr 10, 2024

I've updated the comment here, would you mind check again?

}

// Try to feed more data into the decompressed_ buffer.
Status RefillDecompressed(bool* has_data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are making changes in this function, can you make it return a Result<bool>?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand, seems that syntax of this function is not changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before #39807 . decompressed_ would set to nullptr if it's all consumed. And even #39807 change the syntax, if would gurantee ok since DecompressData() in compressed_ and decompressor_ would protect it. So I think this just didn't change the syntax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just take the opportunity to use the newer style of returning Result<bool> instead of taking a bool* output argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, done

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@mapleFU mapleFU force-pushed the minor/enhance-style-for-compressed branch from 8887644 to a740df3 Compare April 11, 2024 15:57
@mapleFU mapleFU removed the request for review from wgtmac April 11, 2024 15:57
@mapleFU
Copy link
Member Author

mapleFU commented Apr 12, 2024

Will merge in monday if no negative comments

@mapleFU mapleFU merged commit 271c878 into apache:main Apr 15, 2024
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Apr 15, 2024
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 271c878.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…tStream (apache#41117)

### Rationale for this change

Enhance the boundary checking code style in `io::CompressedInputStream`.

### What changes are included in this PR?

* Add `compressed_buffer_available` and `decompressed_buffer_available` in the class, and uses them for checking the boundary
* Change `Status(bool*)` to `Result<bool>`

### Are these changes tested?

Already has testing. I don't know how to hacking into internal

### Are there any user-facing changes?

No

* GitHub Issue: apache#41116

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants