Conversation
- Add ensure_bits() and get_bits_unchecked() to BitReaderReversed for batched unchecked bit reads after a single refill check - Add update_state_fast() to FSEDecoder using unchecked reads - Restructure both sequence decode loops (with/without RLE) to use one ensure_bits() call covering all three FSE state updates per iteration, replacing three individual per-update refill checks - Fix pre-existing bench compile error (rand 0.10 Rng -> RngExt) Closes #11
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded unchecked bit-read APIs and a fast FSE state-update; refactored sequence decoding to compute a single batched bit-refill budget and call fast updates for non-RLE decoders; adjusted a benchmark RNG trait import. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces a decoding hot-loop optimization for FSE sequence decoding by batching bitstream refill checks, reducing per-iteration branch overhead in the sequence state-update path.
Changes:
- Added
BitReaderReversed::ensure_bits()andBitReaderReversed::get_bits_unchecked()to enable batched unchecked bit reads after a single refill check. - Added
FSEDecoder::update_state_fast()and updated sequence decoding loops (with/without RLE) to perform a singleensure_bits()call per iteration for all relevant state updates. - Updated bench support imports for rand 0.10 (
Rng→RngExt).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| zstd/src/fse/fse_decoder.rs | Adds a fast-path state update that uses unchecked bit reads under a caller-enforced bit budget. |
| zstd/src/decoding/sequence_section_decoder.rs | Restructures sequence decode loops to replace multiple per-update refill checks with a single batched ensure_bits() call. |
| zstd/src/bit_io/bit_reader_reverse.rs | Adds the batched refill precheck (ensure_bits) and unchecked read helper (get_bits_unchecked). |
| zstd/benches/support/mod.rs | Fixes bench compilation against rand 0.10 by switching to RngExt. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/support/mod.rs`:
- Line 3: Reorder the rand import so single identifiers come before nested
module paths and each group is alphabetically ordered: change the use statement
that currently lists rngs::SmallRng first to list RngExt and SeedableRng before
rngs::SmallRng (i.e., ensure the import uses RngExt, SeedableRng, then
rngs::SmallRng) so it satisfies cargo fmt ordering rules.
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 115-123: Add a debug-only precondition to get_bits_unchecked: use
debug_assert! to verify the requested n is within bounds (e.g. n <= 64) and that
the reader currently has at least n bits available before calling
peek_bits/consume (for example debug_assert!(self.available_bits() >= n as
usize)). If there is no available_bits()/bits_remaining() helper, add one based
on the reader's buffer state and use it in the debug_assert so misuse is caught
in debug builds while release behavior remains unchanged; reference
get_bits_unchecked, peek_bits, consume and ensure_bits in the check.
In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 72-85: Add a debug assertion enforcing the 56-bit budget invariant
immediately after computing max_update_bits: insert
debug_assert!(max_update_bits <= 56, "sequence section update bits exceed 56-bit
budget"); this uses the existing local max_update_bits derived from
scratch.literal_lengths.accuracy_log, scratch.match_lengths.accuracy_log and
scratch.offsets.accuracy_log and their corresponding RLE checks (scratch.ll_rle,
scratch.ml_rle, scratch.of_rle). Repeat the same debug_assert at the other
similar computation later in this file (the second max_update_bits calculation
around the 174-182 region) so both places validate the 56-bit table-log budget.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e4016e20-76bb-41b1-ba93-fc6be8f12740
📒 Files selected for processing (4)
zstd/benches/support/mod.rszstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/sequence_section_decoder.rszstd/src/fse/fse_decoder.rs
- Add debug_assert!(n <= 56) to get_bits_unchecked - Add debug_assert!(max_update_bits <= 56) in both sequence decode loops - Add ensure_and_unchecked_match_get_bits test covering fast-path equivalence with get_bits across refill boundaries and n=0 edge case - Update bench rand 0.10 doc: Rng::fill() → RngExt::fill()
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Remove misleading "dual-state interleaving" wording from comments. The optimization is a batched refill check covering three single-state FSE decoders, not a dual-state interleaving pattern.
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 209-255: The test currently calls fast_br.ensure_bits(26) after
only 37 bits consumed, so the ensure stays on the no-refill path; to force a
real refill make fast_br consume the same extra two 9-bit reads (the r6 and r7
reads done on ref_br via get_bits(9) twice) before calling ensure_bits(26). In
other words, call fast_br.ensure_bits(9); fast_br.get_bits_unchecked(9) (or
fast_br.get_bits(9) to advance) twice to advance fast_br to 55 bits consumed,
then call fast_br.ensure_bits(26) and assert the subsequent
get_bits_unchecked(9/9/8) match ref_br; reference the BitReaderReversed methods
ensure_bits, get_bits_unchecked and get_bits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b910c873-9f53-49b4-b062-f7339a700b20
📒 Files selected for processing (4)
zstd/benches/support/mod.rszstd/src/bit_io/bit_reader_reverse.rszstd/src/decoding/sequence_section_decoder.rszstd/src/fse/fse_decoder.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 6193385 | Previous: d402315 | Ratio |
|---|---|---|---|
compress/best/small-1k-random/matrix/c_ffi |
0.369 ms |
0.315 ms |
1.17 |
compress/better/small-4k-log-lines/matrix/pure_rust |
0.155 ms |
0.124 ms |
1.25 |
compress/better/small-4k-log-lines/matrix/c_ffi |
0.099 ms |
0.077 ms |
1.29 |
compress/best/small-4k-log-lines/matrix/pure_rust |
0.262 ms |
0.202 ms |
1.30 |
compress/best/small-4k-log-lines/matrix/c_ffi |
0.373 ms |
0.273 ms |
1.37 |
compress/best/decodecorpus-z000033/matrix/pure_rust |
75.248 ms |
60.668 ms |
1.24 |
compress/best/high-entropy-1m/matrix/c_ffi |
1.143 ms |
0.904 ms |
1.26 |
compress/best/low-entropy-1m/matrix/c_ffi |
1.246 ms |
1.048 ms |
1.19 |
compress/default/large-log-stream/matrix/c_ffi |
3.946 ms |
3.203 ms |
1.23 |
decompress/better/high-entropy-1m/c_stream/matrix/c_ffi |
0.035 ms |
0.027 ms |
1.30 |
decompress/best/high-entropy-1m/c_stream/matrix/c_ffi |
0.035 ms |
0.027 ms |
1.30 |
compress-dict/better/small-4k-log-lines/matrix/c_ffi_with_dict |
0.004 ms |
0.003 ms |
1.33 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
…overage - Add debug_assert!(bits_consumed + n <= 64) to get_bits_unchecked to catch caller violations in debug builds - Force real refill boundary in test: consume 39 bits before batched ensure_bits(26), triggering actual refill (39+26=65 > 64)
Summary
Implements the dual-state parallel decoding optimization from the C reference (
fse_decompress.c) for the FSE sequence decoder, reducing per-iteration branch overhead in the hot decode loop.ensure_bits()/get_bits_unchecked()toBitReaderReversedfor batched unchecked bit reads after a single refill checkupdate_state_fast()toFSEDecoder— uses unchecked reads, caller guarantees bit availabilitywith_rleandwithout_rle) to perform oneensure_bits()call covering all three FSE state updates per iteration, replacing three individual per-update conditional refill checksdebug_assert!guards for the 56-bit budget invariant and unchecked read boundsrand0.10Rng→RngExt)Mechanism: Max bits for 3 state updates = LL(9) + ML(9) + OF(8) = 26, well within the 56-bit post-refill guarantee. One branch replaces three.
Not implemented: Bulk table spreading (
u64symbol writes from the C reference) — ourEntrystruct is 8 bytes, not byte-packed like C'sFSE_DTable, so theMEM_write64()pattern does not apply.Test Plan
ensure_and_unchecked_match_get_bits)-D warnings)cargo fmt --all -- --checkpassesdecode_allbenchmark: ~3% improvement (4.93ms → 4.79ms median)Closes #11
Summary by CodeRabbit