perf(decoding): dual-state interleaved FSE sequence decoding by polaz · Pull Request #55 · structured-world/structured-zstd

polaz · 2026-04-03T07:47:46Z

Summary

Implements the dual-state parallel decoding optimization from the C reference (fse_decompress.c) for the FSE sequence decoder, reducing per-iteration branch overhead in the hot decode loop.

Add ensure_bits() / get_bits_unchecked() to BitReaderReversed for batched unchecked bit reads after a single refill check
Add update_state_fast() to FSEDecoder — uses unchecked reads, caller guarantees bit availability
Restructure both sequence decode loops (with_rle and without_rle) to perform one ensure_bits() call covering all three FSE state updates per iteration, replacing three individual per-update conditional refill checks
Add debug_assert! guards for the 56-bit budget invariant and unchecked read bounds
Fix pre-existing bench compile error (rand 0.10 Rng → RngExt)

Mechanism: Max bits for 3 state updates = LL(9) + ML(9) + OF(8) = 26, well within the 56-bit post-refill guarantee. One branch replaces three.

Not implemented: Bulk table spreading (u64 symbol writes from the C reference) — our Entry struct is 8 bytes, not byte-packed like C's FSE_DTable, so the MEM_write64() pattern does not apply.

Test Plan

All 173 unit tests pass (includes new ensure_and_unchecked_match_get_bits)
All 15 cross-validation tests pass (Rust ↔ C FFI)
All 8 doc-tests pass
Clippy clean on all targets (-D warnings)
cargo fmt --all -- --check passes
decode_all benchmark: ~3% improvement (4.93ms → 4.79ms median)

Closes #11

Summary by CodeRabbit

Performance
- Faster decompression by batching bit-buffer checks and adding a fast unchecked bit-read path, reducing redundant refills and speeding up state updates during sequence decoding.
Tests
- Expanded tests to validate behavior across refill boundaries and ensure fast-path reads match existing checked reads.

- Add ensure_bits() and get_bits_unchecked() to BitReaderReversed for batched unchecked bit reads after a single refill check - Add update_state_fast() to FSEDecoder using unchecked reads - Restructure both sequence decode loops (with/without RLE) to use one ensure_bits() call covering all three FSE state updates per iteration, replacing three individual per-update refill checks - Fix pre-existing bench compile error (rand 0.10 Rng -> RngExt) Closes #11

coderabbitai · 2026-04-03T07:48:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4be1b155-2918-4abe-816a-2e7c1def0fc3

📥 Commits

Reviewing files that changed from the base of the PR and between 8826cdf and 6193385.

📒 Files selected for processing (1)

zstd/src/bit_io/bit_reader_reverse.rs

📝 Walkthrough

Walkthrough

Added unchecked bit-read APIs and a fast FSE state-update; refactored sequence decoding to compute a single batched bit-refill budget and call fast updates for non-RLE decoders; adjusted a benchmark RNG trait import.

Changes

Cohort / File(s)	Summary
Benchmark RNG `zstd/benches/support/mod.rs`	Swapped `rand::Rng` import for `rand::RngExt` so `fill(&mut bytes[..])` resolves to the correct trait method in benchmarks.
Bit Reader `zstd/src/bit_io/bit_reader_reverse.rs`	Added `pub fn ensure_bits(&mut self, n: u8)` (batched pre-refill) and `pub fn get_bits_unchecked(&mut self, n: u8) -> u64` (unchecked read + consume); added tests comparing checked vs. ensure+unchecked reads across refill boundaries.
FSE Decoder `zstd/src/fse/fse_decoder.rs`	Added `pub fn update_state_fast(&mut self, bits: &mut BitReaderReversed<'_>)` that uses `get_bits_unchecked` to update state without per-call refill checks; existing `update_state` remains.
Sequence Section Decoder `zstd/src/decoding/sequence_section_decoder.rs`	Compute a combined `max_update_bits` for non-RLE LL/ML/OF decoders, assert ≤56 bits, call `br.ensure_bits(max_update_bits)` once (when nonzero), then invoke `update_state_fast` for each decoder; removed commented debug prints.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant SeqDecoder as SequenceSectionDecoder
participant BR as BitReaderReversed
participant FSE as FSEDecoder
SeqDecoder->>BR: compute max_update_bits; call ensure_bits(max_update_bits)
Note right of BR: rgba(200,200,255,0.5) refill buffer if needed
SeqDecoder->>FSE: call update_state_fast(bits) for LL
FSE->>BR: get_bits_unchecked(num_bits) — no refill
SeqDecoder->>FSE: call update_state_fast(bits) for ML
FSE->>BR: get_bits_unchecked(num_bits) — no refill
SeqDecoder->>FSE: call update_state_fast(bits) for OF
FSE->>BR: get_bits_unchecked(num_bits) — no refill

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

structured-world/structured-zstd#38 — Modifies the same benchmark RNG helper (zstd/benches/support/mod.rs) affecting the RNG trait/import.
structured-world/structured-zstd#54 — Also alters the benchmark RNG helper and the byte-fill usage in zstd/benches/support/mod.rs.

Poem

🐰
I nibble bits in single hops,
One refill then the reading stops,
Fast states tumble, unchecked and spry,
Sequences sing as bytes go by,
A carrot-coded victory, hi! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(decoding): dual-state interleaved FSE sequence decoding' directly describes the main optimization implemented in this PR—batched refill with fast unchecked reads across LL/ML/OF decoders.
Linked Issues check	✅ Passed	All key objectives from issue `#11` are met: batched refill (ensure_bits + update_state_fast) [`#11`], fast bounds-check-free path (get_bits_unchecked) [`#11`], 56-bit invariant assertions [`#11`], benchmark improvement (~3%) [`#11`], and roundtrip/corpus tests pass [`#11`].
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#11` objectives: BitReaderReversed enhancements, FSEDecoder.update_state_fast, sequence loop restructuring, and bench dependency fix are all in-scope improvements for the dual-state parallel decoding optimization.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#11-perf-fse-decoder--dual-state-parallel-decoding

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR introduces a decoding hot-loop optimization for FSE sequence decoding by batching bitstream refill checks, reducing per-iteration branch overhead in the sequence state-update path.

Changes:

Added BitReaderReversed::ensure_bits() and BitReaderReversed::get_bits_unchecked() to enable batched unchecked bit reads after a single refill check.
Added FSEDecoder::update_state_fast() and updated sequence decoding loops (with/without RLE) to perform a single ensure_bits() call per iteration for all relevant state updates.
Updated bench support imports for rand 0.10 (Rng → RngExt).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
zstd/src/fse/fse_decoder.rs	Adds a fast-path state update that uses unchecked bit reads under a caller-enforced bit budget.
zstd/src/decoding/sequence_section_decoder.rs	Restructures sequence decode loops to replace multiple per-update refill checks with a single batched `ensure_bits()` call.
zstd/src/bit_io/bit_reader_reverse.rs	Adds the batched refill precheck (`ensure_bits`) and unchecked read helper (`get_bits_unchecked`).
zstd/benches/support/mod.rs	Fixes bench compilation against rand 0.10 by switching to `RngExt`.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/support/mod.rs`:
- Line 3: Reorder the rand import so single identifiers come before nested
module paths and each group is alphabetically ordered: change the use statement
that currently lists rngs::SmallRng first to list RngExt and SeedableRng before
rngs::SmallRng (i.e., ensure the import uses RngExt, SeedableRng, then
rngs::SmallRng) so it satisfies cargo fmt ordering rules.

In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 115-123: Add a debug-only precondition to get_bits_unchecked: use
debug_assert! to verify the requested n is within bounds (e.g. n <= 64) and that
the reader currently has at least n bits available before calling
peek_bits/consume (for example debug_assert!(self.available_bits() >= n as
usize)). If there is no available_bits()/bits_remaining() helper, add one based
on the reader's buffer state and use it in the debug_assert so misuse is caught
in debug builds while release behavior remains unchanged; reference
get_bits_unchecked, peek_bits, consume and ensure_bits in the check.

In `@zstd/src/decoding/sequence_section_decoder.rs`:
- Around line 72-85: Add a debug assertion enforcing the 56-bit budget invariant
immediately after computing max_update_bits: insert
debug_assert!(max_update_bits <= 56, "sequence section update bits exceed 56-bit
budget"); this uses the existing local max_update_bits derived from
scratch.literal_lengths.accuracy_log, scratch.match_lengths.accuracy_log and
scratch.offsets.accuracy_log and their corresponding RLE checks (scratch.ll_rle,
scratch.ml_rle, scratch.of_rle). Repeat the same debug_assert at the other
similar computation later in this file (the second max_update_bits calculation
around the 174-182 region) so both places validate the 56-bit table-log budget.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e4016e20-76bb-41b1-ba93-fc6be8f12740

📥 Commits

Reviewing files that changed from the base of the PR and between a7a6ae8 and 2f2d608.

📒 Files selected for processing (4)

zstd/benches/support/mod.rs
zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/fse/fse_decoder.rs

- Add debug_assert!(n <= 56) to get_bits_unchecked - Add debug_assert!(max_update_bits <= 56) in both sequence decode loops - Add ensure_and_unchecked_match_get_bits test covering fast-path equivalence with get_bits across refill boundaries and n=0 edge case - Update bench rand 0.10 doc: Rng::fill() → RngExt::fill()

codecov · 2026-04-03T08:02:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Remove misleading "dual-state interleaving" wording from comments. The optimization is a batched refill check covering three single-state FSE decoders, not a dual-state interleaving pattern.

polaz · 2026-04-03T08:31:18Z

@coderabbitai re-review

coderabbitai · 2026-04-03T08:31:30Z

@polaz Sure! I'll kick off a full re-review of PR #55 now.

🧠 Learnings used

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/bit_io/bit_reader_reverse.rs`:
- Around line 209-255: The test currently calls fast_br.ensure_bits(26) after
only 37 bits consumed, so the ensure stays on the no-refill path; to force a
real refill make fast_br consume the same extra two 9-bit reads (the r6 and r7
reads done on ref_br via get_bits(9) twice) before calling ensure_bits(26). In
other words, call fast_br.ensure_bits(9); fast_br.get_bits_unchecked(9) (or
fast_br.get_bits(9) to advance) twice to advance fast_br to 55 bits consumed,
then call fast_br.ensure_bits(26) and assert the subsequent
get_bits_unchecked(9/9/8) match ref_br; reference the BitReaderReversed methods
ensure_bits, get_bits_unchecked and get_bits.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b910c873-9f53-49b4-b062-f7339a700b20

📥 Commits

Reviewing files that changed from the base of the PR and between a7a6ae8 and 8826cdf.

📒 Files selected for processing (4)

zstd/benches/support/mod.rs
zstd/src/bit_io/bit_reader_reverse.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/fse/fse_decoder.rs

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `6193385`	Previous: `d402315`	Ratio
`compress/best/small-1k-random/matrix/c_ffi`	`0.369` ms	`0.315` ms	`1.17`
`compress/better/small-4k-log-lines/matrix/pure_rust`	`0.155` ms	`0.124` ms	`1.25`
`compress/better/small-4k-log-lines/matrix/c_ffi`	`0.099` ms	`0.077` ms	`1.29`
`compress/best/small-4k-log-lines/matrix/pure_rust`	`0.262` ms	`0.202` ms	`1.30`
`compress/best/small-4k-log-lines/matrix/c_ffi`	`0.373` ms	`0.273` ms	`1.37`
`compress/best/decodecorpus-z000033/matrix/pure_rust`	`75.248` ms	`60.668` ms	`1.24`
`compress/best/high-entropy-1m/matrix/c_ffi`	`1.143` ms	`0.904` ms	`1.26`
`compress/best/low-entropy-1m/matrix/c_ffi`	`1.246` ms	`1.048` ms	`1.19`
`compress/default/large-log-stream/matrix/c_ffi`	`3.946` ms	`3.203` ms	`1.23`
`decompress/better/high-entropy-1m/c_stream/matrix/c_ffi`	`0.035` ms	`0.027` ms	`1.30`
`decompress/best/high-entropy-1m/c_stream/matrix/c_ffi`	`0.035` ms	`0.027` ms	`1.30`
`compress-dict/better/small-4k-log-lines/matrix/c_ffi_with_dict`	`0.004` ms	`0.003` ms	`1.33`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

…overage - Add debug_assert!(bits_consumed + n <= 64) to get_bits_unchecked to catch caller violations in debug builds - Force real refill boundary in test: consume 39 bits before batched ensure_bits(26), triggering actual refill (39+26=65 > 64)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings April 3, 2026 07:47

Copilot started reviewing on behalf of polaz April 3, 2026 07:48 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Comment thread zstd/src/bit_io/bit_reader_reverse.rs

Comment thread zstd/benches/support/mod.rs Outdated

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread zstd/benches/support/mod.rs Outdated

Comment thread zstd/src/bit_io/bit_reader_reverse.rs

Comment thread zstd/src/decoding/sequence_section_decoder.rs

polaz added 2 commits April 3, 2026 10:55

style: fix rustfmt import ordering in bench support

8225bc3

polaz requested a review from Copilot April 3, 2026 08:03

Copilot started reviewing on behalf of polaz April 3, 2026 08:04 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Comment thread zstd/src/decoding/sequence_section_decoder.rs Outdated

docs(decoding): clarify batched refill comments in sequence decoder

8826cdf

Remove misleading "dual-state interleaving" wording from comments. The optimization is a batched refill check covering three single-state FSE decoders, not a dual-state interleaving pattern.

polaz requested a review from Copilot April 3, 2026 08:28

polaz mentioned this pull request Apr 3, 2026

perf: packed FSE Entry layout (4-byte entries + bulk table spreading) #56

Closed

Copilot started reviewing on behalf of polaz April 3, 2026 08:29 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Comment thread zstd/src/bit_io/bit_reader_reverse.rs

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread zstd/src/bit_io/bit_reader_reverse.rs

sw-release-bot Bot reviewed Apr 3, 2026

View reviewed changes

polaz requested a review from Copilot April 3, 2026 09:07

Copilot started reviewing on behalf of polaz April 3, 2026 09:07 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

polaz merged commit 7115945 into main Apr 3, 2026
14 of 15 checks passed

polaz deleted the feat/#11-perf-fse-decoder--dual-state-parallel-decoding branch April 3, 2026 09:11

sw-release-bot Bot mentioned this pull request Apr 3, 2026

chore: release v0.0.6 #57

Merged

This was referenced Apr 3, 2026

perf(decoding): branchless bitstream reader with mask table and BMI2 support #58

Merged

perf(decoding): SIMD HUF decode kernels with runtime CPU dispatch #66

Closed

Conversation

polaz commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 3, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

polaz commented Apr 3, 2026

Uh oh!

coderabbitai Bot commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sw-release-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading

sw-release-bot Bot left a comment •

edited

Loading