Skip to content

perf(decoding): SIMD HUF kernels with runtime dispatch#92

Merged
polaz merged 31 commits intomainfrom
perf/#66-simd-huf-decode
Apr 9, 2026
Merged

perf(decoding): SIMD HUF kernels with runtime dispatch#92
polaz merged 31 commits intomainfrom
perf/#66-simd-huf-decode

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 9, 2026

Summary

  • add runtime-dispatched HUF decode kernels: scalar, BMI2, AVX2, VBMI2, NEON, and SVE
  • fuse symbol decode + state advance in literals decode hot path
  • add packed decode table and 4-stream fast paths (AVX2 gather, VBMI2 unpack, NEON/SVE unpack)
  • add literals-heavy benchmark huf_decode_kernels

Validation

  • cargo build --manifest-path zstd/Cargo.toml
  • cargo clippy --manifest-path zstd/Cargo.toml --all-targets -- -D warnings
  • cargo nextest run --manifest-path zstd/Cargo.toml
  • cargo bench --manifest-path zstd/Cargo.toml --bench huf_decode_kernels -- --sample-size 10

Closes #66

Summary by CodeRabbit

  • Tests

    • Added a Criterion benchmark measuring decode throughput using a reference compressed sample and a deterministic 16 MiB literals‑heavy payload.
  • Refactor

    • Consolidated decode-and-advance logic and introduced runtime‑selectable, architecture‑aware Huffman decode kernels with safe scalar fallbacks.
  • Performance

    • Improved decoding throughput, CPU efficiency, and cross‑platform consistency.

Scope note

  • decode_symbol and next_state are intentionally gated behind feature = "fuzz_exports".
  • This is an intentional restriction of direct low-level API surface for fuzz harnesses; production decode paths are unchanged (decode_symbol_and_advance and batched decode helpers are always available).
  • So this is not a breaking change for normal/default production builds; it only narrows direct fuzz-facing entrypoints.

Copilot AI review requested due to automatic review settings April 9, 2026 13:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dc027e5c-9066-44ec-b763-43adcece7105

📥 Commits

Reviewing files that changed from the base of the PR and between 209cebb and 5a63cbe.

📒 Files selected for processing (5)
  • zstd/Cargo.toml
  • zstd/benches/huf_decode_kernels.rs
  • zstd/src/decoding/literals_section_decoder.rs
  • zstd/src/huff0/huff0_decoder.rs
  • zstd/src/huff0/mod.rs

📝 Walkthrough

Walkthrough

Adds runtime-dispatched Huffman decode kernels (scalar and architecture-specific), packed table/state fields and batched decode APIs, updates literals decoder loops to use the new APIs with explicit bitstream validation, and adds a Criterion benchmark huf_decode_kernels for literals‑heavy workloads.

Changes

Cohort / File(s) Summary
Benchmarks
zstd/Cargo.toml, zstd/benches/huf_decode_kernels.rs
Added a new benchmark target huf_decode_kernels and a Criterion benchmark that measures decode throughput on an embedded reference .zst frame and a generated 16 MiB literals‑heavy compressed payload.
Literals decoder integration
zstd/src/decoding/literals_section_decoder.rs
Replaced per-stream decode_symbol() + next_state() sequences with kernel-dispatched APIs (decode_symbol_and_advance, decode_symbol_and_num_bits, advance_state_by_bits); updated 4‑stream and 1‑stream loops and added post-loop bitstream validation.
Huffman decode core & kernels
zstd/src/huff0/huff0_decoder.rs
Introduced HuffmanDecodeKernel and runtime kernel detection, stored kernel in HuffmanDecoder, added packed_decode and state_mask to HuffmanTable, restricted some helpers to fuzz exports, and added combined/batched APIs plus SIMD/BMI2/NEON/SVE kernel implementations and unit tests.
Tests / helpers
zstd/src/huff0/mod.rs
Updated round_trip test helper to use decode_symbol_and_advance(&mut br) in place of separate decode + state-advance calls.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant HuffmanDecoder
    participant CPUDetector as CPU Detector
    participant Kernel as Selected Kernel

    Caller->>HuffmanDecoder: HuffmanDecoder::new(table)
    activate HuffmanDecoder
    HuffmanDecoder->>CPUDetector: detect_huffman_decode_kernel()
    activate CPUDetector
    alt runtime detection
        CPUDetector->>CPUDetector: is_x86_feature_detected / is_aarch64_feature_detected
    else compile-time fallback
        CPUDetector->>CPUDetector: cfg! selection
    end
    CPUDetector-->>HuffmanDecoder: selected Kernel
    deactivate CPUDetector
    HuffmanDecoder->>HuffmanDecoder: store kernel, packed_decode, state_mask
    deactivate HuffmanDecoder

    loop decode loop
        Caller->>HuffmanDecoder: decode_symbol_and_advance() / decode4_symbols_and_num_bits()
        activate HuffmanDecoder
        HuffmanDecoder->>Kernel: dispatch to kernel impl
        activate Kernel
        alt SIMD Kernel (AVX2/VBMI2/NEON/SVE)
            Kernel->>Kernel: vectorized extraction & packed_decode lookup
        else BMI2 Kernel
            Kernel->>Kernel: bit-extract (bzhi/pext) & derive symbols/bits
        else Scalar Kernel
            Kernel->>Kernel: table-driven decode & bits
        end
        Kernel-->>HuffmanDecoder: (symbols, bits) or (symbol, bits)
        deactivate Kernel
        HuffmanDecoder->>HuffmanDecoder: advance_state_by_bits(bits)
        HuffmanDecoder-->>Caller: decoded symbols
        deactivate HuffmanDecoder
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I nibble bits where kernels spring,
Four lanes hop and SIMD sings,
Packed states snug, masks tucked in tight,
Symbols bound out, dancing light,
Hooray — decode zooms into the night!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'perf(decoding): SIMD HUF kernels with runtime dispatch' directly and concisely summarizes the main change: adding SIMD Huffman decode kernels with runtime CPU dispatch, which is the primary objective of this changeset.
Linked Issues check ✅ Passed The pull request successfully implements all core coding requirements from issue #66: runtime-dispatched HUF decode kernels (scalar, BMI2, AVX2, VBMI2, NEON, SVE), fused symbol decode/state advance, packed decode tables, 4-stream fast paths, and benchmarks, achieving byte-exact parity with error behavior matching.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with issue #66 objectives: Huffman kernel implementations, literals decode hot-path optimization, benchmark additions, and feature-gated fuzz API exports are all explicitly planned in the linked issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#66-simd-huf-decode

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds runtime-selected Huffman (HUF) literal decode kernels and refactors the literals decode hot path to reduce per-symbol overhead, with a new Criterion benchmark to measure the impact.

Changes:

  • Introduces HuffmanDecodeKernel runtime dispatch and fused “decode + state-advance” APIs in the HUF decoder.
  • Adds a packed decode table and SIMD-oriented 4-stream lookup helpers used by the literals section decoder.
  • Adds a new huf_decode_kernels Criterion benchmark and wires it into zstd/Cargo.toml.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
zstd/src/huff0/huff0_decoder.rs Adds kernel detection/dispatch, packed decode table (packed_decode), state_mask, and SIMD/arch-specific 4-stream helpers (incl. SVE asm).
zstd/src/decoding/literals_section_decoder.rs Switches literals decode loop to fused symbol decode + state advance, and uses 4-wide symbol/bit extraction helper.
zstd/Cargo.toml Registers the new huf_decode_kernels benchmark target.
zstd/benches/huf_decode_kernels.rs Adds a literals-heavy benchmark plus a corpus reference benchmark case.

Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/huf_decode_kernels.rs`:
- Around line 28-45: Summary: The two benchmarks use different throughput bases
(corpus_src compressed size vs. literals_heavy uncompressed size), making their
reported Bytes/s incomparable; unify the basis. Fix: choose a single throughput
basis (either compressed input bytes or uncompressed output bytes) and apply it
to both benchmarks by calling group.throughput(...) with that chosen length; for
example, use the compressed size (compressed.len() or corpus_src.len()) for both
corpus and literals_heavy cases or use the uncompressed size
(literals_heavy.len() and decoded size for corpus) consistently. Locate and
update the throughput calls around the corpus_src, literals_heavy and compressed
variables (symbols: corpus_src, literals_heavy, compressed, group.throughput) so
both benches report Bytes/s on the same basis.

In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 405-417: The call to the unsafe function advance_state_x86_bmi2
inside the unsafe fn decode_symbol_and_advance_x86_bmi2 must be wrapped in an
explicit unsafe block to satisfy the unsafe_op_in_unsafe_fn lint; update
decode_symbol_and_advance_x86_bmi2 so the invocation of
self.advance_state_x86_bmi2(entry.num_bits, new_bits) is executed inside unsafe
{ … } and leave all other logic (reading entry, getting new_bits, assigning
self.state and returning entry.symbol) unchanged.
- Around line 252-310: The calls to SIMD intrinsics are made directly inside
unsafe functions which still require explicit unsafe blocks for each intrinsic;
wrap each intrinsic invocation (_mm_maskz_compress_epi8, _mm_storeu_si128,
_mm_i32gather_epi32) in explicit unsafe { ... } blocks to satisfy the
unsafe_op_in_unsafe_fn lint. Concretely, update
decode4_symbols_and_num_bits_vbmi2 to call _mm_maskz_compress_epi8 and both
_mm_storeu_si128 calls inside their own unsafe { } and update
decode4_symbols_and_num_bits_avx2 to call _mm_i32gather_epi32 and the
_mm_storeu_si128(packed) inside explicit unsafe { } blocks, leaving surrounding
logic intact. Ensure you reference the same local variables (packed,
symbols_bytes, bits_bytes, gathered) when moving the intrinsic calls into those
blocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 54887e6a-6d09-4232-9388-ee682f960c9b

📥 Commits

Reviewing files that changed from the base of the PR and between e101e88 and 9611de6.

📒 Files selected for processing (4)
  • zstd/Cargo.toml
  • zstd/benches/huf_decode_kernels.rs
  • zstd/src/decoding/literals_section_decoder.rs
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/benches/huf_decode_kernels.rs
Comment thread zstd/src/huff0/huff0_decoder.rs
Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
@polaz polaz force-pushed the perf/#66-simd-huf-decode branch from 9611de6 to b2220ed Compare April 9, 2026 13:16
@polaz polaz requested a review from Copilot April 9, 2026 13:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 83.13492% with 85 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/huff0/huff0_decoder.rs 83.71% 78 Missing ⚠️
zstd/src/decoding/literals_section_decoder.rs 70.83% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4e86f7c7-c331-4c00-9483-54554d3c9db6

📥 Commits

Reviewing files that changed from the base of the PR and between b2220ed and 3f6ce20.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 431-443: The BMI2 intrinsics calls in advance_state_x86_bmi2 are
themselves unsafe and must be invoked inside explicit unsafe blocks to satisfy
the unsafe_op_in_unsafe_fn Clippy gate; wrap the _bzhi_u64 call and the
_bzhi_u32 call each in their own unsafe { ... } expression and keep the
surrounding logic the same (i.e., replace direct calls to _bzhi_u64(...) and
_bzhi_u32(...) with unsafe { _bzhi_u64(...) } and unsafe { _bzhi_u32(...) }
respectively within the advance_state_x86_bmi2 function).
- Around line 382-402: The inline SVE assembly block in the unsafe asm! call
that loads and splits packed_scalar into symbols_u32/bits_u32 incorrectly
declares options(preserves_flags) even though the whilelt instruction modifies
NZCV; remove preserves_flags from the options list so the asm options read
options(nostack) (or only the remaining valid options) in the asm! invocation
that references inptr, symptr, bitptr, lanes and uses z0/z1/z2/p0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1bac5e24-7223-4234-80f7-bfa89d731abe

📥 Commits

Reviewing files that changed from the base of the PR and between 3f6ce20 and a4e22fd.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/src/huff0/huff0_decoder.rs
Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
zstd/src/huff0/huff0_decoder.rs (1)

214-243: ⚠️ Potential issue | 🔴 Critical

Enforce the shared-table invariant in release builds.

decode4_symbols_and_num_bits() is a safe API, but the AVX2 branch gathers from decoders[0].table.packed_decode for all four states. The current debug_assert! disappears in release, so passing mixed tables can turn this into an out-of-bounds gather instead of a clean fallback.

🛡️ Proposed fix
     pub(crate) fn decode4_symbols_and_num_bits(
         decoders: &[HuffmanDecoder<'_>; 4],
     ) -> ([u8; 4], [u8; 4]) {
         let kernel = decoders[0].kernel;
-        debug_assert!(decoders.iter().all(|d| d.kernel == kernel));
-        debug_assert!(
-            decoders
-                .iter()
-                .all(|d| core::ptr::eq(d.table, decoders[0].table))
-        );
+        let same_kernel = decoders.iter().all(|d| d.kernel == kernel);
+        let same_table = decoders
+            .iter()
+            .all(|d| core::ptr::eq(d.table, decoders[0].table));
+        debug_assert!(same_kernel);
+        debug_assert!(same_table);
+        if !(same_kernel && same_table) {
+            let mut symbols = [0_u8; 4];
+            let mut num_bits = [0_u8; 4];
+            let mut i = 0;
+            while i < 4 {
+                let (sym, bits) = decoders[i].decode_symbol_and_num_bits();
+                symbols[i] = sym;
+                num_bits[i] = bits;
+                i += 1;
+            }
+            return (symbols, num_bits);
+        }
         match kernel {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/huff0/huff0_decoder.rs` around lines 214 - 243, The function
decode4_symbols_and_num_bits currently uses debug_assert! to ensure all decoders
share the same table, but that disappears in release builds and can lead the
AVX2 path (decode4_symbols_and_num_bits_avx2) to perform an out-of-bounds gather
from decoders[0].table.packed_decode when tables differ; replace the debug-only
check with a runtime check at the start of decode4_symbols_and_num_bits that
verifies decoders.iter().all(|d| core::ptr::eq(d.table, decoders[0].table)) and
if it fails call a safe, non-SIMD fallback (e.g.,
Self::decode4_symbols_and_num_bits_scalar or another existing scalar/fallback
decoder) instead of proceeding into SIMD branches so the AVX2/NEON/VBMI2
branches never run on mixed tables. Ensure the check happens before the match on
kernel so decode4_symbols_and_num_bits_avx2 / _vbmi2 / _neon / _sve are only
reached when the invariant holds.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 949-1026: Add parity tests for the remaining SIMD kernels by
mirroring the existing BMI2 and AVX2 tests: create tests that are cfg-gated and
runtime-gated by feature detection (e.g., is_x86_feature_detected!("avx2") style
on x86, and the appropriate ARM/SVE checks), construct parallel HuffmanDecoder
arrays using the same table but set kernel to the VBMI2/NEON/SVE enum variants
from HuffmanDecodeKernel, then call HuffmanDecoder::decode4_symbols_and_num_bits
on both scalar and the SIMD kernel arrays and assert equality; place these
alongside bmi2_advance_matches_scalar_formula_when_available and
decode4_avx2_matches_scalar_when_available to ensure unsafe
architecture-specific kernels get the same byte-parity coverage.

---

Duplicate comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 214-243: The function decode4_symbols_and_num_bits currently uses
debug_assert! to ensure all decoders share the same table, but that disappears
in release builds and can lead the AVX2 path (decode4_symbols_and_num_bits_avx2)
to perform an out-of-bounds gather from decoders[0].table.packed_decode when
tables differ; replace the debug-only check with a runtime check at the start of
decode4_symbols_and_num_bits that verifies decoders.iter().all(|d|
core::ptr::eq(d.table, decoders[0].table)) and if it fails call a safe, non-SIMD
fallback (e.g., Self::decode4_symbols_and_num_bits_scalar or another existing
scalar/fallback decoder) instead of proceeding into SIMD branches so the
AVX2/NEON/VBMI2 branches never run on mixed tables. Ensure the check happens
before the match on kernel so decode4_symbols_and_num_bits_avx2 / _vbmi2 / _neon
/ _sve are only reached when the invariant holds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f4970499-738a-4278-88cc-0a127bae1e27

📥 Commits

Reviewing files that changed from the base of the PR and between a4e22fd and 1e3a769.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/src/huff0/huff0_decoder.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/huff0/huff0_decoder.rs (1)

131-152: ⚠️ Potential issue | 🟡 Minor

Confirm: This is a breaking API change for the fuzz_exports feature build, but not for normal production code.

HuffmanDecoder is publicly re-exported via pub use huff0_decoder::*; (huff0/mod.rs:6) and is part of the public surface when fuzz_exports is enabled (lib.rs:48). The gating of decode_symbol() and next_state() with #[cfg(any(test, feature = "fuzz_exports"))] means these methods will not exist in normal builds, creating a breaking change for any fuzz code that was directly calling them.

However, production code is unaffected: the decoding path in literals_section_decoder.rs uses decode4_symbols_and_num_bits(), which is not gated. The direct method calls appear only in test code and internally within private read_weights(). This is an intentional API restriction for fuzzing harnesses, not an accidental breakage of production APIs.

If the intent is to prevent external fuzz code from directly calling these methods (forcing use of the four-symbol batch interface), document this explicitly or provide shims. If external fuzz users may depend on direct access, restore pub visibility for the fuzz_exports build.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/huff0/huff0_decoder.rs` around lines 131 - 152, The methods
decode_symbol and next_state on HuffmanDecoder are currently gated with
#[cfg(any(test, feature = "fuzz_exports"))], which makes them unavailable in
normal builds and breaks any external fuzz harnesses that expect direct access;
either restore their availability for fuzz builds or provide documented shims:
Option A — remove the cfg gating so pub fn decode_symbol(&mut self) -> u8 and
pub fn next_state(&mut self, br: &mut BitReaderReversed<'_>) -> u8 are always
compiled (keeping #[inline(always)]) so callers can use them, or Option B — keep
them private for production but add explicit pub wrapper functions behind the
feature flag (e.g., pub fn fuzz_decode_symbol(...) and pub fn
fuzz_next_state(...)) gated with feature = "fuzz_exports" and document their
intended use; update the re-export surface (HuffmanDecoder usage) accordingly so
the fuzzing API is stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 1098-1166: The two tests
(decode4_mixed_tables_falls_back_in_release and
decode4_mixed_kernels_falls_back_in_release) violate the debug_assert!
invariants in HuffmanDecoder::decode4_symbols_and_num_bits (the same_kernel /
same_table checks) and therefore panic under debug test builds; update the tests
so they only run in release-mode fallback scenarios: either split each into a
debug-only test that asserts the function panics (use #[cfg(debug_assertions)]
and #[should_panic] referencing the same test names) and a separate release-only
test that verifies the fallback behavior (use #[cfg(not(debug_assertions))] and
keep the current assertions), or simply gate the existing fallback tests with
#[cfg(not(debug_assertions))] so they do not run when debug assertions are
enabled. Ensure you reference HuffmanDecoder::decode4_symbols_and_num_bits and
the two test functions when making these changes.

---

Outside diff comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 131-152: The methods decode_symbol and next_state on
HuffmanDecoder are currently gated with #[cfg(any(test, feature =
"fuzz_exports"))], which makes them unavailable in normal builds and breaks any
external fuzz harnesses that expect direct access; either restore their
availability for fuzz builds or provide documented shims: Option A — remove the
cfg gating so pub fn decode_symbol(&mut self) -> u8 and pub fn next_state(&mut
self, br: &mut BitReaderReversed<'_>) -> u8 are always compiled (keeping
#[inline(always)]) so callers can use them, or Option B — keep them private for
production but add explicit pub wrapper functions behind the feature flag (e.g.,
pub fn fuzz_decode_symbol(...) and pub fn fuzz_next_state(...)) gated with
feature = "fuzz_exports" and document their intended use; update the re-export
surface (HuffmanDecoder usage) accordingly so the fuzzing API is stable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 29623d07-d897-431c-b4c6-8d2b738ccfe7

📥 Commits

Reviewing files that changed from the base of the PR and between a4e22fd and 88e55a6.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 522-544: Remove the unnecessary #[target_feature(enable = "neon")]
and #[target_feature(enable = "sve")] annotations from
decode_symbol_and_advance_aarch64_neon and decode_symbol_and_advance_aarch64_sve
since they are scalar-equivalent; either (a) make these functions plain unsafe
fn (no target_feature) with the same body, or (b) consolidate by having
decode_symbol_and_advance dispatch to the scalar implementation (e.g., call the
existing scalar decode_symbol_and_advance method) and delete these duplicates;
if you intend to add SIMD later, replace the attribute with a brief "// TODO:
add SIMD optimization" comment inside the function to document intent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f182ef47-4b12-403c-a318-28c42d298230

📥 Commits

Reviewing files that changed from the base of the PR and between 6f5f374 and 58ce933.

📒 Files selected for processing (1)
  • zstd/src/huff0/huff0_decoder.rs

Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/huff0/huff0_decoder.rs
Comment thread zstd/src/huff0/huff0_decoder.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

✅ Actions performed

Full review triggered.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/literals_section_decoder.rs`:
- Around line 240-243: The single-stream termination check currently allows
br.bits_remaining() <= - (scratch.table.max_num_bits as isize), which lets
malformed single-stream literals succeed while the 4-stream path rejects
non-exact termination with BitstreamReadMismatch; change the logic to require
exact termination (i.e., bits_remaining() == - (scratch.table.max_num_bits as
isize)) after decoding so the single-stream path mirrors the 4-stream behavior.
Locate the loop using br.bits_remaining() and scratch.table.max_num_bits and
adjust the loop/termination condition and any subsequent validation (the code
around decoder.decode_symbol_and_advance(&mut br) and the bytes_read update) to
return BitstreamReadMismatch unless the final bits_remaining equals the exact
negative max_num_bits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 034b3820-de48-44a9-96d2-48574d2016a1

📥 Commits

Reviewing files that changed from the base of the PR and between 209cebb and 53ad4a2.

📒 Files selected for processing (5)
  • zstd/Cargo.toml
  • zstd/benches/huf_decode_kernels.rs
  • zstd/src/decoding/literals_section_decoder.rs
  • zstd/src/huff0/huff0_decoder.rs
  • zstd/src/huff0/mod.rs

Comment thread zstd/src/decoding/literals_section_decoder.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/literals_section_decoder.rs (1)

152-175: ⚠️ Potential issue | 🔴 Critical

Make Decode4Mode Copy before matching it in the loop.

Line 170 matches decode4_mode by value, so the enum is moved on the first iteration and this loop will not compile. Derive Copy/Clone, or match on a reference instead.

Suggested fix
-        enum Decode4Mode {
+        #[derive(Clone, Copy)]
+        enum Decode4Mode {
             Unchecked,
             Checked,
         }
In Rust, are fieldless enums `Copy` by default? If a non-`Copy` enum is matched by value inside a loop, does that move the enum and prevent reuse on later iterations?

As per coding guidelines, **/*.rs: Code must pass cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings (excluding rustc-dep-of-std and fuzz_exports features).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/literals_section_decoder.rs` around lines 152 - 175, The
enum Decode4Mode is moved when matched by value inside the while loop (variable
decode4_mode), causing a move on first iteration; fix by making Decode4Mode Copy
(e.g., add #[derive(Copy, Clone)] to the Decode4Mode definition) so it can be
copied each iteration, or alternatively change the match to use a reference
(match &decode4_mode { Decode4Mode::Unchecked => ... Decode4Mode::Checked => ...
}) so the enum is not moved; update the Decode4Mode declaration or the match in
literals_section_decoder.rs around the decode4_symbols_and_num_bits call
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@zstd/src/decoding/literals_section_decoder.rs`:
- Around line 152-175: The enum Decode4Mode is moved when matched by value
inside the while loop (variable decode4_mode), causing a move on first
iteration; fix by making Decode4Mode Copy (e.g., add #[derive(Copy, Clone)] to
the Decode4Mode definition) so it can be copied each iteration, or alternatively
change the match to use a reference (match &decode4_mode {
Decode4Mode::Unchecked => ... Decode4Mode::Checked => ... }) so the enum is not
moved; update the Decode4Mode declaration or the match in
literals_section_decoder.rs around the decode4_symbols_and_num_bits call
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b9d5fb3e-92e6-42b8-99c3-8baa4b9f3498

📥 Commits

Reviewing files that changed from the base of the PR and between 53ad4a2 and 5a63cbe.

📒 Files selected for processing (1)
  • zstd/src/decoding/literals_section_decoder.rs

@polaz polaz requested a review from Copilot April 9, 2026 19:32
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai Re outside-diff note on Decode4Mode: fixed for clarity by deriving Clone, Copy in zstd/src/decoding/literals_section_decoder.rs (no semantic change). This should address the repeated move concern in the loop match.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that Decode4Mode derives Clone, Copy in literals_section_decoder.rs
rg -n "Decode4Mode" zstd/src/decoding/literals_section_decoder.rs -A3 -B2

Length of output: 1086


🏁 Script executed:

#!/bin/bash
# Check for derive attribute just above the Decode4Mode enum definition
sed -n '148,160p' zstd/src/decoding/literals_section_decoder.rs

Length of output: 583


@polaz: The fix doesn't appear to be reflected at branch HEAD yet. Decode4Mode is currently defined without any derive attribute at Lines 152–155:

enum Decode4Mode {
    Unchecked,
    Checked,
}

The expected fix should look like:

#[derive(Clone, Copy)]
enum Decode4Mode {
    Unchecked,
    Checked,
}

Could you double-check that the commit was pushed? Happy to re-verify once it lands.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T16:00:30.438Z
Learning: In `zstd/src/encoding/match_generator.rs`, the `RowMatchGenerator` (used for `CompressionLevel::Level(4)` / `MatcherBackend::Row`) uses a stable 4-byte hash key, consistent with the 4-byte lookahead constraint shared by `HcMatchGenerator`. Previous-block tail positions are backfilled into the row tables before matching/skip begins (analogous to `backfill_boundary_positions` for the HC backend), and a regression test covers cross-boundary tail reuse. Do not flag missing backfill or key-width issues for the Row backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:17.229Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 33
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T00:05:14.600Z
Learning: In structured-world/structured-zstd, the Clippy requirement is: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`. The `rustc-dep-of-std` feature is excluded (internal feature for Rust stdlib builds only) and `fuzz_exports` is excluded (fuzzing-specific entry points validated separately from the regular lint gate). The `--all-features` flag should NOT be used.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:10.603Z
Learning: In structured-world/structured-zstd, the `fuzz_exports` API concern was resolved via "Option B": internal decode/encode helpers remain `pub(crate)` by default, and explicit public shims are exposed only behind a dedicated feature flag (e.g., `fuzz`) in a separate file. Do not flag these internal symbols as needing a wider public API — the feature-gated shim layer is the intentional design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 73
File: zstd/src/dictionary/mod.rs:337-345
Timestamp: 2026-04-05T21:31:56.797Z
Learning: In `zstd/src/decoding/dictionary.rs` (structured-world/structured-zstd), `Dictionary::decode_dict()` parses the full zstd dictionary wire format: magic (4 bytes) → dict_id (4 bytes) → Huffman table (variable, consumed via `build_decoder`) → OF FSE table → ML FSE table → LL FSE table → 3 × u32 repeat offsets → raw content. This matches the layout produced by `finalize_raw_dict()` in `zstd/src/dictionary/mod.rs`. Do not flag these as mismatched layouts.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.408Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:17.570Z
Learning: In `zstd/src/huff0/huff0_decoder.rs` (structured-world/structured-zstd), `decode_symbol` and `next_state` are intentionally private (not `pub`). Public fuzz-only shims `fuzz_decode_symbol` and `fuzz_next_state` are exposed exclusively behind `#[cfg(feature = "fuzz_exports")]`. Do not flag `decode_symbol` or `next_state` as needing public visibility — the fuzz shim pattern is the intended API contract.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T16:36:03.248Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` appears asymmetric to `new()` in its pre-allocation, but the asymmetry is intentional: `decoder_scratch.reset(window_size)` internally calls `DecodeBuffer::reset()` which already reserves `window_size`. Therefore `reset()` only needs to call `scratch.buffer.reserve(frame_content_size)` on top of that for the FCS delta, rather than explicitly reserving `window_size` again as `new()` does. Do not re-flag this as a bug or asymmetry.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 9, 2026

@coderabbitai re-review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

@polaz: Sure! Kicking off a fresh re-review now. I'll verify the Decode4Mode derive fix and the unsafe { _bzhi_u64(...) } wrapper as part of this pass.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T16:00:30.438Z
Learning: In `zstd/src/encoding/match_generator.rs`, the `RowMatchGenerator` (used for `CompressionLevel::Level(4)` / `MatcherBackend::Row`) uses a stable 4-byte hash key, consistent with the 4-byte lookahead constraint shared by `HcMatchGenerator`. Previous-block tail positions are backfilled into the row tables before matching/skip begins (analogous to `backfill_boundary_positions` for the HC backend), and a regression test covers cross-boundary tail reuse. Do not flag missing backfill or key-width issues for the Row backend.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1554-1559
Timestamp: 2026-04-02T12:03:04.619Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` has an early-return guard at `abs_pos >= u32::MAX` that disables all further HC insertions for streams >4 GiB, degrading `CompressionLevel::Better` to repcodes-only. This is a known limitation tracked in issue `#51`; the proper fix (table rebasing) is deferred. A TODO comment and a limitation note on `CompressionLevel::Better` doc were added in PR `#48`. Do not re-flag this as a new bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 53
File: zstd/src/tests/roundtrip_integrity.rs:498-509
Timestamp: 2026-04-02T22:26:07.979Z
Learning: In `structured-zstd` (`zstd/src/tests/roundtrip_integrity.rs`), `best_level_does_not_regress_vs_better` uses a `<=` (not strict `<`) assertion because the `repeat_offset_fixture(b"HelloWorld", ...)` input is simple enough that HC saturates at both Better (16 candidates) and Best (32 candidates) search depths, producing identical compressed sizes (~30243 bytes). Strict `<` would be a false positive on this fixture. The strict `Best < Better` quality assertion lives in `cross_validation::best_level_beats_better_on_corpus_proxy` on the decodecorpus sample. Do not re-flag the `<=` as a weakened guard.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.233Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:17.229Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1470-1490
Timestamp: 2026-04-02T07:27:15.870Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::pick_lazy_match` intentionally queries `find_best_match(abs_pos + 1, ...)` and `find_best_match(abs_pos + 2, ...)` *before* those positions are inserted into the hash-chain tables. This matches C zstd lazy evaluation semantics: pre-seeding those positions would allow a position to match against itself, producing invalid self-matches. Do not flag the absent pre-seeding as a bug.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:10.603Z
Learning: In structured-world/structured-zstd, the `fuzz_exports` API concern was resolved via "Option B": internal decode/encode helpers remain `pub(crate)` by default, and explicit public shims are exposed only behind a dedicated feature flag (e.g., `fuzz`) in a separate file. Do not flag these internal symbols as needing a wider public API — the feature-gated shim layer is the intentional design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 33
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T00:05:14.600Z
Learning: In structured-world/structured-zstd, the Clippy requirement is: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`. The `rustc-dep-of-std` feature is excluded (internal feature for Rust stdlib builds only) and `fuzz_exports` is excluded (fuzzing-specific entry points validated separately from the regular lint gate). The `--all-features` flag should NOT be used.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 73
File: zstd/src/dictionary/mod.rs:337-345
Timestamp: 2026-04-05T21:31:56.797Z
Learning: In `zstd/src/decoding/dictionary.rs` (structured-world/structured-zstd), `Dictionary::decode_dict()` parses the full zstd dictionary wire format: magic (4 bytes) → dict_id (4 bytes) → Huffman table (variable, consumed via `build_decoder`) → OF FSE table → ML FSE table → LL FSE table → 3 × u32 repeat offsets → raw content. This matches the layout produced by `finalize_raw_dict()` in `zstd/src/dictionary/mod.rs`. Do not flag these as mismatched layouts.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.408Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:17.570Z
Learning: In `zstd/src/huff0/huff0_decoder.rs` (structured-world/structured-zstd), `decode_symbol` and `next_state` are intentionally private (not `pub`). Public fuzz-only shims `fuzz_decode_symbol` and `fuzz_next_state` are exposed exclusively behind `#[cfg(feature = "fuzz_exports")]`. Do not flag `decode_symbol` or `next_state` as needing public visibility — the fuzz shim pattern is the intended API contract.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T16:36:03.248Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` appears asymmetric to `new()` in its pre-allocation, but the asymmetry is intentional: `decoder_scratch.reset(window_size)` internally calls `DecodeBuffer::reset()` which already reserves `window_size`. Therefore `reset()` only needs to call `scratch.buffer.reserve(frame_content_size)` on top of that for the FCS delta, rather than explicitly reserving `window_size` again as `new()` does. Do not re-flag this as a bug or asymmetry.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-03T17:39:58.627Z
Learning: In `zstd/src/decoding/frame_decoder.rs`, `FrameDecoderState::reset()` intentionally does NOT reserve the frame-content-size (FCS) delta beyond `window_size`. Two cases cover the design space: (1) multi-segment frames where FCS >> window_size — the delta can reach ~99 MiB, and eagerly reserving it would penalize streaming callers; (2) single-segment frames where window_size == FCS — the delta is always 0 so there is nothing to reserve. Incremental growth during decoding is handled by the per-block `MAX_BLOCK_SIZE` (128 KiB) reserve in `execute_sequences`. Do not flag the absence of an FCS-delta reserve in `reset()` as a missing pre-allocation.
✅ Actions performed

Full review triggered.

@polaz polaz merged commit 786b00b into main Apr 9, 2026
17 of 18 checks passed
@polaz polaz deleted the perf/#66-simd-huf-decode branch April 9, 2026 20:12
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(decoding): SIMD HUF decode kernels with runtime CPU dispatch

2 participants