perf(decoding): SIMD HUF kernels with runtime dispatch#92
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds runtime-dispatched Huffman decode kernels (scalar and architecture-specific), packed table/state fields and batched decode APIs, updates literals decoder loops to use the new APIs with explicit bitstream validation, and adds a Criterion benchmark Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant HuffmanDecoder
participant CPUDetector as CPU Detector
participant Kernel as Selected Kernel
Caller->>HuffmanDecoder: HuffmanDecoder::new(table)
activate HuffmanDecoder
HuffmanDecoder->>CPUDetector: detect_huffman_decode_kernel()
activate CPUDetector
alt runtime detection
CPUDetector->>CPUDetector: is_x86_feature_detected / is_aarch64_feature_detected
else compile-time fallback
CPUDetector->>CPUDetector: cfg! selection
end
CPUDetector-->>HuffmanDecoder: selected Kernel
deactivate CPUDetector
HuffmanDecoder->>HuffmanDecoder: store kernel, packed_decode, state_mask
deactivate HuffmanDecoder
loop decode loop
Caller->>HuffmanDecoder: decode_symbol_and_advance() / decode4_symbols_and_num_bits()
activate HuffmanDecoder
HuffmanDecoder->>Kernel: dispatch to kernel impl
activate Kernel
alt SIMD Kernel (AVX2/VBMI2/NEON/SVE)
Kernel->>Kernel: vectorized extraction & packed_decode lookup
else BMI2 Kernel
Kernel->>Kernel: bit-extract (bzhi/pext) & derive symbols/bits
else Scalar Kernel
Kernel->>Kernel: table-driven decode & bits
end
Kernel-->>HuffmanDecoder: (symbols, bits) or (symbol, bits)
deactivate Kernel
HuffmanDecoder->>HuffmanDecoder: advance_state_by_bits(bits)
HuffmanDecoder-->>Caller: decoded symbols
deactivate HuffmanDecoder
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds runtime-selected Huffman (HUF) literal decode kernels and refactors the literals decode hot path to reduce per-symbol overhead, with a new Criterion benchmark to measure the impact.
Changes:
- Introduces
HuffmanDecodeKernelruntime dispatch and fused “decode + state-advance” APIs in the HUF decoder. - Adds a packed decode table and SIMD-oriented 4-stream lookup helpers used by the literals section decoder.
- Adds a new
huf_decode_kernelsCriterion benchmark and wires it intozstd/Cargo.toml.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
zstd/src/huff0/huff0_decoder.rs |
Adds kernel detection/dispatch, packed decode table (packed_decode), state_mask, and SIMD/arch-specific 4-stream helpers (incl. SVE asm). |
zstd/src/decoding/literals_section_decoder.rs |
Switches literals decode loop to fused symbol decode + state advance, and uses 4-wide symbol/bit extraction helper. |
zstd/Cargo.toml |
Registers the new huf_decode_kernels benchmark target. |
zstd/benches/huf_decode_kernels.rs |
Adds a literals-heavy benchmark plus a corpus reference benchmark case. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/huf_decode_kernels.rs`:
- Around line 28-45: Summary: The two benchmarks use different throughput bases
(corpus_src compressed size vs. literals_heavy uncompressed size), making their
reported Bytes/s incomparable; unify the basis. Fix: choose a single throughput
basis (either compressed input bytes or uncompressed output bytes) and apply it
to both benchmarks by calling group.throughput(...) with that chosen length; for
example, use the compressed size (compressed.len() or corpus_src.len()) for both
corpus and literals_heavy cases or use the uncompressed size
(literals_heavy.len() and decoded size for corpus) consistently. Locate and
update the throughput calls around the corpus_src, literals_heavy and compressed
variables (symbols: corpus_src, literals_heavy, compressed, group.throughput) so
both benches report Bytes/s on the same basis.
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 405-417: The call to the unsafe function advance_state_x86_bmi2
inside the unsafe fn decode_symbol_and_advance_x86_bmi2 must be wrapped in an
explicit unsafe block to satisfy the unsafe_op_in_unsafe_fn lint; update
decode_symbol_and_advance_x86_bmi2 so the invocation of
self.advance_state_x86_bmi2(entry.num_bits, new_bits) is executed inside unsafe
{ … } and leave all other logic (reading entry, getting new_bits, assigning
self.state and returning entry.symbol) unchanged.
- Around line 252-310: The calls to SIMD intrinsics are made directly inside
unsafe functions which still require explicit unsafe blocks for each intrinsic;
wrap each intrinsic invocation (_mm_maskz_compress_epi8, _mm_storeu_si128,
_mm_i32gather_epi32) in explicit unsafe { ... } blocks to satisfy the
unsafe_op_in_unsafe_fn lint. Concretely, update
decode4_symbols_and_num_bits_vbmi2 to call _mm_maskz_compress_epi8 and both
_mm_storeu_si128 calls inside their own unsafe { } and update
decode4_symbols_and_num_bits_avx2 to call _mm_i32gather_epi32 and the
_mm_storeu_si128(packed) inside explicit unsafe { } blocks, leaving surrounding
logic intact. Ensure you reference the same local variables (packed,
symbols_bytes, bits_bytes, gathered) when moving the intrinsic calls into those
blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 54887e6a-6d09-4232-9388-ee682f960c9b
📒 Files selected for processing (4)
zstd/Cargo.tomlzstd/benches/huf_decode_kernels.rszstd/src/decoding/literals_section_decoder.rszstd/src/huff0/huff0_decoder.rs
9611de6 to
b2220ed
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 431-443: The BMI2 intrinsics calls in advance_state_x86_bmi2 are
themselves unsafe and must be invoked inside explicit unsafe blocks to satisfy
the unsafe_op_in_unsafe_fn Clippy gate; wrap the _bzhi_u64 call and the
_bzhi_u32 call each in their own unsafe { ... } expression and keep the
surrounding logic the same (i.e., replace direct calls to _bzhi_u64(...) and
_bzhi_u32(...) with unsafe { _bzhi_u64(...) } and unsafe { _bzhi_u32(...) }
respectively within the advance_state_x86_bmi2 function).
- Around line 382-402: The inline SVE assembly block in the unsafe asm! call
that loads and splits packed_scalar into symbols_u32/bits_u32 incorrectly
declares options(preserves_flags) even though the whilelt instruction modifies
NZCV; remove preserves_flags from the options list so the asm options read
options(nostack) (or only the remaining valid options) in the asm! invocation
that references inptr, symptr, bitptr, lanes and uses z0/z1/z2/p0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1bac5e24-7223-4234-80f7-bfa89d731abe
📒 Files selected for processing (1)
zstd/src/huff0/huff0_decoder.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
zstd/src/huff0/huff0_decoder.rs (1)
214-243:⚠️ Potential issue | 🔴 CriticalEnforce the shared-table invariant in release builds.
decode4_symbols_and_num_bits()is a safe API, but the AVX2 branch gathers fromdecoders[0].table.packed_decodefor all four states. The currentdebug_assert!disappears in release, so passing mixed tables can turn this into an out-of-bounds gather instead of a clean fallback.🛡️ Proposed fix
pub(crate) fn decode4_symbols_and_num_bits( decoders: &[HuffmanDecoder<'_>; 4], ) -> ([u8; 4], [u8; 4]) { let kernel = decoders[0].kernel; - debug_assert!(decoders.iter().all(|d| d.kernel == kernel)); - debug_assert!( - decoders - .iter() - .all(|d| core::ptr::eq(d.table, decoders[0].table)) - ); + let same_kernel = decoders.iter().all(|d| d.kernel == kernel); + let same_table = decoders + .iter() + .all(|d| core::ptr::eq(d.table, decoders[0].table)); + debug_assert!(same_kernel); + debug_assert!(same_table); + if !(same_kernel && same_table) { + let mut symbols = [0_u8; 4]; + let mut num_bits = [0_u8; 4]; + let mut i = 0; + while i < 4 { + let (sym, bits) = decoders[i].decode_symbol_and_num_bits(); + symbols[i] = sym; + num_bits[i] = bits; + i += 1; + } + return (symbols, num_bits); + } match kernel {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/huff0/huff0_decoder.rs` around lines 214 - 243, The function decode4_symbols_and_num_bits currently uses debug_assert! to ensure all decoders share the same table, but that disappears in release builds and can lead the AVX2 path (decode4_symbols_and_num_bits_avx2) to perform an out-of-bounds gather from decoders[0].table.packed_decode when tables differ; replace the debug-only check with a runtime check at the start of decode4_symbols_and_num_bits that verifies decoders.iter().all(|d| core::ptr::eq(d.table, decoders[0].table)) and if it fails call a safe, non-SIMD fallback (e.g., Self::decode4_symbols_and_num_bits_scalar or another existing scalar/fallback decoder) instead of proceeding into SIMD branches so the AVX2/NEON/VBMI2 branches never run on mixed tables. Ensure the check happens before the match on kernel so decode4_symbols_and_num_bits_avx2 / _vbmi2 / _neon / _sve are only reached when the invariant holds.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 949-1026: Add parity tests for the remaining SIMD kernels by
mirroring the existing BMI2 and AVX2 tests: create tests that are cfg-gated and
runtime-gated by feature detection (e.g., is_x86_feature_detected!("avx2") style
on x86, and the appropriate ARM/SVE checks), construct parallel HuffmanDecoder
arrays using the same table but set kernel to the VBMI2/NEON/SVE enum variants
from HuffmanDecodeKernel, then call HuffmanDecoder::decode4_symbols_and_num_bits
on both scalar and the SIMD kernel arrays and assert equality; place these
alongside bmi2_advance_matches_scalar_formula_when_available and
decode4_avx2_matches_scalar_when_available to ensure unsafe
architecture-specific kernels get the same byte-parity coverage.
---
Duplicate comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 214-243: The function decode4_symbols_and_num_bits currently uses
debug_assert! to ensure all decoders share the same table, but that disappears
in release builds and can lead the AVX2 path (decode4_symbols_and_num_bits_avx2)
to perform an out-of-bounds gather from decoders[0].table.packed_decode when
tables differ; replace the debug-only check with a runtime check at the start of
decode4_symbols_and_num_bits that verifies decoders.iter().all(|d|
core::ptr::eq(d.table, decoders[0].table)) and if it fails call a safe, non-SIMD
fallback (e.g., Self::decode4_symbols_and_num_bits_scalar or another existing
scalar/fallback decoder) instead of proceeding into SIMD branches so the
AVX2/NEON/VBMI2 branches never run on mixed tables. Ensure the check happens
before the match on kernel so decode4_symbols_and_num_bits_avx2 / _vbmi2 / _neon
/ _sve are only reached when the invariant holds.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f4970499-738a-4278-88cc-0a127bae1e27
📒 Files selected for processing (1)
zstd/src/huff0/huff0_decoder.rs
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/huff0/huff0_decoder.rs (1)
131-152:⚠️ Potential issue | 🟡 MinorConfirm: This is a breaking API change for the
fuzz_exportsfeature build, but not for normal production code.
HuffmanDecoderis publicly re-exported viapub use huff0_decoder::*;(huff0/mod.rs:6) and is part of the public surface whenfuzz_exportsis enabled (lib.rs:48). The gating ofdecode_symbol()andnext_state()with#[cfg(any(test, feature = "fuzz_exports"))]means these methods will not exist in normal builds, creating a breaking change for any fuzz code that was directly calling them.However, production code is unaffected: the decoding path in
literals_section_decoder.rsusesdecode4_symbols_and_num_bits(), which is not gated. The direct method calls appear only in test code and internally within privateread_weights(). This is an intentional API restriction for fuzzing harnesses, not an accidental breakage of production APIs.If the intent is to prevent external fuzz code from directly calling these methods (forcing use of the four-symbol batch interface), document this explicitly or provide shims. If external fuzz users may depend on direct access, restore
pubvisibility for thefuzz_exportsbuild.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/huff0/huff0_decoder.rs` around lines 131 - 152, The methods decode_symbol and next_state on HuffmanDecoder are currently gated with #[cfg(any(test, feature = "fuzz_exports"))], which makes them unavailable in normal builds and breaks any external fuzz harnesses that expect direct access; either restore their availability for fuzz builds or provide documented shims: Option A — remove the cfg gating so pub fn decode_symbol(&mut self) -> u8 and pub fn next_state(&mut self, br: &mut BitReaderReversed<'_>) -> u8 are always compiled (keeping #[inline(always)]) so callers can use them, or Option B — keep them private for production but add explicit pub wrapper functions behind the feature flag (e.g., pub fn fuzz_decode_symbol(...) and pub fn fuzz_next_state(...)) gated with feature = "fuzz_exports" and document their intended use; update the re-export surface (HuffmanDecoder usage) accordingly so the fuzzing API is stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 1098-1166: The two tests
(decode4_mixed_tables_falls_back_in_release and
decode4_mixed_kernels_falls_back_in_release) violate the debug_assert!
invariants in HuffmanDecoder::decode4_symbols_and_num_bits (the same_kernel /
same_table checks) and therefore panic under debug test builds; update the tests
so they only run in release-mode fallback scenarios: either split each into a
debug-only test that asserts the function panics (use #[cfg(debug_assertions)]
and #[should_panic] referencing the same test names) and a separate release-only
test that verifies the fallback behavior (use #[cfg(not(debug_assertions))] and
keep the current assertions), or simply gate the existing fallback tests with
#[cfg(not(debug_assertions))] so they do not run when debug assertions are
enabled. Ensure you reference HuffmanDecoder::decode4_symbols_and_num_bits and
the two test functions when making these changes.
---
Outside diff comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 131-152: The methods decode_symbol and next_state on
HuffmanDecoder are currently gated with #[cfg(any(test, feature =
"fuzz_exports"))], which makes them unavailable in normal builds and breaks any
external fuzz harnesses that expect direct access; either restore their
availability for fuzz builds or provide documented shims: Option A — remove the
cfg gating so pub fn decode_symbol(&mut self) -> u8 and pub fn next_state(&mut
self, br: &mut BitReaderReversed<'_>) -> u8 are always compiled (keeping
#[inline(always)]) so callers can use them, or Option B — keep them private for
production but add explicit pub wrapper functions behind the feature flag (e.g.,
pub fn fuzz_decode_symbol(...) and pub fn fuzz_next_state(...)) gated with
feature = "fuzz_exports" and document their intended use; update the re-export
surface (HuffmanDecoder usage) accordingly so the fuzzing API is stable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 29623d07-d897-431c-b4c6-8d2b738ccfe7
📒 Files selected for processing (1)
zstd/src/huff0/huff0_decoder.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/huff0/huff0_decoder.rs`:
- Around line 522-544: Remove the unnecessary #[target_feature(enable = "neon")]
and #[target_feature(enable = "sve")] annotations from
decode_symbol_and_advance_aarch64_neon and decode_symbol_and_advance_aarch64_sve
since they are scalar-equivalent; either (a) make these functions plain unsafe
fn (no target_feature) with the same body, or (b) consolidate by having
decode_symbol_and_advance dispatch to the scalar implementation (e.g., call the
existing scalar decode_symbol_and_advance method) and delete these duplicates;
if you intend to add SIMD later, replace the attribute with a brief "// TODO:
add SIMD optimization" comment inside the function to document intent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f182ef47-4b12-403c-a318-28c42d298230
📒 Files selected for processing (1)
zstd/src/huff0/huff0_decoder.rs
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/literals_section_decoder.rs`:
- Around line 240-243: The single-stream termination check currently allows
br.bits_remaining() <= - (scratch.table.max_num_bits as isize), which lets
malformed single-stream literals succeed while the 4-stream path rejects
non-exact termination with BitstreamReadMismatch; change the logic to require
exact termination (i.e., bits_remaining() == - (scratch.table.max_num_bits as
isize)) after decoding so the single-stream path mirrors the 4-stream behavior.
Locate the loop using br.bits_remaining() and scratch.table.max_num_bits and
adjust the loop/termination condition and any subsequent validation (the code
around decoder.decode_symbol_and_advance(&mut br) and the bytes_read update) to
return BitstreamReadMismatch unless the final bits_remaining equals the exact
negative max_num_bits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 034b3820-de48-44a9-96d2-48574d2016a1
📒 Files selected for processing (5)
zstd/Cargo.tomlzstd/benches/huf_decode_kernels.rszstd/src/decoding/literals_section_decoder.rszstd/src/huff0/huff0_decoder.rszstd/src/huff0/mod.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/literals_section_decoder.rs (1)
152-175:⚠️ Potential issue | 🔴 CriticalMake
Decode4ModeCopybefore matching it in the loop.Line 170 matches
decode4_modeby value, so the enum is moved on the first iteration and this loop will not compile. DeriveCopy/Clone, or match on a reference instead.Suggested fix
- enum Decode4Mode { + #[derive(Clone, Copy)] + enum Decode4Mode { Unchecked, Checked, }In Rust, are fieldless enums `Copy` by default? If a non-`Copy` enum is matched by value inside a loop, does that move the enum and prevent reuse on later iterations?As per coding guidelines,
**/*.rs: Code must passcargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings(excludingrustc-dep-of-stdandfuzz_exportsfeatures).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/literals_section_decoder.rs` around lines 152 - 175, The enum Decode4Mode is moved when matched by value inside the while loop (variable decode4_mode), causing a move on first iteration; fix by making Decode4Mode Copy (e.g., add #[derive(Copy, Clone)] to the Decode4Mode definition) so it can be copied each iteration, or alternatively change the match to use a reference (match &decode4_mode { Decode4Mode::Unchecked => ... Decode4Mode::Checked => ... }) so the enum is not moved; update the Decode4Mode declaration or the match in literals_section_decoder.rs around the decode4_symbols_and_num_bits call accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@zstd/src/decoding/literals_section_decoder.rs`:
- Around line 152-175: The enum Decode4Mode is moved when matched by value
inside the while loop (variable decode4_mode), causing a move on first
iteration; fix by making Decode4Mode Copy (e.g., add #[derive(Copy, Clone)] to
the Decode4Mode definition) so it can be copied each iteration, or alternatively
change the match to use a reference (match &decode4_mode {
Decode4Mode::Unchecked => ... Decode4Mode::Checked => ... }) so the enum is not
moved; update the Decode4Mode declaration or the match in
literals_section_decoder.rs around the decode4_symbols_and_num_bits call
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b9d5fb3e-92e6-42b8-99c3-8baa4b9f3498
📒 Files selected for processing (1)
zstd/src/decoding/literals_section_decoder.rs
|
@coderabbitai Re outside-diff note on |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 1086 🏁 Script executed: Length of output: 583
enum Decode4Mode {
Unchecked,
Checked,
}The expected fix should look like: #[derive(Clone, Copy)]
enum Decode4Mode {
Unchecked,
Checked,
}Could you double-check that the commit was pushed? Happy to re-verify once it lands. 🧠 Learnings used |
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
Summary
huf_decode_kernelsValidation
Closes #66
Summary by CodeRabbit
Tests
Refactor
Performance
Scope note
decode_symbolandnext_stateare intentionally gated behindfeature = "fuzz_exports".decode_symbol_and_advanceand batched decode helpers are always available).