perf(fse): pack decoder entries and align decode tables by polaz · Pull Request #76 · structured-world/structured-zstd

polaz · 2026-04-06T16:10:18Z

Summary

switch FSE decoder entry to a packed 4-byte layout (new_state: u16, symbol: u8, num_bits: u8) with explicit layout assertions
align LL/ML/OF FSE table containers in decoder scratch to cache-line boundaries (64B, 128B on aarch64) to reduce cross-table placement effects in the hot loop
refactor table build to spread symbols via a reusable scratch buffer and bulk-write two entries per 64-bit store on little-endian targets
add explicit accuracy_log <= 16 validation and wide next-state arithmetic in decode updates to avoid truncation/panic paths

Validation

cargo nextest run --workspace
cargo build --workspace
cargo clippy -p structured-zstd --all-targets --features hash,std,dict_builder -- -D warnings

Closes #56

Summary by CodeRabbit

Bug Fixes
- Enforced a maximum accuracy parameter and surface explicit errors for oversized/invalid decompression settings.
Refactor / Performance
- Improved memory alignment and compacted decoder entry layout for safer, more portable behavior.
- Optimized symbol placement and bulk-copying to accelerate decoding on little-endian platforms.
Tests
- Added and updated tests validating decoder layout, entry size, and accuracy-limit enforcement.

- replace Entry.base_line(u32) with Entry.new_state(u16) - keep decode transition semantics (new_state + low bits) - update FSE/sequence tests and add size assertion for packed entry

coderabbitai · 2026-04-06T16:10:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c08086b4-8f8c-47f1-836f-a28e66a928c5

📥 Commits

Reviewing files that changed from the base of the PR and between 71708e5 and 61e4ac1.

📒 Files selected for processing (4)

zstd/src/decoding/scratch.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/fse/fse_decoder.rs
zstd/src/fse/mod.rs

📝 Walkthrough

Walkthrough

Replaces FSE decode entries with a packed 4-byte layout (new_state: u16, symbol: u8, num_bits: u8), adds an aligned FSE table wrapper, changes table build to a two-phase symbol spread with endian-aware bulk writes on little-endian, clamps accuracy log to 16, and updates tests and scratch initialization to use the new layout.

Changes

Cohort / File(s)	Summary
Core FSE entry & decoder `zstd/src/fse/fse_decoder.rs`	Replaced `Entry.base_line: u32` with `Entry.new_state: u16` and added `#[repr(C)]`. Refactored decode state math to index via `new_state`, added `ENTRY_MAX_ACCURACY_LOG = 16`, introduced `symbol_spread_buffer` and `copy_symbols_into_decode`, implemented endian-aware bulk symbol writes (little-endian uses unaligned `u64` writes; fallback per-entry otherwise), and adjusted num_bits / new_state assignment timing and error checks.
Aligned table wrapper & scratch changes `zstd/src/decoding/scratch.rs`	Added `pub struct AlignedFSETable(FSETable)` with architecture-dependent alignment (`repr(align(128))` on aarch64, else `repr(align(64))`), plus `Deref`/`DerefMut` and a `new` constructor. Switched `FSEScratch` fields (`offsets`, `literal_lengths`, `match_lengths`) to `AlignedFSETable` and updated constructors/init paths accordingly.
Tests & call sites `zstd/src/decoding/sequence_section_decoder.rs`, `zstd/src/fse/mod.rs`	Updated tests to reference `table.decode[idx].new_state` instead of `base_line`. Added unit tests verifying `Entry` layout/size/field offsets, enforcing `AccLogTooBig` when accuracy log > 16, and ensuring oversized build errors propagate. Minor test helper updates to compare encoder baseline against `dec_state.new_state`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Builder as Builder (table builder)
    participant FSETable as AlignedFSETable / FSETable
    participant Memory as Memory (bulk write)
    participant Decoder as FSEDecoder

    Builder->>Builder: compute symbol distribution & table_symbols
    Builder->>Memory: bulk-copy symbols (little-endian: u64 writes)
    Memory-->>FSETable: symbols written into decode[] region
    Builder->>FSETable: fill new_state and num_bits fields
    Decoder->>FSETable: read entry = decode[state_index]
    Decoder->>Decoder: next_state = entry.new_state + bits_read
    Decoder->>Decoder: emit entry.symbol

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

perf(decoding): dual-state interleaved FSE sequence decoding #55 — Overlaps FSE decoder/table and state-update paths; likely touches the same Entry/layout and decode-state changes.
feat(encoding): add dictionary compression support #44 — Modifies FSE decoder internals; may conflict with new packed entry/table alignment changes.

Poem

🐰 I packed the entries tight and small,
Aligned their hops to cache-line wall,
I spread the symbols, wrote them fast,
New_state leaps — the bits are cast,
I nibble speed, and bounce with joy 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(fse): pack decoder entries and align decode tables' directly and accurately summarizes the main changes: packing FSE decoder entries from 8 to 4 bytes and implementing cache-line alignment for decode tables.
Linked Issues check	✅ Passed	The pull request implements all coding requirements from issue `#56`: packed 4-byte Entry struct with new_state/symbol/num_bits fields, updated decode hot paths, cache-line alignment (64/128 bytes), accuracy_log validation (<=16), and bulk table spreading with symbol buffer. Tests and safety invariants are updated accordingly.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with issue `#56` requirements: Entry field changes, FSETable alignment wrapper, build_decoding_table refactoring with symbol spreading, accuracy_log validation, and test updates. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#56-packed-fse-entry-pr

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-06T16:12:40Z

Codecov Report

❌ Patch coverage is 94.26230% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/fse/fse_decoder.rs	91.25%	7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR optimizes the Zstd FSE decoder’s table representation and construction to improve cache efficiency and speed up decode-table building in hot paths.

Changes:

Refactors fse_decoder::Entry to a packed 4-byte layout and updates state-advance logic accordingly.
Refactors decode table building to spread symbols into a temporary byte buffer and bulk-write into the decode table (little-endian).
Introduces a cache-line-aligned wrapper type for the three sequence FSE tables (LL/ML/OF) in decoder scratch space.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
zstd/src/fse/mod.rs	Adds layout assertions for packed `Entry` and updates table equivalence test to use `new_state`.
zstd/src/fse/fse_decoder.rs	Implements packed `Entry`, updates state transitions, and refactors table build + symbol copy using bulk stores.
zstd/src/decoding/sequence_section_decoder.rs	Updates tests/diagnostics to reference `new_state` instead of `base_line`.
zstd/src/decoding/scratch.rs	Wraps LL/ML/OF `FSETable`s in an aligned newtype to influence placement in scratch.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 48-49: The decoder table can overflow Entry::new_state (u16) when
accuracy_log > 16; update the public builders build_from_probabilities and
build_decoder to validate accuracy_log (or max_log) is ≤ 16 up front and return
FSETableError::AccLogTooBig instead of allowing later panics, and in
build_decoding_table replace the assert that new_state fits u16 with an explicit
check that returns FSETableError::AccLogTooBig if exceeded; ensure callers
cannot construct a table with entries whose new_state + add would index out of
bounds in the decode paths that use self.state.new_state + add (the decode logic
that indexes self.table.decode must only run when accuracy_log ≤ 16 and the
table was built with that constraint).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7193e4e9-3146-45bc-83cb-b29dcf2ec11f

📥 Commits

Reviewing files that changed from the base of the PR and between 71708e5 and 1a69aab.

📒 Files selected for processing (4)

zstd/src/decoding/scratch.rs
zstd/src/decoding/sequence_section_decoder.rs
zstd/src/fse/fse_decoder.rs
zstd/src/fse/mod.rs

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `25be69c`	Previous: `71708e5`	Ratio
`compress/better/small-1k-random/matrix/pure_rust`	`0.182` ms	`0.157` ms	`1.16`
`compress/better/small-1k-random/matrix/c_ffi`	`0.108` ms	`0.092` ms	`1.17`
`compress/best/small-1k-random/matrix/pure_rust`	`0.3` ms	`0.256` ms	`1.17`
`compress/better/small-10k-random/matrix/c_ffi`	`0.126` ms	`0.096` ms	`1.31`
`compress/best/small-10k-random/matrix/pure_rust`	`0.775` ms	`0.64` ms	`1.21`
`compress/best/small-10k-random/matrix/c_ffi`	`0.384` ms	`0.293` ms	`1.31`
`compress/better/small-4k-log-lines/matrix/pure_rust`	`0.153` ms	`0.125` ms	`1.22`
`compress/better/small-4k-log-lines/matrix/c_ffi`	`0.098` ms	`0.076` ms	`1.29`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 278-306: Update the safety comment inside copy_symbols_into_decode
to precisely state the invariants that make the unaligned u64 write safe:
mention the debug_assert_eq!(table_symbols.len(), self.decode.len()) and that
the loop condition idx + 1 < table_symbols.len() implies there are at least two
remaining decode entries (each 4 bytes), therefore at least 8 bytes available at
self.decode.as_mut_ptr().add(idx).cast::<u64>() for ptr::write_unaligned; also
note that unaligned writes are intentional and that the condition ensures idx is
within bounds of self.decode. Reference the symbols: copy_symbols_into_decode,
idx, table_symbols.len(), self.decode, and ptr::write_unaligned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a34162f-fb33-4ab6-8ec9-a8be8a31e620

📥 Commits

Reviewing files that changed from the base of the PR and between 1a69aab and e748900.

📒 Files selected for processing (3)

zstd/src/decoding/scratch.rs
zstd/src/fse/fse_decoder.rs
zstd/src/fse/mod.rs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

- Clamp build_decoder max_log to entry layout limit instead of early reject - Add explicit layout assertions and tighten unsafe write safety invariants - Update regression test to validate decoder path behavior

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/mod.rs`:
- Around line 41-49: The test build_decoder_clamps_max_log_over_entry_limit
doesn't exercise the clamp because it passes 16; change the test to call
dec_table.build_decoder(&[], ENTRY_MAX_ACCURACY_LOG + 1) (or another literal >16
like 20) so the max_log.min(ENTRY_MAX_ACCURACY_LOG) branch in fse_decoder.rs is
executed; alternatively, if you intend to keep the 16-case, rename the test to
reflect it verifies truncated-input/GetBitsError behavior rather than clamping.
Ensure you reference FSETable::new and the build_decoder call when making the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a642ece-a116-43e7-8fa9-c08da4d9f22d

📥 Commits

Reviewing files that changed from the base of the PR and between e748900 and cb6f0c7.

📒 Files selected for processing (2)

zstd/src/fse/fse_decoder.rs
zstd/src/fse/mod.rs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

- exercise build_decoder clamp branch with max_log > 16 - add compile-time size and field-offset assertions for Entry on little-endian

polaz · 2026-04-06T21:49:13Z

@coderabbitai full review

coderabbitai · 2026-04-06T21:49:33Z

✅ Actions performed

Full review triggered.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

- copy symbol_spread_buffer in reinit_from to retain allocated capacity

polaz · 2026-04-06T23:20:18Z

@coderabbitai full review

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

- preserve only symbol_spread_buffer capacity via reserve - rename empty-input test to match asserted behavior

polaz · 2026-04-06T23:33:04Z

@coderabbitai full review

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

- drop commented println that logged post-update state and could mislead debugging

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

polaz added 2 commits April 6, 2026 19:09

perf(fse): pack decoder entry to 4-byte layout

45db0d7

- replace Entry.base_line(u32) with Entry.new_state(u16) - keep decode transition semantics (new_state + low bits) - update FSE/sequence tests and add size assertion for packed entry

perf(fse): align decode tables and bulk spread symbols

1a69aab

Copilot AI review requested due to automatic review settings April 6, 2026 16:10

Copilot started reviewing on behalf of polaz April 6, 2026 16:11 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs

Comment thread zstd/src/fse/fse_decoder.rs Outdated

Comment thread zstd/src/fse/fse_decoder.rs Outdated

Comment thread zstd/src/decoding/scratch.rs Outdated

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs Outdated

sw-release-bot Bot reviewed Apr 6, 2026

View reviewed changes

fix(fse): validate acc log bounds and reuse spread buffer

e748900

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs

polaz requested a review from Copilot April 6, 2026 20:47

Copilot started reviewing on behalf of polaz April 6, 2026 20:48 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs Outdated

Comment thread zstd/src/fse/fse_decoder.rs

fix(fse): address review feedback for packed decode path

cb6f0c7

- Clamp build_decoder max_log to entry layout limit instead of early reject - Add explicit layout assertions and tighten unsafe write safety invariants - Update regression test to validate decoder path behavior

polaz requested a review from Copilot April 6, 2026 21:14

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/mod.rs

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs

test(fse): cover clamp path and enforce entry layout

61e4ac1

- exercise build_decoder clamp branch with max_log > 16 - add compile-time size and field-offset assertions for Entry on little-endian

polaz requested a review from Copilot April 6, 2026 21:44

Copilot started reviewing on behalf of polaz April 6, 2026 21:45 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs

polaz requested review from Copilot and removed request for Copilot April 6, 2026 23:14

Copilot started reviewing on behalf of polaz April 6, 2026 23:15 View session

perf(fse): preserve spread-buffer reuse on table reinit

ee56e95

- copy symbol_spread_buffer in reinit_from to retain allocated capacity

polaz requested a review from Copilot April 6, 2026 23:18

Copilot started reviewing on behalf of polaz April 6, 2026 23:18 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs Outdated

Comment thread zstd/src/fse/mod.rs

perf(fse): avoid spread-buffer copy on table reinit

0e022ec

- preserve only symbol_spread_buffer capacity via reserve - rename empty-input test to match asserted behavior

polaz requested a review from Copilot April 6, 2026 23:32

Copilot started reviewing on behalf of polaz April 6, 2026 23:33 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_decoder.rs

polaz requested review from Copilot and removed request for Copilot April 6, 2026 23:41

refactor(fse): remove stale debug print in update_state

25be69c

- drop commented println that logged post-update state and could mislead debugging

polaz requested a review from Copilot April 6, 2026 23:52

Copilot started reviewing on behalf of polaz April 6, 2026 23:53 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

polaz merged commit baf2ffd into main Apr 7, 2026
13 of 14 checks passed

polaz deleted the perf/#56-packed-fse-entry-pr branch April 7, 2026 00:13

sw-release-bot Bot mentioned this pull request Apr 6, 2026

chore: release v0.0.8 #74

Merged

polaz mentioned this pull request Apr 7, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

This was referenced Apr 9, 2026

perf(decoding): branchless offset history, stride prefetch, BMI2 pext #69

Closed

perf(decoding): SIMD HUF kernels with runtime dispatch #92

Merged

perf(encoding): complete ARM histogram path for #71 #104

Merged

Conversation

polaz commented Apr 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sw-release-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

polaz commented Apr 6, 2026

Uh oh!

coderabbitai Bot commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

polaz commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

polaz commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

polaz commented Apr 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading

codecov Bot commented Apr 6, 2026 •

edited

Loading

sw-release-bot Bot left a comment •

edited

Loading