Skip to content

perf(fse): pack decoder entries and align decode tables#76

Merged
polaz merged 8 commits intomainfrom
perf/#56-packed-fse-entry-pr
Apr 7, 2026
Merged

perf(fse): pack decoder entries and align decode tables#76
polaz merged 8 commits intomainfrom
perf/#56-packed-fse-entry-pr

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 6, 2026

Summary

  • switch FSE decoder entry to a packed 4-byte layout (new_state: u16, symbol: u8, num_bits: u8) with explicit layout assertions
  • align LL/ML/OF FSE table containers in decoder scratch to cache-line boundaries (64B, 128B on aarch64) to reduce cross-table placement effects in the hot loop
  • refactor table build to spread symbols via a reusable scratch buffer and bulk-write two entries per 64-bit store on little-endian targets
  • add explicit accuracy_log <= 16 validation and wide next-state arithmetic in decode updates to avoid truncation/panic paths

Validation

  • cargo nextest run --workspace
  • cargo build --workspace
  • cargo clippy -p structured-zstd --all-targets --features hash,std,dict_builder -- -D warnings

Closes #56

Summary by CodeRabbit

  • Bug Fixes

    • Enforced a maximum accuracy parameter and surface explicit errors for oversized/invalid decompression settings.
  • Refactor / Performance

    • Improved memory alignment and compacted decoder entry layout for safer, more portable behavior.
    • Optimized symbol placement and bulk-copying to accelerate decoding on little-endian platforms.
  • Tests

    • Added and updated tests validating decoder layout, entry size, and accuracy-limit enforcement.

polaz added 2 commits April 6, 2026 19:09
- replace Entry.base_line(u32) with Entry.new_state(u16)
- keep decode transition semantics (new_state + low bits)
- update FSE/sequence tests and add size assertion for packed entry
Copilot AI review requested due to automatic review settings April 6, 2026 16:10
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c08086b4-8f8c-47f1-836f-a28e66a928c5

📥 Commits

Reviewing files that changed from the base of the PR and between 71708e5 and 61e4ac1.

📒 Files selected for processing (4)
  • zstd/src/decoding/scratch.rs
  • zstd/src/decoding/sequence_section_decoder.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/fse/mod.rs

📝 Walkthrough

Walkthrough

Replaces FSE decode entries with a packed 4-byte layout (new_state: u16, symbol: u8, num_bits: u8), adds an aligned FSE table wrapper, changes table build to a two-phase symbol spread with endian-aware bulk writes on little-endian, clamps accuracy log to 16, and updates tests and scratch initialization to use the new layout.

Changes

Cohort / File(s) Summary
Core FSE entry & decoder
zstd/src/fse/fse_decoder.rs
Replaced Entry.base_line: u32 with Entry.new_state: u16 and added #[repr(C)]. Refactored decode state math to index via new_state, added ENTRY_MAX_ACCURACY_LOG = 16, introduced symbol_spread_buffer and copy_symbols_into_decode, implemented endian-aware bulk symbol writes (little-endian uses unaligned u64 writes; fallback per-entry otherwise), and adjusted num_bits / new_state assignment timing and error checks.
Aligned table wrapper & scratch changes
zstd/src/decoding/scratch.rs
Added pub struct AlignedFSETable(FSETable) with architecture-dependent alignment (repr(align(128)) on aarch64, else repr(align(64))), plus Deref/DerefMut and a new constructor. Switched FSEScratch fields (offsets, literal_lengths, match_lengths) to AlignedFSETable and updated constructors/init paths accordingly.
Tests & call sites
zstd/src/decoding/sequence_section_decoder.rs, zstd/src/fse/mod.rs
Updated tests to reference table.decode[idx].new_state instead of base_line. Added unit tests verifying Entry layout/size/field offsets, enforcing AccLogTooBig when accuracy log > 16, and ensuring oversized build errors propagate. Minor test helper updates to compare encoder baseline against dec_state.new_state.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Builder as Builder (table builder)
    participant FSETable as AlignedFSETable / FSETable
    participant Memory as Memory (bulk write)
    participant Decoder as FSEDecoder

    Builder->>Builder: compute symbol distribution & table_symbols
    Builder->>Memory: bulk-copy symbols (little-endian: u64 writes)
    Memory-->>FSETable: symbols written into decode[] region
    Builder->>FSETable: fill new_state and num_bits fields
    Decoder->>FSETable: read entry = decode[state_index]
    Decoder->>Decoder: next_state = entry.new_state + bits_read
    Decoder->>Decoder: emit entry.symbol
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I packed the entries tight and small,
Aligned their hops to cache-line wall,
I spread the symbols, wrote them fast,
New_state leaps — the bits are cast,
I nibble speed, and bounce with joy 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(fse): pack decoder entries and align decode tables' directly and accurately summarizes the main changes: packing FSE decoder entries from 8 to 4 bytes and implementing cache-line alignment for decode tables.
Linked Issues check ✅ Passed The pull request implements all coding requirements from issue #56: packed 4-byte Entry struct with new_state/symbol/num_bits fields, updated decode hot paths, cache-line alignment (64/128 bytes), accuracy_log validation (<=16), and bulk table spreading with symbol buffer. Tests and safety invariants are updated accordingly.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #56 requirements: Entry field changes, FSETable alignment wrapper, build_decoding_table refactoring with symbol spreading, accuracy_log validation, and test updates. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#56-packed-fse-entry-pr

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 94.26230% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/fse/fse_decoder.rs 91.25% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the Zstd FSE decoder’s table representation and construction to improve cache efficiency and speed up decode-table building in hot paths.

Changes:

  • Refactors fse_decoder::Entry to a packed 4-byte layout and updates state-advance logic accordingly.
  • Refactors decode table building to spread symbols into a temporary byte buffer and bulk-write into the decode table (little-endian).
  • Introduces a cache-line-aligned wrapper type for the three sequence FSE tables (LL/ML/OF) in decoder scratch space.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
zstd/src/fse/mod.rs Adds layout assertions for packed Entry and updates table equivalence test to use new_state.
zstd/src/fse/fse_decoder.rs Implements packed Entry, updates state transitions, and refactors table build + symbol copy using bulk stores.
zstd/src/decoding/sequence_section_decoder.rs Updates tests/diagnostics to reference new_state instead of base_line.
zstd/src/decoding/scratch.rs Wraps LL/ML/OF FSETables in an aligned newtype to influence placement in scratch.

Comment thread zstd/src/fse/fse_decoder.rs
Comment thread zstd/src/fse/fse_decoder.rs Outdated
Comment thread zstd/src/fse/fse_decoder.rs Outdated
Comment thread zstd/src/decoding/scratch.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 48-49: The decoder table can overflow Entry::new_state (u16) when
accuracy_log > 16; update the public builders build_from_probabilities and
build_decoder to validate accuracy_log (or max_log) is ≤ 16 up front and return
FSETableError::AccLogTooBig instead of allowing later panics, and in
build_decoding_table replace the assert that new_state fits u16 with an explicit
check that returns FSETableError::AccLogTooBig if exceeded; ensure callers
cannot construct a table with entries whose new_state + add would index out of
bounds in the decode paths that use self.state.new_state + add (the decode logic
that indexes self.table.decode must only run when accuracy_log ≤ 16 and the
table was built with that constraint).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7193e4e9-3146-45bc-83cb-b29dcf2ec11f

📥 Commits

Reviewing files that changed from the base of the PR and between 71708e5 and 1a69aab.

📒 Files selected for processing (4)
  • zstd/src/decoding/scratch.rs
  • zstd/src/decoding/sequence_section_decoder.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/fse/mod.rs

Comment thread zstd/src/fse/fse_decoder.rs Outdated
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: 25be69c Previous: 71708e5 Ratio
compress/better/small-1k-random/matrix/pure_rust 0.182 ms 0.157 ms 1.16
compress/better/small-1k-random/matrix/c_ffi 0.108 ms 0.092 ms 1.17
compress/best/small-1k-random/matrix/pure_rust 0.3 ms 0.256 ms 1.17
compress/better/small-10k-random/matrix/c_ffi 0.126 ms 0.096 ms 1.31
compress/best/small-10k-random/matrix/pure_rust 0.775 ms 0.64 ms 1.21
compress/best/small-10k-random/matrix/c_ffi 0.384 ms 0.293 ms 1.31
compress/better/small-4k-log-lines/matrix/pure_rust 0.153 ms 0.125 ms 1.22
compress/better/small-4k-log-lines/matrix/c_ffi 0.098 ms 0.076 ms 1.29

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 278-306: Update the safety comment inside copy_symbols_into_decode
to precisely state the invariants that make the unaligned u64 write safe:
mention the debug_assert_eq!(table_symbols.len(), self.decode.len()) and that
the loop condition idx + 1 < table_symbols.len() implies there are at least two
remaining decode entries (each 4 bytes), therefore at least 8 bytes available at
self.decode.as_mut_ptr().add(idx).cast::<u64>() for ptr::write_unaligned; also
note that unaligned writes are intentional and that the condition ensures idx is
within bounds of self.decode. Reference the symbols: copy_symbols_into_decode,
idx, table_symbols.len(), self.decode, and ptr::write_unaligned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a34162f-fb33-4ab6-8ec9-a8be8a31e620

📥 Commits

Reviewing files that changed from the base of the PR and between 1a69aab and e748900.

📒 Files selected for processing (3)
  • zstd/src/decoding/scratch.rs
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/fse/mod.rs

Comment thread zstd/src/fse/fse_decoder.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/fse/fse_decoder.rs Outdated
Comment thread zstd/src/fse/fse_decoder.rs
- Clamp build_decoder max_log to entry layout limit instead of early reject

- Add explicit layout assertions and tighten unsafe write safety invariants

- Update regression test to validate decoder path behavior
@polaz polaz requested a review from Copilot April 6, 2026 21:14
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/mod.rs`:
- Around line 41-49: The test build_decoder_clamps_max_log_over_entry_limit
doesn't exercise the clamp because it passes 16; change the test to call
dec_table.build_decoder(&[], ENTRY_MAX_ACCURACY_LOG + 1) (or another literal >16
like 20) so the max_log.min(ENTRY_MAX_ACCURACY_LOG) branch in fse_decoder.rs is
executed; alternatively, if you intend to keep the 16-case, rename the test to
reflect it verifies truncated-input/GetBitsError behavior rather than clamping.
Ensure you reference FSETable::new and the build_decoder call when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a642ece-a116-43e7-8fa9-c08da4d9f22d

📥 Commits

Reviewing files that changed from the base of the PR and between e748900 and cb6f0c7.

📒 Files selected for processing (2)
  • zstd/src/fse/fse_decoder.rs
  • zstd/src/fse/mod.rs

Comment thread zstd/src/fse/mod.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_decoder.rs
- exercise build_decoder clamp branch with max_log > 16

- add compile-time size and field-offset assertions for Entry on little-endian
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 6, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_decoder.rs
@polaz polaz requested review from Copilot and removed request for Copilot April 6, 2026 23:14
- copy symbol_spread_buffer in reinit_from to retain allocated capacity
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 6, 2026

@coderabbitai full review

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/fse/fse_decoder.rs Outdated
Comment thread zstd/src/fse/mod.rs
- preserve only symbol_spread_buffer capacity via reserve

- rename empty-input test to match asserted behavior
@polaz polaz requested a review from Copilot April 6, 2026 23:32
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 6, 2026

@coderabbitai full review

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/fse/fse_decoder.rs
@polaz polaz requested review from Copilot and removed request for Copilot April 6, 2026 23:41
- drop commented println that logged post-update state and could mislead debugging
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@polaz polaz merged commit baf2ffd into main Apr 7, 2026
13 of 14 checks passed
@polaz polaz deleted the perf/#56-packed-fse-entry-pr branch April 7, 2026 00:13
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: packed FSE Entry layout (4-byte entries + bulk table spreading)

2 participants