perf(fse): pack decoder entries and align decode tables#76
Conversation
- replace Entry.base_line(u32) with Entry.new_state(u16) - keep decode transition semantics (new_state + low bits) - update FSE/sequence tests and add size assertion for packed entry
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughReplaces FSE decode entries with a packed 4-byte layout ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Builder as Builder (table builder)
participant FSETable as AlignedFSETable / FSETable
participant Memory as Memory (bulk write)
participant Decoder as FSEDecoder
Builder->>Builder: compute symbol distribution & table_symbols
Builder->>Memory: bulk-copy symbols (little-endian: u64 writes)
Memory-->>FSETable: symbols written into decode[] region
Builder->>FSETable: fill new_state and num_bits fields
Decoder->>FSETable: read entry = decode[state_index]
Decoder->>Decoder: next_state = entry.new_state + bits_read
Decoder->>Decoder: emit entry.symbol
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the Zstd FSE decoder’s table representation and construction to improve cache efficiency and speed up decode-table building in hot paths.
Changes:
- Refactors
fse_decoder::Entryto a packed 4-byte layout and updates state-advance logic accordingly. - Refactors decode table building to spread symbols into a temporary byte buffer and bulk-write into the decode table (little-endian).
- Introduces a cache-line-aligned wrapper type for the three sequence FSE tables (LL/ML/OF) in decoder scratch space.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| zstd/src/fse/mod.rs | Adds layout assertions for packed Entry and updates table equivalence test to use new_state. |
| zstd/src/fse/fse_decoder.rs | Implements packed Entry, updates state transitions, and refactors table build + symbol copy using bulk stores. |
| zstd/src/decoding/sequence_section_decoder.rs | Updates tests/diagnostics to reference new_state instead of base_line. |
| zstd/src/decoding/scratch.rs | Wraps LL/ML/OF FSETables in an aligned newtype to influence placement in scratch. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 48-49: The decoder table can overflow Entry::new_state (u16) when
accuracy_log > 16; update the public builders build_from_probabilities and
build_decoder to validate accuracy_log (or max_log) is ≤ 16 up front and return
FSETableError::AccLogTooBig instead of allowing later panics, and in
build_decoding_table replace the assert that new_state fits u16 with an explicit
check that returns FSETableError::AccLogTooBig if exceeded; ensure callers
cannot construct a table with entries whose new_state + add would index out of
bounds in the decode paths that use self.state.new_state + add (the decode logic
that indexes self.table.decode must only run when accuracy_log ≤ 16 and the
table was built with that constraint).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7193e4e9-3146-45bc-83cb-b29dcf2ec11f
📒 Files selected for processing (4)
zstd/src/decoding/scratch.rszstd/src/decoding/sequence_section_decoder.rszstd/src/fse/fse_decoder.rszstd/src/fse/mod.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 25be69c | Previous: 71708e5 | Ratio |
|---|---|---|---|
compress/better/small-1k-random/matrix/pure_rust |
0.182 ms |
0.157 ms |
1.16 |
compress/better/small-1k-random/matrix/c_ffi |
0.108 ms |
0.092 ms |
1.17 |
compress/best/small-1k-random/matrix/pure_rust |
0.3 ms |
0.256 ms |
1.17 |
compress/better/small-10k-random/matrix/c_ffi |
0.126 ms |
0.096 ms |
1.31 |
compress/best/small-10k-random/matrix/pure_rust |
0.775 ms |
0.64 ms |
1.21 |
compress/best/small-10k-random/matrix/c_ffi |
0.384 ms |
0.293 ms |
1.31 |
compress/better/small-4k-log-lines/matrix/pure_rust |
0.153 ms |
0.125 ms |
1.22 |
compress/better/small-4k-log-lines/matrix/c_ffi |
0.098 ms |
0.076 ms |
1.29 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/fse/fse_decoder.rs`:
- Around line 278-306: Update the safety comment inside copy_symbols_into_decode
to precisely state the invariants that make the unaligned u64 write safe:
mention the debug_assert_eq!(table_symbols.len(), self.decode.len()) and that
the loop condition idx + 1 < table_symbols.len() implies there are at least two
remaining decode entries (each 4 bytes), therefore at least 8 bytes available at
self.decode.as_mut_ptr().add(idx).cast::<u64>() for ptr::write_unaligned; also
note that unaligned writes are intentional and that the condition ensures idx is
within bounds of self.decode. Reference the symbols: copy_symbols_into_decode,
idx, table_symbols.len(), self.decode, and ptr::write_unaligned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a34162f-fb33-4ab6-8ec9-a8be8a31e620
📒 Files selected for processing (3)
zstd/src/decoding/scratch.rszstd/src/fse/fse_decoder.rszstd/src/fse/mod.rs
- Clamp build_decoder max_log to entry layout limit instead of early reject - Add explicit layout assertions and tighten unsafe write safety invariants - Update regression test to validate decoder path behavior
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/fse/mod.rs`:
- Around line 41-49: The test build_decoder_clamps_max_log_over_entry_limit
doesn't exercise the clamp because it passes 16; change the test to call
dec_table.build_decoder(&[], ENTRY_MAX_ACCURACY_LOG + 1) (or another literal >16
like 20) so the max_log.min(ENTRY_MAX_ACCURACY_LOG) branch in fse_decoder.rs is
executed; alternatively, if you intend to keep the 16-case, rename the test to
reflect it verifies truncated-input/GetBitsError behavior rather than clamping.
Ensure you reference FSETable::new and the build_decoder call when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8a642ece-a116-43e7-8fa9-c08da4d9f22d
📒 Files selected for processing (2)
zstd/src/fse/fse_decoder.rszstd/src/fse/mod.rs
- exercise build_decoder clamp branch with max_log > 16 - add compile-time size and field-offset assertions for Entry on little-endian
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
- copy symbol_spread_buffer in reinit_from to retain allocated capacity
|
@coderabbitai full review |
- preserve only symbol_spread_buffer capacity via reserve - rename empty-input test to match asserted behavior
|
@coderabbitai full review |
- drop commented println that logged post-update state and could mislead debugging
Summary
accuracy_log <= 16validation and wide next-state arithmetic in decode updates to avoid truncation/panic pathsValidation
Closes #56
Summary by CodeRabbit
Bug Fixes
Refactor / Performance
Tests