perf(decoding): optimize sequence execution with overlap fast paths#42
perf(decoding): optimize sequence execution with overlap fast paths#42
Conversation
- split repeat into overlap-aware paths (>=16, >=8, <8) - add short-offset replication path with chunked pattern writes - add match/literal/dictionary prefetch hints for x86/x86_64 - add regression test covering overlap fast-path correctness Refs #12
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughDecodeBuffer repeat now rejects zero offsets, early-returns on zero-length matches, prefetches match and dictionary sources, and dispatches overlap-aware vs non-overlapping repeat strategies with branchless copy selection on some targets. execute_sequences prefetches literal slices. A new prefetch module and tests were added. Changes
Sequence Diagram(s)sequenceDiagram
participant Executor as Executor (execute_sequences)
participant Prefetch as Prefetch
participant Decode as DecodeBuffer::repeat
participant Ring as RingBuffer / Output
participant Dict as Dictionary
Executor->>Prefetch: prefetch_literals(literals_slice)
Executor->>Ring: copy literals -> output
Executor->>Decode: request repeat(offset, match_len)
Decode->>Prefetch: prefetch_match_source(start_idx)
alt repeating from dictionary
Decode->>Prefetch: prefetch_slice(dict_slice)
Prefetch->>Dict: warm dictionary data
Decode->>Ring: extend from dictionary
else repeating from ringbuffer
Prefetch->>Ring: warm source region
Decode->>Ring: choose repeat_overlapping / repeat_in_chunks / repeat_short_offset or branchless non-overlap
end
Ring-->>Decode: write complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 150-167: Add a defensive debug assertion at the start of fn
repeat_short_offset(&mut self, offset: usize, match_length: usize, start_idx:
usize) to ensure offset > 0, documenting the invariant and preventing a
division-by-zero panic in the modular indexing (base[(copied + j) % offset]);
place a debug_assert!(offset > 0) (or equivalent) as the first statement in
repeat_short_offset so the check runs in debug builds and clarifies the
precondition assumed by execute_sequences/ZeroOffset.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d31ea7bd-a538-4823-b096-9b01b622b426
📒 Files selected for processing (2)
zstd/src/decoding/decode_buffer.rszstd/src/decoding/sequence_execution.rs
There was a problem hiding this comment.
Pull request overview
This PR targets the decompression hot path by adding overlap-aware match-copy fast paths and lightweight prefetching to reduce branch overhead and improve cache behavior during sequence execution.
Changes:
- Add x86/x86_64 literal prefetching in
execute_sequences. - Rework
DecodeBuffer::repeatto use overlap-classified copy paths (≥16, ≥8, <8), add dictionary/match-source prefetching, and route non-overlapping copies to an optional branchless copy routine. - Add a regression test validating output correctness across several overlap boundary cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| zstd/src/decoding/sequence_execution.rs | Adds x86/x86_64 prefetch for literal slices before copying into the output buffer. |
| zstd/src/decoding/decode_buffer.rs | Introduces overlap-aware match copy paths, match/dict prefetching, and adds regression tests for overlap correctness. |
- return DecodeBufferError::ZeroOffset for repeat(offset=0) - add wrapped-ringbuffer overlap regression coverage - move prefetch helper into shared decoding::prefetch module Refs #12
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 1d7f06e | Previous: 6fc1fc3 | Ratio |
|---|---|---|---|
compress/fastest/small-1k-random/matrix/c_ffi |
0.007 ms |
0.005 ms |
1.40 |
compress/default/small-1k-random/matrix/c_ffi |
0.022 ms |
0.018 ms |
1.22 |
compress-dict/fastest/small-4k-log-lines/matrix/c_ffi_without_dict |
0.005 ms |
0.004 ms |
1.25 |
compress-dict/default/small-4k-log-lines/matrix/c_ffi_without_dict |
0.005 ms |
0.004 ms |
1.25 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 155-176: repeat_short_offset currently builds scratch per output
byte using a modulo and scalar stores; instead, after filling base in
repeat_short_offset, construct a single 8-byte pattern (u64 or [u8;8]) that
contains the offset-length repeat of base and then emit into self.buffer in
word-sized (8-byte) chunks by copying that 8-byte pattern repeatedly (using bulk
extend/append operations) and only handle the final <8 tail bytes with a small
scalar copy; update repeat_short_offset to avoid per-byte modulo and use the
8-byte replication trick for the main loop, falling back to copying the
remaining bytes at the end.
- Around line 68-75: Add a focused regression test that calls the repeat method
with offset == 0 (e.g., repeat(0, n)) and asserts it returns
Err(DecodeBufferError::ZeroOffset) instead of succeeding or panicking; locate
the test near other decode_buffer tests and target the repeat(&mut self, offset:
usize, match_length: usize) function so the new contract for
DecodeBufferError::ZeroOffset is exercised and prevented from being accidentally
removed.
In `@zstd/src/decoding/prefetch.rs`:
- Around line 6-17: The prefetch implementation should be split into two
arch-specific versions to avoid the unnecessary unsafe and to gate 32-bit x86 on
SSE: create one #[cfg(target_arch = "x86_64")] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86_64::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch directly (no unsafe), and create a
#[cfg(all(target_arch = "x86", target_feature = "sse"))] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch (no unsafe); also add a fallback
#[cfg(not(any(target_arch = "x86_64", all(target_arch = "x86", target_feature =
"sse"))))] #[inline(always)] fn prefetch_slice_impl(_slice: &[u8]) { /* no-op */
} so code compiles without SSE on 32-bit x86 and removes unsafe usage while
keeping the same API.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8d163dc1-f1b5-4657-83ac-1ce8e96c3c54
📒 Files selected for processing (5)
zstd/src/decoding/decode_buffer.rszstd/src/decoding/errors.rszstd/src/decoding/mod.rszstd/src/decoding/prefetch.rszstd/src/decoding/sequence_execution.rs
- add regression for repeat(offset=0) -> DecodeBufferError::ZeroOffset - switch short-offset repeat hot loop to 8-byte phase patterns - gate x86 prefetch by SSE and remove unnecessary unsafe Refs #12
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/decode_buffer.rs (1)
228-242:⚠️ Potential issue | 🔴 CriticalUpdate
total_output_counterin the full-dictionary branch.When
bytes_from_dict >= match_length, this arm extendsdict_sliceand returns without bumpingself.total_output_counter. That leaves theself.total_output_counter <= self.window_size as u64gate stale, so later repeats can keep treating dictionary history as in-window after it should have expired.♻️ Proposed fix
} else { let low = self.dict_content.len() - bytes_from_dict; let high = low + match_length; let dict_slice = &self.dict_content[low..high]; prefetch::prefetch_slice(dict_slice); self.buffer.extend(dict_slice); + self.total_output_counter += match_length as u64; } Ok(())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/decode_buffer.rs` around lines 228 - 242, The else branch that handles the full-dictionary copy (when bytes_from_dict >= match_length) extends self.buffer from self.dict_content but fails to update self.total_output_counter; update self.total_output_counter by adding match_length (cast to u64) after the buffer.extend(dict_slice) in that branch so the total_output_counter remains correct for window expiration checks used by methods like repeat and the window_size gate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 228-242: The else branch that handles the full-dictionary copy
(when bytes_from_dict >= match_length) extends self.buffer from
self.dict_content but fails to update self.total_output_counter; update
self.total_output_counter by adding match_length (cast to u64) after the
buffer.extend(dict_slice) in that branch so the total_output_counter remains
correct for window expiration checks used by methods like repeat and the
window_size gate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2c50dca5-5e1a-4a6c-8976-f5ec39c2be47
📒 Files selected for processing (2)
zstd/src/decoding/decode_buffer.rszstd/src/decoding/prefetch.rs
Summary
Validation
Closes #12
Summary by CodeRabbit
Refactor
Bug Fixes
Tests