perf(decoding): optimize sequence execution with overlap fast paths by polaz · Pull Request #42 · structured-world/structured-zstd

polaz · 2026-03-28T16:55:23Z

Summary

add overlap-aware match copy paths in decode hot path (offset >= 16, offset >= 8, offset < 8)
add short-offset replication path for offset < 8
add x86/x86_64 prefetch hints for literals, match source, and dictionary slices
add regression coverage for overlap-path correctness

Validation

cargo fmt
cargo clippy --all-targets -- -D warnings
cargo nextest run
cargo build

Closes #12

Summary by CodeRabbit

Refactor
- Improved decoding performance with memory prefetching, CPU-targeted branchless fast-copy paths (x86/x86_64), and refined handling of overlapping and short-offset repeats; added a prefetch helper used for literals and dictionary data.
Bug Fixes
- New explicit error for illegal zero offsets; zero-length repeats are now immediate no-ops.
Tests
- Added unit tests validating overlap expansion, ringbuffer-wrapped draining, zero-offset behavior, and dictionary repeat accounting.

- split repeat into overlap-aware paths (>=16, >=8, <8) - add short-offset replication path with chunked pattern writes - add match/literal/dictionary prefetch hints for x86/x86_64 - add regression test covering overlap fast-path correctness Refs #12

coderabbitai · 2026-03-28T16:55:39Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f9b9903-4d24-4a83-b0d1-7b0a4941c948

📥 Commits

Reviewing files that changed from the base of the PR and between bbd1657 and 1d7f06e.

📒 Files selected for processing (2)

zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/prefetch.rs

📝 Walkthrough

Walkthrough

DecodeBuffer repeat now rejects zero offsets, early-returns on zero-length matches, prefetches match and dictionary sources, and dispatches overlap-aware vs non-overlapping repeat strategies with branchless copy selection on some targets. execute_sequences prefetches literal slices. A new prefetch module and tests were added.

Changes

Cohort / File(s)	Summary
Decode buffer core `zstd/src/decoding/decode_buffer.rs`	Validate zero offset (`ZeroOffset`), early-return for zero-length matches, prefetch match source, replace prior overlap path with `repeat_overlapping` and helpers (`repeat_in_chunks`, `repeat_short_offset`), add `byte_at`, `prefetch_match_source`, and branchless/non-branchless non-overlap branches.
Sequence execution `zstd/src/decoding/sequence_execution.rs`	Call `prefetch_literals(slice)` for each non-empty literals slice before advancing the copy counter; added private `prefetch_literals` helper.
Errors `zstd/src/decoding/errors.rs`	Added `DecodeBufferError::ZeroOffset` and updated `Display` to emit "Illegal offset: 0 found".
Prefetch module `zstd/src/decoding/mod.rs`, `zstd/src/decoding/prefetch.rs`	Exported `pub(crate) mod prefetch`; implemented `prefetch_slice(slice)` with arch-specific `prefetch_slice_impl` (x86/x86_64 use `_mm_prefetch`, others no-op).
Dictionary/ringbuffer interactions `zstd/src/decoding/decode_buffer.rs` (repeat_from_dict path)	Prefetch dictionary slice before extending buffer; ensure total output accounting updated in tests.
Tests `zstd/src/decoding/...tests`	Added tests covering overlap expansion correctness vs reference model for many `(offset, match_length)` cases (including ringbuffer wrap/drain), zero-offset error, and dictionary repeat counter behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Executor as Executor (execute_sequences)
  participant Prefetch as Prefetch
  participant Decode as DecodeBuffer::repeat
  participant Ring as RingBuffer / Output
  participant Dict as Dictionary

  Executor->>Prefetch: prefetch_literals(literals_slice)
  Executor->>Ring: copy literals -> output
  Executor->>Decode: request repeat(offset, match_len)
  Decode->>Prefetch: prefetch_match_source(start_idx)
  alt repeating from dictionary
    Decode->>Prefetch: prefetch_slice(dict_slice)
    Prefetch->>Dict: warm dictionary data
    Decode->>Ring: extend from dictionary
  else repeating from ringbuffer
    Prefetch->>Ring: warm source region
    Decode->>Ring: choose repeat_overlapping / repeat_in_chunks / repeat_short_offset or branchless non-overlap
  end
  Ring-->>Decode: write complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibble bytes and prefetch dreams, 🐇
Offsets checked, no zero schemes,
Overlap hops in careful beats,
Branchless leaps where cache warmth meets,
A rabbit hums — the stream completes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR implements overlap-aware fast paths, prefetch infrastructure, short-offset replication, and improved error handling. However, wildcopy overlength allocation (overlapping reads/writes past logical buffer bounds) is not present in the changeset; the implementation uses safe unchecked access instead.	Implement overlength buffer allocation and wildcopy patterns as specified in `#12`, allowing 8-16 byte-at-a-time copies without per-iteration bounds checks. Verify that the current 'branchless' approach fully meets the performance goals.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title clearly summarizes the main optimization focus: sequence execution with overlap fast paths, which aligns with the changeset's core improvements to decode buffer repeat operations.
Out of Scope Changes check	✅ Passed	All changes directly support the linked issue objectives: overlap handling, prefetch hints, error handling enhancements, and architecture-specific optimizations. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 86.96% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#12-sequence-exec-wildcopy

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-28T16:57:06Z

Codecov Report

❌ Patch coverage is 97.20670% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/decoding/decode_buffer.rs	97.59%	4 Missing ⚠️
zstd/src/decoding/errors.rs	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 150-167: Add a defensive debug assertion at the start of fn
repeat_short_offset(&mut self, offset: usize, match_length: usize, start_idx:
usize) to ensure offset > 0, documenting the invariant and preventing a
division-by-zero panic in the modular indexing (base[(copied + j) % offset]);
place a debug_assert!(offset > 0) (or equivalent) as the first statement in
repeat_short_offset so the check runs in debug builds and clarifies the
precondition assumed by execute_sequences/ZeroOffset.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d31ea7bd-a538-4823-b096-9b01b622b426

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc1fc3 and ce987d7.

📒 Files selected for processing (2)

zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/sequence_execution.rs

Copilot

Pull request overview

This PR targets the decompression hot path by adding overlap-aware match-copy fast paths and lightweight prefetching to reduce branch overhead and improve cache behavior during sequence execution.

Changes:

Add x86/x86_64 literal prefetching in execute_sequences.
Rework DecodeBuffer::repeat to use overlap-classified copy paths (≥16, ≥8, <8), add dictionary/match-source prefetching, and route non-overlapping copies to an optional branchless copy routine.
Add a regression test validating output correctness across several overlap boundary cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
zstd/src/decoding/sequence_execution.rs	Adds x86/x86_64 prefetch for literal slices before copying into the output buffer.
zstd/src/decoding/decode_buffer.rs	Introduces overlap-aware match copy paths, match/dict prefetching, and adds regression tests for overlap correctness.

- return DecodeBufferError::ZeroOffset for repeat(offset=0) - add wrapped-ringbuffer overlap regression coverage - move prefetch helper into shared decoding::prefetch module Refs #12

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite	Current: `1d7f06e`	Previous: `6fc1fc3`	Ratio
`compress/fastest/small-1k-random/matrix/c_ffi`	`0.007` ms	`0.005` ms	`1.40`
`compress/default/small-1k-random/matrix/c_ffi`	`0.022` ms	`0.018` ms	`1.22`
`compress-dict/fastest/small-4k-log-lines/matrix/c_ffi_without_dict`	`0.005` ms	`0.004` ms	`1.25`
`compress-dict/default/small-4k-log-lines/matrix/c_ffi_without_dict`	`0.005` ms	`0.004` ms	`1.25`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 155-176: repeat_short_offset currently builds scratch per output
byte using a modulo and scalar stores; instead, after filling base in
repeat_short_offset, construct a single 8-byte pattern (u64 or [u8;8]) that
contains the offset-length repeat of base and then emit into self.buffer in
word-sized (8-byte) chunks by copying that 8-byte pattern repeatedly (using bulk
extend/append operations) and only handle the final <8 tail bytes with a small
scalar copy; update repeat_short_offset to avoid per-byte modulo and use the
8-byte replication trick for the main loop, falling back to copying the
remaining bytes at the end.
- Around line 68-75: Add a focused regression test that calls the repeat method
with offset == 0 (e.g., repeat(0, n)) and asserts it returns
Err(DecodeBufferError::ZeroOffset) instead of succeeding or panicking; locate
the test near other decode_buffer tests and target the repeat(&mut self, offset:
usize, match_length: usize) function so the new contract for
DecodeBufferError::ZeroOffset is exercised and prevented from being accidentally
removed.

In `@zstd/src/decoding/prefetch.rs`:
- Around line 6-17: The prefetch implementation should be split into two
arch-specific versions to avoid the unnecessary unsafe and to gate 32-bit x86 on
SSE: create one #[cfg(target_arch = "x86_64")] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86_64::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch directly (no unsafe), and create a
#[cfg(all(target_arch = "x86", target_feature = "sse"))] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch (no unsafe); also add a fallback
#[cfg(not(any(target_arch = "x86_64", all(target_arch = "x86", target_feature =
"sse"))))] #[inline(always)] fn prefetch_slice_impl(_slice: &[u8]) { /* no-op */
} so code compiles without SSE on 32-bit x86 and removes unsafe usage while
keeping the same API.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d163dc1-f1b5-4657-83ac-1ce8e96c3c54

📥 Commits

Reviewing files that changed from the base of the PR and between ce987d7 and 02b20b7.

📒 Files selected for processing (5)

zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/errors.rs
zstd/src/decoding/mod.rs
zstd/src/decoding/prefetch.rs
zstd/src/decoding/sequence_execution.rs

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

- add regression for repeat(offset=0) -> DecodeBufferError::ZeroOffset - switch short-offset repeat hot loop to 8-byte phase patterns - gate x86 prefetch by SSE and remove unnecessary unsafe Refs #12

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

zstd/src/decoding/decode_buffer.rs (1)
228-242: ⚠️ Potential issue | 🔴 Critical

Update total_output_counter in the full-dictionary branch.

When bytes_from_dict >= match_length, this arm extends dict_slice and returns without bumping self.total_output_counter. That leaves the self.total_output_counter <= self.window_size as u64 gate stale, so later repeats can keep treating dictionary history as in-window after it should have expired.
♻️ Proposed fix
             } else {
                 let low = self.dict_content.len() - bytes_from_dict;
                 let high = low + match_length;
                 let dict_slice = &self.dict_content[low..high];
                 prefetch::prefetch_slice(dict_slice);
                 self.buffer.extend(dict_slice);
+                self.total_output_counter += match_length as u64;
             }
             Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/decode_buffer.rs` around lines 228 - 242, The else branch
that handles the full-dictionary copy (when bytes_from_dict >= match_length)
extends self.buffer from self.dict_content but fails to update
self.total_output_counter; update self.total_output_counter by adding
match_length (cast to u64) after the buffer.extend(dict_slice) in that branch so
the total_output_counter remains correct for window expiration checks used by
methods like repeat and the window_size gate.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 228-242: The else branch that handles the full-dictionary copy
(when bytes_from_dict >= match_length) extends self.buffer from
self.dict_content but fails to update self.total_output_counter; update
self.total_output_counter by adding match_length (cast to u64) after the
buffer.extend(dict_slice) in that branch so the total_output_counter remains
correct for window expiration checks used by methods like repeat and the
window_size gate.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2c50dca5-5e1a-4a6c-8976-f5ec39c2be47

📥 Commits

Reviewing files that changed from the base of the PR and between 02b20b7 and bbd1657.

📒 Files selected for processing (2)

zstd/src/decoding/decode_buffer.rs
zstd/src/decoding/prefetch.rs

- wrap _mm_prefetch calls in unsafe blocks for target-feature safety - keep x86 SSE gating and no-op fallback intact Refs #12

- update total_output_counter when bytes_from_dict >= match_length - add regression test for stale-window guard after full dict branch Refs #12

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 28, 2026 16:55

Copilot started reviewing on behalf of polaz March 28, 2026 16:55 View session

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs

Comment thread zstd/src/decoding/decode_buffer.rs

Comment thread zstd/src/decoding/sequence_execution.rs Outdated

fix(decoding): harden offset handling and wraparound coverage

02b20b7

- return DecodeBufferError::ZeroOffset for repeat(offset=0) - add wrapped-ringbuffer overlap regression coverage - move prefetch helper into shared decoding::prefetch module Refs #12

polaz commented Mar 28, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs

sw-release-bot Bot reviewed Mar 28, 2026

View reviewed changes

polaz requested a review from Copilot March 28, 2026 17:38

Copilot started reviewing on behalf of polaz March 28, 2026 17:39 View session

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread zstd/src/decoding/decode_buffer.rs

Comment thread zstd/src/decoding/decode_buffer.rs

Comment thread zstd/src/decoding/prefetch.rs Outdated

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Comment thread zstd/src/decoding/prefetch.rs Outdated

test(decoding): cover zero-offset and tune short-repeat path

bbd1657

- add regression for repeat(offset=0) -> DecodeBufferError::ZeroOffset - switch short-offset repeat hot loop to 8-byte phase patterns - gate x86 prefetch by SSE and remove unnecessary unsafe Refs #12

coderabbitai Bot reviewed Mar 28, 2026

View reviewed changes

polaz added 2 commits March 28, 2026 20:29

fix(decoding): restore safe prefetch intrinsic usage

fe13f5a

- wrap _mm_prefetch calls in unsafe blocks for target-feature safety - keep x86 SSE gating and no-op fallback intact Refs #12

fix(decoding): advance output counter on full dict copy

1d7f06e

- update total_output_counter when bytes_from_dict >= match_length - add regression test for stale-window guard after full dict branch Refs #12

polaz requested a review from Copilot March 28, 2026 18:31

Copilot started reviewing on behalf of polaz March 28, 2026 18:32 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

polaz merged commit 4857f1e into main Mar 28, 2026
14 of 15 checks passed

polaz deleted the perf/#12-sequence-exec-wildcopy branch March 28, 2026 18:36

sw-release-bot Bot mentioned this pull request Mar 28, 2026

chore: release v0.0.4 #40

Merged

Conversation

polaz commented Mar 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sw-release-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading

codecov Bot commented Mar 28, 2026 •

edited

Loading

sw-release-bot Bot left a comment •

edited

Loading