Skip to content

perf(decoding): optimize sequence execution with overlap fast paths#42

Merged
polaz merged 5 commits intomainfrom
perf/#12-sequence-exec-wildcopy
Mar 28, 2026
Merged

perf(decoding): optimize sequence execution with overlap fast paths#42
polaz merged 5 commits intomainfrom
perf/#12-sequence-exec-wildcopy

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Mar 28, 2026

Summary

  • add overlap-aware match copy paths in decode hot path (offset >= 16, offset >= 8, offset < 8)
  • add short-offset replication path for offset < 8
  • add x86/x86_64 prefetch hints for literals, match source, and dictionary slices
  • add regression coverage for overlap-path correctness

Validation

  • cargo fmt
  • cargo clippy --all-targets -- -D warnings
  • cargo nextest run
  • cargo build

Closes #12

Summary by CodeRabbit

  • Refactor

    • Improved decoding performance with memory prefetching, CPU-targeted branchless fast-copy paths (x86/x86_64), and refined handling of overlapping and short-offset repeats; added a prefetch helper used for literals and dictionary data.
  • Bug Fixes

    • New explicit error for illegal zero offsets; zero-length repeats are now immediate no-ops.
  • Tests

    • Added unit tests validating overlap expansion, ringbuffer-wrapped draining, zero-offset behavior, and dictionary repeat accounting.

- split repeat into overlap-aware paths (>=16, >=8, <8)

- add short-offset replication path with chunked pattern writes

- add match/literal/dictionary prefetch hints for x86/x86_64

- add regression test covering overlap fast-path correctness

Refs #12
Copilot AI review requested due to automatic review settings March 28, 2026 16:55
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2f9b9903-4d24-4a83-b0d1-7b0a4941c948

📥 Commits

Reviewing files that changed from the base of the PR and between bbd1657 and 1d7f06e.

📒 Files selected for processing (2)
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/prefetch.rs

📝 Walkthrough

Walkthrough

DecodeBuffer repeat now rejects zero offsets, early-returns on zero-length matches, prefetches match and dictionary sources, and dispatches overlap-aware vs non-overlapping repeat strategies with branchless copy selection on some targets. execute_sequences prefetches literal slices. A new prefetch module and tests were added.

Changes

Cohort / File(s) Summary
Decode buffer core
zstd/src/decoding/decode_buffer.rs
Validate zero offset (ZeroOffset), early-return for zero-length matches, prefetch match source, replace prior overlap path with repeat_overlapping and helpers (repeat_in_chunks, repeat_short_offset), add byte_at, prefetch_match_source, and branchless/non-branchless non-overlap branches.
Sequence execution
zstd/src/decoding/sequence_execution.rs
Call prefetch_literals(slice) for each non-empty literals slice before advancing the copy counter; added private prefetch_literals helper.
Errors
zstd/src/decoding/errors.rs
Added DecodeBufferError::ZeroOffset and updated Display to emit "Illegal offset: 0 found".
Prefetch module
zstd/src/decoding/mod.rs, zstd/src/decoding/prefetch.rs
Exported pub(crate) mod prefetch; implemented prefetch_slice(slice) with arch-specific prefetch_slice_impl (x86/x86_64 use _mm_prefetch, others no-op).
Dictionary/ringbuffer interactions
zstd/src/decoding/decode_buffer.rs (repeat_from_dict path)
Prefetch dictionary slice before extending buffer; ensure total output accounting updated in tests.
Tests
zstd/src/decoding/...tests
Added tests covering overlap expansion correctness vs reference model for many (offset, match_length) cases (including ringbuffer wrap/drain), zero-offset error, and dictionary repeat counter behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Executor as Executor (execute_sequences)
  participant Prefetch as Prefetch
  participant Decode as DecodeBuffer::repeat
  participant Ring as RingBuffer / Output
  participant Dict as Dictionary

  Executor->>Prefetch: prefetch_literals(literals_slice)
  Executor->>Ring: copy literals -> output
  Executor->>Decode: request repeat(offset, match_len)
  Decode->>Prefetch: prefetch_match_source(start_idx)
  alt repeating from dictionary
    Decode->>Prefetch: prefetch_slice(dict_slice)
    Prefetch->>Dict: warm dictionary data
    Decode->>Ring: extend from dictionary
  else repeating from ringbuffer
    Prefetch->>Ring: warm source region
    Decode->>Ring: choose repeat_overlapping / repeat_in_chunks / repeat_short_offset or branchless non-overlap
  end
  Ring-->>Decode: write complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibble bytes and prefetch dreams, 🐇
Offsets checked, no zero schemes,
Overlap hops in careful beats,
Branchless leaps where cache warmth meets,
A rabbit hums — the stream completes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements overlap-aware fast paths, prefetch infrastructure, short-offset replication, and improved error handling. However, wildcopy overlength allocation (overlapping reads/writes past logical buffer bounds) is not present in the changeset; the implementation uses safe unchecked access instead. Implement overlength buffer allocation and wildcopy patterns as specified in #12, allowing 8-16 byte-at-a-time copies without per-iteration bounds checks. Verify that the current 'branchless' approach fully meets the performance goals.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly summarizes the main optimization focus: sequence execution with overlap fast paths, which aligns with the changeset's core improvements to decode buffer repeat operations.
Out of Scope Changes check ✅ Passed All changes directly support the linked issue objectives: overlap handling, prefetch hints, error handling enhancements, and architecture-specific optimizations. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 86.96% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#12-sequence-exec-wildcopy

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 97.20670% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/decoding/decode_buffer.rs 97.59% 4 Missing ⚠️
zstd/src/decoding/errors.rs 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 150-167: Add a defensive debug assertion at the start of fn
repeat_short_offset(&mut self, offset: usize, match_length: usize, start_idx:
usize) to ensure offset > 0, documenting the invariant and preventing a
division-by-zero panic in the modular indexing (base[(copied + j) % offset]);
place a debug_assert!(offset > 0) (or equivalent) as the first statement in
repeat_short_offset so the check runs in debug builds and clarifies the
precondition assumed by execute_sequences/ZeroOffset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d31ea7bd-a538-4823-b096-9b01b622b426

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc1fc3 and ce987d7.

📒 Files selected for processing (2)
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/sequence_execution.rs

Comment thread zstd/src/decoding/decode_buffer.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets the decompression hot path by adding overlap-aware match-copy fast paths and lightweight prefetching to reduce branch overhead and improve cache behavior during sequence execution.

Changes:

  • Add x86/x86_64 literal prefetching in execute_sequences.
  • Rework DecodeBuffer::repeat to use overlap-classified copy paths (≥16, ≥8, <8), add dictionary/match-source prefetching, and route non-overlapping copies to an optional branchless copy routine.
  • Add a regression test validating output correctness across several overlap boundary cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
zstd/src/decoding/sequence_execution.rs Adds x86/x86_64 prefetch for literal slices before copying into the output buffer.
zstd/src/decoding/decode_buffer.rs Introduces overlap-aware match copy paths, match/dict prefetching, and adds regression tests for overlap correctness.

Comment thread zstd/src/decoding/decode_buffer.rs
Comment thread zstd/src/decoding/decode_buffer.rs
Comment thread zstd/src/decoding/sequence_execution.rs Outdated
- return DecodeBufferError::ZeroOffset for repeat(offset=0)

- add wrapped-ringbuffer overlap regression coverage

- move prefetch helper into shared decoding::prefetch module

Refs #12
Comment thread zstd/src/decoding/decode_buffer.rs
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: 1d7f06e Previous: 6fc1fc3 Ratio
compress/fastest/small-1k-random/matrix/c_ffi 0.007 ms 0.005 ms 1.40
compress/default/small-1k-random/matrix/c_ffi 0.022 ms 0.018 ms 1.22
compress-dict/fastest/small-4k-log-lines/matrix/c_ffi_without_dict 0.005 ms 0.004 ms 1.25
compress-dict/default/small-4k-log-lines/matrix/c_ffi_without_dict 0.005 ms 0.004 ms 1.25

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 155-176: repeat_short_offset currently builds scratch per output
byte using a modulo and scalar stores; instead, after filling base in
repeat_short_offset, construct a single 8-byte pattern (u64 or [u8;8]) that
contains the offset-length repeat of base and then emit into self.buffer in
word-sized (8-byte) chunks by copying that 8-byte pattern repeatedly (using bulk
extend/append operations) and only handle the final <8 tail bytes with a small
scalar copy; update repeat_short_offset to avoid per-byte modulo and use the
8-byte replication trick for the main loop, falling back to copying the
remaining bytes at the end.
- Around line 68-75: Add a focused regression test that calls the repeat method
with offset == 0 (e.g., repeat(0, n)) and asserts it returns
Err(DecodeBufferError::ZeroOffset) instead of succeeding or panicking; locate
the test near other decode_buffer tests and target the repeat(&mut self, offset:
usize, match_length: usize) function so the new contract for
DecodeBufferError::ZeroOffset is exercised and prevented from being accidentally
removed.

In `@zstd/src/decoding/prefetch.rs`:
- Around line 6-17: The prefetch implementation should be split into two
arch-specific versions to avoid the unnecessary unsafe and to gate 32-bit x86 on
SSE: create one #[cfg(target_arch = "x86_64")] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86_64::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch directly (no unsafe), and create a
#[cfg(all(target_arch = "x86", target_feature = "sse"))] #[inline(always)] fn
prefetch_slice_impl(slice: &[u8]) that imports core::arch::x86::{_MM_HINT_T0,
_mm_prefetch} and calls _mm_prefetch (no unsafe); also add a fallback
#[cfg(not(any(target_arch = "x86_64", all(target_arch = "x86", target_feature =
"sse"))))] #[inline(always)] fn prefetch_slice_impl(_slice: &[u8]) { /* no-op */
} so code compiles without SSE on 32-bit x86 and removes unsafe usage while
keeping the same API.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d163dc1-f1b5-4657-83ac-1ce8e96c3c54

📥 Commits

Reviewing files that changed from the base of the PR and between ce987d7 and 02b20b7.

📒 Files selected for processing (5)
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/errors.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/decoding/prefetch.rs
  • zstd/src/decoding/sequence_execution.rs

Comment thread zstd/src/decoding/decode_buffer.rs
Comment thread zstd/src/decoding/decode_buffer.rs
Comment thread zstd/src/decoding/prefetch.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/decoding/prefetch.rs Outdated
- add regression for repeat(offset=0) -> DecodeBufferError::ZeroOffset

- switch short-offset repeat hot loop to 8-byte phase patterns

- gate x86 prefetch by SSE and remove unnecessary unsafe

Refs #12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/decoding/decode_buffer.rs (1)

228-242: ⚠️ Potential issue | 🔴 Critical

Update total_output_counter in the full-dictionary branch.

When bytes_from_dict >= match_length, this arm extends dict_slice and returns without bumping self.total_output_counter. That leaves the self.total_output_counter <= self.window_size as u64 gate stale, so later repeats can keep treating dictionary history as in-window after it should have expired.

♻️ Proposed fix
             } else {
                 let low = self.dict_content.len() - bytes_from_dict;
                 let high = low + match_length;
                 let dict_slice = &self.dict_content[low..high];
                 prefetch::prefetch_slice(dict_slice);
                 self.buffer.extend(dict_slice);
+                self.total_output_counter += match_length as u64;
             }
             Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/decode_buffer.rs` around lines 228 - 242, The else branch
that handles the full-dictionary copy (when bytes_from_dict >= match_length)
extends self.buffer from self.dict_content but fails to update
self.total_output_counter; update self.total_output_counter by adding
match_length (cast to u64) after the buffer.extend(dict_slice) in that branch so
the total_output_counter remains correct for window expiration checks used by
methods like repeat and the window_size gate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@zstd/src/decoding/decode_buffer.rs`:
- Around line 228-242: The else branch that handles the full-dictionary copy
(when bytes_from_dict >= match_length) extends self.buffer from
self.dict_content but fails to update self.total_output_counter; update
self.total_output_counter by adding match_length (cast to u64) after the
buffer.extend(dict_slice) in that branch so the total_output_counter remains
correct for window expiration checks used by methods like repeat and the
window_size gate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2c50dca5-5e1a-4a6c-8976-f5ec39c2be47

📥 Commits

Reviewing files that changed from the base of the PR and between 02b20b7 and bbd1657.

📒 Files selected for processing (2)
  • zstd/src/decoding/decode_buffer.rs
  • zstd/src/decoding/prefetch.rs

polaz added 2 commits March 28, 2026 20:29
- wrap _mm_prefetch calls in unsafe blocks for target-feature safety

- keep x86 SSE gating and no-op fallback intact

Refs #12
- update total_output_counter when bytes_from_dict >= match_length

- add regression test for stale-window guard after full dict branch

Refs #12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: sequence execution — wildcopy, overlap handling, and prefetch

2 participants