Skip to content

perf(encoding): SIMD-dispatch common_prefix_len#96

Merged
polaz merged 3 commits intomainfrom
perf/#70-simd-match-length
Apr 9, 2026
Merged

perf(encoding): SIMD-dispatch common_prefix_len#96
polaz merged 3 commits intomainfrom
perf/#70-simd-match-length

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 9, 2026

Summary

  • add runtime SIMD dispatch for common_prefix_len with AVX2, SSE2, and AArch64 NEON kernels (little-endian AArch64 only)
  • keep scalar fallback for unsupported targets and tail handling to preserve bounds safety
  • add scalar-parity regression test covering varying lengths and mismatch offsets

Validation

  • cargo build -p structured-zstd --features hash,std,dict_builder
  • cargo fmt --all -- --check
  • cargo clippy -p structured-zstd --all-targets --features hash,std,dict_builder -- -D warnings
  • cargo nextest run -p structured-zstd --features hash,std,dict_builder

Closes #70

Summary by CodeRabbit

Release Notes

  • Performance Improvements

    • Optimized prefix-length computation with architecture-specific SIMD implementations for x86 (SSE2, AVX2) and ARM64 (NEON), improving encoding performance across supported architectures.
  • Tests

    • Added comprehensive unit tests for prefix-length computation validation across multiple slice lengths, offsets, and edge cases.

Copilot AI review requested due to automatic review settings April 9, 2026 22:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 95ac7525-90e3-41a5-ad19-8e82f4c8a94f

📥 Commits

Reviewing files that changed from the base of the PR and between 362caf1 and da54c27.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

📝 Walkthrough

Walkthrough

Dispatches common-prefix length computation at runtime to architecture-specific SIMD kernels (x86 SSE2/AVX2, aarch64 NEON) via a new PrefixKernel detector, then continues with a refactored scalar tail routine; adds a unit test verifying parity with the scalar reference.

Changes

Cohort / File(s) Summary
Match generator: SIMD dispatch & scalar refactor
zstd/src/encoding/match_generator.rs
Introduced PrefixKernel and detect_prefix_kernel() (cached via OnceLock when std), added SIMD implementations (SSE2 16-byte, AVX2 32-byte, NEON 16-byte) used by common_prefix_len() at runtime, extracted scalar continuation into common_prefix_len_scalar(a,b,off,max), and added common_prefix_len_matches_scalar_reference_across_offsets unit test.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Detector
    participant Kernel as SIMD Kernel
    participant Scalar as Scalar Tail

    Caller->>Detector: request prefix_length(a, b, max)
    Detector-->>Caller: selected PrefixKernel
    Caller->>Kernel: run SIMD loop(a, b, off, max)
    Kernel-->>Caller: mismatch offset or chunk-complete
    alt remaining bytes > 0
        Caller->>Scalar: common_prefix_len_scalar(a, b, off, max)
        Scalar-->>Caller: final match length
    else
        Kernel-->>Caller: final match length
    end

    rect rgba(0,128,255,0.5)
    note over Detector,Kernel: runtime feature detection + kernel dispatch
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐇 I hopped through memory, nibbling bytes with care,
SSE, AVX, NEON — I checked them everywhere.
I picked the fastest kernel, then finished with a tuck,
tails and tails — a match-length munching luck!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf(encoding): SIMD-dispatch common_prefix_len' accurately summarizes the main change: adding SIMD dispatch to the common_prefix_len function for performance improvement.
Linked Issues check ✅ Passed The PR implements all core objectives from issue #70: runtime SIMD dispatch for SSE2, AVX2, and NEON kernels with scalar fallback, proper tail handling, and regression test validation.
Out of Scope Changes check ✅ Passed All changes are scoped to match_generator.rs and directly support the SIMD dispatch implementation; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#70-simd-match-length

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 77.35849% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/match_generator.rs 77.35% 24 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds runtime SIMD dispatch to accelerate MatchGenerator::common_prefix_len (hot match-length path in the encoder), with AVX2/SSE/NEON implementations plus a scalar fallback and a regression test to ensure SIMD/scalar parity.

Changes:

  • Introduces a PrefixKernel enum and detect_prefix_kernel() runtime dispatch (cached via OnceLock under std).
  • Adds x86 AVX2 + “SSE4.2” and AArch64 NEON prefix-length kernels, falling back to the existing scalar logic for tails/unsupported targets.
  • Adds a scalar-parity regression test covering multiple lengths and mismatch offsets.

Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 3624-3652: Extend the parity test for
MatchGenerator::common_prefix_len / scalar_reference to include misaligned
subslices and larger mismatch offsets: add tests that call common_prefix_len on
slices starting at non-zero offsets (e.g., &buf[1..], &buf[3..]) so
unaligned/unsafe loads are exercised, and include mismatch positions beyond 127
(e.g., 128, 129, len - 1) as well as mismatches near the end of buffers to
exercise AVX2/loop tails; update the existing loops to iterate these additional
start offsets and mismatch indexes against both equal and single-byte-altered
clones to assert parity with scalar_reference.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d91c598-6a0b-47cc-99c1-9ea24db3ded1

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3e17 and 20b7b5d.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
- Retarget x86 fallback kernel from SSE4.2 to SSE2 to match used intrinsics
- Remove NEON stack roundtrip in mismatch scan via lane extraction
- Extend common_prefix_len parity matrix for misaligned and >127 mismatch cases
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 10-11: The NEON path is incorrectly enabled on big-endian AArch64
because the import and NEON-specific code are only gated by #[cfg(target_arch =
"aarch64")]; update all NEON-related cfgs to require little-endian as well
(#[cfg(all(target_arch = "aarch64", target_endian = "little"))]). Specifically,
change the import line that brings in uint8x16_t, vceqq_u8, vgetq_lane_u64,
vld1q_u8, vreinterpretq_u64_u8 to use #[cfg(all(target_arch = "aarch64",
target_endian = "little"))], and apply the same cfg adjustment to the NEON
kernel variant definitions, the NEON detection logic, and the NEON dispatch
points (references: the symbol vreinterpretq_u64_u8 and the NEON kernel/variant
names and detection/dispatch code in match_generator.rs).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74590dfe-85bc-4370-84c8-23bb51baeae2

📥 Commits

Reviewing files that changed from the base of the PR and between 20b7b5d and 362caf1.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
- Restrict AArch64 NEON prefix path to little-endian targets
- Reuse endian-aware mismatch byte index helper in NEON lane extraction
- Keep scalar/SSE2/AVX2 dispatch behavior unchanged
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@polaz polaz merged commit 002ae0e into main Apr 9, 2026
16 of 17 checks passed
@polaz polaz deleted the perf/#70-simd-match-length branch April 9, 2026 22:44
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 9, 2026
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: da54c27 Previous: 544abd5 Ratio
compress/better/decodecorpus-z000033/matrix/pure_rust 130.577 ms 57.734 ms 2.26
compress/level4-row/decodecorpus-z000033/matrix/pure_rust 97.473 ms 47.686 ms 2.04
compress/best/decodecorpus-z000033/matrix/pure_rust 138.751 ms 70.642 ms 1.96
compress/best/low-entropy-1m/matrix/c_ffi 1.575 ms 1.185 ms 1.33

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON)

2 participants