perf(encoding): SIMD-dispatch common_prefix_len by polaz · Pull Request #96 · structured-world/structured-zstd

polaz · 2026-04-09T22:11:42Z

Summary

add runtime SIMD dispatch for common_prefix_len with AVX2, SSE2, and AArch64 NEON kernels (little-endian AArch64 only)
keep scalar fallback for unsupported targets and tail handling to preserve bounds safety
add scalar-parity regression test covering varying lengths and mismatch offsets

Validation

cargo build -p structured-zstd --features hash,std,dict_builder
cargo fmt --all -- --check
cargo clippy -p structured-zstd --all-targets --features hash,std,dict_builder -- -D warnings
cargo nextest run -p structured-zstd --features hash,std,dict_builder

Closes #70

Summary by CodeRabbit

Release Notes

Performance Improvements
- Optimized prefix-length computation with architecture-specific SIMD implementations for x86 (SSE2, AVX2) and ARM64 (NEON), improving encoding performance across supported architectures.
Tests
- Added comprehensive unit tests for prefix-length computation validation across multiple slice lengths, offsets, and edge cases.

coderabbitai · 2026-04-09T22:11:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 95ac7525-90e3-41a5-ad19-8e82f4c8a94f

📥 Commits

Reviewing files that changed from the base of the PR and between 362caf1 and da54c27.

📒 Files selected for processing (1)

zstd/src/encoding/match_generator.rs

📝 Walkthrough

Walkthrough

Dispatches common-prefix length computation at runtime to architecture-specific SIMD kernels (x86 SSE2/AVX2, aarch64 NEON) via a new PrefixKernel detector, then continues with a refactored scalar tail routine; adds a unit test verifying parity with the scalar reference.

Changes

Cohort / File(s)	Summary
Match generator: SIMD dispatch & scalar refactor `zstd/src/encoding/match_generator.rs`	Introduced `PrefixKernel` and `detect_prefix_kernel()` (cached via `OnceLock` when `std`), added SIMD implementations (SSE2 16-byte, AVX2 32-byte, NEON 16-byte) used by `common_prefix_len()` at runtime, extracted scalar continuation into `common_prefix_len_scalar(a,b,off,max)`, and added `common_prefix_len_matches_scalar_reference_across_offsets` unit test.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Detector
    participant Kernel as SIMD Kernel
    participant Scalar as Scalar Tail

    Caller->>Detector: request prefix_length(a, b, max)
    Detector-->>Caller: selected PrefixKernel
    Caller->>Kernel: run SIMD loop(a, b, off, max)
    Kernel-->>Caller: mismatch offset or chunk-complete
    alt remaining bytes > 0
        Caller->>Scalar: common_prefix_len_scalar(a, b, off, max)
        Scalar-->>Caller: final match length
    else
        Kernel-->>Caller: final match length
    end

    rect rgba(0,128,255,0.5)
    note over Detector,Kernel: runtime feature detection + kernel dispatch
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

perf(encoding): align fastest matcher with zstd fast path #39: Modifies MatchGenerator::common_prefix_len with an alternative mismatch-scanning approach; high code-level overlap.

Poem

🐇 I hopped through memory, nibbling bytes with care,
SSE, AVX, NEON — I checked them everywhere.
I picked the fastest kernel, then finished with a tuck,
tails and tails — a match-length munching luck!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'perf(encoding): SIMD-dispatch common_prefix_len' accurately summarizes the main change: adding SIMD dispatch to the common_prefix_len function for performance improvement.
Linked Issues check	✅ Passed	The PR implements all core objectives from issue `#70`: runtime SIMD dispatch for SSE2, AVX2, and NEON kernels with scalar fallback, proper tail handling, and regression test validation.
Out of Scope Changes check	✅ Passed	All changes are scoped to match_generator.rs and directly support the SIMD dispatch implementation; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#70-simd-match-length

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-09T22:14:05Z

Codecov Report

❌ Patch coverage is 77.35849% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/encoding/match_generator.rs	77.35%	24 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Adds runtime SIMD dispatch to accelerate MatchGenerator::common_prefix_len (hot match-length path in the encoder), with AVX2/SSE/NEON implementations plus a scalar fallback and a regression test to ensure SIMD/scalar parity.

Changes:

Introduces a PrefixKernel enum and detect_prefix_kernel() runtime dispatch (cached via OnceLock under std).
Adds x86 AVX2 + “SSE4.2” and AArch64 NEON prefix-length kernels, falling back to the existing scalar logic for tails/unsupported targets.
Adds a scalar-parity regression test covering multiple lengths and mismatch offsets.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 3624-3652: Extend the parity test for
MatchGenerator::common_prefix_len / scalar_reference to include misaligned
subslices and larger mismatch offsets: add tests that call common_prefix_len on
slices starting at non-zero offsets (e.g., &buf[1..], &buf[3..]) so
unaligned/unsafe loads are exercised, and include mismatch positions beyond 127
(e.g., 128, 129, len - 1) as well as mismatches near the end of buffers to
exercise AVX2/loop tails; update the existing loops to iterate these additional
start offsets and mismatch indexes against both equal and single-byte-altered
clones to assert parity with scalar_reference.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d91c598-6a0b-47cc-99c1-9ea24db3ded1

📥 Commits

Reviewing files that changed from the base of the PR and between b0a3e17 and 20b7b5d.

📒 Files selected for processing (1)

zstd/src/encoding/match_generator.rs

- Retarget x86 fallback kernel from SSE4.2 to SSE2 to match used intrinsics - Remove NEON stack roundtrip in mismatch scan via lane extraction - Extend common_prefix_len parity matrix for misaligned and >127 mismatch cases

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 10-11: The NEON path is incorrectly enabled on big-endian AArch64
because the import and NEON-specific code are only gated by #[cfg(target_arch =
"aarch64")]; update all NEON-related cfgs to require little-endian as well
(#[cfg(all(target_arch = "aarch64", target_endian = "little"))]). Specifically,
change the import line that brings in uint8x16_t, vceqq_u8, vgetq_lane_u64,
vld1q_u8, vreinterpretq_u64_u8 to use #[cfg(all(target_arch = "aarch64",
target_endian = "little"))], and apply the same cfg adjustment to the NEON
kernel variant definitions, the NEON detection logic, and the NEON dispatch
points (references: the symbol vreinterpretq_u64_u8 and the NEON kernel/variant
names and detection/dispatch code in match_generator.rs).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74590dfe-85bc-4370-84c8-23bb51baeae2

📥 Commits

Reviewing files that changed from the base of the PR and between 20b7b5d and 362caf1.

📒 Files selected for processing (1)

zstd/src/encoding/match_generator.rs

- Restrict AArch64 NEON prefix path to little-endian targets - Reuse endian-aware mismatch byte index helper in NEON lane extraction - Keep scalar/SSE2/AVX2 dispatch behavior unchanged

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

sw-release-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite	Current: `da54c27`	Previous: `544abd5`	Ratio
`compress/better/decodecorpus-z000033/matrix/pure_rust`	`130.577` ms	`57.734` ms	`2.26`
`compress/level4-row/decodecorpus-z000033/matrix/pure_rust`	`97.473` ms	`47.686` ms	`2.04`
`compress/best/decodecorpus-z000033/matrix/pure_rust`	`138.751` ms	`70.642` ms	`1.96`
`compress/best/low-entropy-1m/matrix/c_ffi`	`1.575` ms	`1.185` ms	`1.33`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

perf(encoding): SIMD-dispatch common_prefix_len

20b7b5d

Copilot AI review requested due to automatic review settings April 9, 2026 22:11

Copilot started reviewing on behalf of polaz April 9, 2026 22:12 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread zstd/src/encoding/match_generator.rs Outdated

perf(encoding): tighten prefix SIMD paths

362caf1

- Retarget x86 fallback kernel from SSE4.2 to SSE2 to match used intrinsics - Remove NEON stack roundtrip in mismatch scan via lane extraction - Extend common_prefix_len parity matrix for misaligned and >127 mismatch cases

polaz requested a review from Copilot April 9, 2026 22:25

Copilot started reviewing on behalf of polaz April 9, 2026 22:25 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread zstd/src/encoding/match_generator.rs Outdated

fix(encoding): gate neon prefix kernel by endian

da54c27

- Restrict AArch64 NEON prefix path to little-endian targets - Reuse endian-aware mismatch byte index helper in NEON lane extraction - Keep scalar/SSE2/AVX2 dispatch behavior unchanged

polaz requested a review from Copilot April 9, 2026 22:39

Copilot started reviewing on behalf of polaz April 9, 2026 22:39 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

polaz merged commit 002ae0e into main Apr 9, 2026
16 of 17 checks passed

polaz deleted the perf/#70-simd-match-length branch April 9, 2026 22:44

sw-release-bot Bot mentioned this pull request Apr 9, 2026

chore: release v0.0.15 #98

Merged

polaz mentioned this pull request Apr 9, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

sw-release-bot Bot reviewed Apr 9, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Apr 11, 2026

perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt) #71

Closed

8 tasks

Conversation

polaz commented Apr 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

sw-release-bot Bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented Apr 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

codecov Bot commented Apr 9, 2026 •

edited

Loading