perf(encoding): SIMD-dispatch common_prefix_len#96
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughDispatches common-prefix length computation at runtime to architecture-specific SIMD kernels (x86 SSE2/AVX2, aarch64 NEON) via a new Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Detector
participant Kernel as SIMD Kernel
participant Scalar as Scalar Tail
Caller->>Detector: request prefix_length(a, b, max)
Detector-->>Caller: selected PrefixKernel
Caller->>Kernel: run SIMD loop(a, b, off, max)
Kernel-->>Caller: mismatch offset or chunk-complete
alt remaining bytes > 0
Caller->>Scalar: common_prefix_len_scalar(a, b, off, max)
Scalar-->>Caller: final match length
else
Kernel-->>Caller: final match length
end
rect rgba(0,128,255,0.5)
note over Detector,Kernel: runtime feature detection + kernel dispatch
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds runtime SIMD dispatch to accelerate MatchGenerator::common_prefix_len (hot match-length path in the encoder), with AVX2/SSE/NEON implementations plus a scalar fallback and a regression test to ensure SIMD/scalar parity.
Changes:
- Introduces a
PrefixKernelenum anddetect_prefix_kernel()runtime dispatch (cached viaOnceLockunderstd). - Adds x86 AVX2 + “SSE4.2” and AArch64 NEON prefix-length kernels, falling back to the existing scalar logic for tails/unsupported targets.
- Adds a scalar-parity regression test covering multiple lengths and mismatch offsets.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 3624-3652: Extend the parity test for
MatchGenerator::common_prefix_len / scalar_reference to include misaligned
subslices and larger mismatch offsets: add tests that call common_prefix_len on
slices starting at non-zero offsets (e.g., &buf[1..], &buf[3..]) so
unaligned/unsafe loads are exercised, and include mismatch positions beyond 127
(e.g., 128, 129, len - 1) as well as mismatches near the end of buffers to
exercise AVX2/loop tails; update the existing loops to iterate these additional
start offsets and mismatch indexes against both equal and single-byte-altered
clones to assert parity with scalar_reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9d91c598-6a0b-47cc-99c1-9ea24db3ded1
📒 Files selected for processing (1)
zstd/src/encoding/match_generator.rs
- Retarget x86 fallback kernel from SSE4.2 to SSE2 to match used intrinsics - Remove NEON stack roundtrip in mismatch scan via lane extraction - Extend common_prefix_len parity matrix for misaligned and >127 mismatch cases
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 10-11: The NEON path is incorrectly enabled on big-endian AArch64
because the import and NEON-specific code are only gated by #[cfg(target_arch =
"aarch64")]; update all NEON-related cfgs to require little-endian as well
(#[cfg(all(target_arch = "aarch64", target_endian = "little"))]). Specifically,
change the import line that brings in uint8x16_t, vceqq_u8, vgetq_lane_u64,
vld1q_u8, vreinterpretq_u64_u8 to use #[cfg(all(target_arch = "aarch64",
target_endian = "little"))], and apply the same cfg adjustment to the NEON
kernel variant definitions, the NEON detection logic, and the NEON dispatch
points (references: the symbol vreinterpretq_u64_u8 and the NEON kernel/variant
names and detection/dispatch code in match_generator.rs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 74590dfe-85bc-4370-84c8-23bb51baeae2
📒 Files selected for processing (1)
zstd/src/encoding/match_generator.rs
- Restrict AArch64 NEON prefix path to little-endian targets - Reuse endian-aware mismatch byte index helper in NEON lane extraction - Keep scalar/SSE2/AVX2 dispatch behavior unchanged
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: da54c27 | Previous: 544abd5 | Ratio |
|---|---|---|---|
compress/better/decodecorpus-z000033/matrix/pure_rust |
130.577 ms |
57.734 ms |
2.26 |
compress/level4-row/decodecorpus-z000033/matrix/pure_rust |
97.473 ms |
47.686 ms |
2.04 |
compress/best/decodecorpus-z000033/matrix/pure_rust |
138.751 ms |
70.642 ms |
1.96 |
compress/best/low-entropy-1m/matrix/c_ffi |
1.575 ms |
1.185 ms |
1.33 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
Summary
Validation
Closes #70
Summary by CodeRabbit
Release Notes
Performance Improvements
Tests