perf(decoding): add runtime-dispatched simd wildcopy#85
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded a new internal SIMD-aware copy module Changes
Sequence Diagram(s)sequenceDiagram
participant RingBuffer as "RingBuffer (caller)"
participant SIMD as "decoding::simd_copy"
participant Memory as "src/dst memory"
RingBuffer->>SIMD: copy_bytes_overshooting((src_ptr, src_cap), (dst_ptr, dst_cap), copy_at_least)
SIMD->>SIMD: select strategy & chunk size (cfg/runtime)
alt min(src_cap,dst_cap) >= rounded_chunk
SIMD->>Memory: perform chunked loads/stores (SIMD or scalar chunked loops)
else
SIMD->>Memory: call copy_from_nonoverlapping for exact copy_at_least
end
SIMD-->>RingBuffer: return (void), debug-assert checked
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new runtime-dispatched “overshooting” copy helper to speed up decoder hot-path copies across multiple SIMD backends, and routes ringbuffer match-copy logic through that helper.
Changes:
- Added
decoding::simd_copywith runtime-selected AVX-512/AVX2/SSE2 (x86/x86_64), NEON (aarch64), and scalar fallback overshooting copy paths. - Updated ringbuffer copy hot paths to use the new helper instead of the local implementation / direct
copy_from_nonoverlappingin several cases. - Wired the new module into
decoding::mod.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| zstd/src/decoding/simd_copy.rs | New runtime-dispatched SIMD/scalar overshooting copy implementation. |
| zstd/src/decoding/ringbuffer.rs | Switched multiple internal copy sites to use simd_copy::copy_bytes_overshooting; removed the old local helper. |
| zstd/src/decoding/mod.rs | Added the new simd_copy module. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
zstd/src/decoding/ringbuffer.rs (1)
1030-1064:⚠️ Potential issue | 🟡 MinorMake this test derive its lengths from the active chunk size.
The fixed
16/32/17cases no longer cover “all copy strategies” on wider backends: with AVX2,32only exercises a single chunk, and with AVX-512 both16and32fall back tocopy_from_nonoverlapping. Please key these sizes off the runtime-selected chunk size, or move the test next tosimd_copy.rsso the SIMD loop stays exercised on every supported backend.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/ringbuffer.rs` around lines 1030 - 1064, The test copy_bytes_overshooting_covers_all_copy_strategies uses hardcoded lengths (16/32/17) that don't exercise all SIMD paths on wider backends; change it to derive lengths from the runtime chunk size provided by simd_copy so the single-chunk, multi-chunk, and fallback cases always map to 1*chunk, 2*chunk, and chunk+1 respectively. Update the test to query the chunk size (use the public symbol in simd_copy such as a CHUNK/CHUNK_SIZE constant or a chunk_size() accessor) and compute single_len = chunk, multi_len = chunk * 2, fallback_len = chunk + 1, then use those lengths in the copy_bytes_overshooting calls in copy_bytes_overshooting_covers_all_copy_strategies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 1-2: The import of the is_x86_feature_detected macro is from the
wrong module; replace the use of core::arch::is_x86_feature_detected with
std::arch::is_x86_feature_detected in the cfg block (the line currently under
#[cfg(all(feature = "std", any(target_arch = "x86", target_arch = "x86_64")))]
that brings in is_x86_feature_detected) so the code uses the macro provided by
std::arch.
---
Outside diff comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 1030-1064: The test
copy_bytes_overshooting_covers_all_copy_strategies uses hardcoded lengths
(16/32/17) that don't exercise all SIMD paths on wider backends; change it to
derive lengths from the runtime chunk size provided by simd_copy so the
single-chunk, multi-chunk, and fallback cases always map to 1*chunk, 2*chunk,
and chunk+1 respectively. Update the test to query the chunk size (use the
public symbol in simd_copy such as a CHUNK/CHUNK_SIZE constant or a chunk_size()
accessor) and compute single_len = chunk, multi_len = chunk * 2, fallback_len =
chunk + 1, then use those lengths in the copy_bytes_overshooting calls in
copy_bytes_overshooting_covers_all_copy_strategies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: aef01b78-5497-483b-a6de-767f87f1a793
📒 Files selected for processing (3)
zstd/src/decoding/mod.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/simd_copy.rs
|
@coderabbitai Fixed outside-diff finding in zstd/src/decoding/ringbuffer.rs: copy_bytes_overshooting_covers_all_copy_strategies now derives lengths from runtime chunk size via active_chunk_size_for_tests(), so single/multi/fallback cases adapt to active backend width. |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
11bd090 to
7b46f00
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 587-614: The SIMD copy calls in the unsafe block
(simd_copy::copy_bytes_overshooting invocations using m1_in_f1, m2_in_f1,
m1_in_f2, m2_in_f2) pass the source/destination lengths equal to the requested
copy amounts, which prevents the routine from using safe overshoot
optimizations; compute and pass the actual remaining buffer capacities for each
tuple instead (e.g., remaining bytes from m1_ptr to its buffer end, from m2_ptr
to its buffer end, from f1_ptr/f2_ptr to their ring-buffer chunk ends) so the
tuples become (ptr, src_capacity) and (ptr, dst_capacity) while keeping the
third parameter as the copy length, and update the arguments for all four calls
in this block to enable SIMD overshooting.
In `@zstd/src/decoding/simd_copy.rs`:
- Line 99: Remove the needless explicit return when returning the result of
scalar_strategy(): replace the line that says `return scalar_strategy();` with a
bare final expression `scalar_strategy()` so the function returns by expression
rather than using the `return` keyword; this fixes the clippy::needless_return
lint in simd_copy.rs for the code path that returns the scalar_strategy() value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7104f346-6cee-41e5-ab6c-05c263c7e896
📒 Files selected for processing (3)
zstd/src/decoding/mod.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/simd_copy.rs
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 282-316: Add a #[cfg(feature = "std")] guard to the x86 SIMD tests
so they don't reference std::arch::is_x86_feature_detected! in no_std builds;
specifically wrap the test functions copy_avx2_copies_full_chunk_when_available
and copy_avx512_copies_full_chunk_when_available (which call
is_x86_feature_detected!("avx2") and is_x86_feature_detected!("avx512f")) with
the same #[cfg(feature = "std")] used on
copy_sse2_copies_full_chunk_when_available so the tests only compile when the
standard library is available.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 71683fee-119b-4a36-8497-5de63d4144a3
📒 Files selected for processing (3)
zstd/src/decoding/mod.rszstd/src/decoding/ringbuffer.rszstd/src/decoding/simd_copy.rs
|
@coderabbitai re-review |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/decoding/simd_copy.rs`:
- Around line 97-102: The SSE2 branch currently returns a CopyStrategy { chunk:
16, copy: copy_sse2 } without checking copy_at_least; add a minimum-threshold
guard (e.g., require copy_at_least >= 16) before selecting the SSE2 strategy so
small copies fall through to the generic/fallback handling
(copy_bytes_overshooting), mirroring the AVX2/AVX512/NEON checks; update the
conditional that references caps.sse2 to also test copy_at_least and only return
the SSE2 CopyStrategy when both caps.sse2 and copy_at_least >= 16 are true.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb7564dd-e4f7-4841-b09e-bd167a1b820a
📒 Files selected for processing (1)
zstd/src/decoding/simd_copy.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: 01b3e8d | Previous: e877c44 | Ratio |
|---|---|---|---|
compress/level4-row/decodecorpus-z000033/matrix/pure_rust |
63.537 ms |
44.778 ms |
1.42 |
compress/best/decodecorpus-z000033/matrix/pure_rust |
98.124 ms |
60.807 ms |
1.61 |
compress/best/low-entropy-1m/matrix/c_ffi |
1.598 ms |
1.217 ms |
1.31 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
♻️ Duplicate comments (1)
zstd/src/decoding/ringbuffer.rs (1)
659-663: 🧹 Nitpick | 🔵 TrivialConsider threading proper capacities for SIMD consistency (optional).
Unlike
copy_with_checks(which now passes aggregate capacities),copy_with_nobranch_checkpasses the copy length as the capacity in each case, e.g.,(m1_ptr, m1_in_f1)for both source capacity and copy amount. This prevents SIMD overshoot when the length isn't chunk-aligned.Since this function is marked
#[allow(dead_code)]and the branchless design prioritizes branch elimination over SIMD throughput, this is a low-priority consistency improvement. If this path is ever enabled for production, consider threadingf1_len/f2_lenthrough to enable SIMD benefits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/ringbuffer.rs` around lines 659 - 663, The call in copy_with_nobranch_check uses m1_in_f1 as both the source capacity and copy length which can prevent SIMD overshoot optimization; update the simd_copy::copy_bytes_overshooting calls in copy_with_nobranch_check to pass the actual source capacities (e.g., f1_len/f2_len or the aggregate capacity used by copy_with_checks) instead of the copy length m1_in_f1, so the tuple arguments reflect (ptr, source_capacity) like the non-branchless path and enable proper SIMD behavior if this path is used in production.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@zstd/src/decoding/ringbuffer.rs`:
- Around line 659-663: The call in copy_with_nobranch_check uses m1_in_f1 as
both the source capacity and copy length which can prevent SIMD overshoot
optimization; update the simd_copy::copy_bytes_overshooting calls in
copy_with_nobranch_check to pass the actual source capacities (e.g.,
f1_len/f2_len or the aggregate capacity used by copy_with_checks) instead of the
copy length m1_in_f1, so the tuple arguments reflect (ptr, source_capacity) like
the non-branchless path and enable proper SIMD behavior if this path is used in
production.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3f86b8f0-ab67-41b2-85a6-d543e6eb20c1
📒 Files selected for processing (2)
zstd/src/decoding/ringbuffer.rszstd/src/decoding/simd_copy.rs
|
@coderabbitai Fixed duplicate note from review body (ringbuffer.rs 659-663) in commit 01b3e8d: |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
Summary
Validation
Benchmark note
On this branch, decode benchmarks did not show statistically significant gains yet (mostly noise / no-change on tested scenarios). This PR focuses on architectural cleanup and hot-path dispatch tightening; x86_64 AVX-512 behavior should be validated on hardware/runners that expose AVX-512.
Summary by CodeRabbit
Refactor
Tests