perf(bench): add wildcopy candidate research bench and dashboard range filter#107
perf(bench): add wildcopy candidate research bench and dashboard range filter#107
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a bench-only Criterion benchmark and feature-gated wrappers to exercise simd_copy wildcopy candidates; re-exports a bench entrypoint; and enhances the bench-dashboard with From/To time-window selectors and Chart.js sizing/layout adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant Browser as Browser (bench-dashboard)
participant Runner as Bench Runner (cargo + Criterion)
participant BenchLib as zstd::testing (bench_internals)
participant Simd as simd_copy impl
participant CPU as CPU (ISA/features)
Browser->>Runner: request benchmark data / trigger run (filtered by From/To)
Runner->>BenchLib: execute bench_wildcopy_candidates
BenchLib->>Simd: call copy_bytes_overshooting_for_bench(src,dst,len)
Simd->>CPU: invoke ISA-specific kernel (AVX-512/AVX2/SSE2/scalar)
CPU-->>Simd: return copied bytes
Simd-->>BenchLib: provide output for verification
BenchLib-->>Runner: report throughput & correctness
Runner-->>Browser: update dashboard charts (respecting time window)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds benchmarking tooling for researching alternative decode wildcopy kernels (issue #87) and improves the benchmark dashboard UX by making the chart more readable and allowing users to filter to a snapshot range.
Changes:
- Add a feature-gated Criterion bench (
wildcopy_candidates) and a bench-only entrypoint to call the production overshooting wildcopy implementation. - Register the new bench in
zstd/Cargo.tomlbehindbench_internals. - Improve the gh-pages benchmark dashboard with a taller chart and
From/Tosnapshot selectors to focus on a time window.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-export bench-only wildcopy entrypoint via structured_zstd::testing for benchmark use. |
zstd/src/decoding/simd_copy.rs |
Add copy_bytes_overshooting_for_bench wrapper behind bench_internals. |
zstd/src/decoding/mod.rs |
Re-export the bench-only entrypoint from decoding under bench_internals. |
zstd/Cargo.toml |
Register new wildcopy_candidates Criterion bench gated by bench_internals. |
zstd/benches/wildcopy_candidates.rs |
New microbench comparing production wildcopy vs an AVX2 unroll candidate. |
.github/bench-dashboard/index.html |
Dashboard layout improvements + snapshot range filtering (From/To). |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 68-126: The benchmark is comparing different call paths: baseline
uses copy_bytes_overshooting_for_bench (which does strategy selection, capacity
checks, and rounding) while the candidate calls candidate_copy_kernel directly
and all tested lengths are 64-byte aligned—this produces an optimistic and
non-equivalent comparison. Fix by making the two measured paths equivalent:
either (A) call the selected kernel directly for both sides (invoke the same
kernel function used by copy_bytes_overshooting_for_bench for the baseline
instead of copy_bytes_overshooting_for_bench), or (B) wrap the candidate
invocation in the same dispatch/rounding logic (call
copy_bytes_overshooting_for_bench for both) so both pay the same
selection/rounding costs; also update the lengths array to include at least one
non-multiple-of-64 value (e.g., 65) so tail/rounded paths are exercised. Ensure
you modify the benchmark code that constructs lengths, the baseline invocation
using copy_bytes_overshooting_for_bench, and the candidate invocation using
candidate_copy_kernel so both paths are symmetric.
In `@zstd/src/decoding/mod.rs`:
- Line 24: Replace the broad module export with a private module and expose only
the bench shim: change the declaration `pub(crate) mod simd_copy;` to `mod
simd_copy;` inside `decoding`, then add a targeted re-export like `pub(crate)
use self::simd_copy::copy_bytes_overshooting_for_bench;` so only that function
is visible across the crate; finally update zstd/src/lib.rs to reference
`crate::decoding::copy_bytes_overshooting_for_bench` instead of importing the
entire `simd_copy` module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: fcd3516a-4962-40fe-98c2-9a2fe3bc9523
📒 Files selected for processing (5)
BENCHMARKS.mdzstd/benches/wildcopy_candidates.rszstd/src/decoding/mod.rszstd/src/decoding/simd_copy.rszstd/src/lib.rs
There was a problem hiding this comment.
Pull request overview
This PR improves the benchmarking workflow for structured-zstd by adding a new research-focused Criterion benchmark for SIMD “wildcopy” candidate evaluation and enhancing the benchmark dashboard UI with a selectable snapshot window.
Changes:
- Added a new
wildcopy_candidatesCriterion benchmark (gated behindbench_internals) to compare copy kernels against the production overshooting wildcopy. - Exposed a bench-only internal entrypoint for the production wildcopy implementation and adjusted module visibility to support it.
- Improved the benchmark dashboard with a larger chart area and
From/Tosnapshot selectors to filter the displayed time window.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-exports the bench-only wildcopy entrypoint under the existing bench_internals-gated testing module. |
zstd/src/decoding/simd_copy.rs |
Adds a bench_internals-gated wrapper around the production overshooting copy routine for benchmarking. |
zstd/src/decoding/mod.rs |
Makes simd_copy visible within the crate to enable the bench-only re-export path. |
zstd/Cargo.toml |
Registers the new wildcopy_candidates bench and gates it behind bench_internals. |
zstd/benches/wildcopy_candidates.rs |
Implements the new Criterion benchmark comparing baseline vs a candidate AVX2 unroll2 kernel (with scalar fallback). |
BENCHMARKS.md |
Documents the Issue #87 research mapping, local sample measurements, and decision/follow-up linkage. |
.github/bench-dashboard/index.html |
Adds snapshot range selectors and adjusts layout/chart sizing to improve dashboard usability. |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.
| Benchmark suite | Current: 70e7a5c | Previous: 051fed2 | Ratio |
|---|---|---|---|
compress/fastest/decodecorpus-z000033/matrix/c_ffi |
3.736 ms |
2.766 ms |
1.35 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
zstd/src/decoding/mod.rs (1)
25-26: 🛠️ Refactor suggestion | 🟠 MajorKeep the bench shim off the public
decodingsurface.
pub usehere makes the helper reachable asstructured_zstd::decoding::copy_bytes_overshooting_for_bench, so the hiddentestingmodule is no longer the sole bench-only escape hatch. Keep this re-export non-public fromdecodingand lettestingremain the only public facade for bench builds.Based on learnings, internal decode/encode helpers remain
pub(crate)by default, and explicit public shims are exposed only behind a dedicated feature flag in a separate file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/mod.rs` around lines 25 - 26, The public re-export exposes the bench helper on the decoding public API; change the visibility of the re-export so it is not public (e.g., replace `pub use self::simd_copy::copy_bytes_overshooting_for_bench;` with a non-public re-export such as `pub(crate) use` or remove the re-export and let the `testing` module remain the only bench-facing facade), ensuring the symbol copy_bytes_overshooting_for_bench from module simd_copy is only reachable via the bench-only testing shim.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 91-115: select_baseline_copy_kernel currently only picks AVX2 vs
scalar and therefore diverges from the production dispatcher (copy_strategy)
which may choose AVX-512 or SSE2 based on features and length; update
select_baseline_copy_kernel to use the same dispatch policy as production by
delegating to or replicating copy_strategy logic so that when AVX-512 or SSE2
are selected in production the bench uses the same baseline kernel name, chunk
and kernel function (e.g., return the same BenchPath values used for AVX-512,
AVX2, SSE2 and scalar in copy_strategy), honoring the same feature checks and
length thresholds (len >= 64 for AVX-512, len >= 16 for SSE2, etc.) so bench
timings match production behavior.
- Line 151: The benchmark's lengths array (variable names: lengths in
wildcopy_candidates.rs) starts at 64 and 65 so it omits small-block cases;
update the lengths array to include at least a couple of sub-64 sizes (for
example 1, 8, 32 or similar small values) before 64 so the bench measures the
small-block regime called out in issue `#87` and can catch chunk-threshold
performance changes; ensure the new small sizes are inserted into the existing
lengths array literal used by the benchmark loop so all tests run as before but
now include those sub-64 cases.
---
Duplicate comments:
In `@zstd/src/decoding/mod.rs`:
- Around line 25-26: The public re-export exposes the bench helper on the
decoding public API; change the visibility of the re-export so it is not public
(e.g., replace `pub use self::simd_copy::copy_bytes_overshooting_for_bench;`
with a non-public re-export such as `pub(crate) use` or remove the re-export and
let the `testing` module remain the only bench-facing facade), ensuring the
symbol copy_bytes_overshooting_for_bench from module simd_copy is only reachable
via the bench-only testing shim.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6bbbdb5f-3697-4c07-9250-db86c02a248e
📒 Files selected for processing (3)
zstd/benches/wildcopy_candidates.rszstd/src/decoding/mod.rszstd/src/lib.rs
There was a problem hiding this comment.
Pull request overview
Adds bench-only tooling and UI improvements to support research/iteration on decode wildcopy performance (issue #87), while improving the benchmark dashboard’s usability when inspecting recent runs.
Changes:
- Adds a feature-gated internal entrypoint exposing the production overshooting copy for benchmarks (
bench_internals). - Introduces a new Criterion benchmark (
wildcopy_candidates) to compare copy kernels across payload sizes. - Updates the GitHub Pages benchmark dashboard to improve chart sizing and add From/To snapshot range filtering.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-exports the new bench-only copy entrypoint via the testing module. |
zstd/src/decoding/simd_copy.rs |
Adds a bench_internals-gated wrapper around the production overshooting copy routine. |
zstd/src/decoding/mod.rs |
Re-exports the bench-only copy wrapper from the decoding module. |
zstd/Cargo.toml |
Registers the new wildcopy_candidates Criterion bench (gated by bench_internals). |
zstd/benches/wildcopy_candidates.rs |
Implements the baseline vs candidate copy-kernel benchmark with correctness checks vs production output. |
BENCHMARKS.md |
Documents the issue #87 research mapping, sample measurements, and follow-up link. |
.github/bench-dashboard/index.html |
Improves chart layout and adds From/To selectors with inclusive time-window filtering. |
|
@coderabbitai Duplicate body item about |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 191-199: Extract the candidate construction logic into
select_candidate_copy_kernel by returning a BenchPath (instead of just kernel)
so the chunk size and name are chosen in one place; update
select_candidate_copy_kernel to return BenchPath (containing name, chunk,
kernel) and change bench_wildcopy_candidates to use that returned BenchPath
rather than computing chunk via string comparison on candidate_name (remove the
special-case for "candidate_avx2_unroll2" there); ensure you update all call
sites to use the new BenchPath return type and keep symbols like BenchPath,
select_candidate_copy_kernel, bench_wildcopy_candidates, candidate_avx2_unroll2,
candidate_name, and candidate_copy_kernel in mind when locating the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2ab1af44-9f8e-40ab-873d-f62c7450d8f8
📒 Files selected for processing (3)
zstd/benches/wildcopy_candidates.rszstd/src/decoding/mod.rszstd/src/lib.rs
There was a problem hiding this comment.
Pull request overview
This PR improves performance research and benchmarking ergonomics by (1) adding a focused Criterion bench to compare wildcopy kernels and (2) enhancing the GitHub Pages benchmark dashboard with a selectable snapshot window.
Changes:
- Added a
wildcopy_candidatesCriterion benchmark (feature-gated bybench_internals) and a bench-only internal entrypoint to call the production overshooting copy routine. - Updated the benchmark dashboard UI to use a taller chart area and added
From/Tosnapshot selectors with inclusive time-window filtering. - Documented Issue #87 research mapping and local sample measurements in
BENCHMARKS.md.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/lib.rs | Re-exports a bench-only wildcopy entrypoint via structured_zstd::testing behind bench_internals. |
| zstd/src/decoding/simd_copy.rs | Adds copy_bytes_overshooting_for_bench wrapper (cfg-gated) to expose the production wildcopy for benchmarks. |
| zstd/src/decoding/mod.rs | Makes simd_copy visible within the crate (pub(crate)) to support the bench-only re-export path. |
| zstd/Cargo.toml | Registers the new wildcopy_candidates bench gated by bench_internals. |
| zstd/benches/wildcopy_candidates.rs | New Criterion bench comparing baseline vs candidate copy kernels across sizes with correctness checks vs production output. |
| BENCHMARKS.md | Adds Issue #87 research mapping + local sample results + follow-up linkage. |
| .github/bench-dashboard/index.html | Improves chart layout and introduces snapshot window filtering (From/To) for relative dashboard points. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 116-129: select_candidate_copy_kernel currently chooses the AVX2
64-byte chunk path solely by CPU feature, which forces small lengths to be
rounded up to 64 and prevents benchmarking the scalar fallback; update
select_candidate_copy_kernel (and the duplicate at the other location) to take
the requested length (or a max_chunk parameter) and choose the candidate based
on both CPU feature and requested length (e.g., if avx2 is available but
requested length < 64 select the scalar/smaller-chunk BenchPath or a smaller
unrolled kernel), referencing the existing symbols select_candidate_copy_kernel,
copy_candidate_unroll2_avx2, and copy_candidate_scalar so the function returns
the appropriate chunk and kernel for small lengths instead of always returning
chunk: 64 on AVX2.
- Around line 170-176: The bench currently hardcodes baseline_scalar for all
non-x86 targets in select_baseline_copy_kernel (returning a BenchPath with name
"baseline_scalar" and kernel copy_candidate_scalar), which misrepresents the
production baseline; either (A) change select_baseline_copy_kernel to mirror the
decoder's runtime strategy selection by calling the same production dispatcher
(the runtime CPU/feature probe used elsewhere) and return the matching BenchPath
so non-x86 targets use their actual backend (NEON/scalar/etc.), or (B) restrict
this benchmark to only the architectures the bench models by cfg-gating the
entire bench to the same target_arch set as the production dispatcher; update
the implementation of select_baseline_copy_kernel and its returned BenchPath
accordingly so results compare candidates against the real runtime baseline
rather than always "baseline_scalar".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb87a0f9-19de-45c7-a93f-c7dfbf4747ab
📒 Files selected for processing (1)
zstd/benches/wildcopy_candidates.rs
There was a problem hiding this comment.
♻️ Duplicate comments (2)
zstd/benches/wildcopy_candidates.rs (2)
170-176:⚠️ Potential issue | 🟠 MajorNon-x86 runs still benchmark
scalar, not the production baseline.Line 171 hardwires
baseline_scalarfor every non-x86 target. On ARM/NEON runners this bench would compare against a path the decoder does not actually use, so those results are not representative for issue#87. Either mirror the non-x86 dispatcher here orcfg-gate the benchmark to the architectures it actually models. Based on learnings,copy_bytes_overshootingstrategy coverage adapts acrossAVX-512/AVX2/SSE2/NEON/scalar, so the production dispatcher is not scalar-only off x86.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/benches/wildcopy_candidates.rs` around lines 170 - 176, The benchmark currently forces select_baseline_copy_kernel to return the scalar BenchPath (name "baseline_scalar", kernel copy_candidate_scalar) for all non-x86 targets; update it to mirror the production dispatcher used by copy_bytes_overshooting so ARM/NEON runs pick the NEON baseline instead of scalar, or alternatively cfg-gate the entire benchmark to only run on the architectures modeled (x86/x86_64/arm/arm64) to match production; modify the select_baseline_copy_kernel implementation (or add matching #[cfg] arms) so the returned BenchPath name and kernel match the real dispatcher choices (e.g., NEON when NEON is available) rather than always returning baseline_scalar.
116-129:⚠️ Potential issue | 🟠 MajorDrive candidate selection from
len, not only CPU caps.Line 202 caches a single
candidate_pathbefore the loop, so on AVX2 hosts the new17/33/63cases are all rounded up to64bycopy_with_overshoot_policy(). That meanscopy_candidate_unroll2_avx2()never exercises its< 64scalar branch, and the small-block points are still reporting the wide AVX2 path rather than candidate small-block behavior. Select the candidate inside thefor len in lengthsloop and key it offlenas well as runtime features.Also applies to: 202-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/benches/wildcopy_candidates.rs` around lines 116 - 129, The candidate path is chosen once by select_candidate_copy_kernel so AVX2 hosts always pick the 64-byte path; change the logic to select the BenchPath inside the loop over lengths (where `len` is available) and pick based on both runtime CPU features (std::arch::is_x86_feature_detected!("avx2")) and the current `len` so small lengths choose `copy_candidate_scalar` (or a smaller chunk) while larger lengths use `copy_candidate_unroll2_avx2`; update any use of `candidate_path`/BenchPath so selection occurs per-iteration before calling `copy_with_overshoot_policy()` to ensure the scalar branch of `copy_candidate_unroll2_avx2` and small-block candidates are exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 170-176: The benchmark currently forces
select_baseline_copy_kernel to return the scalar BenchPath (name
"baseline_scalar", kernel copy_candidate_scalar) for all non-x86 targets; update
it to mirror the production dispatcher used by copy_bytes_overshooting so
ARM/NEON runs pick the NEON baseline instead of scalar, or alternatively
cfg-gate the entire benchmark to only run on the architectures modeled
(x86/x86_64/arm/arm64) to match production; modify the
select_baseline_copy_kernel implementation (or add matching #[cfg] arms) so the
returned BenchPath name and kernel match the real dispatcher choices (e.g., NEON
when NEON is available) rather than always returning baseline_scalar.
- Around line 116-129: The candidate path is chosen once by
select_candidate_copy_kernel so AVX2 hosts always pick the 64-byte path; change
the logic to select the BenchPath inside the loop over lengths (where `len` is
available) and pick based on both runtime CPU features
(std::arch::is_x86_feature_detected!("avx2")) and the current `len` so small
lengths choose `copy_candidate_scalar` (or a smaller chunk) while larger lengths
use `copy_candidate_unroll2_avx2`; update any use of `candidate_path`/BenchPath
so selection occurs per-iteration before calling `copy_with_overshoot_policy()`
to ensure the scalar branch of `copy_candidate_unroll2_avx2` and small-block
candidates are exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 21277d7e-7f8c-49b7-aaf6-a0c90abe2e7f
📒 Files selected for processing (1)
zstd/benches/wildcopy_candidates.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 123-136: When an AVX2 host is detected but len < 64, the scalar
fallback uses chunk = size_of::<usize>() which changes overshoot behavior; in
select_candidate_copy_kernel ensure the fallback candidate uses the same
production chunk as the AVX2 path so both paths pay the same overshoot (i.e.,
return BenchPath with chunk = 64 even when kernel = copy_candidate_scalar) or
alternatively skip returning a candidate when no drop-in candidate exists;
update the BenchPath construction in select_candidate_copy_kernel (and analogous
spots) so chunk matches the production copy size used by
copy_with_overshoot_policy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 97b84742-3f90-4903-a723-eca1d3f077db
📒 Files selected for processing (1)
zstd/benches/wildcopy_candidates.rs
There was a problem hiding this comment.
Pull request overview
This PR improves benchmark research and visualization tooling for structured-zstd, adding a focused wildcopy candidate benchmark (for issue #87 research) and enhancing the bench dashboard UX with a selectable snapshot range.
Changes:
- Add
wildcopy_candidatesCriterion benchmark (feature-gated bybench_internals) to compare copy-kernel candidates vs the production wildcopy behavior. - Expose a bench-only entrypoint via
structured_zstd::testingto call the production overshooting copy function from benches. - Improve the benchmark dashboard layout and add
From/Tosnapshot selectors to filter the displayed time window.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-export bench-only internal entrypoint under testing (feature-gated). |
zstd/src/decoding/simd_copy.rs |
Add bench-only wrapper for the production overshooting wildcopy helper. |
zstd/src/decoding/mod.rs |
Make simd_copy accessible within the crate for bench re-exports. |
zstd/Cargo.toml |
Register the new wildcopy_candidates benchmark (requires bench_internals). |
zstd/benches/wildcopy_candidates.rs |
New benchmark suite comparing baseline/candidate copy kernels and validating output vs production. |
BENCHMARKS.md |
Document issue #87 mapping, sample results, and follow-up linkage. |
.github/bench-dashboard/index.html |
Increase chart height and add inclusive From/To snapshot range filtering. |
There was a problem hiding this comment.
Pull request overview
This PR improves benchmarking support for structured-zstd by (1) adding a Criterion research bench for evaluating alternative overshooting copy kernels (wildcopy candidates) and (2) improving the GitHub Pages benchmark dashboard usability with a selectable snapshot range and better chart sizing.
Changes:
- Add a
bench_internals-gatedwildcopy_candidatesCriterion benchmark to compare baseline vs candidate copy kernels. - Expose a
bench_internals-gated bench-only entrypoint for the production overshooting copy routine viastructured_zstd::testing. - Enhance the benchmark dashboard with From/To snapshot selectors and a taller, responsive chart area.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-export the bench-only wildcopy entrypoint from structured_zstd::testing (feature-gated). |
zstd/src/decoding/simd_copy.rs |
Add copy_bytes_overshooting_for_bench behind bench_internals to call the production implementation. |
zstd/src/decoding/mod.rs |
Make simd_copy pub(crate) so it can be re-exported for benches. |
zstd/Cargo.toml |
Register the new wildcopy_candidates bench behind bench_internals. |
zstd/benches/wildcopy_candidates.rs |
New x86/x86_64-only benchmark comparing baseline vs candidate copy kernels against production output. |
BENCHMARKS.md |
Document issue #87 research mapping, sample numbers, and follow-up link. |
.github/bench-dashboard/index.html |
Add From/To filters and update chart sizing / aspect ratio behavior. |
- benchmark baseline and candidate with symmetric overshoot policy - add non-aligned length coverage and tighten bench shim visibility
- baseline selector now mirrors avx512/avx2/sse2/scalar thresholds - include sub-64 lengths and keep bench shim only under testing facade
17207ec to
2b871a7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
zstd/benches/wildcopy_candidates.rs (1)
240-267:⚠️ Potential issue | 🟠 MajorValidate the rounded wildcopy span, not only the requested prefix.
These assertions stop at
[..len], so a candidate that preserves the firstlenbytes but diverges in the rounded-up overshoot region would still pass here. Since issue#87explicitly calls out preserving the wildcopy overshoot contract, the check should cover the fullcopy_multiplespan whenever the buffers are sized for it.🧪 Minimal tightening
+ let check_len = len.next_multiple_of(baseline_path.chunk); assert_eq!( - &dst_baseline[..len], - &dst_production[..len], + &dst_baseline[..check_len], + &dst_production[..check_len], "baseline path must match production wildcopy for len={len}" ); assert_eq!( - &dst_candidate[..len], - &dst_production[..len], + &dst_candidate[..check_len], + &dst_production[..check_len], "candidate path must match production wildcopy for len={len}" );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/benches/wildcopy_candidates.rs` around lines 240 - 267, The assertions only compare the first len bytes, missing differences in the rounded/overshoot region; compute the full copied span (e.g. let rounded = copy_multiple(len) or the function used to round wildcopy lengths) and assert equality over &dst_baseline[..rounded], &dst_candidate[..rounded] against &dst_production[..rounded] after the unsafe calls to copy_bytes_overshooting_for_bench and copy_with_overshoot_policy so the test validates the entire wildcopy/overshoot contract rather than just the requested prefix (use the existing symbols copy_bytes_overshooting_for_bench, copy_with_overshoot_policy, dst_production, dst_baseline, dst_candidate, len, and copy_multiple).zstd/src/decoding/mod.rs (1)
24-24: 🧹 Nitpick | 🔵 TrivialNarrow the visibility back to the bench shim.
pub(crate) mod simd_copy;exposes the whole decoder-internal module across the crate again, but this PR still only needs one bench-only entrypoint. Re-exporting just the shim keeps the surface tighter.♻️ Possible tightening
- pub(crate) mod simd_copy; + mod simd_copy; + #[cfg(feature = "bench_internals")] + pub(crate) use self::simd_copy::copy_bytes_overshooting_for_bench;Based on learnings, internal decode/encode helpers remain
pub(crate)by default, and explicit public shims are exposed only behind a dedicated feature flag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/decoding/mod.rs` at line 24, Change the module from being crate-visible to private and only re-export the bench entrypoint: replace "pub(crate) mod simd_copy;" with "mod simd_copy;" and add a focused re-export for the bench shim (e.g. "pub(crate) use simd_copy::bench_shim;") placed behind the dedicated bench feature gate (cfg(feature = "bench")) so only the shim is exposed rather than the entire simd_copy module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@BENCHMARKS.md`:
- Around line 72-74: Update the decision label to clarify scope by changing the
line starting with "Decision: **GO** for production integration of AVX2 unroll2
candidate." to a microbench-specific, provisional recommendation—for example,
prefix with "Microbench recommendation / provisional GO:" or "Provisional GO
(microbench only):" and mention that this conclusion is based on local
`wildcopy_candidates` microbench data so it does not close issue `#87`; keep the
follow-up implementation issue reference [`#108`] as-is.
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 142-156: The benchmark incorrectly compares an AVX2 candidate
against an AVX-512 baseline: update select_candidate_copy_kernel to detect when
the chosen baseline is the AVX-512 path (the same detection used by
select_baseline_copy_kernel which returns baseline_avx512 for len >= 64 on
AVX-512 hosts) and either skip returning candidate_avx2_unroll2 or instead
return an AVX-512-equivalent candidate; specifically, when baseline_avx512 is
active do not select candidate_avx2_unroll2 (or add a candidate_avx512_unroll2
and return it), otherwise keep the existing AVX2 and scalar fallback choices
(refer to select_candidate_copy_kernel, select_baseline_copy_kernel,
baseline_avx512, and candidate_avx2_unroll2).
---
Duplicate comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 240-267: The assertions only compare the first len bytes, missing
differences in the rounded/overshoot region; compute the full copied span (e.g.
let rounded = copy_multiple(len) or the function used to round wildcopy lengths)
and assert equality over &dst_baseline[..rounded], &dst_candidate[..rounded]
against &dst_production[..rounded] after the unsafe calls to
copy_bytes_overshooting_for_bench and copy_with_overshoot_policy so the test
validates the entire wildcopy/overshoot contract rather than just the requested
prefix (use the existing symbols copy_bytes_overshooting_for_bench,
copy_with_overshoot_policy, dst_production, dst_baseline, dst_candidate, len,
and copy_multiple).
In `@zstd/src/decoding/mod.rs`:
- Line 24: Change the module from being crate-visible to private and only
re-export the bench entrypoint: replace "pub(crate) mod simd_copy;" with "mod
simd_copy;" and add a focused re-export for the bench shim (e.g. "pub(crate) use
simd_copy::bench_shim;") placed behind the dedicated bench feature gate
(cfg(feature = "bench")) so only the shim is exposed rather than the entire
simd_copy module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d18239bf-9656-4ab7-9943-cf9e592f94b3
📒 Files selected for processing (7)
.github/bench-dashboard/index.htmlBENCHMARKS.mdzstd/Cargo.tomlzstd/benches/wildcopy_candidates.rszstd/src/decoding/mod.rszstd/src/decoding/simd_copy.rszstd/src/lib.rs
There was a problem hiding this comment.
Pull request overview
Adds bench-focused infrastructure for researching alternative “wildcopy” kernels (issue #87), plus usability improvements to the benchmark dashboard used for visualizing relative performance trends.
Changes:
- Added a new
wildcopy_candidatesCriterion benchmark (feature-gated bybench_internals) to compare candidate copy kernels vs the production overshooting copy. - Exposed a bench-only entrypoint to call the production overshooting copy implementation for parity checking in benchmarks.
- Improved the benchmark dashboard with a taller chart area and
From/Tosnapshot range filtering.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-exports the new bench-only overshooting copy entrypoint under structured_zstd::testing (feature-gated). |
zstd/src/decoding/simd_copy.rs |
Adds copy_bytes_overshooting_for_bench behind bench_internals to reach production wildcopy from benches. |
zstd/src/decoding/mod.rs |
Makes simd_copy visible within the crate (pub(crate)) so the bench re-export can reference it. |
zstd/Cargo.toml |
Registers the new wildcopy_candidates bench target and gates it with bench_internals. |
zstd/benches/wildcopy_candidates.rs |
New benchmark comparing baseline vs candidate kernels with correctness parity checks against production wildcopy. |
BENCHMARKS.md |
Documents the issue #87 research mapping, sample measurements, and follow-up link. |
.github/bench-dashboard/index.html |
Adds snapshot From/To selectors and adjusts chart sizing to keep legend/toggles visible. |
|
@coderabbitai Duplicate review-body findings were addressed in the latest commit as well: |
|
Tip For best results, initiate chat on the files or code changes.
Both are good hardening steps. 🧠 Learnings used |
There was a problem hiding this comment.
Pull request overview
Adds bench-only infrastructure and a focused Criterion microbench to evaluate alternative “wildcopy” kernels for issue #87, plus improves the benchmark dashboard UX by resizing the chart and adding snapshot range filters.
Changes:
- Expose a bench-only entrypoint for the production
copy_bytes_overshootingimplementation (feature-gated bybench_internals). - Add a new Criterion benchmark target (
wildcopy_candidates) to compare baseline vs candidate copy kernels and validate byte-exact parity. - Enhance the GH Pages benchmark dashboard with a taller chart and
From/Tosnapshot selectors to filter the visible time window.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Re-export the new bench-only copy entrypoint from structured_zstd::testing behind bench_internals. |
zstd/src/decoding/simd_copy.rs |
Add a bench-only wrapper entrypoint for the production overshooting copy routine. |
zstd/src/decoding/mod.rs |
Re-export the bench-only entrypoint within the decoding module (crate-visible, feature-gated). |
zstd/Cargo.toml |
Register the new wildcopy_candidates Criterion bench (requires bench_internals). |
zstd/benches/wildcopy_candidates.rs |
New microbench comparing baseline vs candidate kernels with correctness assertions and ISA dispatch. |
BENCHMARKS.md |
Document the issue #87 research mapping, sample results, and provisional recommendation. |
.github/bench-dashboard/index.html |
UI/layout updates plus From/To range filtering for benchmark snapshots. |
There was a problem hiding this comment.
Pull request overview
This PR enhances the project’s benchmarking and performance-research workflow by (1) adding a new Criterion microbench to evaluate decoder wildcopy copy-kernel candidates for issue #87 research, and (2) improving the benchmark dashboard UX with a selectable snapshot range window.
Changes:
- Expose a bench-only shim (
structured_zstd::testing::copy_bytes_overshooting_for_bench) for benchmarking the production wildcopy implementation underbench_internals. - Add a new Criterion benchmark target
wildcopy_candidates(feature-gated) to compare candidate copy kernels vs the production overshooting wildcopy behavior. - Improve the gh-pages benchmark dashboard layout and add From/To snapshot selectors to filter the displayed time window.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
zstd/src/lib.rs |
Adds a bench_internals-gated public testing facade exposing a wildcopy bench shim. |
zstd/src/decoding/simd_copy.rs |
Adds a bench_internals entrypoint delegating to the production overshooting copy implementation. |
zstd/src/decoding/mod.rs |
Re-exports the bench-only copy entrypoint (crate-visible) from the decoding module. |
zstd/Cargo.toml |
Registers the new wildcopy_candidates Criterion bench gated behind bench_internals. |
zstd/benches/wildcopy_candidates.rs |
Introduces the x86/x86_64 microbench comparing production wildcopy vs baseline/candidate kernels. |
BENCHMARKS.md |
Documents issue #87 research mapping, local microbench results, and a provisional recommendation. |
.github/bench-dashboard/index.html |
Updates chart sizing and adds From/To snapshot selectors + filtering logic for the dashboard. |
Summary
gh-pagessource:From/Tosnapshot selectors to focus on recent measurement pointswildcopy_candidates(feature-gated bybench_internals) for issue research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline #87 researchIssue linkage
Bench findings (local)
cargo bench --bench wildcopy_candidates -p structured-zstd --features bench_internals -- --output-format bencher3 ns2 ns7 ns4 ns28 ns14 ns94 ns58 ns347 ns268 ns1368 ns1121 nsValidation
cargo fmt -- --checkcargo check --workspacecargo clippy --all-targets -- -D warningscargo check -p structured-zstd --features bench_internals --benchescargo nextest run --workspacecargo test --doc --workspaceSummary by CodeRabbit
New Features
UI
Tests
Documentation
Refactor