Skip to content

perf(bench): add wildcopy candidate research bench and dashboard range filter#107

Merged
polaz merged 12 commits intomainfrom
perf/#87-wildcopy-research
Apr 12, 2026
Merged

perf(bench): add wildcopy candidate research bench and dashboard range filter#107
polaz merged 12 commits intomainfrom
perf/#87-wildcopy-research

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 12, 2026

Summary

  • add benchmark dashboard usability fixes in gh-pages source:
    • taller chart canvas so legend/series toggles remain visible
    • From/To snapshot selectors to focus on recent measurement points
  • add a dedicated Criterion bench wildcopy_candidates (feature-gated by bench_internals) for issue research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline #87 research
  • expose a bench-only internal entrypoint for production wildcopy kernel comparison

Issue linkage

Bench findings (local)

cargo bench --bench wildcopy_candidates -p structured-zstd --features bench_internals -- --output-format bencher

  • baseline/64: 3 ns
  • candidate/64: 2 ns
  • baseline/256: 7 ns
  • candidate/256: 4 ns
  • baseline/1024: 28 ns
  • candidate/1024: 14 ns
  • baseline/4096: 94 ns
  • candidate/4096: 58 ns
  • baseline/16384: 347 ns
  • candidate/16384: 268 ns
  • baseline/65536: 1368 ns
  • candidate/65536: 1121 ns

Validation

  • cargo fmt -- --check
  • cargo check --workspace
  • cargo clippy --all-targets -- -D warnings
  • cargo check -p structured-zstd --features bench_internals --benches
  • cargo nextest run --workspace
  • cargo test --doc --workspace

Summary by CodeRabbit

  • New Features

    • Added date range (From/To) controls and inclusive time-window filtering to the dashboard.
  • UI

    • Chart now fills its container with fixed visual height; time-window select lists populate dynamically, retain/validate selections, and update with filter changes.
  • Tests

    • Added an x86/x86_64 benchmark comparing baseline and candidate copy strategies and registered a new benchmark target.
  • Documentation

    • Added benchmark research notes with measurements and a provisional GO recommendation.
  • Refactor

    • Exposed bench-only internals behind a feature gate to support benchmarking.

Copilot AI review requested due to automatic review settings April 12, 2026 10:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2d982636-19cc-4751-a9b2-54f49cef79c8

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb9592 and 70e7a5c.

📒 Files selected for processing (1)
  • zstd/src/decoding/simd_copy.rs

📝 Walkthrough

Walkthrough

Adds a bench-only Criterion benchmark and feature-gated wrappers to exercise simd_copy wildcopy candidates; re-exports a bench entrypoint; and enhances the bench-dashboard with From/To time-window selectors and Chart.js sizing/layout adjustments.

Changes

Cohort / File(s) Summary
Bench dashboard
\.github/bench-dashboard/index.html
Added From/To time-point selects; split filtering into core vs time-window (matchesCoreFilters, applyTimeWindow, updateTimeWindowOptions); added renderPointOptions; adjusted chart sizing and maintainAspectRatio: false.
Bench target & new benchmark
zstd/Cargo.toml, zstd/benches/wildcopy_candidates.rs
Added feature-gated wildcopy_candidates bench target and new Criterion benchmark comparing SIMD baselines (AVX-512/AVX2/SSE2/scalar) vs an AVX2 unroll2 candidate; x86/x86_64-only kernels with noop main on other targets.
Decoder simd visibility & bench wrappers
zstd/src/decoding/mod.rs, zstd/src/decoding/simd_copy.rs, zstd/src/lib.rs
Exposed a bench-gated crate-scoped re-export and added bench-only wrapper functions copy_bytes_overshooting_for_bench to call existing copy_bytes_overshooting(...) for benchmarks (feature = "bench_internals").
Docs
BENCHMARKS.md
Appended "Wildcopy Candidates" research section with links, measurement table, checklist, and provisional GO decision for the AVX2 unroll2 candidate.

Sequence Diagram(s)

sequenceDiagram
  participant Browser as Browser (bench-dashboard)
  participant Runner as Bench Runner (cargo + Criterion)
  participant BenchLib as zstd::testing (bench_internals)
  participant Simd as simd_copy impl
  participant CPU as CPU (ISA/features)

  Browser->>Runner: request benchmark data / trigger run (filtered by From/To)
  Runner->>BenchLib: execute bench_wildcopy_candidates
  BenchLib->>Simd: call copy_bytes_overshooting_for_bench(src,dst,len)
  Simd->>CPU: invoke ISA-specific kernel (AVX-512/AVX2/SSE2/scalar)
  CPU-->>Simd: return copied bytes
  Simd-->>BenchLib: provide output for verification
  BenchLib-->>Runner: report throughput & correctness
  Runner-->>Browser: update dashboard charts (respecting time window)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related issues

  • perf(decoding): integrate AVX2 unroll2 wildcopy candidate #108 — Benchmarks and exposes the AVX2 unroll2 candidate and bench-only wrapper required to evaluate integration into the production wildcopy path.
  • #87: research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline — This PR provides the benchmark harness and candidate kernels needed to fulfill the research objectives.

Possibly related PRs

Poem

🐇
I hopped through lanes of AVX and SSE,
unrolled the bytes so wild and free,
From and To keep charts in line,
benches hum, the kernels shine,
rabbit claps — the bytes agree!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the two main changes: adding a wildcopy candidate research benchmark and a dashboard range filter feature.
Linked Issues check ✅ Passed The PR implements all coding requirements from #87: benchmarks a wildco candidate (AVX2 unroll2) vs baseline, validates correctness with existing tests, provides reproducible performance gains (2-5x improvement), and documents go/no-go decision.
Out of Scope Changes check ✅ Passed All changes directly support #87 research objectives: benchmark code/infrastructure, dashboard improvements for visualizing results, and documentation. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#87-wildcopy-research

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds benchmarking tooling for researching alternative decode wildcopy kernels (issue #87) and improves the benchmark dashboard UX by making the chart more readable and allowing users to filter to a snapshot range.

Changes:

  • Add a feature-gated Criterion bench (wildcopy_candidates) and a bench-only entrypoint to call the production overshooting wildcopy implementation.
  • Register the new bench in zstd/Cargo.toml behind bench_internals.
  • Improve the gh-pages benchmark dashboard with a taller chart and From/To snapshot selectors to focus on a time window.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-export bench-only wildcopy entrypoint via structured_zstd::testing for benchmark use.
zstd/src/decoding/simd_copy.rs Add copy_bytes_overshooting_for_bench wrapper behind bench_internals.
zstd/src/decoding/mod.rs Re-export the bench-only entrypoint from decoding under bench_internals.
zstd/Cargo.toml Register new wildcopy_candidates Criterion bench gated by bench_internals.
zstd/benches/wildcopy_candidates.rs New microbench comparing production wildcopy vs an AVX2 unroll candidate.
.github/bench-dashboard/index.html Dashboard layout improvements + snapshot range filtering (From/To).

Comment thread zstd/src/decoding/mod.rs
Comment thread zstd/benches/wildcopy_candidates.rs
Comment thread zstd/benches/wildcopy_candidates.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 68-126: The benchmark is comparing different call paths: baseline
uses copy_bytes_overshooting_for_bench (which does strategy selection, capacity
checks, and rounding) while the candidate calls candidate_copy_kernel directly
and all tested lengths are 64-byte aligned—this produces an optimistic and
non-equivalent comparison. Fix by making the two measured paths equivalent:
either (A) call the selected kernel directly for both sides (invoke the same
kernel function used by copy_bytes_overshooting_for_bench for the baseline
instead of copy_bytes_overshooting_for_bench), or (B) wrap the candidate
invocation in the same dispatch/rounding logic (call
copy_bytes_overshooting_for_bench for both) so both pay the same
selection/rounding costs; also update the lengths array to include at least one
non-multiple-of-64 value (e.g., 65) so tail/rounded paths are exercised. Ensure
you modify the benchmark code that constructs lengths, the baseline invocation
using copy_bytes_overshooting_for_bench, and the candidate invocation using
candidate_copy_kernel so both paths are symmetric.

In `@zstd/src/decoding/mod.rs`:
- Line 24: Replace the broad module export with a private module and expose only
the bench shim: change the declaration `pub(crate) mod simd_copy;` to `mod
simd_copy;` inside `decoding`, then add a targeted re-export like `pub(crate)
use self::simd_copy::copy_bytes_overshooting_for_bench;` so only that function
is visible across the crate; finally update zstd/src/lib.rs to reference
`crate::decoding::copy_bytes_overshooting_for_bench` instead of importing the
entire `simd_copy` module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fcd3516a-4962-40fe-98c2-9a2fe3bc9523

📥 Commits

Reviewing files that changed from the base of the PR and between d405446 and 3d5a9c8.

📒 Files selected for processing (5)
  • BENCHMARKS.md
  • zstd/benches/wildcopy_candidates.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/decoding/simd_copy.rs
  • zstd/src/lib.rs

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Comment thread zstd/src/decoding/mod.rs Outdated
@polaz polaz requested a review from Copilot April 12, 2026 11:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the benchmarking workflow for structured-zstd by adding a new research-focused Criterion benchmark for SIMD “wildcopy” candidate evaluation and enhancing the benchmark dashboard UI with a selectable snapshot window.

Changes:

  • Added a new wildcopy_candidates Criterion benchmark (gated behind bench_internals) to compare copy kernels against the production overshooting wildcopy.
  • Exposed a bench-only internal entrypoint for the production wildcopy implementation and adjusted module visibility to support it.
  • Improved the benchmark dashboard with a larger chart area and From/To snapshot selectors to filter the displayed time window.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-exports the bench-only wildcopy entrypoint under the existing bench_internals-gated testing module.
zstd/src/decoding/simd_copy.rs Adds a bench_internals-gated wrapper around the production overshooting copy routine for benchmarking.
zstd/src/decoding/mod.rs Makes simd_copy visible within the crate to enable the bench-only re-export path.
zstd/Cargo.toml Registers the new wildcopy_candidates bench and gates it behind bench_internals.
zstd/benches/wildcopy_candidates.rs Implements the new Criterion benchmark comparing baseline vs a candidate AVX2 unroll2 kernel (with scalar fallback).
BENCHMARKS.md Documents the Issue #87 research mapping, local sample measurements, and decision/follow-up linkage.
.github/bench-dashboard/index.html Adds snapshot range selectors and adjusts layout/chart sizing to improve dashboard usability.

Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: 70e7a5c Previous: 051fed2 Ratio
compress/fastest/decodecorpus-z000033/matrix/c_ffi 3.736 ms 2.766 ms 1.35

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
zstd/src/decoding/mod.rs (1)

25-26: 🛠️ Refactor suggestion | 🟠 Major

Keep the bench shim off the public decoding surface.

pub use here makes the helper reachable as structured_zstd::decoding::copy_bytes_overshooting_for_bench, so the hidden testing module is no longer the sole bench-only escape hatch. Keep this re-export non-public from decoding and let testing remain the only public facade for bench builds.

Based on learnings, internal decode/encode helpers remain pub(crate) by default, and explicit public shims are exposed only behind a dedicated feature flag in a separate file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/mod.rs` around lines 25 - 26, The public re-export exposes
the bench helper on the decoding public API; change the visibility of the
re-export so it is not public (e.g., replace `pub use
self::simd_copy::copy_bytes_overshooting_for_bench;` with a non-public re-export
such as `pub(crate) use` or remove the re-export and let the `testing` module
remain the only bench-facing facade), ensuring the symbol
copy_bytes_overshooting_for_bench from module simd_copy is only reachable via
the bench-only testing shim.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 91-115: select_baseline_copy_kernel currently only picks AVX2 vs
scalar and therefore diverges from the production dispatcher (copy_strategy)
which may choose AVX-512 or SSE2 based on features and length; update
select_baseline_copy_kernel to use the same dispatch policy as production by
delegating to or replicating copy_strategy logic so that when AVX-512 or SSE2
are selected in production the bench uses the same baseline kernel name, chunk
and kernel function (e.g., return the same BenchPath values used for AVX-512,
AVX2, SSE2 and scalar in copy_strategy), honoring the same feature checks and
length thresholds (len >= 64 for AVX-512, len >= 16 for SSE2, etc.) so bench
timings match production behavior.
- Line 151: The benchmark's lengths array (variable names: lengths in
wildcopy_candidates.rs) starts at 64 and 65 so it omits small-block cases;
update the lengths array to include at least a couple of sub-64 sizes (for
example 1, 8, 32 or similar small values) before 64 so the bench measures the
small-block regime called out in issue `#87` and can catch chunk-threshold
performance changes; ensure the new small sizes are inserted into the existing
lengths array literal used by the benchmark loop so all tests run as before but
now include those sub-64 cases.

---

Duplicate comments:
In `@zstd/src/decoding/mod.rs`:
- Around line 25-26: The public re-export exposes the bench helper on the
decoding public API; change the visibility of the re-export so it is not public
(e.g., replace `pub use self::simd_copy::copy_bytes_overshooting_for_bench;`
with a non-public re-export such as `pub(crate) use` or remove the re-export and
let the `testing` module remain the only bench-facing facade), ensuring the
symbol copy_bytes_overshooting_for_bench from module simd_copy is only reachable
via the bench-only testing shim.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6bbbdb5f-3697-4c07-9250-db86c02a248e

📥 Commits

Reviewing files that changed from the base of the PR and between 3d5a9c8 and e9e073e.

📒 Files selected for processing (3)
  • zstd/benches/wildcopy_candidates.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/lib.rs

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds bench-only tooling and UI improvements to support research/iteration on decode wildcopy performance (issue #87), while improving the benchmark dashboard’s usability when inspecting recent runs.

Changes:

  • Adds a feature-gated internal entrypoint exposing the production overshooting copy for benchmarks (bench_internals).
  • Introduces a new Criterion benchmark (wildcopy_candidates) to compare copy kernels across payload sizes.
  • Updates the GitHub Pages benchmark dashboard to improve chart sizing and add From/To snapshot range filtering.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-exports the new bench-only copy entrypoint via the testing module.
zstd/src/decoding/simd_copy.rs Adds a bench_internals-gated wrapper around the production overshooting copy routine.
zstd/src/decoding/mod.rs Re-exports the bench-only copy wrapper from the decoding module.
zstd/Cargo.toml Registers the new wildcopy_candidates Criterion bench (gated by bench_internals).
zstd/benches/wildcopy_candidates.rs Implements the baseline vs candidate copy-kernel benchmark with correctness checks vs production output.
BENCHMARKS.md Documents the issue #87 research mapping, sample measurements, and follow-up link.
.github/bench-dashboard/index.html Improves chart layout and adds From/To selectors with inclusive time-window filtering.

Comment thread zstd/src/decoding/mod.rs Outdated
Comment thread zstd/src/lib.rs Outdated
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 12, 2026

@coderabbitai Duplicate body item about pub use self::simd_copy::copy_bytes_overshooting_for_bench in decoding is now fixed in commit f507c7e: decoding no longer publicly re-exports the bench shim, and testing re-exports directly from decoding::simd_copy.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: Thanks for the update! The fix in commit f507c7e looks correct — removing the intermediate pub use from decoding/mod.rs and having testing re-export directly from decoding::simd_copy eliminates the duplicate path cleanly. The bench shim is now accessible via a single, unambiguous route.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.902Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.408Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.626Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.926Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-11T11:56:17.795Z
Learning: In `zstd/src/encoding/match_generator.rs` (structured-world/structured-zstd, PR `#102`), after commit 46927db, hash-mix kernel dispatch was moved from the global `HASH_MIX_KERNEL` AtomicU8 hot-path read into a per-matcher instance field `hash_mix_kernel` stored directly in `DfastMatchGenerator` and `RowMatchGenerator`. The hot path no longer performs an atomic load or global branch; kernel selection happens once at matcher initialisation and is subsequently a plain struct-field read. Do not flag the hot path as doing a per-call atomic load — that pattern was superseded by commit 46927db.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 99
File: zstd/src/encoding/frame_compressor.rs:779-805
Timestamp: 2026-04-11T07:48:58.281Z
Learning: In `zstd/src/encoding/frame_compressor.rs` (structured-world/structured-zstd, PR `#99`), a temporary compatibility guard exists for sub-256-byte compressed payloads: the 1-byte-FCS single-segment form is not yet fully C-FFI compatible for payloads below 256 bytes, and a compat guard was added to handle this case while the path is hardened. Do not flag this guard as unnecessary.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:10.603Z
Learning: In structured-world/structured-zstd, the `fuzz_exports` API concern was resolved via "Option B": internal decode/encode helpers remain `pub(crate)` by default, and explicit public shims are exposed only behind a dedicated feature flag (e.g., `fuzz`) in a separate file. Do not flag these internal symbols as needing a wider public API — the feature-gated shim layer is the intentional design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-11T11:31:46.064Z
Learning: In `zstd/src/encoding/match_generator.rs` (structured-world/structured-zstd, PR `#102`), the test-only `with_forced_hash_mix_kernel(...)` helper forces the global `HASH_MIX_KERNEL` atomic to a specific kernel for coverage. Tests that call this helper must acquire a shared mutex lock *before* the `with_forced_hash_mix_kernel` call (not only around the assertion) to prevent cross-test kernel races when `cargo nextest` runs tests in parallel. This pattern was locked in commit 904d2b0. Do not flag the lock acquisition as unnecessarily early.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:17.570Z
Learning: In `zstd/src/huff0/huff0_decoder.rs` (structured-world/structured-zstd), `decode_symbol` and `next_state` are intentionally private (not `pub`). Public fuzz-only shims `fuzz_decode_symbol` and `fuzz_next_state` are exposed exclusively behind `#[cfg(feature = "fuzz_exports")]`. Do not flag `decode_symbol` or `next_state` as needing public visibility — the fuzz shim pattern is the intended API contract.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 191-199: Extract the candidate construction logic into
select_candidate_copy_kernel by returning a BenchPath (instead of just kernel)
so the chunk size and name are chosen in one place; update
select_candidate_copy_kernel to return BenchPath (containing name, chunk,
kernel) and change bench_wildcopy_candidates to use that returned BenchPath
rather than computing chunk via string comparison on candidate_name (remove the
special-case for "candidate_avx2_unroll2" there); ensure you update all call
sites to use the new BenchPath return type and keep symbols like BenchPath,
select_candidate_copy_kernel, bench_wildcopy_candidates, candidate_avx2_unroll2,
candidate_name, and candidate_copy_kernel in mind when locating the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2ab1af44-9f8e-40ab-873d-f62c7450d8f8

📥 Commits

Reviewing files that changed from the base of the PR and between e9e073e and f507c7e.

📒 Files selected for processing (3)
  • zstd/benches/wildcopy_candidates.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/lib.rs

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance research and benchmarking ergonomics by (1) adding a focused Criterion bench to compare wildcopy kernels and (2) enhancing the GitHub Pages benchmark dashboard with a selectable snapshot window.

Changes:

  • Added a wildcopy_candidates Criterion benchmark (feature-gated by bench_internals) and a bench-only internal entrypoint to call the production overshooting copy routine.
  • Updated the benchmark dashboard UI to use a taller chart area and added From / To snapshot selectors with inclusive time-window filtering.
  • Documented Issue #87 research mapping and local sample measurements in BENCHMARKS.md.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-exports a bench-only wildcopy entrypoint via structured_zstd::testing behind bench_internals.
zstd/src/decoding/simd_copy.rs Adds copy_bytes_overshooting_for_bench wrapper (cfg-gated) to expose the production wildcopy for benchmarks.
zstd/src/decoding/mod.rs Makes simd_copy visible within the crate (pub(crate)) to support the bench-only re-export path.
zstd/Cargo.toml Registers the new wildcopy_candidates bench gated by bench_internals.
zstd/benches/wildcopy_candidates.rs New Criterion bench comparing baseline vs candidate copy kernels across sizes with correctness checks vs production output.
BENCHMARKS.md Adds Issue #87 research mapping + local sample results + follow-up linkage.
.github/bench-dashboard/index.html Improves chart layout and introduces snapshot window filtering (From/To) for relative dashboard points.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 116-129: select_candidate_copy_kernel currently chooses the AVX2
64-byte chunk path solely by CPU feature, which forces small lengths to be
rounded up to 64 and prevents benchmarking the scalar fallback; update
select_candidate_copy_kernel (and the duplicate at the other location) to take
the requested length (or a max_chunk parameter) and choose the candidate based
on both CPU feature and requested length (e.g., if avx2 is available but
requested length < 64 select the scalar/smaller-chunk BenchPath or a smaller
unrolled kernel), referencing the existing symbols select_candidate_copy_kernel,
copy_candidate_unroll2_avx2, and copy_candidate_scalar so the function returns
the appropriate chunk and kernel for small lengths instead of always returning
chunk: 64 on AVX2.
- Around line 170-176: The bench currently hardcodes baseline_scalar for all
non-x86 targets in select_baseline_copy_kernel (returning a BenchPath with name
"baseline_scalar" and kernel copy_candidate_scalar), which misrepresents the
production baseline; either (A) change select_baseline_copy_kernel to mirror the
decoder's runtime strategy selection by calling the same production dispatcher
(the runtime CPU/feature probe used elsewhere) and return the matching BenchPath
so non-x86 targets use their actual backend (NEON/scalar/etc.), or (B) restrict
this benchmark to only the architectures the bench models by cfg-gating the
entire bench to the same target_arch set as the production dispatcher; update
the implementation of select_baseline_copy_kernel and its returned BenchPath
accordingly so results compare candidates against the real runtime baseline
rather than always "baseline_scalar".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb87a0f9-19de-45c7-a93f-c7dfbf4747ab

📥 Commits

Reviewing files that changed from the base of the PR and between f507c7e and 9aa0c69.

📒 Files selected for processing (1)
  • zstd/benches/wildcopy_candidates.rs

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Comment thread zstd/benches/wildcopy_candidates.rs Outdated
@polaz polaz requested a review from Copilot April 12, 2026 13:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
zstd/benches/wildcopy_candidates.rs (2)

170-176: ⚠️ Potential issue | 🟠 Major

Non-x86 runs still benchmark scalar, not the production baseline.

Line 171 hardwires baseline_scalar for every non-x86 target. On ARM/NEON runners this bench would compare against a path the decoder does not actually use, so those results are not representative for issue #87. Either mirror the non-x86 dispatcher here or cfg-gate the benchmark to the architectures it actually models. Based on learnings, copy_bytes_overshooting strategy coverage adapts across AVX-512/AVX2/SSE2/NEON/scalar, so the production dispatcher is not scalar-only off x86.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/benches/wildcopy_candidates.rs` around lines 170 - 176, The benchmark
currently forces select_baseline_copy_kernel to return the scalar BenchPath
(name "baseline_scalar", kernel copy_candidate_scalar) for all non-x86 targets;
update it to mirror the production dispatcher used by copy_bytes_overshooting so
ARM/NEON runs pick the NEON baseline instead of scalar, or alternatively
cfg-gate the entire benchmark to only run on the architectures modeled
(x86/x86_64/arm/arm64) to match production; modify the
select_baseline_copy_kernel implementation (or add matching #[cfg] arms) so the
returned BenchPath name and kernel match the real dispatcher choices (e.g., NEON
when NEON is available) rather than always returning baseline_scalar.

116-129: ⚠️ Potential issue | 🟠 Major

Drive candidate selection from len, not only CPU caps.

Line 202 caches a single candidate_path before the loop, so on AVX2 hosts the new 17/33/63 cases are all rounded up to 64 by copy_with_overshoot_policy(). That means copy_candidate_unroll2_avx2() never exercises its < 64 scalar branch, and the small-block points are still reporting the wide AVX2 path rather than candidate small-block behavior. Select the candidate inside the for len in lengths loop and key it off len as well as runtime features.

Also applies to: 202-203

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/benches/wildcopy_candidates.rs` around lines 116 - 129, The candidate
path is chosen once by select_candidate_copy_kernel so AVX2 hosts always pick
the 64-byte path; change the logic to select the BenchPath inside the loop over
lengths (where `len` is available) and pick based on both runtime CPU features
(std::arch::is_x86_feature_detected!("avx2")) and the current `len` so small
lengths choose `copy_candidate_scalar` (or a smaller chunk) while larger lengths
use `copy_candidate_unroll2_avx2`; update any use of `candidate_path`/BenchPath
so selection occurs per-iteration before calling `copy_with_overshoot_policy()`
to ensure the scalar branch of `copy_candidate_unroll2_avx2` and small-block
candidates are exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 170-176: The benchmark currently forces
select_baseline_copy_kernel to return the scalar BenchPath (name
"baseline_scalar", kernel copy_candidate_scalar) for all non-x86 targets; update
it to mirror the production dispatcher used by copy_bytes_overshooting so
ARM/NEON runs pick the NEON baseline instead of scalar, or alternatively
cfg-gate the entire benchmark to only run on the architectures modeled
(x86/x86_64/arm/arm64) to match production; modify the
select_baseline_copy_kernel implementation (or add matching #[cfg] arms) so the
returned BenchPath name and kernel match the real dispatcher choices (e.g., NEON
when NEON is available) rather than always returning baseline_scalar.
- Around line 116-129: The candidate path is chosen once by
select_candidate_copy_kernel so AVX2 hosts always pick the 64-byte path; change
the logic to select the BenchPath inside the loop over lengths (where `len` is
available) and pick based on both runtime CPU features
(std::arch::is_x86_feature_detected!("avx2")) and the current `len` so small
lengths choose `copy_candidate_scalar` (or a smaller chunk) while larger lengths
use `copy_candidate_unroll2_avx2`; update any use of `candidate_path`/BenchPath
so selection occurs per-iteration before calling `copy_with_overshoot_policy()`
to ensure the scalar branch of `copy_candidate_unroll2_avx2` and small-block
candidates are exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 21277d7e-7f8c-49b7-aaf6-a0c90abe2e7f

📥 Commits

Reviewing files that changed from the base of the PR and between f507c7e and 9aa0c69.

📒 Files selected for processing (1)
  • zstd/benches/wildcopy_candidates.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 123-136: When an AVX2 host is detected but len < 64, the scalar
fallback uses chunk = size_of::<usize>() which changes overshoot behavior; in
select_candidate_copy_kernel ensure the fallback candidate uses the same
production chunk as the AVX2 path so both paths pay the same overshoot (i.e.,
return BenchPath with chunk = 64 even when kernel = copy_candidate_scalar) or
alternatively skip returning a candidate when no drop-in candidate exists;
update the BenchPath construction in select_candidate_copy_kernel (and analogous
spots) so chunk matches the production copy size used by
copy_with_overshoot_policy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 97b84742-3f90-4903-a723-eca1d3f077db

📥 Commits

Reviewing files that changed from the base of the PR and between 9aa0c69 and c6e741b.

📒 Files selected for processing (1)
  • zstd/benches/wildcopy_candidates.rs

Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves benchmark research and visualization tooling for structured-zstd, adding a focused wildcopy candidate benchmark (for issue #87 research) and enhancing the bench dashboard UX with a selectable snapshot range.

Changes:

  • Add wildcopy_candidates Criterion benchmark (feature-gated by bench_internals) to compare copy-kernel candidates vs the production wildcopy behavior.
  • Expose a bench-only entrypoint via structured_zstd::testing to call the production overshooting copy function from benches.
  • Improve the benchmark dashboard layout and add From/To snapshot selectors to filter the displayed time window.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-export bench-only internal entrypoint under testing (feature-gated).
zstd/src/decoding/simd_copy.rs Add bench-only wrapper for the production overshooting wildcopy helper.
zstd/src/decoding/mod.rs Make simd_copy accessible within the crate for bench re-exports.
zstd/Cargo.toml Register the new wildcopy_candidates benchmark (requires bench_internals).
zstd/benches/wildcopy_candidates.rs New benchmark suite comparing baseline/candidate copy kernels and validating output vs production.
BENCHMARKS.md Document issue #87 mapping, sample results, and follow-up linkage.
.github/bench-dashboard/index.html Increase chart height and add inclusive From/To snapshot range filtering.

Comment thread zstd/src/decoding/simd_copy.rs
Comment thread zstd/benches/wildcopy_candidates.rs
Comment thread zstd/benches/wildcopy_candidates.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves benchmarking support for structured-zstd by (1) adding a Criterion research bench for evaluating alternative overshooting copy kernels (wildcopy candidates) and (2) improving the GitHub Pages benchmark dashboard usability with a selectable snapshot range and better chart sizing.

Changes:

  • Add a bench_internals-gated wildcopy_candidates Criterion benchmark to compare baseline vs candidate copy kernels.
  • Expose a bench_internals-gated bench-only entrypoint for the production overshooting copy routine via structured_zstd::testing.
  • Enhance the benchmark dashboard with From/To snapshot selectors and a taller, responsive chart area.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-export the bench-only wildcopy entrypoint from structured_zstd::testing (feature-gated).
zstd/src/decoding/simd_copy.rs Add copy_bytes_overshooting_for_bench behind bench_internals to call the production implementation.
zstd/src/decoding/mod.rs Make simd_copy pub(crate) so it can be re-exported for benches.
zstd/Cargo.toml Register the new wildcopy_candidates bench behind bench_internals.
zstd/benches/wildcopy_candidates.rs New x86/x86_64-only benchmark comparing baseline vs candidate copy kernels against production output.
BENCHMARKS.md Document issue #87 research mapping, sample numbers, and follow-up link.
.github/bench-dashboard/index.html Add From/To filters and update chart sizing / aspect ratio behavior.

Comment thread zstd/benches/wildcopy_candidates.rs
Comment thread zstd/benches/wildcopy_candidates.rs Outdated
polaz added 8 commits April 12, 2026 17:54
- keep bench internals exposed only via structured_zstd::testing

- select candidate kernel once and harden benchmark against copy elision

- document #87 decision mapping and link follow-up issue #108
- benchmark baseline and candidate with symmetric overshoot policy

- add non-aligned length coverage and tighten bench shim visibility
- baseline selector now mirrors avx512/avx2/sse2/scalar thresholds

- include sub-64 lengths and keep bench shim only under testing facade
@polaz polaz force-pushed the perf/#87-wildcopy-research branch from 17207ec to 2b871a7 Compare April 12, 2026 15:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
zstd/benches/wildcopy_candidates.rs (1)

240-267: ⚠️ Potential issue | 🟠 Major

Validate the rounded wildcopy span, not only the requested prefix.

These assertions stop at [..len], so a candidate that preserves the first len bytes but diverges in the rounded-up overshoot region would still pass here. Since issue #87 explicitly calls out preserving the wildcopy overshoot contract, the check should cover the full copy_multiple span whenever the buffers are sized for it.

🧪 Minimal tightening
+        let check_len = len.next_multiple_of(baseline_path.chunk);
         assert_eq!(
-            &dst_baseline[..len],
-            &dst_production[..len],
+            &dst_baseline[..check_len],
+            &dst_production[..check_len],
             "baseline path must match production wildcopy for len={len}"
         );
         assert_eq!(
-            &dst_candidate[..len],
-            &dst_production[..len],
+            &dst_candidate[..check_len],
+            &dst_production[..check_len],
             "candidate path must match production wildcopy for len={len}"
         );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/benches/wildcopy_candidates.rs` around lines 240 - 267, The assertions
only compare the first len bytes, missing differences in the rounded/overshoot
region; compute the full copied span (e.g. let rounded = copy_multiple(len) or
the function used to round wildcopy lengths) and assert equality over
&dst_baseline[..rounded], &dst_candidate[..rounded] against
&dst_production[..rounded] after the unsafe calls to
copy_bytes_overshooting_for_bench and copy_with_overshoot_policy so the test
validates the entire wildcopy/overshoot contract rather than just the requested
prefix (use the existing symbols copy_bytes_overshooting_for_bench,
copy_with_overshoot_policy, dst_production, dst_baseline, dst_candidate, len,
and copy_multiple).
zstd/src/decoding/mod.rs (1)

24-24: 🧹 Nitpick | 🔵 Trivial

Narrow the visibility back to the bench shim.

pub(crate) mod simd_copy; exposes the whole decoder-internal module across the crate again, but this PR still only needs one bench-only entrypoint. Re-exporting just the shim keeps the surface tighter.

♻️ Possible tightening
- pub(crate) mod simd_copy;
+ mod simd_copy;
+ #[cfg(feature = "bench_internals")]
+ pub(crate) use self::simd_copy::copy_bytes_overshooting_for_bench;

Based on learnings, internal decode/encode helpers remain pub(crate) by default, and explicit public shims are exposed only behind a dedicated feature flag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/decoding/mod.rs` at line 24, Change the module from being
crate-visible to private and only re-export the bench entrypoint: replace
"pub(crate) mod simd_copy;" with "mod simd_copy;" and add a focused re-export
for the bench shim (e.g. "pub(crate) use simd_copy::bench_shim;") placed behind
the dedicated bench feature gate (cfg(feature = "bench")) so only the shim is
exposed rather than the entire simd_copy module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@BENCHMARKS.md`:
- Around line 72-74: Update the decision label to clarify scope by changing the
line starting with "Decision: **GO** for production integration of AVX2 unroll2
candidate." to a microbench-specific, provisional recommendation—for example,
prefix with "Microbench recommendation / provisional GO:" or "Provisional GO
(microbench only):" and mention that this conclusion is based on local
`wildcopy_candidates` microbench data so it does not close issue `#87`; keep the
follow-up implementation issue reference [`#108`] as-is.

In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 142-156: The benchmark incorrectly compares an AVX2 candidate
against an AVX-512 baseline: update select_candidate_copy_kernel to detect when
the chosen baseline is the AVX-512 path (the same detection used by
select_baseline_copy_kernel which returns baseline_avx512 for len >= 64 on
AVX-512 hosts) and either skip returning candidate_avx2_unroll2 or instead
return an AVX-512-equivalent candidate; specifically, when baseline_avx512 is
active do not select candidate_avx2_unroll2 (or add a candidate_avx512_unroll2
and return it), otherwise keep the existing AVX2 and scalar fallback choices
(refer to select_candidate_copy_kernel, select_baseline_copy_kernel,
baseline_avx512, and candidate_avx2_unroll2).

---

Duplicate comments:
In `@zstd/benches/wildcopy_candidates.rs`:
- Around line 240-267: The assertions only compare the first len bytes, missing
differences in the rounded/overshoot region; compute the full copied span (e.g.
let rounded = copy_multiple(len) or the function used to round wildcopy lengths)
and assert equality over &dst_baseline[..rounded], &dst_candidate[..rounded]
against &dst_production[..rounded] after the unsafe calls to
copy_bytes_overshooting_for_bench and copy_with_overshoot_policy so the test
validates the entire wildcopy/overshoot contract rather than just the requested
prefix (use the existing symbols copy_bytes_overshooting_for_bench,
copy_with_overshoot_policy, dst_production, dst_baseline, dst_candidate, len,
and copy_multiple).

In `@zstd/src/decoding/mod.rs`:
- Line 24: Change the module from being crate-visible to private and only
re-export the bench entrypoint: replace "pub(crate) mod simd_copy;" with "mod
simd_copy;" and add a focused re-export for the bench shim (e.g. "pub(crate) use
simd_copy::bench_shim;") placed behind the dedicated bench feature gate
(cfg(feature = "bench")) so only the shim is exposed rather than the entire
simd_copy module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d18239bf-9656-4ab7-9943-cf9e592f94b3

📥 Commits

Reviewing files that changed from the base of the PR and between 17207ec and 2b871a7.

📒 Files selected for processing (7)
  • .github/bench-dashboard/index.html
  • BENCHMARKS.md
  • zstd/Cargo.toml
  • zstd/benches/wildcopy_candidates.rs
  • zstd/src/decoding/mod.rs
  • zstd/src/decoding/simd_copy.rs
  • zstd/src/lib.rs

Comment thread BENCHMARKS.md Outdated
Comment thread zstd/benches/wildcopy_candidates.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds bench-focused infrastructure for researching alternative “wildcopy” kernels (issue #87), plus usability improvements to the benchmark dashboard used for visualizing relative performance trends.

Changes:

  • Added a new wildcopy_candidates Criterion benchmark (feature-gated by bench_internals) to compare candidate copy kernels vs the production overshooting copy.
  • Exposed a bench-only entrypoint to call the production overshooting copy implementation for parity checking in benchmarks.
  • Improved the benchmark dashboard with a taller chart area and From/To snapshot range filtering.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-exports the new bench-only overshooting copy entrypoint under structured_zstd::testing (feature-gated).
zstd/src/decoding/simd_copy.rs Adds copy_bytes_overshooting_for_bench behind bench_internals to reach production wildcopy from benches.
zstd/src/decoding/mod.rs Makes simd_copy visible within the crate (pub(crate)) so the bench re-export can reference it.
zstd/Cargo.toml Registers the new wildcopy_candidates bench target and gates it with bench_internals.
zstd/benches/wildcopy_candidates.rs New benchmark comparing baseline vs candidate kernels with correctness parity checks against production wildcopy.
BENCHMARKS.md Documents the issue #87 research mapping, sample measurements, and follow-up link.
.github/bench-dashboard/index.html Adds snapshot From/To selectors and adjusts chart sizing to keep legend/toggles visible.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 12, 2026

@coderabbitai Duplicate review-body findings were addressed in the latest commit as well: zstd/src/decoding/mod.rs now keeps simd_copy private with a bench-gated focused re-export, and wildcopy validation now checks the rounded copy span.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 12, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz: Thanks for the follow-up! Both fixes look right:

  1. simd_copy visibility — keeping the module private in decoding/mod.rs and routing the bench-only symbol through a narrower, feature-gated re-export is the cleaner approach; it avoids widening the internal module surface just for bench access.
  2. Rounded-span validation — checking the rounded copy span (i.e. the actual bytes written, including the overshoot) rather than only the first len bytes correctly captures the full invariant that copy_bytes_overshooting guarantees.

Both are good hardening steps.


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:51.957Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test suite for `copy_bytes_overshooting` includes a misaligned extra-capacity case that exercises the rounded-up wildcopy path (where `copy_multiple > copy_at_least`) and validates prefix correctness for that scenario. Do not flag the absence of a misaligned/overshooting extra-capacity sub-case as missing test coverage.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T20:26:15.924Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_covers_all_copy_strategies` derives copy lengths from a helper `active_chunk_size_for_tests()` (not hardcoded values) so that single-chunk, multi-chunk, and sub-chunk (fallback) cases adapt to whichever SIMD backend width is active at runtime (AVX-512/AVX2/SSE2/NEON/scalar). Do not flag hardcoded test lengths as missing coverage for other backends — the test is backend-adaptive by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-08T21:51:56.652Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), the test `copy_bytes_overshooting_preserves_prefix_for_runtime_chunk_lengths` (or equivalent) includes a misaligned extra-capacity case that uses `add(1)`-offset pointers and a `chunk_plus_one` length to exercise the rounded-up wildcopy overshoot path (where `copy_multiple > copy_at_least`) and validates prefix correctness. Do not flag the absence of a misaligned-pointer case in this test — it is present by design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T06:17:40.433Z
Learning: In `zstd/src/decoding/ringbuffer.rs` (structured-world/structured-zstd), both `copy_with_checks` and `copy_with_nobranch_check` thread aggregate source/destination capacities (`m1_src_cap`, `m2_src_cap`, `f1_dst_cap`, `f2_dst_cap`) into `simd_copy::copy_bytes_overshooting` so the SIMD eligibility check (`min(src_len, dst_len) >= copy_multiple`) is applied uniformly. Do not flag the capacity parameters in `copy_with_nobranch_check` as unnecessary or inconsistent with `copy_with_checks`.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-06T01:40:24.378Z
Learning: In `zstd/benches/compare_ffi.rs` (structured-world/structured-zstd), Rust FastCOVER trains with the post-finalization content budget in both the `REPORT_DICT_TRAIN` emission path (around lines 208-225) and the Criterion benchmark path (around lines 266-280). Both paths were aligned in commit 8622344. Do not flag these ranges as using inconsistent budget values.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 99
File: zstd/src/encoding/frame_compressor.rs:779-805
Timestamp: 2026-04-11T07:49:03.699Z
Learning: In `zstd/src/encoding/frame_compressor.rs` (structured-world/structured-zstd, PR `#99`), a temporary compatibility guard exists for sub-256-byte compressed payloads: the 1-byte-FCS single-segment form is not yet fully C-FFI compatible for payloads below 256 bytes, and a compat guard was added to handle this case while the path is hardened. Do not flag this guard as unnecessary.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-11T11:56:17.814Z
Learning: In `zstd/src/encoding/match_generator.rs` (structured-world/structured-zstd, PR `#102`), after commit 46927db, hash-mix kernel dispatch was moved from the global `HASH_MIX_KERNEL` AtomicU8 hot-path read into a per-matcher instance field `hash_mix_kernel` stored directly in `DfastMatchGenerator` and `RowMatchGenerator`. The hot path no longer performs an atomic load or global branch; kernel selection happens once at matcher initialisation and is subsequently a plain struct-field read. Do not flag the hot path as doing a per-call atomic load — that pattern was superseded by commit 46927db.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-09T15:16:10.624Z
Learning: In structured-world/structured-zstd, the `fuzz_exports` API concern was resolved via "Option B": internal decode/encode helpers remain `pub(crate)` by default, and explicit public shims are exposed only behind a dedicated feature flag (e.g., `fuzz`) in a separate file. Do not flag these internal symbols as needing a wider public API — the feature-gated shim layer is the intentional design.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-11T11:31:46.105Z
Learning: In `zstd/src/encoding/match_generator.rs` (structured-world/structured-zstd, PR `#102`), the test-only `with_forced_hash_mix_kernel(...)` helper forces the global `HASH_MIX_KERNEL` atomic to a specific kernel for coverage. Tests that call this helper must acquire a shared mutex lock *before* the `with_forced_hash_mix_kernel` call (not only around the assertion) to prevent cross-test kernel races when `cargo nextest` runs tests in parallel. This pattern was locked in commit 904d2b0. Do not flag the lock acquisition as unnecessarily early.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-11T11:31:54.292Z
Learning: In `zstd/src/encoding/match_generator.rs` (structured-world/structured-zstd, PR `#102`), the test `row_hash_and_row_extracts_high_bits` acquires `HASH_MIX_KERNEL_TEST_LOCK` before any `hash_mix_u64` call (fixed in commit 904d2b0). This lock guards all tests that force a specific `HashMixKernel` via `with_forced_hash_mix_kernel(...)` to prevent cross-test kernel races caused by the global `HASH_MIX_KERNEL` atomic. Do not flag the absence of this lock as missing — it is acquired at the top of each affected test.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 58
File: .github/workflows/ci.yml:31-32
Timestamp: 2026-04-03T11:24:17.021Z
Learning: In `structured-world/structured-zstd`, the CI Clippy lint job (`.github/workflows/ci.yml`) uses two separate steps: (1) the required gate: `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings`, and (2) a separate bench-internals step: `cargo clippy -p structured-zstd --features hash,std,dict_builder,bench_internals -- -D warnings`. The `bench_internals` feature must NOT replace the primary gate — it runs as an additional step only.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds bench-only infrastructure and a focused Criterion microbench to evaluate alternative “wildcopy” kernels for issue #87, plus improves the benchmark dashboard UX by resizing the chart and adding snapshot range filters.

Changes:

  • Expose a bench-only entrypoint for the production copy_bytes_overshooting implementation (feature-gated by bench_internals).
  • Add a new Criterion benchmark target (wildcopy_candidates) to compare baseline vs candidate copy kernels and validate byte-exact parity.
  • Enhance the GH Pages benchmark dashboard with a taller chart and From/To snapshot selectors to filter the visible time window.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
zstd/src/lib.rs Re-export the new bench-only copy entrypoint from structured_zstd::testing behind bench_internals.
zstd/src/decoding/simd_copy.rs Add a bench-only wrapper entrypoint for the production overshooting copy routine.
zstd/src/decoding/mod.rs Re-export the bench-only entrypoint within the decoding module (crate-visible, feature-gated).
zstd/Cargo.toml Register the new wildcopy_candidates Criterion bench (requires bench_internals).
zstd/benches/wildcopy_candidates.rs New microbench comparing baseline vs candidate kernels with correctness assertions and ISA dispatch.
BENCHMARKS.md Document the issue #87 research mapping, sample results, and provisional recommendation.
.github/bench-dashboard/index.html UI/layout updates plus From/To range filtering for benchmark snapshots.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the project’s benchmarking and performance-research workflow by (1) adding a new Criterion microbench to evaluate decoder wildcopy copy-kernel candidates for issue #87 research, and (2) improving the benchmark dashboard UX with a selectable snapshot range window.

Changes:

  • Expose a bench-only shim (structured_zstd::testing::copy_bytes_overshooting_for_bench) for benchmarking the production wildcopy implementation under bench_internals.
  • Add a new Criterion benchmark target wildcopy_candidates (feature-gated) to compare candidate copy kernels vs the production overshooting wildcopy behavior.
  • Improve the gh-pages benchmark dashboard layout and add From/To snapshot selectors to filter the displayed time window.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
zstd/src/lib.rs Adds a bench_internals-gated public testing facade exposing a wildcopy bench shim.
zstd/src/decoding/simd_copy.rs Adds a bench_internals entrypoint delegating to the production overshooting copy implementation.
zstd/src/decoding/mod.rs Re-exports the bench-only copy entrypoint (crate-visible) from the decoding module.
zstd/Cargo.toml Registers the new wildcopy_candidates Criterion bench gated behind bench_internals.
zstd/benches/wildcopy_candidates.rs Introduces the x86/x86_64 microbench comparing production wildcopy vs baseline/candidate kernels.
BENCHMARKS.md Documents issue #87 research mapping, local microbench results, and a provisional recommendation.
.github/bench-dashboard/index.html Updates chart sizing and adds From/To snapshot selectors + filtering logic for the dashboard.

Comment thread zstd/src/decoding/simd_copy.rs
@polaz polaz merged commit fe27fc6 into main Apr 12, 2026
14 checks passed
@polaz polaz deleted the perf/#87-wildcopy-research branch April 12, 2026 18:11
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research(decoding): evaluate advanced SIMD wildcopy paths beyond current baseline

2 participants