perf(encoding): complete ARM histogram path for #71 by polaz · Pull Request #104 · structured-world/structured-zstd

polaz · 2026-04-11T13:03:35Z

Summary

This PR finalizes the remaining #71 work by adding the shared histogram-count path used by Huffman/FSE/dictionary entropy builders.

What this PR changes

add shared donor-style striped histogram counter (zstd/src/histogram.rs) with scalar fallback
wire histogram counting into Huffman/FSE/dictionary entropy-table paths
add AArch64 runtime dispatch for an SVE2-gated histogram variant (#[target_feature(enable = "sve2")])

#71 Objective Coverage

✅ 1) AArch64 CRC32 hash-mix path with runtime detection: completed in PR perf(encoding): CRC-gated hash mix for ARM and x86_64 #102
✅ 1b) x86_64 SSE4.2 hash-mix path with runtime detection: completed in PR perf(encoding): CRC-gated hash mix for ARM and x86_64 #102
✅ 2) NEON wildcopy decode backend: completed in PR perf(decoding): add runtime-dispatched simd wildcopy #85 (issue perf(decoding): SIMD wildcopy for literal and match memcpy #68)
✅ 3) SVE2-oriented histogram-count path for entropy builders: completed in this PR
✅ 4) ARM prefetch (prfm) path: completed in PR perf(decoding): branchless offset history, prefetch pipeline, and BMI2 triple extract #90
✅ scalar fallback preserved across all added paths
✅ CPU-gated tests for CRC hash paths: covered in PR perf(encoding): CRC-gated hash mix for ARM and x86_64 #102

Validation

cargo fmt --all
cargo clippy --all-targets -- -D warnings
cargo nextest run -p structured-zstd

Closes #71

Summary by CodeRabbit

Refactor
- Streamlined entropy-table construction to use a direct byte-slice path and simplified internal imports.
Performance
- Faster encoding on large inputs via an optimized counting path and runtime CPU-path selection where supported.
New Features
- Added a shared, efficient byte-frequency histogram utility used across encoders.
Tests
- Added unit tests covering counting correctness, dispatcher behavior, and edge cases.

coderabbitai · 2026-04-11T13:03:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9de243d1-2f5e-412d-a3f3-396c4b6dba5b

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7675c and 00e87e1.

📒 Files selected for processing (1)

zstd/src/histogram.rs

📝 Walkthrough

Walkthrough

Adds a private histogram module with scalar and striped-parallel byte counters and a dispatcher; introduces build_table_from_bytes(...) in the FSE encoder (gating the iterator-based builder); and updates Huffman and dictionary call sites to use the new histogram and byte-slice table builder. (<=50 words)

Changes

Cohort / File(s)	Summary
New histogram module `zstd/src/histogram.rs`	Adds `pub(crate) fn count_bytes(data: &[u8], counts: &mut [usize;256]) -> (usize, usize)`, `count_bytes_scalar`, striped `count_bytes_parallel`, lane-merge logic, SVE2-targeting dispatch paths, and unit tests.
FSE encoder API `zstd/src/fse/fse_encoder.rs`	Gates iterator-based `build_table_from_data(...)` behind `#[cfg(any(test, feature = "fuzz_exports"))]`; adds `pub(crate) fn build_table_from_bytes(data: &[u8], max_log: u8, avoid_0_numbit: bool) -> FSETable` which counts bytes then calls `build_table_from_counts`.
Huffman encoder updates `zstd/src/huff0/huff0_encoder.rs`	Replaces manual counting with `histogram::count_bytes(...)`; calls `fse_encoder::build_table_from_bytes(weights, 6, true)` and uses `max_symbol` returned by the histogram.
Dictionary module update `zstd/src/dictionary/mod.rs`	Replaces iterator-based FSE table construction with `build_table_from_bytes(&symbols, max_log, false)` and adjusts imports to the new API.
Crate mod registration `zstd/src/lib.rs`	Adds private `mod histogram;` declaration.
Imports adjusted `zstd/src/dictionary/mod.rs`, `zstd/src/huff0/huff0_encoder.rs`	Switches `use` statements to import `build_table_from_bytes` and `histogram` utilities instead of the iterator-based helpers.

Sequence Diagram(s)

sequenceDiagram
    participant Huff as HuffmanEncoder
    participant FSE as FSE encoder
    participant Hist as histogram::count_bytes
    participant Counts as Counts[256]

    Huff->>Hist: count_bytes(weights, Counts)
    Hist-->>Huff: (max_symbol, largest_count)
    Huff->>FSE: build_table_from_bytes(weights, max_log, avoid_0_numbit)
    FSE->>Hist: count_bytes(weights, Counts)
    Hist-->>FSE: (max_symbol, largest_count)
    FSE->>FSE: build_table_from_counts(&counts[..=max_symbol], max_log, avoid_0_numbit)
    FSE-->>Huff: FSETable

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I tallied bytes in moonlit rows,
Four lanes of hops where histogram grows,
I built the table, byte by byte,
Quiet paws in coder's night,
Hops, counts, and a fse delight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'perf(encoding): complete ARM histogram path for `#71`' directly matches the main change: implementing the SVE2 histogram-count path for ARM platforms to close issue `#71`.
Linked Issues check	✅ Passed	The PR implements the SVE2 histogram-count path for ARM with runtime detection and scalar fallbacks, completing the encoding optimization portion of `#71` as stated in objectives.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#71`: adding histogram counting infrastructure and integrating it into FSE/Huffman/dictionary paths with ARM SVE2 support.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/#71-histogram-count-path

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-11T13:05:32Z

Codecov Report

❌ Patch coverage is 98.59155% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
zstd/src/histogram.rs	98.47%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR introduces a shared byte-frequency histogram implementation and wires it into entropy-table construction paths (Huffman and FSE) to enable ARM/AArch64-optimized counting with scalar fallback, as part of the broader ARM optimization work for #71.

Changes:

Add a new histogram module providing count_bytes() with scalar + striped (“donor-style”) counting and an AArch64 SVE2-gated variant.
Use the shared histogram counter when building Huffman symbol counts and when building FSE tables from byte slices.
Replace iterator-based build_table_from_data(...) usage in non-test code with a new slice-based build_table_from_bytes(...) helper.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`zstd/src/lib.rs`	Registers the new `histogram` module in the crate.
`zstd/src/histogram.rs`	Implements shared histogram counting + tests; adds SVE2-gated variant.
`zstd/src/huff0/huff0_encoder.rs`	Switches Huffman counting to `histogram::count_bytes` and updates FSE weight-table build to slice-based API.
`zstd/src/fse/fse_encoder.rs`	Adds `build_table_from_bytes()` using the shared histogram and gates iterator-based builder to tests/fuzz.
`zstd/src/dictionary/mod.rs`	Updates dictionary FSE table serialization to use `build_table_from_bytes()`.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/fse/fse_encoder.rs`:
- Around line 316-320: build_table_from_bytes currently calls
histogram::count_bytes on empty input which yields (0,0) and later causes a deep
panic; add an explicit precondition check at the start of build_table_from_bytes
to reject empty slices (e.g. assert! or panic with a clear message) before
calling histogram::count_bytes, so callers get an immediate, descriptive
failure; keep the rest of the logic (the call to histogram::count_bytes and the
subsequent build_table_from_counts(&counts[..=max_symbol], max_log,
avoid_0_numbit)) unchanged.

In `@zstd/src/histogram.rs`:
- Around line 137-148: The test count_bytes_handles_small_input_with_tail never
exercises the parallel path; update the test so it triggers count_bytes()'s
parallel branch by using a data length > PARALLEL_COUNT_THRESHOLD and not a
multiple of 16, or alternatively call count_bytes_parallel(&data, &mut fast)
directly from the test; ensure you still compare results against
count_bytes_scalar(&data, &mut scalar) and assert both the histogram arrays and
returned metadata match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f80c42a-b0f5-47a6-8608-0dfdb190e956

📥 Commits

Reviewing files that changed from the base of the PR and between e8ad7c1 and 0c842af.

📒 Files selected for processing (5)

zstd/src/dictionary/mod.rs
zstd/src/fse/fse_encoder.rs
zstd/src/histogram.rs
zstd/src/huff0/huff0_encoder.rs
zstd/src/lib.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/histogram.rs`:
- Around line 29-74: count_bytes_parallel can overflow when merging four u32
bucket arrays because the code sums them as u32 then casts; change the merge to
widen each term first (e.g. let value = counting1[symbol] as usize +
counting2[symbol] as usize + counting3[symbol] as usize + counting4[symbol] as
usize; counts[symbol] = value;) and add a large-input fallback: if data.len() >
u32::MAX as usize or any countingN[symbol] == u32::MAX detected
(counting1..counting4), switch to a safe accumulation path that uses a single
usize accumulator (or u64) per symbol and recomputes counts to avoid wraparound.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e266b924-9ee6-4cbc-8fbe-838873e6abb5

📥 Commits

Reviewing files that changed from the base of the PR and between 0c842af and 69d2fba.

📒 Files selected for processing (2)

zstd/src/fse/fse_encoder.rs
zstd/src/histogram.rs

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/histogram.rs`:
- Around line 205-209: The test merge_lane_counts_widens_before_sum assumes
usize is 64-bit and overflows on 32-bit targets; either gate the test with
#[cfg(target_pointer_width = "64")] placed before #[test] or change the test to
use width-aware values (e.g., cast to u64 before summing or assert via u64
arithmetic) so it doesn't rely on usize widening—update the test function
merge_lane_counts_widens_before_sum accordingly to use one of these approaches.
- Around line 149-153: The test fails on 32-bit because per-lane accumulators
use usize and can overflow; update merge_lane_counts to widen its accumulator to
u64 (use u64 temps when summing lane/u32 counts before aggregating into final
counts) or alternatively gate the test with #[cfg(target_pointer_width = "64")];
also ensure the SVE2 specialization is preserved by making the parallel
implementation guaranteed to inline into the #[target_feature(enable = "sve2")]
function: either mark count_bytes_parallel with #[inline(always)] or move the
loop/body of count_bytes_parallel directly into the unsafe fn count_bytes_sve2
so the SVE2-compiled code contains the specialized loop. Target symbols:
merge_lane_counts, count_bytes_sve2, count_bytes_parallel.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 70bac832-713b-493a-8c8f-2872ec5b7acf

📥 Commits

Reviewing files that changed from the base of the PR and between c729b7a and 977bdaa.

📒 Files selected for processing (1)

zstd/src/histogram.rs

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/histogram.rs`:
- Around line 43-47: In count_bytes_parallel, replace the loop condition that
uses data.len().saturating_sub(16) (index <= data.len().saturating_sub(16)) with
an explicit bounds check (index + 16 <= data.len()) so the unsafe read_unaligned
at ptr is directly guarded by the loop condition; this makes the 16-byte read
self-contained and avoids relying on an external length invariant, ensuring the
unsafe block remains sound if the surrounding logic changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 429b5a62-1291-4a8a-bab0-39dcf10c288d

📥 Commits

Reviewing files that changed from the base of the PR and between 977bdaa and 8d7675c.

📒 Files selected for processing (2)

zstd/src/fse/fse_encoder.rs
zstd/src/histogram.rs

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

polaz added 2 commits April 11, 2026 15:49

perf(encoding): use donor-style striped histogram count

f925df5

perf(encoding): gate histogram path on aarch64 sve2

0c842af

Copilot AI review requested due to automatic review settings April 11, 2026 13:03

polaz mentioned this pull request Apr 11, 2026

perf: ARM platform optimizations (CRC32 hash, NEON copy, SVE2 histcnt) #71

Closed

8 tasks

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs Outdated

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_encoder.rs Outdated

Comment thread zstd/src/histogram.rs

fix(histogram): resolve review thread regressions

69d2fba

polaz requested a review from Copilot April 11, 2026 13:25

Copilot started reviewing on behalf of polaz April 11, 2026 13:25 View session

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs Outdated

Comment thread zstd/src/histogram.rs Outdated

polaz added 2 commits April 11, 2026 17:11

perf(histogram): harden and streamline hot path

c729b7a

test(histogram): avoid 32-bit overflow in sum test

977bdaa

polaz requested a review from Copilot April 11, 2026 14:21

Copilot started reviewing on behalf of polaz April 11, 2026 14:22 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs Outdated

Comment thread zstd/src/histogram.rs

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs

Comment thread zstd/src/histogram.rs

test(histogram): avoid overflow in lane-merge test

d0d5b75

polaz requested a review from Copilot April 11, 2026 14:49

Copilot started reviewing on behalf of polaz April 11, 2026 14:50 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/fse/fse_encoder.rs

polaz added 2 commits April 11, 2026 18:01

perf(histogram): harden 32-bit hot loop

70ed146

fix(fse): keep byte-table helper crate-local

8d7675c

polaz requested a review from Copilot April 11, 2026 15:11

Copilot started reviewing on behalf of polaz April 11, 2026 15:12 View session

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs Outdated

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread zstd/src/histogram.rs Outdated

fix(histogram): guard 16-byte loop bounds

00e87e1

polaz requested a review from Copilot April 11, 2026 15:23

Copilot started reviewing on behalf of polaz April 11, 2026 15:23 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

polaz merged commit b850c04 into main Apr 11, 2026
17 checks passed

polaz deleted the perf/#71-histogram-count-path branch April 11, 2026 15:32

sw-release-bot Bot mentioned this pull request Apr 11, 2026

chore: release v0.0.17 #103

Merged

polaz mentioned this pull request Apr 11, 2026

Roadmap: structured-zstd feature parity with C zstd #28

Open

Conversation

polaz commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR changes

#71 Objective Coverage

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

codecov Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

codecov Bot commented Apr 11, 2026 •

edited

Loading