Skip to content

perf(encoding): rebase hc positions past u32 boundary#82

Merged
polaz merged 5 commits intomainfrom
perf/#51-hc-rebase-positions
Apr 8, 2026
Merged

perf(encoding): rebase hc positions past u32 boundary#82
polaz merged 5 commits intomainfrom
perf/#51-hc-rebase-positions

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 8, 2026

Summary

  • reworked HcMatchGenerator to store chain/hash positions relative to a moving position_base
  • added overflow-safe rebase + live-table rebuild to prevent the post-4GiB hash-chain cutoff
  • added regression coverage for boundary crossing and removed outdated 4GiB limitation docs

Validation

  • cargo check --workspace
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo nextest run --workspace
  • cargo test --doc --workspace

Closes #51

Summary by CodeRabbit

  • Bug Fixes

    • Match tables now remain usable across 32-bit position boundaries, ensuring valid match candidates for very large single-frame inputs (avoids degradation on >4 GiB data).
  • Documentation

    • Removed an outdated note about 32-bit position limits in hash-chain processing.

- store HC table indexes relative to a moving position base

- rebase and rebuild live chain/hash entries before u32 overflow

- add regression test for post-u32 boundary insertion path

- remove obsolete 4 GiB limitation notes for HC levels

Closes #51
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 97.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/match_generator.rs 97.33% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the ~4 GiB single-frame degradation for the hash-chain (HC) matcher by rebasing stored hash/chain positions relative to a moving position_base, rebuilding the tables when the relative position space approaches u32 limits.

Changes:

  • Reworked HcMatchGenerator to store HC table positions as (relative_pos + 1) using a position_base, with an overflow-safe rebase + live-table rebuild.
  • Updated docs to remove the previous “>4 GiB limitation” notes for HC-based compression levels.
  • Added a regression test intended to cover boundary crossing behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
zstd/src/encoding/mod.rs Removes outdated public docs describing the old 4 GiB HC cutoff limitation.
zstd/src/encoding/match_generator.rs Implements HC position rebasing and adds a regression test for u32-boundary crossing.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9662d825-dac6-401c-a57f-882389ce937d

📥 Commits

Reviewing files that changed from the base of the PR and between a9a14b0 and cd89f8d.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

📝 Walkthrough

Walkthrough

HcMatchGenerator now encodes stored positions as offsets relative to a new position_base, rebases stored tables when the u32 range would overflow, updates insert/traversal to use relative values, adds regression tests for rebasing at the u32 boundary, and removes ~4 GiB limitation docs.

Changes

Cohort / File(s) Summary
Position Rebasing & HC logic
zstd/src/encoding/match_generator.rs
Switch stored entries from absolute to relative_pos + 1; add position_base and relative_position(abs_pos) -> Option<u32>; introduce maybe_rebase_positions invoked from insert_position; clear/repopulate tables on rebase; update insert_position_no_rebase, hash/chain indexing, and chain_candidates to decode relative entries and reconstruct absolute candidates; add regression tests for rebasing behavior around u32 boundary.
Docs Cleanup
zstd/src/encoding/mod.rs
Remove enum variant doc comments that described the prior 32-bit/~4 GiB single-frame limitation for CompressionLevel::{Better, Best, Level}.

Sequence Diagram(s)

sequenceDiagram
    participant Test as RegressionTest
    participant Hc as HcMatchGenerator
    participant HT as hash_table
    participant CT as chain_table

    Test->>Hc: insert_position(abs_pos)
    Hc->>Hc: relative = relative_position(abs_pos)
    alt relative fits in u32
        Hc->>HT: insert_position_no_rebase(relative)
        Hc->>CT: update_chain_no_rebase(relative)
    else needs rebase
        Hc->>Hc: maybe_rebase_positions(abs_pos)
        Hc->>HT: clear
        Hc->>CT: clear
        Hc->>Hc: repopulate by reinserting live history (no_rebase)
        Hc->>HT: insert_position_no_rebase(relative)
        Hc->>CT: update_chain_no_rebase(relative)
    end

    Test->>Hc: chain_candidates(query_pos)
    Hc->>CT: read candidate_rel
    Hc->>Hc: candidate_abs = position_base + candidate_rel
    Hc->>Test: return absolute candidates
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hop along the position base, nibbling offsets true,
When u32 fills the meadow, I shift the view.
I clear old burrows, replant each chain,
Matches spring forward across the stream again. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'perf(encoding): rebase hc positions past u32 boundary' accurately describes the main change: implementing position rebasing to overcome the u32 boundary limitation in hash-chain matching.
Linked Issues check ✅ Passed The PR fully implements the suggested approach from #51: adds position rebasing with position_base tracking, implements rebase logic when u32 range nears exhaustion, rebuilds hash/chain tables on rebase, and removes outdated 4 GiB limitation documentation.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #51: match_generator.rs implements the rebasing mechanism, and mod.rs removes now-obsolete documentation about the 4 GiB limitation. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/#51-hc-rebase-positions

Comment @coderabbitai help to get the list of available commands and usage tips.

polaz added 2 commits April 8, 2026 12:24
- make hc_rebases_positions_after_u32_boundary platform-safe

- keep 64-bit >u32 scenario, use boundary path on 32-bit

- prevent const arithmetic overflow on i686 CI
- avoid const overflow in i686 by using fallible conversion

- clarify relative-position and chain-candidate storage comments
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2698-2719: Extend the test hc_rebases_positions_after_u32_boundary
to also verify that chain_candidates returns valid matches after the rebase:
after calling matcher.skip_matching() (which triggers the rebase), invoke
matcher.chain_candidates(current_pos) for one or more positions within the added
data and assert that the returned candidate list contains at least one valid
match (e.g., non-empty and within expected bounds), using HcMatchGenerator
methods add_data, ensure_tables, skip_matching, and chain_candidates to locate
and validate matches; this confirms that rebasing not only updates
position_base/hash_table but preserves matchability.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f4965f88-60da-4939-b338-7e237c831b85

📥 Commits

Reviewing files that changed from the base of the PR and between 1f379be and 3eb4cce.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
💤 Files with no reviewable changes (1)
  • zstd/src/encoding/mod.rs

Comment thread zstd/src/encoding/match_generator.rs
- extend u32-boundary regression to assert chain_candidates returns valid entries

- addresses CodeRabbit nitpick on post-rebase matchability
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1749-1766: The rebase currently repopulates HC state up to
history_abs_end(), which accidentally inserts not-yet-processed positions and
can change lazy-match behavior; update HcMatchGenerator::maybe_rebase_positions
so after clearing position_base, hash_table, and chain_table it only re-inserts
the already-seen prefix range history_abs_start..abs_pos (use abs_pos as the
upper bound) by calling insert_position_no_rebase for pos in that range, leaving
abs_pos and later positions to be inserted by the caller in normal order; keep
the existing checks (relative_position, position_base assignment, table fills)
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d06f78a1-aef2-48b2-98be-767462d9e62f

📥 Commits

Reviewing files that changed from the base of the PR and between c1fe8c7 and a9a14b0.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI (x86_64-gnu)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.30.

Benchmark suite Current: cd89f8d Previous: 1f379be Ratio
compress/default/decodecorpus-z000033/matrix/pure_rust 139.172 ms 104.909 ms 1.33

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

@polaz polaz merged commit c9ab3d3 into main Apr 8, 2026
18 checks passed
@polaz polaz deleted the perf/#51-hc-rebase-positions branch April 8, 2026 14:35
@sw-release-bot sw-release-bot Bot mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(encoding): rebase HC table positions to remove 4 GiB Better level cutoff

2 participants