Skip to content

feat(encoding): add Better compression level (zstd level 7, lazy2 strategy)#48

Merged
polaz merged 8 commits intomainfrom
feat/#6-feat-implement-better-compression-level-zstd-level
Apr 2, 2026
Merged

feat(encoding): add Better compression level (zstd level 7, lazy2 strategy)#48
polaz merged 8 commits intomainfrom
feat/#6-feat-implement-better-compression-level-zstd-level

Conversation

@polaz
Copy link
Copy Markdown
Member

@polaz polaz commented Apr 1, 2026

Summary

  • Implement CompressionLevel::Better — roughly equivalent to zstd level 7
  • Uses a hash chain matcher (hash table + chain table) with lazy2 strategy and gain-based match cost model
  • Explores up to 16 match candidates per position (vs 4 in Default's Dfast buckets)
  • 8 MB window, 6 MB hash/chain tables, ~22 MB peak memory

Technical Details

Hash chain architecture (HcMatchGenerator):

  • Hash table (1M u32 entries, 4 MB) maps hash → most recent position
  • Chain table (512K u32 entries, 2 MB) links positions with the same hash
  • Chain traversal skips stale entries and explores up to 16 valid candidates per position (matching C zstd level 7's searchLog=4)
  • Gain-based lazy2 matching: match_len × 4 − log2(offset) cost model with +4 literal bonus
  • Boundary backfill: last 3 positions of each slice are seeded when the next slice arrives

Compression ratio (corpus z000033, 1 MB):

  • Default (Dfast): 578,142 bytes
  • Better (hash chain): 541,592 bytes (+6.3% vs Default)
  • C zstd level 7: 510,030 bytes (our Better is within 6.2%)

Known Limitations

  • ~6% ratio gap vs C zstd level 7 remains; C uses deeper chain traversal and row-based SIMD matching
  • Small/highly-repetitive data shows minimal improvement over Default (already near-optimal)

Test Plan

  • 10 roundtrip integrity tests (compressible, random, multi-block, streaming, edge cases, repeat offsets, large literals, 8 MiB window boundary)
  • Compression ratio tests: Better < Default on corpus, Better within 1% of Default on repeat-offset fixture
  • Cross-validation: Rust Better compress → C FFI decompress
  • Streaming encoder roundtrip + buffer reuse test
  • All 182 workspace tests pass

Closes #6

Summary by CodeRabbit

  • New Features

    • "Better" compression (CLI level 3) is now supported for one-shot and streaming use, routing through the shared block-encoded pipeline and accepted by the CLI.
  • Documentation

    • README and benchmarks updated to mark Better as implemented and mapped to an equivalent zstd ~7 level; Best remains pending. Added tradeoff and limitation guidance.
  • Tests

    • New and expanded tests verify roundtrip integrity, streaming roundtrips, cross-language decoding, and size/regression expectations for Better.

Copilot AI review requested due to automatic review settings April 1, 2026 06:17
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72359af4-7a85-4331-90fc-54640c221c87

📥 Commits

Reviewing files that changed from the base of the PR and between 59cb45e and af9dadb.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs

📝 Walkthrough

Walkthrough

Implements and wires CompressionLevel::Better (zstd-level 7, hash‑chain matcher with lazy2) across CLI, streaming encoder, frame compressor, match generator, benchmarks, docs, and tests; adds unit/integration/cross‑validation tests and benchmark config.

Changes

Cohort / File(s) Summary
Docs
BENCHMARKS.md, README.md
Advertise Better (≈level 7) as implemented; remove from excluded list and mark status/checklist items accordingly.
CLI
cli/src/main.rs
Accept --level up to 3, map 3CompressionLevel::Better, update usage text and tests to reject 4.
Benchmarks
zstd/benches/support/mod.rs
Expand supported_levels() from 2→3 entries and add "better" with rust_level: Better, ffi_level: 7.
Encoding — match generation
zstd/src/encoding/match_generator.rs
Add MatcherBackend::HashChain and HcMatchGenerator; route Better to HashChain; implement hash-chain tables, lazy2 lookahead, lazy_depth adjustments, and integrate with driver control flow and tests.
Encoding — pipeline & reexports
zstd/src/encoding/levels/fastest.rs, zstd/src/encoding/levels/mod.rs, zstd/src/encoding/frame_compressor.rs
Rename compress_fastestcompress_block_encoded, update re-export, and route Better into the shared block-encoded compression path used by Fastest/Default.
Streaming encoder
zstd/src/encoding/streaming_encoder.rs
Enable Better for streaming: allocate matcher pending space, route to block-encoded path, update support checks/messages, and replace unsupported-level test with streaming roundtrip test for Better.
Tests — unit & integration
zstd/src/tests/roundtrip_integrity.rs, zstd/tests/cross_validation.rs
Generalize roundtrip helpers to support arbitrary levels; add many Better roundtrip/streaming tests, size-regression assertions vs Default, and cross-validation tests verifying C-side decode parity.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Parser
    participant SE as StreamingEncoder
    participant FC as FrameCompressor
    participant MG as MatchGenerator
    participant HC as HashChainMatcher
    participant DF as DfastMatcher

    CLI->>SE: parse --level 3 (Better)
    SE->>SE: ensure_level_supported(Better)
    SE->>FC: FrameCompressor::new(level=Better)
    FC->>MG: reset/prepare matcher for Better
    alt Better → HashChain
        MG->>HC: init hc_match_generator (lazy_depth=2, window=rgba(0,128,0,0.5))
        loop per position
            HC->>HC: find best match at N
            HC->>HC: lookahead N+1 and N+2 (lazy2)
            HC->>FC: emit match/literal
        end
    else Fastest/Default → Dfast
        MG->>DF: init dfast (lazy_depth=1, window=rgba(0,0,255,0.5))
        DF->>DF: matching pipeline (lazy1)
        DF->>FC: emit match/literal
    end
    FC->>SE: produce compressed frame
    SE->>CLI: finish()/output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped the chains and peeked two more,
N+1 and N+2 — I sniffed behind the door,
Gathered matches, tucked literals tight,
Better burrowed smaller, snug and light,
A rabbit’s nibble — leaner bytes tonight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing the Better compression level with zstd level 7 and lazy2 strategy, which is the core purpose of the PR.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #6: Better compression level with hash-chain matcher, lazy2 strategy, window/memory configuration, roundtrip/interop tests, and meets the ~10% compression ratio target.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing Better compression level: matcher backend, encoding pipeline integration, CLI argument expansion, benchmarks, documentation, and comprehensive tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#6-feat-implement-better-compression-level-zstd-level

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new mid-tier encoder setting, CompressionLevel::Better, intended to approximate C zstd level 7 by reusing the existing fast block pipeline while upgrading the match-finding strategy (lazy2) and window size.

Changes:

  • Implement CompressionLevel::Better using the Dfast backend with configurable lazy lookahead depth (lazy2) and an 8 MiB window.
  • Extend streaming and frame compressors to accept Better, plus add roundtrip/cross-validation tests.
  • Update CLI, benchmarks, and docs to expose/measure the new level.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
zstd/tests/cross_validation.rs Adds Better → C-FFI decompress regression test and ratio proximity check.
zstd/src/tests/roundtrip_integrity.rs Adds Better roundtrip coverage, including streaming and edge cases.
zstd/src/encoding/streaming_encoder.rs Enables Better in streaming encoder and adds streaming roundtrip test.
zstd/src/encoding/mod.rs Documents Better as implemented (lazy2 strategy).
zstd/src/encoding/match_generator.rs Routes Better to Dfast, increases window, adds lazy_depth, and implements lazy2 lookahead.
zstd/src/encoding/frame_compressor.rs Treats Better as a supported compressed level (same block pipeline).
zstd/benches/support/mod.rs Adds Better to benchmark level matrix (mapped to C level 7).
README.md Marks Better as implemented in the feature checklist.
cli/src/main.rs Maps CLI level 3 to Better and updates validation/docs for streaming support.
BENCHMARKS.md Updates benchmark documentation to include Better and clarify exclusions.

Comment thread zstd/src/encoding/streaming_encoder.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
zstd/src/encoding/match_generator.rs (1)

1051-1087: ⚠️ Potential issue | 🟠 Major

Probe lazy lookahead after seeding the deferred positions.

Line 1065 and Line 1078 call best_match() for abs_pos + 1 / abs_pos + 2 before abs_pos and abs_pos + 1 are inserted into short_hash / long_hash. That means the lazy comparison can't see matches that only become reachable after emitting one or two literals—especially fresh offset-1/offset-2 candidates—so pick_lazy_match() can choose against a different state than the parser will actually have on the next iterations. Please seed those lookahead positions, or simulate and roll back the touched buckets, before comparing lazy candidates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 1051 - 1087,
pick_lazy_match currently calls best_match(abs_pos + 1 / +2) before the parser
state (short_hash/long_hash) has been updated with the deferred positions for
abs_pos and abs_pos+1, so the lazy lookahead misses matches that only appear
after emitting one or two literals; update pick_lazy_match to seed the hash
tables (or simulate the same bucket inserts and remember which buckets were
modified) for abs_pos (and abs_pos+1 when checking abs_pos+2) before invoking
best_match, then roll back the seeded buckets (or undo the simulated inserts)
after the comparisons; refer to pick_lazy_match, best_match, short_hash,
long_hash, lazy_depth and history_abs_end when implementing the seeding/rollback
so the lazy checks see the same state the parser will have on the next
iterations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cli/src/main.rs`:
- Line 12: The import ordering in cli/src/main.rs (specifically the use
color_eyre::eyre::{eyre, ContextCompat, WrapErr}; line and related imports in
the main/test modules) does not match rustfmt expectations and is failing CI;
run cargo fmt --all to automatically reorder and format imports, or manually
reorder the use statements to match rustfmt output (ensuring grouped/sorted
paths and braces order) so the imports around the eyre::{eyre, ContextCompat,
WrapErr} line and the corresponding imports at lines 382-383 and 389 are
formatted consistently.

In `@README.md`:
- Line 47: The README contains conflicting fork-goals text: the checklist now
marks "Better" as implemented while the description still says "currently only
Fastest is implemented"; update the descriptive sentence that references
"Fastest" so it reflects that "Better" is implemented (e.g., change "currently
only Fastest is implemented" to indicate both Fastest and Better are implemented
or remove the exclusivity), and ensure the checklist items and any other
mentions of "Better" and "Fastest" are consistent.

In `@zstd/benches/support/mod.rs`:
- Line 1: The import line use rand::{rngs::SmallRng, RngCore, SeedableRng}; is
mis-ordered for rustfmt; run cargo fmt --all (or apply rustfmt ordering) to
reorder the items in the brace list to rustfmt's expected order so the CI fmt
check passes, e.g., update the use statement for rand (the rngs::SmallRng,
RngCore, SeedableRng import) to match cargo fmt's output.

In `@zstd/src/encoding/streaming_encoder.rs`:
- Around line 265-266: The allocate_pending_space() logic should treat
CompressionLevel::Better the same as Fastest/Default so matcher-backed reusable
pending buffers are reused instead of reallocating; update the match arm(s) in
allocate_pending_space() (and any related buffer-reuse conditionals) to include
CompressionLevel::Better alongside CompressionLevel::Fastest and
CompressionLevel::Default when deciding to reuse the matcher-backed pending
buffer, ensuring pending-buffer reuse occurs for Better and avoids extra
allocations on large streams.

In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 76-96: The two level-specific helpers roundtrip_better and
roundtrip_better_streaming duplicate existing scaffolding; replace them with
generic functions roundtrip_at_level(data: &[u8], level: CompressionLevel) ->
Vec<u8> and roundtrip_streaming_at_level(data: &[u8], level: CompressionLevel)
-> Vec<u8> that perform the same compress->decode flow but accept a
CompressionLevel parameter, update any callers that used
roundtrip_better/roundtrip_better_streaming to call the new generic helpers, and
remove the old specific functions; reference existing helpers roundtrip_simple,
roundtrip_streaming, and roundtrip_default as the pattern to follow for naming
and behavior.
- Around line 384-446: Add a new test that constructs a repeated-pattern fixture
whose second match occurs between 4 MiB and 8 MiB (use repeat_offset_fixture to
place the repeating pattern at an offset > 4*1024*1024 and < 8*1024*1024),
compress it with compress_to_vec using CompressionLevel::Default and
CompressionLevel::Better, and assert that the Better output preserves the
long-distance match (e.g., compressed_better.len() < compressed_default.len() or
at least strictly <= and noticeably smaller) to ensure Better's larger window is
exercised; place this alongside the other better-level tests (or add a new
function better_level_crosses_default_history) so the suite fails if Better
falls back to a 4 MiB effective window.
- Line 13: The import group in zstd/src/tests/roundtrip_integrity.rs is out of
rustfmt order; reorder the items inside the use crate::encoding::{...} import so
they read CompressionLevel, FrameCompressor, compress_to_vec (i.e., change the
list to CompressionLevel, FrameCompressor, compress_to_vec) and run cargo fmt
--check to verify formatting passes.

---

Outside diff comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1051-1087: pick_lazy_match currently calls best_match(abs_pos + 1
/ +2) before the parser state (short_hash/long_hash) has been updated with the
deferred positions for abs_pos and abs_pos+1, so the lazy lookahead misses
matches that only appear after emitting one or two literals; update
pick_lazy_match to seed the hash tables (or simulate the same bucket inserts and
remember which buckets were modified) for abs_pos (and abs_pos+1 when checking
abs_pos+2) before invoking best_match, then roll back the seeded buckets (or
undo the simulated inserts) after the comparisons; refer to pick_lazy_match,
best_match, short_hash, long_hash, lazy_depth and history_abs_end when
implementing the seeding/rollback so the lazy checks see the same state the
parser will have on the next iterations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1120a5c3-5d47-4d93-ba3e-ff278f344ebd

📥 Commits

Reviewing files that changed from the base of the PR and between db372df and cd5ca5a.

📒 Files selected for processing (10)
  • BENCHMARKS.md
  • README.md
  • cli/src/main.rs
  • zstd/benches/support/mod.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/encoding/streaming_encoder.rs
  • zstd/src/tests/roundtrip_integrity.rs
  • zstd/tests/cross_validation.rs

Comment thread cli/src/main.rs Outdated
Comment thread README.md
Comment thread zstd/benches/support/mod.rs Outdated
Comment thread zstd/src/encoding/streaming_encoder.rs
Comment thread zstd/src/tests/roundtrip_integrity.rs Outdated
Comment thread zstd/src/tests/roundtrip_integrity.rs Outdated
Comment thread zstd/src/tests/roundtrip_integrity.rs
@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 1, 2026

@coderabbitai Re: outside-diff pick_lazy_match position seeding (match_generator.rs:1051-1087) — this code is in the Dfast backend which is only used by Default level. Better now uses the new HcMatchGenerator (hash chain backend) which inserts positions before lazy evaluation in its own start_matching loop. The Dfast lazy seeding order is a valid improvement for Default but out of scope for this PR.

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 1, 2026

@coderabbitai Re: outside-diff pick_lazy_match position seeding (match_generator.rs:1051-1087) — this code is in the Dfast backend which is only used by Default level. Better now uses the new HcMatchGenerator (hash chain backend) which inserts positions before lazy evaluation in its own start_matching loop. The Dfast lazy seeding order is a valid improvement for Default but out of scope for this PR.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz, thanks for the clarification. You're right — since Better routes through HcMatchGenerator (which manages position insertion within its own start_matching loop) rather than the Dfast backend, the seeding order concern in pick_lazy_match doesn't apply to this PR's scope. I'll note the architectural split for future reviews.


✏️ Learnings added
Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/48

Timestamp: 2026-04-01T06:49:34.213Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz, thanks for the clarification. That makes sense — since CompressionLevel::Better routes through HcMatchGenerator and handles position insertion within its own start_matching loop, the seeding-order concern in DfastMatchGenerator::pick_lazy_match is indeed inapplicable to this PR. Agreed that improving the Dfast seeding order for Default would be a separate, self-contained change.


✏️ Learnings added
Learnt from: polaz
URL: https://github.com/structured-world/structured-zstd/pull/48

Timestamp: 2026-04-01T06:49:39.991Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
zstd/src/tests/roundtrip_integrity.rs (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Fix import ordering to pass cargo fmt --check.

The pipeline failure indicates the import order doesn't match rustfmt expectations. The current order is compress_to_vec, CompressionLevel, FrameCompressor but rustfmt expects a different order.

Run cargo fmt --all to fix:

cargo fmt --all
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/tests/roundtrip_integrity.rs` at line 13, The import list in
roundtrip_integrity.rs is not ordered to rustfmt's expectations: reorder the
items in the use crate::encoding import so the symbols compress_to_vec,
CompressionLevel, FrameCompressor are arranged according to rustfmt's ordering
(usually alphabetical/grouped) and then run cargo fmt --all to apply formatting;
update the use line that references compress_to_vec, CompressionLevel,
FrameCompressor so it matches the formatted output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1685-1692: The match_gain function currently treats offset==0 as
valid and yields a positive gain; add a defensive check in match_gain to handle
the invalid offset==0 case (offsets are 1-indexed in zstd) — for example, add a
debug assertion referencing match_gain that offset != 0 and return a
sentinel/very-low value when offset == 0 so callers won’t treat it as a useful
match; update the logic inside match_gain (and mention the symbol match_gain) to
assert or short-circuit on offset==0 and produce a clearly invalid gain instead
of computing with offset_bits == 0.
- Around line 1547-1570: chain_candidates currently allocates a new Vec<usize>
on each call which causes allocation churn; change it to avoid heap allocation
by either (A) changing its signature to accept a reusable &mut Vec<usize>
scratch buffer (clear and push candidate_abs into that buffer) and return a
slice or unit so callers reuse the buffer, or (B) return a lazy iterator that
walks hash_table and chain_table (using HC_SEARCH_DEPTH, HC_EMPTY,
history_abs_start, live_history(), hash_position(), chain_table and hash_table)
without collecting into a Vec; update all callers to use the new API so
allocations are eliminated.
- Around line 277-280: The assignment to dfast.lazy_depth for
CompressionLevel::Better is unreachable because level_config() maps Better to
MatcherBackend::HashChain, so update the Dfast branch in match_generator.rs by
either removing the CompressionLevel::Better arm from the match or replacing it
with a brief defensive comment; specifically, edit the match on level where
dfast.lazy_depth is set (referencing dfast.lazy_depth, CompressionLevel::Better,
level_config(), and MatcherBackend::HashChain) so the code no longer contains a
misleading unreachable case or explicitly notes it’s defensive.

---

Duplicate comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 13: The import list in roundtrip_integrity.rs is not ordered to rustfmt's
expectations: reorder the items in the use crate::encoding import so the symbols
compress_to_vec, CompressionLevel, FrameCompressor are arranged according to
rustfmt's ordering (usually alphabetical/grouped) and then run cargo fmt --all
to apply formatting; update the use line that references compress_to_vec,
CompressionLevel, FrameCompressor so it matches the formatted output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a60c61f4-41db-49fb-9fa5-f8d69d7e8527

📥 Commits

Reviewing files that changed from the base of the PR and between cd5ca5a and 3969e38.

📒 Files selected for processing (7)
  • README.md
  • cli/src/main.rs
  • zstd/benches/support/mod.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/streaming_encoder.rs
  • zstd/src/tests/roundtrip_integrity.rs
  • zstd/tests/cross_validation.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs
@polaz polaz force-pushed the feat/#6-feat-implement-better-compression-level-zstd-level branch from 3969e38 to 0402707 Compare April 1, 2026 15:16
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
zstd/src/tests/roundtrip_integrity.rs (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Fix import order to pass cargo fmt --check.

Pipeline is failing because rustfmt expects the items sorted differently (types before functions).

🔧 Proposed fix
-use crate::encoding::{compress_to_vec, CompressionLevel, FrameCompressor};
+use crate::encoding::{CompressionLevel, FrameCompressor, compress_to_vec};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/tests/roundtrip_integrity.rs` at line 13, The import items in the
use statement are not ordered as rustfmt expects (types before functions);
update the use crate::encoding::{...} to list the type names first
(CompressionLevel, FrameCompressor) and then the function compress_to_vec so
rustfmt/cargo fmt --check passes; adjust the import in the file where the use
appears (referencing CompressionLevel, FrameCompressor, compress_to_vec).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1531-1535: The code casts abs_pos to u32 (pos_u32) and stores it
in hash_table/chain_table, which silently wraps when abs_pos > u32::MAX and
corrupts chain traversal; update MatchGenerator (the logic around pos_u32,
abs_pos, hash_table, chain_table, HC_CHAIN_LOG, chain_idx) to either (A) store a
rebased position relative to the current sliding window (e.g., compute
relative_pos = (abs_pos - window_base) and convert that to u32) so values never
exceed u32, or (B) perform a checked conversion and return an error for
oversized streams (rejecting Better streams) instead of wrapping; implement the
chosen approach consistently where pos_u32 is produced and read to avoid
mismatch in chain traversal.

In `@zstd/src/encoding/mod.rs`:
- Around line 64-66: The doc comment for CompressionLevel::Better is out of date
— it's no longer using the dfast matcher; update the public docs to state that
CompressionLevel::Better uses the hash-chain matcher (HcMatchGenerator) instead
of DfastMatchGenerator, and adjust the memory/ratio tradeoff description to
reflect higher memory use and better compression ratio compared to
CompressionLevel::Default. Locate the comment near CompressionLevel::Better in
mod.rs (referencing CompressionLevel::Better, DfastMatchGenerator,
HcMatchGenerator) and replace the wording to mention the hash-chain backend, the
HcMatchGenerator symbol, and the expected memory vs ratio characteristics.

In `@zstd/tests/cross_validation.rs`:
- Around line 261-264: The current test only checks a 1% regression and can pass
when CompressionLevel::Better is slightly larger than CompressionLevel::Default,
so update the assertion to require an actual Better win: add an assertion that
better.len() < default.len() (using the existing better and default variables
which compare CompressionLevel::Better and CompressionLevel::Default) before or
alongside the percentage check and include a clear failure message referencing
better and default.

---

Duplicate comments:
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Line 13: The import items in the use statement are not ordered as rustfmt
expects (types before functions); update the use crate::encoding::{...} to list
the type names first (CompressionLevel, FrameCompressor) and then the function
compress_to_vec so rustfmt/cargo fmt --check passes; adjust the import in the
file where the use appears (referencing CompressionLevel, FrameCompressor,
compress_to_vec).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8619c951-f47f-4ef3-a54d-c18000bd956e

📥 Commits

Reviewing files that changed from the base of the PR and between 3969e38 and 0402707.

📒 Files selected for processing (10)
  • BENCHMARKS.md
  • README.md
  • cli/src/main.rs
  • zstd/benches/support/mod.rs
  • zstd/src/encoding/frame_compressor.rs
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/encoding/streaming_encoder.rs
  • zstd/src/tests/roundtrip_integrity.rs
  • zstd/tests/cross_validation.rs

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/mod.rs Outdated
Comment thread zstd/tests/cross_validation.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1148-1170: The nested `if let Some(next) = next { if ... }` in the
lazy-check should be collapsed into a single if using let-chains to satisfy
clippy::collapsible_if; replace the two-level check around the `next` variable
(the call to best_match used in the lazy check) with a single `if let Some(next)
= next && (next.match_len > best.match_len || (next.match_len == best.match_len
&& next.offset < best.offset))` and keep the early return None as-is; apply the
same style consistency for analogous checks (e.g., the lazy2 block uses `next2`)
and ensure references to best_match, lazy_depth, DFAST_MIN_MATCH_LEN,
history_abs_end, match_len, and offset remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a0c1f32e-fd54-4523-babe-3577b227e6e6

📥 Commits

Reviewing files that changed from the base of the PR and between 0402707 and d78cee6.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/match_generator.rs
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 91.68399% with 40 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
zstd/src/encoding/match_generator.rs 91.59% 39 Missing ⚠️
zstd/src/encoding/streaming_encoder.rs 93.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
zstd/src/encoding/match_generator.rs (1)

1160-1165: 🧹 Nitpick | 🔵 Trivial

Consider collapsing the second nested if for consistency.

The first lazy check at lines 1150-1155 uses let-chains, but this lazy2 check still has a nested if. While Clippy may not flag this currently (possibly due to the comment), collapsing it would maintain consistency.

♻️ Suggested refactor for consistency
         if self.lazy_depth >= 2 && abs_pos + 2 + DFAST_MIN_MATCH_LEN <= self.history_abs_end() {
             let next2 = self.best_match(abs_pos + 2, lit_len + 2);
-            if let Some(next2) = next2 {
-                // Require strictly longer match to justify skipping two literal bytes
-                if next2.match_len > best.match_len + 1 {
-                    return None;
-                }
+            // Require strictly longer match to justify skipping two literal bytes
+            if let Some(next2) = next2 && next2.match_len > best.match_len + 1 {
+                return None;
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 1160 - 1165, Refactor the
nested check in the lazy2 block to use a let-chain like the earlier lazy check:
combine the Option pattern and the length comparison into a single if condition
(referencing next2 and next2.match_len against best.match_len + 1) and return
None when that combined condition holds, removing the nested if for consistency
with the prior pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/tests/cross_validation.rs`:
- Around line 252-254: The doc comment above the cross-validation test currently
says "within a small margin" but the test asserts that Better strictly
outperforms Default; update the comment to reflect that strict requirement
(e.g., say that Better is expected to be strictly better than Default on the
corpus proxy) so the prose matches the assertion involving Better and Default in
this test.

---

Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1160-1165: Refactor the nested check in the lazy2 block to use a
let-chain like the earlier lazy check: combine the Option pattern and the length
comparison into a single if condition (referencing next2 and next2.match_len
against best.match_len + 1) and return None when that combined condition holds,
removing the nested if for consistency with the prior pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c976b1c4-cadd-49bf-aad3-534799be04c6

📥 Commits

Reviewing files that changed from the base of the PR and between d78cee6 and f2ca81e.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/tests/cross_validation.rs

Comment thread zstd/tests/cross_validation.rs Outdated
Copy link
Copy Markdown

@sw-release-bot sw-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.

Benchmark suite Current: af9dadb Previous: db372df Ratio
compress/default/large-log-stream/matrix/c_ffi 3.895 ms 3.383 ms 1.15

This comment was automatically generated by workflow using github-action-benchmark.

CC: @polaz

@polaz
Copy link
Copy Markdown
Member Author

polaz commented Apr 1, 2026

@coderabbitai Re: cross_validation.rs:254 doc comment nitpick — Fixed, updated doc to match the strict assertion.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Tip

For best results, initiate chat on the files or code changes.

@polaz, thanks for the update — glad the doc comment now accurately reflects the strict assertion. ✅


🧠 Learnings used
Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:34.213Z
Learning: In the `structured-zstd` codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These two backends are independent; review comments about Dfast internals (e.g. position seeding order in `pick_lazy_match`) do not apply to the Better level's implementation path.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 0
File: :0-0
Timestamp: 2026-04-01T06:49:39.991Z
Learning: In the structured-zstd codebase (`zstd/src/encoding/match_generator.rs`), `CompressionLevel::Default` uses the `DfastMatchGenerator` backend (with `pick_lazy_match` for lazy evaluation), while `CompressionLevel::Better` uses the `HcMatchGenerator` (hash chain backend) which inserts positions before lazy evaluation inside its own `start_matching` loop. These are distinct backends with separate lazy-seeding paths.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 48
File: zstd/src/encoding/match_generator.rs:1531-1535
Timestamp: 2026-04-01T15:33:13.169Z
Learning: In `zstd/src/encoding/match_generator.rs`, `HcMatchGenerator::insert_position` intentionally casts `abs_pos: usize` to `u32` (wrapping on streams > 4 GiB). This is safe by design: (1) the chain table slot is computed as `(pos_u32 as usize) & chain_mask`, so only the low bits matter and wrapping leaves the addressed slot unchanged; (2) `chain_candidates` filters any recovered `cur as usize` value against `history_abs_start` — a wrapped (small) value is always less than the current sliding-window start and is discarded. Do not flag this cast as a bug.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Applies to **/*.rs : Code must pass `cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings` (excluding `rustc-dep-of-std` and `fuzz_exports` features)

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 44
File: zstd/src/encoding/frame_compressor.rs:0-0
Timestamp: 2026-03-28T22:55:01.577Z
Learning: In `zstd/src/encoding/frame_compressor.rs`, `FrameCompressor::set_dictionary` returns `Result<Option<Dictionary>, DictionaryDecodeError>` and enforces two fail-fast programmer-error contracts via `Err(...)` rather than panics: (1) `dictionary.id == 0` returns `DictionaryDecodeError::ZeroDictionaryId`; (2) any zero entry in `dictionary.offset_hist` returns `DictionaryDecodeError::ZeroRepeatOffsetInDictionary`. The fallible parsing paths (`Dictionary::decode_dict`, `Dictionary::from_raw_content`) apply the same rejections at input boundaries. Do not flag these as inconsistencies — the uniform `Result`-based contract is intentional.

Learnt from: polaz
Repo: structured-world/structured-zstd PR: 45
File: cli/src/main.rs:195-202
Timestamp: 2026-03-31T12:34:08.051Z
Learning: In `cli/src/main.rs` (structured-world/structured-zstd), the Windows `std::os::windows::fs::MetadataExt` trait exposes `file_index() -> Option<u64>` (a single 64-bit value stabilized in Rust 1.83) — NOT `file_index_high()` / `file_index_low()` split methods, which do not exist. The correct same-file check on Windows uses `volume_serial_number()` and `file_index()`.

Learnt from: CR
Repo: structured-world/structured-zstd PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-03-26T15:15:32.318Z
Learning: Avoid suggesting architectural changes that diverge too far from upstream, as this is a fork of ruzstd (KillingSpark/zstd-rs)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
zstd/src/encoding/match_generator.rs (1)

1157-1165: ⚠️ Potential issue | 🟡 Minor

Collapse the remaining lazy2 if let nest.

The first lazy check was fixed, but the next2 block still uses the if let Some(next2) = next2 { if ... } shape, so clippy::collapsible_if can still keep the Rust lint gate red.

🔧 Suggested fix
         if self.lazy_depth >= 2 && abs_pos + 2 + DFAST_MIN_MATCH_LEN <= self.history_abs_end() {
             let next2 = self.best_match(abs_pos + 2, lit_len + 2);
-            if let Some(next2) = next2 {
-                // Require strictly longer match to justify skipping two literal bytes
-                if next2.match_len > best.match_len + 1 {
-                    return None;
-                }
+            if let Some(next2) = next2 && next2.match_len > best.match_len + 1 {
+                // Require strictly longer match to justify skipping two literal bytes
+                return None;
             }
         }

As per coding guidelines, "Code must pass cargo clippy -p structured-zstd --features hash,std,dict_builder -- -D warnings (excluding rustc-dep-of-std and fuzz_exports features)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 1157 - 1165, Collapse the
nested `if let` by combining the match check and the length guard into a single
conditional: replace the current two-step `let next2 = self.best_match(...)`
followed by `if let Some(next2) = next2 { if next2.match_len > best.match_len +
1 { return None; } }` with a single `if let Some(next2) =
self.best_match(abs_pos + 2, lit_len + 2) && next2.match_len > best.match_len +
1 { return None; }` inside the existing `if self.lazy_depth >= 2 && abs_pos + 2
+ DFAST_MIN_MATCH_LEN <= self.history_abs_end()` block so the `next2` binding
and the length check are collapsed and the clippy `collapsible_if` warning is
resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1435-1440: skip_matching currently only inserts positions starting
at current_abs_start so the three 4-byte HC starts at current_abs_start - 3 ..
current_abs_start never become hashable until the next slice, hiding cross-slice
matches from Better; fix by backfilling those three starts when advancing
slices: in skip_matching (and the analogous place around lines ~1450-1480) call
insert_position/insert_positions for the range that includes
current_abs_start.saturating_sub(3) up through current_abs_start + current_len
(or explicitly insert the three indices current_abs_start -3, -2, -1 if
in-range), making sure to guard by bounds derived from
history_abs_start/window_size/window to avoid inserting negative or
out-of-window positions so those HC starts become hashable for Better.

---

Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1157-1165: Collapse the nested `if let` by combining the match
check and the length guard into a single conditional: replace the current
two-step `let next2 = self.best_match(...)` followed by `if let Some(next2) =
next2 { if next2.match_len > best.match_len + 1 { return None; } }` with a
single `if let Some(next2) = self.best_match(abs_pos + 2, lit_len + 2) &&
next2.match_len > best.match_len + 1 { return None; }` inside the existing `if
self.lazy_depth >= 2 && abs_pos + 2 + DFAST_MIN_MATCH_LEN <=
self.history_abs_end()` block so the `next2` binding and the length check are
collapsed and the clippy `collapsible_if` warning is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 084f9adc-51b1-4c3b-8a11-3f6a4b4e7c06

📥 Commits

Reviewing files that changed from the base of the PR and between d78cee6 and 3ed9ac1.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/tests/cross_validation.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1160-1167: Collapse the nested optional check into a single
let-chain conditional: inside the block guarded by self.lazy_depth and
history_abs_end() replace the separate let next2 = self.best_match(...) followed
by if let Some(next2) = next2 { if next2.match_len > best.match_len + 1 { ... }
} with a single if let Some(next2) = self.best_match(abs_pos + 2, lit_len + 2)
&& next2.match_len > best.match_len + 1 { return None; } so you use a let-chain
and remove the intermediate variable while keeping the same logic (symbols:
lazy_depth, abs_pos, DFAST_MIN_MATCH_LEN, history_abs_end(), best_match, next2,
match_len, best.match_len).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80c5a4e0-d3e5-4a7b-8ad6-d8308f45220d

📥 Commits

Reviewing files that changed from the base of the PR and between 3ed9ac1 and b4b95f3.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread zstd/src/encoding/match_generator.rs Outdated
Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1470-1490: The lazy look-ahead is calling find_best_match(abs_pos
+ 1/2, ...) before the skipped bytes have been inserted into the hash-chain
tables, so the matcher sees stale state; fix by seeding those skipped positions
into the matcher before any look-ahead comparisons. Concretely, in the lazy path
(where pick_lazy_match is used from match_generator.rs) ensure you call
insert_positions or insert_position for the bytes between abs_pos and the
look-ahead start (i.e., the positions that will be skipped) before invoking
find_best_match for abs_pos+1 and abs_pos+2 (and likewise in the similar block
around the 1722–1740 region). This ensures find_best_match, pick_lazy_match and
subsequent Better comparisons see up-to-date HC state and capture self-overlap /
offset-1 winners.
- Around line 1438-1442: The backfill logic in backfill_boundary_positions
currently bails out when current_abs_start.saturating_sub(3) is below
history_abs_start, dropping anchors; change it to clamp the backfill start to
the live-history start instead of early-returning: compute a clamped start =
max(self.history_abs_start, current_abs_start.saturating_sub(3)) and call
insert_positions(clamped_start, current_abs_start) (only if clamped_start <
current_abs_start), using the existing backfill_boundary_positions and
insert_positions symbols and the history_abs_start/current_abs_start values to
locate the fix.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 768a486f-3ba1-43d5-9815-c84e55a70695

📥 Commits

Reviewing files that changed from the base of the PR and between b4b95f3 and b47b901.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1471-1504: Before emitting the trailing literals, seed the
boundary anchor by calling the same insert logic for the start at current_len -
4 so the final 4-byte tail is indexed; specifically, in the block that handles
remaining literals (after the while loop that uses insert_position and
insert_positions), compute the boundary_start = current_abs_start +
(current_len.saturating_sub(4)) and call insert_position (or insert_positions if
you need to insert up to boundary_start + 4) for that start so
backfill_boundary_positions and cross-slice matching (used by start_matching)
can find matches whose source begins exactly 4 bytes before the slice boundary;
update both places noted (the trailing-literals return and the similar section
around lines 1540-1545) to perform this seeding.
- Around line 2210-2214: Add a cross-slice regression for the HashChain path
immediately after the existing routing check: after calling
driver.reset(CompressionLevel::Better) and asserting driver.active_backend ==
MatcherBackend::HashChain and driver.hc_matcher().lazy_depth == 2, add a test
analogous to dfast_inserts_tail_positions_for_next_block_matching() but targeted
at the HashChain matcher (e.g.,
hashchain_inserts_tail_positions_for_next_block_matching()). The new test should
seed two adjacent slices with overlapping content, run the HashChain match
generation across the boundary using driver.hc_matcher()/driver APIs, and assert
that tail positions from the previous slice are preserved/used for matching in
the next slice to catch boundary-anchor losses. Ensure it lives next to the
routing check so it runs for CompressionLevel::Better.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ac9c3208-f79b-4211-8531-4057aa092090

📥 Commits

Reviewing files that changed from the base of the PR and between b47b901 and fdf783d.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs
…reaming helper

- Check abs_pos >= u32::MAX as usize before casting (prevents silent truncation)
- Delegate roundtrip_streaming to roundtrip_streaming_at_level for consistency
- Fix import ordering (cargo fmt)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 336-339: The match arm that sets min_primed_tail currently treats
MatcherBackend::HashChain like Dfast (value 4), which drops 1–3 byte dictionary
tails even though HcMatchGenerator::backfill_boundary_positions() can make those
tails hashable; update the MatcherBackend::HashChain branch in
match_generator.rs so min_primed_tail uses MIN_MATCH_LEN (like
MatcherBackend::Simple) instead of 4, ensuring CompressionLevel::Better can
detect the first cross-boundary match after priming.
- Around line 1554-1559: The early return when abs_pos >= u32::MAX disables all
further HC insertions for long streams; instead preserve the original wrap-safe
behavior used by HcMatchGenerator::insert_position by casting abs_pos to u32 and
continuing to insert (store (abs_pos as u32).wrapping_add(1) or equivalent) so
HC_EMPTY (0) still never collides, and rely on existing filtering against
history_abs_start to discard wrapped/stale candidates. Replace the return with a
safe cast+insert path that stores (abs_pos+1) as a u32 and keeps the rest of the
insertion logic unchanged (respecting HC_EMPTY and history_abs_start).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 16298c35-ffb2-4ddb-a1e3-875953d6e0f4

📥 Commits

Reviewing files that changed from the base of the PR and between b21c7c9 and e1828c1.

📒 Files selected for processing (2)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/match_generator.rs
Comment thread zstd/src/encoding/match_generator.rs
- CompressionLevel::Better doc warns about >4 GiB single-frame degradation
- TODO(#51) at insert_position guard for future table rebasing
- Clarify min_primed_tail 4-byte requirement for HC backend
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 301-305: The HashChain-specific priming/retention branches in
prime_with_dictionary (and the analogous spots around the other occurrences) are
not directly exercised for CompressionLevel::Better; update the existing
priming/retention tests (the "preserves history for first full block" and the
retention-budget tests) to include a dedicated CompressionLevel::Better case
that drives the HashChain branch so its offset_hist/retention behavior is
asserted explicitly; locate the matcher backend switch in prime_with_dictionary
and add test variants that instantiate the HashChain backend (via the same setup
used to reach MatcherBackend::HashChain) and assert the expected offset_hist and
retention results, repeating this approach for the similar blocks at the other
mentioned locations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9a3b3883-992c-4073-97f9-dab103ddecce

📥 Commits

Reviewing files that changed from the base of the PR and between b21c7c9 and 8cffeec.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/tests/roundtrip_integrity.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 1566-1568: The chain_table stores absolute predecessor positions
but is indexed by a masked chain_idx, which lets reused masked slots form
forward links/cycles; instead of only checking next == cur, ensure you never
store a predecessor that is not strictly older than the current position: when
computing chain_idx and prev (from hash_table[hash]), only assign
self.chain_table[chain_idx] = prev if prev < pos_u32 (otherwise set it to 0 or
skip assignment); apply the same guard wherever chain_table is updated (see
usages around chain_idx, chain_table and hash_table in match_generator.rs and
the chain_candidates()-related updates at ~1602-1613) to prevent
forward/duplicate links and break cycles.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8bc43989-beb8-4097-aaec-038158e33547

📥 Commits

Reviewing files that changed from the base of the PR and between 8cffeec and 59cb45e.

📒 Files selected for processing (1)
  • zstd/src/encoding/match_generator.rs

Comment thread zstd/src/encoding/match_generator.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
zstd/src/encoding/match_generator.rs (2)

2240-2245: 🧹 Nitpick | 🔵 Trivial

Add one CompressionLevel::Better priming regression here too.

These assertions pin backend selection and window size, but the new MatcherBackend::HashChain branches in prime_with_dictionary() still only have backend-specific regressions for Simple and Dfast. A small Better case would keep offset_hist, retained-budget retirement, and the 4-byte primed-tail behavior from being covered only indirectly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 2240 - 2245, Add a direct
CompressionLevel::Better priming regression in prime_with_dictionary() test
coverage: after the existing driver.reset(CompressionLevel::Better)/asserts
block for MatcherBackend::HashChain, add a small prime_with_dictionary()
invocation that uses CompressionLevel::Better and verifies HashChain-specific
behaviors (check driver.hc_matcher().offset_hist state, retained-budget
retirement behavior, and the 4-byte primed-tail handling) so Better is exercised
explicitly rather than only via Simple/Dfast paths; locate the
prime_with_dictionary() calls and MatcherBackend::HashChain branches to insert
this Better-case regression.

1603-1613: ⚠️ Potential issue | 🟠 Major

Guard non-decreasing chain links, not just next == cur.

Line 1605 only catches 1-node self-loops. Reused masked slots can also create 2+ node cycles (older -> newer -> older), so this walk can still burn the full 16-candidate budget on duplicates and miss older valid matches.

🛠️ Minimal guard
-            if next == cur {
-                // Self-loop: two positions share chain_idx, stop to avoid
-                // spinning on the same candidate forever.
+            if next >= cur {
+                // Reused chain slots can overwrite an older node's link with a
+                // newer/equal one. Any non-decreasing edge means traversal is
+                // no longer moving backward through history.
                 if candidate_abs >= self.history_abs_start && candidate_abs < abs_pos {
                     buf[filled] = candidate_abs;
+                    filled += 1;
                 }
                 break;
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@zstd/src/encoding/match_generator.rs` around lines 1603 - 1613, The loop
currently only breaks on exact self-loops (if next == cur) but can still
traverse cycles or non-decreasing links; change the guard to stop when chain
links stop decreasing (e.g., if next >= cur) so you don't revisit newer-or-equal
positions and burn the candidate budget on duplicates. Update the check around
chain_table lookup in the walk (the variables: chain_table, candidate_abs,
chain_mask, cur, next, steps, buf, filled, history_abs_start, abs_pos) to break
when next >= cur and keep the existing logic that records candidate_abs into buf
only when candidate_abs is within [history_abs_start, abs_pos). Ensure cur is
updated only after this guard so the walk never advances into non-decreasing or
cyclic links.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 2240-2245: Add a direct CompressionLevel::Better priming
regression in prime_with_dictionary() test coverage: after the existing
driver.reset(CompressionLevel::Better)/asserts block for
MatcherBackend::HashChain, add a small prime_with_dictionary() invocation that
uses CompressionLevel::Better and verifies HashChain-specific behaviors (check
driver.hc_matcher().offset_hist state, retained-budget retirement behavior, and
the 4-byte primed-tail handling) so Better is exercised explicitly rather than
only via Simple/Dfast paths; locate the prime_with_dictionary() calls and
MatcherBackend::HashChain branches to insert this Better-case regression.
- Around line 1603-1613: The loop currently only breaks on exact self-loops (if
next == cur) but can still traverse cycles or non-decreasing links; change the
guard to stop when chain links stop decreasing (e.g., if next >= cur) so you
don't revisit newer-or-equal positions and burn the candidate budget on
duplicates. Update the check around chain_table lookup in the walk (the
variables: chain_table, candidate_abs, chain_mask, cur, next, steps, buf,
filled, history_abs_start, abs_pos) to break when next >= cur and keep the
existing logic that records candidate_abs into buf only when candidate_abs is
within [history_abs_start, abs_pos). Ensure cur is updated only after this guard
so the walk never advances into non-decreasing or cyclic links.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94f01483-94cc-4d8b-b3d2-6be8629c3cf8

📥 Commits

Reviewing files that changed from the base of the PR and between b21c7c9 and 59cb45e.

📒 Files selected for processing (3)
  • zstd/src/encoding/match_generator.rs
  • zstd/src/encoding/mod.rs
  • zstd/src/tests/roundtrip_integrity.rs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Comment thread zstd/src/encoding/mod.rs Outdated
- Add hash-chain dictionary priming test (mirrors Dfast variant)
- Add HC retention budget eviction test
- Refine Better doc comment: u32 limit is ~4 GiB + window, not instant
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

zstd/src/encoding/levels/fastest.rs:29

  • compress_block_encoded mutates compression state (offset history, last Huffman table, FSE table history) while building a compressed representation, but it can later fall back to emitting a Raw block when the compressed payload exceeds MAX_BLOCK_SIZE. If that fallback happens, the frame stream will encode a Raw block while the encoder state has advanced as-if a Compressed block was emitted, which can desync repcode history and/or cause later blocks to incorrectly select RepeatLast tables (decoder won’t have those tables), risking decode failures/corruption. Consider snapshotting/restoring the relevant fields (e.g., state.offset_hist, state.last_huff_table, state.fse_tables) when falling back to Raw, or refactor so state updates only occur after committing to a Compressed block.
pub fn compress_block_encoded<M: Matcher>(
    state: &mut CompressState<M>,
    last_block: bool,
    uncompressed_data: Vec<u8>,
    output: &mut Vec<u8>,
) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement Better compression level (zstd level 7, lazy2 strategy)

2 participants