fix(encoding): implement default compression level#34
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdded a DFAST-style matcher and wired Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Driver as MatchGeneratorDriver
participant Simple as MatchGenerator
participant Dfast as DfastMatchGenerator
Client->>Driver: reset(level)
alt level selects Dfast
Driver->>Dfast: init_if_needed(), set slice_size/max_window_size
Client->>Driver: start_matching(data)
Driver->>Dfast: start_matching(...)
Dfast->>Driver: emit Sequence(s)
else level selects Simple
Driver->>Simple: set slice_size/max_window_size
Client->>Driver: start_matching(data)
Driver->>Simple: start_matching(...)
Simple->>Driver: emit Sequence(s)
end
Driver->>Client: sequences / matches
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Implements support for CompressionLevel::Default by routing it through the existing encoder (instead of panicking), configuring the matcher/window for Default-level frames, and adding regression tests for Rust↔C zstd roundtrip/interop.
Changes:
- Route
CompressionLevel::Defaultthrough the existingcompress_fastestpath. - Configure the match generator to use a larger window for Default-level frames.
- Add regression tests for Default-level roundtrip and Rust-compress→C-decompress interop.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
zstd/src/encoding/frame_compressor.rs |
Treat Default the same as Fastest during block compression. |
zstd/src/encoding/match_generator.rs |
Adjust matcher reset behavior to configure slice/window sizing for Default. |
zstd/src/encoding/mod.rs |
Remove “UNIMPLEMENTED” marker from CompressionLevel::Default docs. |
zstd/src/tests/roundtrip_integrity.rs |
Add a Default-level roundtrip regression test/helper. |
zstd/tests/cross_validation.rs |
Add Rust Default-compress → C zstd decompress regression coverage. |
Comments suppressed due to low confidence (1)
zstd/src/encoding/frame_compressor.rs:204
CompressionLevel::Defaultis now handled bycompress_fastest, which is documented elsewhere as level-1-ish. This makes theCompressionLevel::Defaultdocs (“roughly equivalent to Zstd level 3”) and the PR/issue intent potentially misleading. Either update the documentation/PR scope to reflect that Default currently reuses the Fastest encoder with a larger window, or introduce a distinct Default implementation so the semantic contract matches the docs.
}
CompressionLevel::Fastest | CompressionLevel::Default => {
compress_fastest(&mut self.state, last_block, uncompressed_data, output)
}
_ => {
unimplemented!();
}
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 39-45: The match in configure_for_level sets
match_generator.max_window_size using a wildcard arm which silently applies the
slice_size to Uncompressed, Fastest, and any future CompressionLevel variants;
update configure_for_level to match CompressionLevel explicitly (e.g., handle
Default, Uncompressed, Fastest) and remove the broad `_` arm so the compiler
will force you to decide window sizes for new variants—either assign the
intended window or add an explicit unreachable/panic with a comment if a variant
should never reach the matcher—referencing configure_for_level and
match_generator.max_window_size to locate the change.
In `@zstd/src/tests/roundtrip_integrity.rs`:
- Around line 200-204: Add a new test that exercises multi-block compression for
CompressionLevel::Default by generating compressible data larger than 128KB
(e.g., 256KB or 512KB) using generate_compressible(...) and asserting roundtrip
equality with roundtrip_default(&data); name it something like
roundtrip_default_multi_block to complement roundtrip_default_level_regression
so window/block-boundary behavior is covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1eabad60-a5ad-48d5-a434-8ac6ec83e75c
📒 Files selected for processing (5)
zstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/tests/roundtrip_integrity.rszstd/tests/cross_validation.rs
💤 Files with no reviewable changes (1)
- zstd/src/encoding/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 42-50: The match in level_config currently returns identical
(self.base_slice_size, self.base_window_size) for every CompressionLevel
variant; leave the explicit arms but add a concise inline comment above or
beside the match (or above the repeated arm block) stating that all arms
intentionally return the same values for now and that explicit arms are kept to
force future reviewers to make an intentional choice when new CompressionLevel
variants are added; reference the level_config method and the CompressionLevel
variants (Uncompressed, Fastest, Default, Better, Best) so the intent is clear
to future contributors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 696d85d0-deb6-4cab-963b-f5825d8048be
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/tests/roundtrip_integrity.rs
06bd5d2 to
6c50fea
Compare
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: 6c50fea | Previous: 98894d0 | Ratio |
|---|---|---|---|
compress/pure_rust/fastest |
38.135 ms |
18.51 ms |
2.06 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 546-601: start_matching currently clears short_hash and long_hash
then calls seed_history_tables over the entire concatenated window each time a
block is processed, which causes repeated reseeding and CPU work; change it to
maintain hash tables incrementally by removing the full clear/seed workflow and
instead only insert positions for newly appended bytes (use insert_position /
insert_positions) and update seed_history_tables or an incremental seeding
helper for the new block range so you avoid reprocessing the whole window; keep
start_matching, seed_history_tables, short_hash, long_hash, insert_position, and
insert_positions as anchors when refactoring so the logic that previously
cleared and reseeded can be replaced with incremental updates for the appended
block.
- Around line 698-727: The long_candidates loop lacks the same guard as
short_candidates against candidate_pos >= abs_pos; add an explicit skip in the
long_candidates iteration (before computing common_prefix_len) similar to the
short_candidates loop so candidate_pos >= abs_pos continues to the next
candidate, then proceed to compute match_len, call extend_backwards, update best
via Self::better_candidate, and keep the early-return when best.match_len >=
DFAST_TARGET_LEN; this ensures long_candidates behaves safely like
short_candidates and prevents invalid comparisons even if seeding produces a
non-decreasing position.
In `@zstd/tests/cross_validation.rs`:
- Around line 208-220: Update the documentation in mod.rs to correct the
statement that Default uses the same encoder path and matcher configuration as
Fastest: change the wording to state that CompressionLevel::Default uses
MatcherBackend::Dfast (see match_generator.rs) while CompressionLevel::Fastest
uses MatcherBackend::Simple, so they do not share the same matcher backend; keep
the rest of the doc intact and reference the two backends by name
(MatcherBackend::Dfast and MatcherBackend::Simple) for clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 154cbc1c-09a4-4ca6-9e34-6b540e6a1a40
📒 Files selected for processing (6)
zstd/src/encoding/blocks/compressed.rszstd/src/encoding/frame_compressor.rszstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/src/tests/roundtrip_integrity.rszstd/tests/cross_validation.rs
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 149-156: The call to dfast_match_generator.as_ref().unwrap() in
window_size (and similar unwraps elsewhere) relies on the Matcher trait
invariant that reset() has been called before using Dfast backend; update the
code to make this invariant explicit: either replace unwrap() with
expect("dfast_match_generator must be Some: reset() must have been called per
Matcher trait contract") to fail with a clear message if violated, and/or add a
brief doc comment on the Matcher trait and on the reset() method (and reference
frame_compressor.rs usage) stating the required call order so the unwraps are
documented and safer to reason about.
- Around line 639-643: The call to encode_offset_with_history(candidate.offset
as u32, literals.len() as u32, &mut self.offset_hist) intentionally discards its
return value and is used solely to update self.offset_hist as a side effect; add
a brief inline comment directly above or beside this call stating that the
return value is intentionally ignored and the call exists only for offset
history tracking (mention candidate.offset and literals.len() for context) so
future readers won't mistake the let _ = ... for an oversight.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 365a448d-975e-4a69-b769-683e46735843
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/tests/cross_validation.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/encoding/match_generator.rs`:
- Around line 609-614: The boundary tail of the previous block isn't being
hashed enough, so before inserting positions for the new block you must backfill
the previous block's tail: in skip_matching (and similarly in start_matching)
call insert_positions covering the last (MAX_LOOKAHEAD-1) bytes of the previous
block (i.e. start from current_abs_start - (LOOKAHEAD-1) up to current_abs_start
+ current_len) so starts in the final 3 bytes (and up to the last 7 bytes for
long hashes) get hashed; use the existing ensure_hash_tables(), window,
history_abs_start, window_size and insert_positions/insert_position helpers to
perform this backfill.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 83f7340c-1dff-41e6-a40d-69a2101f1417
📒 Files selected for processing (3)
zstd/src/encoding/match_generator.rszstd/src/encoding/mod.rszstd/tests/cross_validation.rs
8c0737b to
3595e8f
Compare
Summary
dfast-style matcher forCompressionLevel::Defaultwith a 4 MB window, dual hash tables, repeat-offset checks, and lazy match selectionCompressionLevel::Fasteston the original simple matcher soDefaultnow meaningfully improves compression ratio instead of aliasingFastestencode_match_len()base value bug that surfaced once the new matcher started emitting long matchesTesting
cargo nextest run --workspacecargo build --workspacecargo nextest run -p structured-zstd -E 'test(default_level_beats_fastest_on_corpus_proxy) or test(default_level_stays_within_ten_percent_of_ffi_level3_on_corpus_proxy)'Notes
zstd/decodecorpus_files/z000033) and keepsDefaultwithin 10% of systemzstd -3on that input.Closes #5
Summary by CodeRabbit
New Features
Defaultcompression level now uses it for improved speed/ratio.Bug Fixes
Documentation
Defaultcompression level docs to describe its behavior.Tests