Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a pure‑Rust FastCOVER trainer, dictionary finalization and public APIs; fixes COVER segment scoring; introduces Rust vs C-FFI dict‑training benchmarks and a new bench target; and extends the CI benchmark runner/parser to emit/parse REPORT_DICT_TRAIN with a throughput→ms/iter fallback and stricter presence checks. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Bench Runner (script)
participant Bench as Bench Binary
participant Rust as Rust FastCOVER
participant CFFI as C FFI Trainer
participant Parser as Report Parser
participant Report as Generated Report
Runner->>Bench: run benches (--features dict_builder)
Bench->>Rust: train_fastcover_raw_from_slice / optimize
Bench->>CFFI: call C trainer (FFI)
Rust-->>Bench: rust_train_ms, rust_dict_bytes, score
CFFI-->>Bench: ffi_train_ms, ffi_dict_bytes
Bench-->>Runner: emit REPORT_DICT_TRAIN and other REPORT_* lines
Runner->>Parser: parse raw output (REPORT_* including REPORT_DICT_TRAIN)
Parser-->>Report: populate dictionary_training_rows and delta tables (throughput or ms/iter fallback)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds a FastCOVER-based (pure Rust) dictionary-training path and integrates it into the Criterion compare_ffi benchmarks + CI benchmark reporting, so the repo can track Rust-vs-C dictionary training deltas alongside the existing compression/decompression matrix.
Changes:
- Implement FastCOVER raw dictionary training (plus tuning/optimization scaffolding) and expose new public APIs for training and finalizing dictionaries.
- Extend
compare_ffito benchmark/report dictionary training (dict-train) and update benchmark parsing/report generation to includeREPORT_DICT_TRAIN. - Require
dict_builderfor relevant benches and update docs to reflect the new benchmark stage and dictionary capabilities.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| zstd/src/dictionary/mod.rs | Exposes FastCOVER options/APIs, updates COVER trainer to operate over in-memory input, and adds dictionary finalization logic + tests. |
| zstd/src/dictionary/fastcover.rs | Adds FastCOVER raw trainer + optimizer and unit tests. |
| zstd/src/dictionary/cover.rs | Updates epoch-based segment selection and fixes k-mer scoring bounds for short segments. |
| zstd/Cargo.toml | Gates compare_ffi/new bench behind dict_builder and adds dict_builder_fastcover bench target. |
| zstd/benches/dict_builder_fastcover.rs | Adds a dedicated Criterion bench comparing COVER vs FastCOVER raw training. |
| zstd/benches/compare_ffi.rs | Adds dict-train stage, emits REPORT_DICT_TRAIN, and benchmarks Rust FastCOVER vs C trainer. |
| README.md | Updates dictionary generation documentation to include FastCOVER + finalization APIs. |
| BENCHMARKS.md | Documents the new dict-train stage and updates benchmark invocation flags. |
| .github/scripts/run-benchmarks.sh | Parses/emits dictionary training tables and adds a speed-delta fallback when throughput is unavailable. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/cover.rs`:
- Around line 70-75: The chained method call creating `segments` is misformatted
and failing `cargo fmt --check`; update the expression that builds `segments`
(the `epoch.chunks(params.segment_size as usize).peekable()` chain) to follow
rustfmt style — either place each method on its own indented line with the dot
leading (e.g. `epoch.chunks(...).peekable()` reformatted across lines) or simply
run `cargo fmt` to apply the canonical formatting so the `segments` binding
compiles with rustfmt rules; ensure you adjust the `params.segment_size as
usize` cast placement if needed to satisfy formatting.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 103-130: The function signature for train_fastcover_raw is
misformatted for rustfmt; reformat it to follow Rust style (wrap parameters
across lines and place the return type and opening brace on their own line) so
cargo fmt passes — e.g. format the pub fn train_fastcover_raw(sample: &[u8],
dict_size: usize, params: FastCoverParams) -> Vec<u8> declaration to a
rustfmt-friendly multi-line signature and keep the body calling
build_raw_dict(sample, dict_size, params) unchanged; running cargo fmt/rustfmt
after the change should verify the fix.
In `@zstd/src/dictionary/mod.rs`:
- Around line 181-184: The code now allocates Vec::with_capacity(source_size)
and calls source.read_to_end(&mut all), which forces the entire training corpus
into memory; revert to a streaming approach or explicitly document the memory
constraint. Either replace read_to_end with a BufReader-based streaming read
(using BufReader around source and processing chunks iteratively instead of
collecting into all) or add a comment/Function-level documentation next to the
Vec/all/read_to_end code that states large corpora will be fully buffered and
recommend using a streaming API; reference the symbols Vec::with_capacity, all,
source.read_to_end, source_size and BufReader when making the change.
- Around line 390-439: Both create_fastcover_raw_dict_from_source and
create_fastcover_dict_from_source duplicate reading the source, empty-check, and
branching on fastcover.optimize; extract the shared training logic into a helper
(e.g., train_fastcover_internal(sample: &[u8], dict_size: usize, options:
&FastCoverOptions) -> (Vec<u8>, FastCoverTuned)) that calls
fastcover::optimize_fastcover_raw or fastcover::train_fastcover_raw and returns
(raw_dict, tuned), then have both public functions read and validate the source,
call train_fastcover_internal, call finalize_raw_dict(final_raw.as_slice(),
sample.as_slice(), dict_size, finalize) as before, write output, and return the
tuned struct.
- Around line 30-50: Reorder the use/import block so crate-internal modules are
grouped and sorted consistently (e.g., BitWriter,
decoding::dictionary::MAGIC_NUM, fse::fse_encoder, huff0::HuffmanTable,
huff0::huff0_encoder::{HuffmanEncoder, HuffmanTable},
dictionary::reservoir::create_sample, cover::*, fastcover::... ) and place std
imports (boxed::Box, collections::..., fs::..., io::..., path::..., vec::Vec) in
their own group; move the create_sample import into the crate-internal group,
run cargo fmt to apply rustfmt ordering rules, and ensure symbols like
BitWriter, DICT_MAGIC_NUM, fse_encoder, HuffmanDecoderTable, HuffmanEncoder,
create_sample, and DEFAULT_D_CANDIDATES remain correctly referenced after
reordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 37e2cb36-e61f-41fa-aa4c-1e5637716601
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 244-273: The dict-train parsing block (dict_train_match ->
dictionary_training_rows) must not inherit compression's scenario_input_bytes
for throughput math; stop populating or referencing scenario_input_bytes for
dict-train rows and instead add a dedicated "training_bytes" field (set to None
if the report doesn't include it) so downstream throughput calculations
skip/avoid computing rust_bytes_per_sec/ffi_bytes_per_sec for dict-train. Update
the downstream throughput/printing logic to check for "training_bytes" (or its
absence) before computing bytes/sec and only use REPORT_DICT_TRAIN-provided
training_bytes if you add that field; if you need absolute throughput later, add
training_bytes to REPORT_DICT_TRAIN and consume it here. Ensure changes
reference dict_train_match, dictionary_training_rows, and the throughput
computation code that reads scenario_input_bytes.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 128-179: optimize_fastcover_raw currently panics for 1-byte
samples because split_idx.clamp(1, sample.len().saturating_sub(1)) can produce
an invalid range and it may return an empty/zeroed winner when all candidates
score 0; modify optimize_fastcover_raw to first handle tiny samples (e.g., if
sample.len() < 2 return early or set split_idx = 1 only when sample.len() > 1)
so the clamp bounds are valid, and seed the "best" and "best_dict" with the
first evaluated candidate inside the nested loops (compute
params/build_raw_dict/coverage_score for the first f/d/k triple and assign
regardless of score) so there is always a concrete dict and params returned; add
regression tests calling train_fastcover_raw_from_slice()/optimize_fastcover_raw
with a 1-byte sample and a short sample where eval split is smaller than the
minimum d to ensure no panic and a non-empty dict is produced.
In `@zstd/src/dictionary/mod.rs`:
- Around line 327-330: Reject dict_id == 0 before serializing: after computing
dict_id (from options.dict_id or derive_dict_id(raw_content)) check if dict_id
== 0 and return an early error instead of writing to out; mirror the existing
behavior used by FrameCompressor::set_dictionary which returns
DictionaryDecodeError::ZeroDictionaryId so propagate the same error on
encode/serialization paths (check both when options.dict_id is provided and when
derived via derive_dict_id).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 868b26a4-86cf-42d7-b1d2-eab09d92da77
📒 Files selected for processing (5)
.github/scripts/run-benchmarks.shzstd/benches/compare_ffi.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'structured-zstd vs C FFI'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.15.
| Benchmark suite | Current: c30e9ba | Previous: b280757 | Ratio |
|---|---|---|---|
compress/better/small-1k-random/matrix/c_ffi |
0.11 ms |
0.092 ms |
1.20 |
compress/best/small-1k-random/matrix/pure_rust |
0.296 ms |
0.249 ms |
1.19 |
compress/default/small-4k-log-lines/matrix/c_ffi |
0.021 ms |
0.018 ms |
1.17 |
compress/better/small-4k-log-lines/matrix/pure_rust |
0.152 ms |
0.124 ms |
1.23 |
compress/better/small-4k-log-lines/matrix/c_ffi |
0.099 ms |
0.077 ms |
1.29 |
compress/best/small-4k-log-lines/matrix/pure_rust |
0.258 ms |
0.203 ms |
1.27 |
compress/best/small-4k-log-lines/matrix/c_ffi |
0.368 ms |
0.28 ms |
1.31 |
compress/fastest/low-entropy-1m/matrix/c_ffi |
0.189 ms |
0.142 ms |
1.33 |
compress/fastest/large-log-stream/matrix/c_ffi |
2.726 ms |
2.149 ms |
1.27 |
compress/best/large-log-stream/matrix/c_ffi |
15.06 ms |
12.903 ms |
1.17 |
decompress/fastest/small-10k-random/rust_stream/matrix/c_ffi |
0.002 ms |
0.001 ms |
2 |
decompress/fastest/high-entropy-1m/c_stream/matrix/c_ffi |
0.035 ms |
0.027 ms |
1.30 |
decompress/default/high-entropy-1m/c_stream/matrix/c_ffi |
0.037 ms |
0.027 ms |
1.37 |
decompress/best/high-entropy-1m/c_stream/matrix/c_ffi |
0.172 ms |
0.038 ms |
4.53 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @polaz
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/scripts/run-benchmarks.sh:
- Around line 293-298: The current check for missing dictionary_training_rows
only prints a warning; change it to fail fast by replacing the print-only
behavior in the if not dictionary_training_rows: block so the script exits with
a non-zero status (e.g., call sys.exit(1)) after writing the same error message
to stderr; locate and update the if not dictionary_training_rows: conditional in
.github/scripts/run-benchmarks.sh to preserve the stderr output and then exit to
ensure CI fails when REPORT_DICT_TRAIN rows are missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ca663b82-0133-4187-8e96-f731d78116f3
📒 Files selected for processing (3)
.github/scripts/run-benchmarks.shzstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 35-41: The code currently clamps table bits inside
clamp_table_bits and build_frequency_table but trains, scores, and records using
the original raw params (e.g., build_raw_dict, params.d, k, FastCoverTuned), so
normalize/replace params once up front: in the entrypoints (including the
non-optimized wrapper in mod.rs) compute normalized params.d (clamped to 4..32),
params.f (clamped via clamp_table_bits to 8..20) and adjust k to match the
normalized d, then pass these normalized values into build_frequency_table,
build_raw_dict, the optimizer scoring path, and when constructing FastCoverTuned
so that training, evaluation, and reporting all use the same clamped/adjusted
parameters.
In `@zstd/src/dictionary/mod.rs`:
- Around line 405-416: train_fastcover_raw_from_slice currently only rejects
empty sample slices but can return Ok((Vec::new(), tuned)) for tiny samples;
update this function to treat a produced empty dictionary as invalid when
dict_size > 0: after calling train_fastcover_internal (or by checking the dict
produced there), return an Err(io::ErrorKind::InvalidInput, ...) if
dict.is_empty() && dict_size > 0, or alternatively implement the same
tiny-source fallback logic used by create_raw_dict_from_source so small samples
fall back to the tiny-source path; reference train_fastcover_raw_from_slice,
train_fastcover_internal, create_fastcover_raw_dict_from_source,
create_fastcover_dict_from_source, and finalize_raw_dict when adding the check
or fallback.
- Around line 337-345: finalize_raw_dict() now writes Huffman/FSE tables before
the repeat offsets, but Dictionary::decode_dict() still assumes offsets are at
raw[8..20] and content at raw[20..], causing mis-decoding; update
Dictionary::decode_dict() (and any caller like
FrameCompressor::set_dictionary()) to parse the new wire format by first
reading/consuming the serialized Huffman and three FSE tables (as produced by
serialize_huffman_table and serialize_fse_table) before reading the three repeat
offsets and the remaining content, or alternatively gate finalize_raw_dict()
behind a feature/API flag until the in-crate parser is updated so local
consumers do not receive finalized dictionaries with mismatched layout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ce3585f0-341a-4ce8-a310-d180922a6582
📒 Files selected for processing (2)
zstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
This reverts commit 5f15353.
- extend compare_ffi with dict-train stage and REPORT_DICT_TRAIN lines - parse and render dictionary training section in benchmark reports - require dict_builder feature for compare_ffi benchmark and pipeline command - document new benchmark stage and speed-delta fallback behavior
- treat source_size as hint and document in-memory training behavior - fix epoch sizing units, bound huffman stats sampling, and share fastcover training helper - add k-candidate fallback + regression tests - rename dict-train delta field to match C/Rust ratio semantics
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 427-435: Add a short doc comment to clearly state that these
public Read-based helpers fully buffer the training corpus into memory before
training: update the doc comment for create_fastcover_raw_dict_from_source
(which calls source.read_to_end and then train_fastcover_raw_from_slice) and the
other public Read-based helper that likewise calls read_to_end (the sibling
function around lines 441-449) to say something like "This function fully
buffers the entire training corpus into memory (uses read_to_end) and may
consume large amounts of RAM for large inputs; callers should stream or use
slice-based APIs for very large corpora." Ensure both functions' public docs
contain that exact memory-buffering warning.
- Around line 450-453: The trainer currently calls
train_fastcover_raw_from_slice(..., dict_size, ...) but finalize_raw_dict(...,
dict_size, ...) only allows dict_size - header_bytes for raw content, so
training targets the wrong budget; change the training budget to use the actual
post-finalization content budget by computing the finalized content capacity the
same way finalize_raw_dict does (reuse its header/table sizing logic), e.g.,
add/implement finalized_content_budget(...) and pass that result into
train_fastcover_raw_from_slice instead of dict_size so the raw trainer only
optimizes the bytes that will survive finalization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 61ab7123-92f2-44a7-b8cb-9ad833e2561f
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
zstd/src/dictionary/mod.rs (3)
450-453:⚠️ Potential issue | 🟠 MajorTrain against the post-finalization content budget.
This path tunes raw content with the full
dict_size, thenfinalize_raw_dict()removes header/table bytes and may reject the request anyway. Successful calls ship a truncated dictionary that FastCOVER never optimized, and impossible budgets still pay the full optimization cost.Possible direction
- let (raw_dict, tuned) = - train_fastcover_raw_from_slice(sample.as_slice(), dict_size, fastcover)?; + let content_budget = finalized_content_budget(sample.as_slice(), dict_size, finalize)?; + let (raw_dict, tuned) = + train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover)?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 450 - 453, The code currently calls train_fastcover_raw_from_slice with the full dict_size then finalizes, which wastes work and can exceed the post-finalization content budget; change the flow to compute the actual content budget to be used for training (the maximum number of payload bytes allowed after finalize_raw_dict removes headers/tables) before training, check that this content_budget is positive and return the appropriate error if it is impossible, then call train_fastcover_raw_from_slice(sample.as_slice(), content_budget, fastcover) to produce raw_dict tuned to the post-finalize budget and finally call finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size, finalize) as before; reference the functions train_fastcover_raw_from_slice and finalize_raw_dict and add the pre-check for an impossible content budget to avoid wasted work.
337-340:⚠️ Potential issue | 🟠 MajorSequence entropy tables are still hard-coded.
finalize_raw_dict()always emitsdefault_of_table(),default_ml_table(), anddefault_ll_table(), so only the Huffman side is corpus-trained. Finalized dictionaries will systematically trail C finalization quality because they never learn sample-specific sequence stats.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 337 - 340, finalize_raw_dict() currently emits hard-coded sequence FSE tables (fse_encoder::default_of_table(), default_ml_table(), default_ll_table()), so sequence statistics from sample_data are never used; replace those default tables with FSE tables built from the sample sequence statistics before calling serialize_fse_table. Locate the block where out.extend_from_slice(serialize_fse_table(...)) is called and: compute sequence counts/stats from sample_data (the same sample source used for serialize_huffman_table), build FSE encoder tables from those counts (instead of using default_of_table/default_ml_table/default_ll_table), and pass the generated tables into serialize_fse_table so finalized dictionaries include sample-trained OF/ML/LL sequence tables.
188-199:⚠️ Potential issue | 🟠 MajorDon't panic on I/O failures in this public builder.
create_raw_dict_from_sourcecrashes on read/write errors (lines 193, 202, 257) via.expect(), while the newercreate_fastcover_raw_dict_from_sourceandcreate_fastcover_dict_from_sourceproperly propagate errors asio::Result. Change the return type toio::Result<()>and use?instead of.expect()for consistent error handling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 188 - 199, The public builder create_raw_dict_from_source currently panics on I/O errors via .expect() calls (e.g., the read_to_end and write_all sites) — change its signature to return io::Result<()> and replace those .expect(...) calls with the ? operator so I/O errors are propagated; also update any other remaining .expect() in create_raw_dict_from_source (including the later write at the tiny-dictionary path) to use ? and ensure callers handle the propagated Result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 197-200: The FFI and Rust trainers are being fed different
corpora: Rust uses a flattened training_blob while the FFI calls still use
sample_refs (preserving original boundaries), so make them identical by choosing
one canonical representation and using it for both trainers; e.g., keep
training_blob (Vec<u8>) and compute sample_offsets/lengths or create slices that
point into training_blob (replace sample_refs with slices derived from
training_blob) and pass those same slices/offsets to the FFI dict-train calls
(the variables to change are training_samples, sample_refs, training_blob,
total_training_bytes and the FFI dict-train invocations near where sample_refs
are passed). Ensure total_training_bytes is computed from the unified
representation and update both trainer call sites to use that unified data.
In `@zstd/src/dictionary/fastcover.rs`:
- Around line 39-43: normalize_fastcover_params currently clamps d/k/f but
leaves accel unchecked so callers and the optimizer can use out-of-range accel
values; update normalize_fastcover_params(FastCoverParams) to clamp params.accel
to 1..=10 (same pattern as params.d and params.f using clamp_table_bits) and
return the clamped struct, then ensure any other normalization or tuning paths
that construct FastCoverTuned or score using accel (the optimizer/scoring code
and the code that builds FastCoverTuned) use that clamped accel value instead of
the raw input so the tuned result never contains accel outside 1..=10.
In `@zstd/src/dictionary/mod.rs`:
- Around line 385-398: The FastCoverParams are normalized into params via
normalize_fastcover_params(FastCoverParams { ... }) but the FastCoverTuned
struct still uses options.accel; change it to use the normalized params.accel
instead. Locate the block that creates params and returns
(fastcover::train_fastcover_raw(...), FastCoverTuned { ... }) and replace the
accel field value from options.accel to params.accel so FastCoverTuned
consistently reflects the normalized parameters (refer to
normalize_fastcover_params, FastCoverParams, params, options.accel, and
FastCoverTuned).
---
Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 450-453: The code currently calls train_fastcover_raw_from_slice
with the full dict_size then finalizes, which wastes work and can exceed the
post-finalization content budget; change the flow to compute the actual content
budget to be used for training (the maximum number of payload bytes allowed
after finalize_raw_dict removes headers/tables) before training, check that this
content_budget is positive and return the appropriate error if it is impossible,
then call train_fastcover_raw_from_slice(sample.as_slice(), content_budget,
fastcover) to produce raw_dict tuned to the post-finalize budget and finally
call finalize_raw_dict(raw_dict.as_slice(), sample.as_slice(), dict_size,
finalize) as before; reference the functions train_fastcover_raw_from_slice and
finalize_raw_dict and add the pre-check for an impossible content budget to
avoid wasted work.
- Around line 337-340: finalize_raw_dict() currently emits hard-coded sequence
FSE tables (fse_encoder::default_of_table(), default_ml_table(),
default_ll_table()), so sequence statistics from sample_data are never used;
replace those default tables with FSE tables built from the sample sequence
statistics before calling serialize_fse_table. Locate the block where
out.extend_from_slice(serialize_fse_table(...)) is called and: compute sequence
counts/stats from sample_data (the same sample source used for
serialize_huffman_table), build FSE encoder tables from those counts (instead of
using default_of_table/default_ml_table/default_ll_table), and pass the
generated tables into serialize_fse_table so finalized dictionaries include
sample-trained OF/ML/LL sequence tables.
- Around line 188-199: The public builder create_raw_dict_from_source currently
panics on I/O errors via .expect() calls (e.g., the read_to_end and write_all
sites) — change its signature to return io::Result<()> and replace those
.expect(...) calls with the ? operator so I/O errors are propagated; also update
any other remaining .expect() in create_raw_dict_from_source (including the
later write at the tiny-dictionary path) to use ? and ensure callers handle the
propagated Result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 076babeb-74dc-441d-a1f2-a9c33c1d779a
📒 Files selected for processing (9)
.github/scripts/run-benchmarks.shBENCHMARKS.mdREADME.mdzstd/Cargo.tomlzstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/cover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
zstd/src/dictionary/mod.rs (1)
479-492: 🧹 Nitpick | 🔵 TrivialUse the normalized
accelinFastCoverTuned.
k,d, andfcome fromnormalize_fastcover_params(...), butaccelstill comes from the raw option. That makes the publicFastCoverTunedresult inconsistent with the parameters actually passed tofastcover::train_fastcover_raw(...).♻️ Proposed fix
( fastcover::train_fastcover_raw(sample, dict_size, params), FastCoverTuned { k: params.k, d: params.d, f: params.f, - accel: options.accel, + accel: params.accel, score: 0, }, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 479 - 492, The FastCoverTuned returned struct uses options.accel instead of the normalized accel from normalize_fastcover_params, causing a mismatch with the params passed to fastcover::train_fastcover_raw; update the FastCoverTuned initializer to set accel: params.accel (and ensure FastCoverParams/normalize_fastcover_params provide accel) so k, d, f, and accel in FastCoverTuned reflect the actual normalized params used for training.zstd/benches/compare_ffi.rs (1)
197-200:⚠️ Potential issue | 🟠 MajorBenchmark the same corpus model on both sides.
The Rust path trains on one contiguous
training_blob, while the C path still trains onsample_refswith explicit sample boundaries. That changes both the work performed and the resulting dictionary, so the reporteddict-traindelta is not comparing like-for-like inputs.Also applies to: 206-208, 221-223, 251-267
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 235-243: The REPORT_DICT_TRAIN reported training_bytes is
incorrect because emit_dictionary_training_report is being passed scenario.len()
even though split_training_samples() may truncate samples; update the calls to
compute the actual byte count from the samples returned by
split_training_samples (e.g., sum of lengths of the training sample slices) and
pass that sum as the training_bytes argument to emit_dictionary_training_report
(change the calls around emit_dictionary_training_report where
rust_dictionary/ffi_dictionary are computed; also apply the same fix to the
other occurrence around lines 419-431). Ensure you reference
split_training_samples, emit_dictionary_training_report, and REPORT_DICT_TRAIN
when locating and fixing the code.
---
Duplicate comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 479-492: The FastCoverTuned returned struct uses options.accel
instead of the normalized accel from normalize_fastcover_params, causing a
mismatch with the params passed to fastcover::train_fastcover_raw; update the
FastCoverTuned initializer to set accel: params.accel (and ensure
FastCoverParams/normalize_fastcover_params provide accel) so k, d, f, and accel
in FastCoverTuned reflect the actual normalized params used for training.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e2ed0ba-b8ff-4cba-91d5-6cf52e42bc1f
📒 Files selected for processing (3)
.github/scripts/run-benchmarks.shzstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
|
@coderabbitai Re duplicate about |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 206-246: The benchmark compares non-equivalent outputs:
zstd::dict::from_samples returns a finalized dictionary while
train_fastcover_raw_from_slice returns only raw content, so either finalize the
Rust result before measuring and reporting (e.g., pass
rust_dictionary/rust_tuned through the same finalization used by
create_fastcover_dict_from_source or a slice-based wrapper so
rust_dictionary.len() and rust_train_ms reflect the full finalized dict), or
change the emitted metrics in emit_dictionary_training_report to clearly label
these as "raw-content training" and report raw vs finalized sizes separately
(update the fields/labels for rust_dict_bytes vs ffi_dict_bytes and any timing
fields accordingly). Ensure you update the code paths around
train_fastcover_raw_from_slice, rust_dictionary, rust_tuned,
zstd::dict::from_samples and emit_dictionary_training_report so the compared
artifacts are equivalent or the report explicitly notes raw vs finalized.
In `@zstd/benches/dict_builder_fastcover.rs`:
- Around line 39-71: Hoist the FastCoverOptions construction out of the b.iter()
closures so the benches don't include allocation/setup cost: create one
FastCoverOptions instance for the optimized bench (e.g., fastcover_opts =
FastCoverOptions::default()) and one for the fixed bench (e.g.,
fastcover_fixed_opts = FastCoverOptions { optimize:false, accel:4, k:256, d:8,
f:20, ..FastCoverOptions::default() }) before calling c.bench_function, then
pass &fastcover_opts / &fastcover_fixed_opts into
create_fastcover_raw_dict_from_source inside the closures (references or clones
as appropriate) instead of building them inside b.iter; reference symbols:
FastCoverOptions, create_fastcover_raw_dict_from_source, the bench names
"dict_builder/fastcover_raw_opt" and "dict_builder/fastcover_raw_fixed".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a8629e61-02e4-4fd1-b930-f0275077209a
📒 Files selected for processing (5)
zstd/benches/compare_ffi.rszstd/benches/dict_builder_fastcover.rszstd/src/dictionary/fastcover.rszstd/src/dictionary/mod.rszstd/src/encoding/frame_compressor.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
zstd/benches/compare_ffi.rs (1)
208-225:⚠️ Potential issue | 🟠 MajorBenchmark against the post-finalization content budget, not the full
dict_size.Lines 209-224 and Lines 268-279 still train FastCOVER with the full
dict_size, thenfinalize_raw_dict()trims that raw dictionary down todict_size - header_bytes. That meansrust_train_msincludes work spent optimizing bytes that can never survive into the emitted artifact, so the new dict-train delta is still skewed. Please reuse the same post-finalization budget logic ascreate_fastcover_dict_from_source()before callingtrain_fastcover_raw_from_slice()in both the report path and the Criterion loop.Also applies to: 266-280
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/benches/compare_ffi.rs` around lines 208 - 225, The benchmark is training FastCOVER with the pre-finalization dict_size but then finalize_raw_dict trims header bytes, so training work includes bytes that cannot appear in output; compute the post-finalization budget the same way as create_fastcover_dict_from_source (i.e., subtract header/metadata overhead to get effective_dict_size) and pass that effective size into train_fastcover_raw_from_slice instead of dict_size in both the report path (where rust_train_started is measured) and the Criterion loop; ensure the same effective size is used when calling finalize_raw_dict to keep training and finalization budgets consistent (refer to train_fastcover_raw_from_slice, finalize_raw_dict, and create_fastcover_dict_from_source to mirror its budget computation).zstd/src/dictionary/mod.rs (1)
301-319:⚠️ Potential issue | 🟠 MajorLL/ML/OF tables are still trained from the wrong symbol space.
Lines 307-316 collapse raw corpus bytes into the FSE code space with
byte % (max_symbol + 1), and Line 333 builds the tables from that synthetic stream. That makes the tables data-dependent, but not sequence-dependent: the resulting histograms still have no relationship to the compressor’s actual literal-length, match-length, or offset-code frequencies. The dictionaries remain parseable, but this finalized path will still miss the intended compression-quality parity. Please derive these histograms from real emitted LL/ML/OF codes before callingbuild_table_from_data.Also applies to: 321-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@zstd/src/dictionary/mod.rs` around lines 301 - 319, The current bounded_fse_symbols function and its use to create FSE tables (via build_table_from_data) wrongly map raw corpus bytes into the FSE symbol space (byte % (max_symbol+1)), producing histograms that do not reflect actual emitted LL/ML/OF codes; instead, compute the histograms from real emitted literal-length, match-length, and offset-code sequences as produced by the compressor/parser and pass those symbol streams into build_table_from_data. Modify the code paths that call bounded_fse_symbols (and any place that feeds build_table_from_data for LL/ML/OF tables) to generate symbol streams by running the corpus through the same parsing/emission logic that outputs LL/ML/OF codes, then feed those emitted code sequences into build_table_from_data so tables are trained on true sequence-dependent frequencies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-219: The benchmark removed the Rust FastCOVER score: keep the
returned rust_tuned from train_fastcover_raw_from_slice (the
Ok((rust_raw_dictionary, _rust_tuned)) pattern) and extract its score
(rust_fastcover_score) to store in DictTrainingMetrics and include it when
emitting REPORT_DICT_TRAIN; specifically, update the pattern that currently
discards rust_tuned to bind it (e.g., rust_tuned), add the rust_fastcover_score
field into the DictTrainingMetrics construction (where metrics are created
around lines ~247-258), and ensure the REPORT_DICT_TRAIN output (around lines
~441-463) appends the rust_fastcover_score field so the external script parsing
REPORT_DICT_TRAIN sees the trailing score value.
---
Duplicate comments:
In `@zstd/benches/compare_ffi.rs`:
- Around line 208-225: The benchmark is training FastCOVER with the
pre-finalization dict_size but then finalize_raw_dict trims header bytes, so
training work includes bytes that cannot appear in output; compute the
post-finalization budget the same way as create_fastcover_dict_from_source
(i.e., subtract header/metadata overhead to get effective_dict_size) and pass
that effective size into train_fastcover_raw_from_slice instead of dict_size in
both the report path (where rust_train_started is measured) and the Criterion
loop; ensure the same effective size is used when calling finalize_raw_dict to
keep training and finalization budgets consistent (refer to
train_fastcover_raw_from_slice, finalize_raw_dict, and
create_fastcover_dict_from_source to mirror its budget computation).
In `@zstd/src/dictionary/mod.rs`:
- Around line 301-319: The current bounded_fse_symbols function and its use to
create FSE tables (via build_table_from_data) wrongly map raw corpus bytes into
the FSE symbol space (byte % (max_symbol+1)), producing histograms that do not
reflect actual emitted LL/ML/OF codes; instead, compute the histograms from real
emitted literal-length, match-length, and offset-code sequences as produced by
the compressor/parser and pass those symbol streams into build_table_from_data.
Modify the code paths that call bounded_fse_symbols (and any place that feeds
build_table_from_data for LL/ML/OF tables) to generate symbol streams by running
the corpus through the same parsing/emission logic that outputs LL/ML/OF codes,
then feed those emitted code sequences into build_table_from_data so tables are
trained on true sequence-dependent frequencies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ddbef425-cf43-4e2f-aa78-7f7b152262e2
📒 Files selected for processing (2)
zstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
|
@coderabbitai Re: duplicate item about post-finalization content budget in (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget via in both report and Criterion paths. |
|
@coderabbitai Re: duplicate item about post-finalization content budget in zstd/benches/compare_ffi.rs (ranges 208-225 and 266-280) — fixed in commit 8622344. Rust FastCOVER now trains with finalized content budget in both report and Criterion paths. |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings used |
|
Tip For best results, initiate chat on the files or code changes.
✏️ Learnings added
🧠 Learnings used |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@zstd/src/dictionary/mod.rs`:
- Around line 589-607: The test finalize_raw_dict_roundtrips_with_ffi_decoder
currently trains raw at the full 4096 target then finalizes into a 4096 budget,
which skips the real-world path; update the test to compute the
post-finalization content budget (use
finalized_content_budget(sample.as_slice(), 4096) or simply call
create_fastcover_dict_from_source(sample.as_slice(),
finalized_content_budget(...)) to obtain a raw trained at the smaller target)
and train the raw with that smaller target via fastcover::train_fastcover_raw
(or call create_fastcover_dict_from_source directly) before calling
finalize_raw_dict so the budgeted finalize path is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2efb5ad9-7b5b-4a69-bc20-3602df8da914
📒 Files selected for processing (2)
zstd/benches/compare_ffi.rszstd/src/dictionary/mod.rs
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
zstd/src/dictionary/cover.rs:106
- Comment typo: "recieves" should be "receives".
let kmer: &KMer = (&segment[i..i + K])
.try_into()
.expect("Failed to make kmer");
// if the kmer is already in the pool, it recieves a score of zero
if ctx.frequencies.contains_key(kmer) {
continue;
Summary
dict-trainbenchmark stage tocompare_ffifor Rust FastCOVER vs C dictionary trainingREPORT_DICT_TRAINlines in.github/scripts/run-benchmarks.shaccelconsistently in tuning/scoring/output paramsio::Result<()>) instead of panickingbenchmark-report.mdand canonical deltas inbenchmark-delta.md/jsondict_builderforcompare_ffibench and align benchmark docsValidation
cargo fmt --all -- --checkcargo clippy --all-targets --features hash,std,dict_builder -- -D warningscargo nextest run --workspace --features hash,std,dict_buildercargo build --workspace --features hash,std,dict_buildercargo test --doc --workspace --features hash,std,dict_builderSame-run numbers (Rust FastCOVER vs C)
decodecorpus-z000033: 2.703 ms vs 23.342 ms (8.6371x)small-10k-random: 0.079 ms vs 0.702 ms (8.8451x)small-4k-log-lines: 0.057 ms vs 0.217 ms (3.7868x)Closes #25
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Tests