Skip to content

feat: FSE table reuse and offset history optimization in encoder #17

@polaz

Description

@polaz

Summary

The encoder rebuilds FSE tables from scratch for every block and doesn't properly track offset history. The C reference reuses tables across blocks and maintains offset history — both critical for compression ratio.

This work also raises the crate to Rust 2024 edition syntax so the new encoder logic can use current language constructs without carrying edition-2018 compatibility workarounds.

Current Rust state

  • encoding/blocks/compressed.rs:54 — FSE table reuse not implemented (TODO comment)
  • encoding/blocks/compressed.rs:27 — offset history always [1, 4, 8], never updated
  • frame_compressor.rs:47-53FseTables struct has ll_previous/ml_previous/of_previous fields but they're never populated
  • zstd/Cargo.toml / cli/Cargo.toml — crates are still on edition 2018, which blocks Rust 2024 syntax used by the ongoing cleanup around the new FSE logic

C reference behavior

FSE table reuse (ZSTD_selectEncodingType())

Decision heuristic per block:

  1. RLE: All symbols identical → single byte
  2. Repeat: Previous table valid AND nbSeq < 1000 → zero-cost reuse
  3. Basic: Small blocks → use predefined default tables
  4. Compressed: Build new dynamic table → full entropy encoding

Offset history

  • rep[3] maintained across blocks within a frame
  • Updated after every sequence: rep[2]=rep[1], rep[1]=rep[0], rep[0]=current_offset
  • Rep matches encoded as offset 1/2/3 → saves significant bits (0 extra bits vs log2(offset) bits)

What needs to be implemented

  1. Populate *_previous FSE table fields after each block encoding
  2. Table reuse heuristic — check if previous table is close enough to reuse
  3. Offset history tracking — maintain [rep0, rep1, rep2] across blocks
  4. Rep match encoding — encode repeat offsets as 1/2/3 instead of full offset
  5. Exact FSE table header cost accounting — use the exact serialized header bit size instead of a conservative upper bound, so table selection remains correct and interop-safe on multi-block inputs
  6. Raise the crate to Rust 2024 edition syntax so the FSE cleanup can use current language features directly and stop carrying edition-2018 syntax constraints in touched code paths

Performance impact

  • Table reuse: ~2-5% compression ratio improvement (fewer table bytes in output)
  • Offset history: ~5-15% ratio improvement (rep matches are nearly free)

Acceptance criteria

  • FSE tables reused when beneficial
  • Offset history correctly maintained across blocks
  • Rep matches encoded with offset 1/2/3
  • Exact FSE table header cost is accounted for in table selection
  • Crates build cleanly on Rust 2024 edition syntax after the manifest bump
  • Compression ratio improves measurably
  • All roundtrip and interop tests pass

Time estimate

2d 4h

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0-criticalMust do first — blocks everythingP2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions