Skip to content

perf: Huffman decoder — 4-stream parallel decoding and table initialization #10

@polaz

Description

@polaz

Summary

Huffman decoding is one of the two hottest paths in zstd decompression. The C reference uses several critical optimizations that the Rust implementation lacks.

C reference optimizations (huf_decompress.c)

1. 4-stream parallel decoding

  • HUF_DecompressFastArgs struct maintains 4 independent bit streams
  • Interleaved decode operations hide memory latency: while one stream waits for table lookup, others are being decoded
  • x86-64 assembly version (huf_decompress_amd64.S) uses 14 registers for all 4 streams simultaneously

2. Bulk table initialization

  • HUF_DEltX1_set4() packs 4 identical table entries using single MEM_write64() call
  • For weights producing 16+ entries: unrolled loop with 4x 64-bit writes per iteration
  • Table build runs ~4x faster than entry-by-entry initialization

3. Single-symbol vs double-symbol decoding

  • X1 (single): 1 symbol per table lookup — compact table, good for high-entropy data
  • X2 (double): 2 symbols per table lookup — larger table but 2x fewer lookups for low-entropy data
  • Selection heuristic based on maximum weight

Current Rust implementation

  • huff0/huff0_decoder.rs — single-stream, single-symbol decoding only
  • Table initialization is entry-by-entry
  • No parallel stream processing

What needs to be implemented

  1. 4-stream decode loop — process all 4 Huffman streams with interleaved operations
  2. Bulk table init — use u64 writes to fill multiple identical entries at once
  3. Double-symbol decode variant — for high compression ratio data
  4. Stream selection heuristic — choose X1 vs X2 based on symbol distribution

Performance impact estimate

  • 4-stream parallel: ~30-40% speedup on Huffman-heavy data
  • Bulk table init: ~10% speedup on table rebuild
  • Combined: major contributor to closing the 1.4-3.5x gap

Acceptance criteria

  • 4-stream parallel Huffman decode implemented
  • Bulk table initialization via u64 writes
  • Benchmark shows measurable improvement on silesia corpus
  • All existing decode tests continue passing
  • Fuzz targets updated

Time estimate

3d

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions