Summary
Huffman decoding is one of the two hottest paths in zstd decompression. The C reference uses several critical optimizations that the Rust implementation lacks.
C reference optimizations (huf_decompress.c)
1. 4-stream parallel decoding
HUF_DecompressFastArgs struct maintains 4 independent bit streams
- Interleaved decode operations hide memory latency: while one stream waits for table lookup, others are being decoded
- x86-64 assembly version (
huf_decompress_amd64.S) uses 14 registers for all 4 streams simultaneously
2. Bulk table initialization
HUF_DEltX1_set4() packs 4 identical table entries using single MEM_write64() call
- For weights producing 16+ entries: unrolled loop with 4x 64-bit writes per iteration
- Table build runs ~4x faster than entry-by-entry initialization
3. Single-symbol vs double-symbol decoding
- X1 (single): 1 symbol per table lookup — compact table, good for high-entropy data
- X2 (double): 2 symbols per table lookup — larger table but 2x fewer lookups for low-entropy data
- Selection heuristic based on maximum weight
Current Rust implementation
huff0/huff0_decoder.rs — single-stream, single-symbol decoding only
- Table initialization is entry-by-entry
- No parallel stream processing
What needs to be implemented
- 4-stream decode loop — process all 4 Huffman streams with interleaved operations
- Bulk table init — use
u64 writes to fill multiple identical entries at once
- Double-symbol decode variant — for high compression ratio data
- Stream selection heuristic — choose X1 vs X2 based on symbol distribution
Performance impact estimate
- 4-stream parallel: ~30-40% speedup on Huffman-heavy data
- Bulk table init: ~10% speedup on table rebuild
- Combined: major contributor to closing the 1.4-3.5x gap
Acceptance criteria
Time estimate
3d
Summary
Huffman decoding is one of the two hottest paths in zstd decompression. The C reference uses several critical optimizations that the Rust implementation lacks.
C reference optimizations (huf_decompress.c)
1. 4-stream parallel decoding
HUF_DecompressFastArgsstruct maintains 4 independent bit streamshuf_decompress_amd64.S) uses 14 registers for all 4 streams simultaneously2. Bulk table initialization
HUF_DEltX1_set4()packs 4 identical table entries using singleMEM_write64()call3. Single-symbol vs double-symbol decoding
Current Rust implementation
huff0/huff0_decoder.rs— single-stream, single-symbol decoding onlyWhat needs to be implemented
u64writes to fill multiple identical entries at oncePerformance impact estimate
Acceptance criteria
Time estimate
3d