Skip to content

perf(decoding): SIMD wildcopy for literal and match memcpy #68

@polaz

Description

@polaz

Problem

The decode hot path copies literals and match back-references through scalar `ptr::copy_nonoverlapping` in `ringbuffer.rs` and `sequence_execution.rs`. C zstd uses `ZSTD_wildcopy` — an overwrite-safe SIMD copy that moves 16/32/64 bytes per iteration regardless of actual copy length, relying on the decode buffer having sufficient headroom.

This scalar-vs-SIMD copy gap is the single largest contributor to the 1.4-3.5x decompression speed difference with C zstd. Literal+match copying dominates decode time on most corpora (60-80% of cycles in sequence execution loop).

Goal

Replace scalar memcpy in the decode buffer with architecture-dispatched SIMD wildcopy, matching C zstd's `ZSTD_wildcopy8`/`ZSTD_overlapCopy8` approach.

Implementation plan

1. Wildcopy primitive

Add `wildcopy(dst, src, length)` in a new `zstd/src/decoding/simd_copy.rs`:

  • x86-64 SSE2 (baseline): `_mm_storeu_si128` / `_mm_loadu_si128` — 16 bytes/iter
  • x86-64 AVX2: `_mm256_storeu_si256` — 32 bytes/iter
  • x86-64 AVX-512: `_mm512_storeu_si512` — 64 bytes/iter (1 cache line)
  • AArch64 NEON (baseline): `vst1q_u8` / `vld1q_u8` — 16 bytes/iter
  • Scalar fallback: current `ptr::copy_nonoverlapping` with 8-byte stride

Runtime dispatch via `#[target_feature]` + `is_x86_feature_detected!`.

2. Overlap copy for short-offset matches

For match offset < copy width (overlapping back-reference), C zstd uses pattern-repeat via shuffle:

  • Offset 1 (RLE): broadcast single byte → vector store
  • Offset 2-7: shuffle mask table → repeat pattern via `_mm_shuffle_epi8` / `vqtbl1q_u8`
  • Offset 8-15: 8-byte load + broadcast
  • Offset ≥ copy width: standard wildcopy

3. Ring buffer headroom

Ensure `RingBuffer` allocates `capacity + WILDCOPY_OVERSHOOT` (32 or 64 bytes) so wildcopy can safely overwrite past the logical end. This is how C zstd avoids per-copy length checks.

4. Integration into sequence execution loop

  • Replace `buffer.push_literals()` in `sequence_execution.rs:29-33` with `wildcopy`
  • Replace `buffer.repeat()` in `sequence_execution.rs:40-44` with overlap-aware wildcopy
  • Keep branchless dispatch: offset < 16 → overlap path, offset ≥ 16 → standard wildcopy

5. Benchmarks

  • Add microbench for wildcopy at various sizes (8B, 64B, 256B, 4KB, 64KB)
  • Compare Rust SIMD vs Rust scalar vs C zstd on existing bench matrix

Acceptance criteria

  • SIMD wildcopy used on x86-64 (SSE2 baseline, AVX2 when available) and AArch64 (NEON baseline).
  • Overlap copy handles offsets 1-15 correctly (pattern repeat, not UB).
  • Ring buffer has wildcopy headroom — no bounds checks in inner copy loop.
  • Byte-exact output parity with scalar path on all existing tests.
  • Measurable decode throughput improvement on bench matrix.
  • Scalar fallback for non-SIMD targets compiles and passes tests.

Performance expectations

  • Literal-heavy corpora (logs, JSON): +40-60% decode throughput
  • Match-heavy corpora (binary, compressed): +20-30% decode throughput
  • Small blocks (1-4KB, CoordiNode workload): +15-25% (amortization overhead)

This single optimization should close ~40-50% of the remaining gap with C zstd.

Files involved

  • `zstd/src/decoding/simd_copy.rs` (new)
  • `zstd/src/decoding/ringbuffer.rs` (headroom allocation, integrate wildcopy)
  • `zstd/src/decoding/sequence_execution.rs` (replace scalar copy calls)
  • `zstd/src/decoding/mod.rs` (module registration)
  • `zstd/benches/` (microbench)

Dependencies

Estimate

3d

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highHigh priority — core functionalityenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions