perf(decoding): SIMD wildcopy for literal and match memcpy

## Problem

The decode hot path copies literals and match back-references through scalar \`ptr::copy_nonoverlapping\` in \`ringbuffer.rs\` and \`sequence_execution.rs\`. C zstd uses \`ZSTD_wildcopy\` — an overwrite-safe SIMD copy that moves 16/32/64 bytes per iteration regardless of actual copy length, relying on the decode buffer having sufficient headroom.

This scalar-vs-SIMD copy gap is the **single largest contributor** to the 1.4-3.5x decompression speed difference with C zstd. Literal+match copying dominates decode time on most corpora (60-80% of cycles in sequence execution loop).

## Goal

Replace scalar memcpy in the decode buffer with architecture-dispatched SIMD wildcopy, matching C zstd's \`ZSTD_wildcopy8\`/\`ZSTD_overlapCopy8\` approach.

## Implementation plan

### 1. Wildcopy primitive
Add \`wildcopy(dst, src, length)\` in a new \`zstd/src/decoding/simd_copy.rs\`:
- **x86-64 SSE2 (baseline):** \`_mm_storeu_si128\` / \`_mm_loadu_si128\` — 16 bytes/iter
- **x86-64 AVX2:** \`_mm256_storeu_si256\` — 32 bytes/iter
- **x86-64 AVX-512:** \`_mm512_storeu_si512\` — 64 bytes/iter (1 cache line)
- **AArch64 NEON (baseline):** \`vst1q_u8\` / \`vld1q_u8\` — 16 bytes/iter
- **Scalar fallback:** current \`ptr::copy_nonoverlapping\` with 8-byte stride

Runtime dispatch via \`#[target_feature]\` + \`is_x86_feature_detected!\`.

### 2. Overlap copy for short-offset matches
For match offset < copy width (overlapping back-reference), C zstd uses pattern-repeat via shuffle:
- **Offset 1 (RLE):** broadcast single byte → vector store
- **Offset 2-7:** shuffle mask table → repeat pattern via \`_mm_shuffle_epi8\` / \`vqtbl1q_u8\`
- **Offset 8-15:** 8-byte load + broadcast
- **Offset ≥ copy width:** standard wildcopy

### 3. Ring buffer headroom
Ensure \`RingBuffer\` allocates \`capacity + WILDCOPY_OVERSHOOT\` (32 or 64 bytes) so wildcopy can safely overwrite past the logical end. This is how C zstd avoids per-copy length checks.

### 4. Integration into sequence execution loop
- Replace \`buffer.push_literals()\` in \`sequence_execution.rs:29-33\` with \`wildcopy\`
- Replace \`buffer.repeat()\` in \`sequence_execution.rs:40-44\` with overlap-aware wildcopy
- Keep branchless dispatch: offset < 16 → overlap path, offset ≥ 16 → standard wildcopy

### 5. Benchmarks
- Add microbench for wildcopy at various sizes (8B, 64B, 256B, 4KB, 64KB)
- Compare Rust SIMD vs Rust scalar vs C zstd on existing bench matrix

## Acceptance criteria
- [ ] SIMD wildcopy used on x86-64 (SSE2 baseline, AVX2 when available) and AArch64 (NEON baseline).
- [ ] Overlap copy handles offsets 1-15 correctly (pattern repeat, not UB).
- [ ] Ring buffer has wildcopy headroom — no bounds checks in inner copy loop.
- [ ] Byte-exact output parity with scalar path on all existing tests.
- [ ] Measurable decode throughput improvement on bench matrix.
- [ ] Scalar fallback for non-SIMD targets compiles and passes tests.

## Performance expectations
- Literal-heavy corpora (logs, JSON): **+40-60%** decode throughput
- Match-heavy corpora (binary, compressed): **+20-30%** decode throughput
- Small blocks (1-4KB, CoordiNode workload): **+15-25%** (amortization overhead)

This single optimization should close ~40-50% of the remaining gap with C zstd.

## Files involved
- \`zstd/src/decoding/simd_copy.rs\` (new)
- \`zstd/src/decoding/ringbuffer.rs\` (headroom allocation, integrate wildcopy)
- \`zstd/src/decoding/sequence_execution.rs\` (replace scalar copy calls)
- \`zstd/src/decoding/mod.rs\` (module registration)
- \`zstd/benches/\` (microbench)

## Dependencies
- Independent from #56, #66, #67.
- Should be done BEFORE #66 (SIMD HUF decode) — wildcopy impact is larger.

## Estimate
3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decoding): SIMD wildcopy for literal and match memcpy #68

Problem

Goal

Implementation plan

1. Wildcopy primitive

2. Overlap copy for short-offset matches

3. Ring buffer headroom

4. Integration into sequence execution loop

5. Benchmarks

Acceptance criteria

Performance expectations

Files involved

Dependencies

Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(decoding): SIMD wildcopy for literal and match memcpy #68

Description

Problem

Goal

Implementation plan

1. Wildcopy primitive

2. Overlap copy for short-offset matches

3. Ring buffer headroom

4. Integration into sequence execution loop

5. Benchmarks

Acceptance criteria

Performance expectations

Files involved

Dependencies

Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions