perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON)

## Problem

\`common_prefix_len()\` in \`match_generator.rs:875-895\` determines match length by comparing 8 bytes per iteration (word-sized). C zstd uses SIMD to compare 16-32 bytes per iteration, reducing match-length determination time on long matches (common at levels 5+).

This is the primary encoding bottleneck at higher compression levels where match lengths frequently exceed 32 bytes and search depth is 8-32 candidates.

## Goal

Vectorize match-length comparison to process 16-32 bytes per iteration using platform SIMD.

## Implementation plan

### 1. SSE2 path (x86-64 baseline)
\`\`\`rust
// Compare 16 bytes, find first mismatch
let a = _mm_loadu_si128(ptr_a.cast());
let b = _mm_loadu_si128(ptr_b.cast());
let cmp = _mm_cmpeq_epi8(a, b);
let mask = _mm_movemask_epi8(cmp);
if mask != 0xFFFF {
    return offset + mask.trailing_ones() as usize;
}
\`\`\`

### 2. AVX2 path
Same pattern with \`_mm256_cmpeq_epi8\` — 32 bytes per iteration. Beneficial when average match length > 32 bytes (levels 7+).

### 3. NEON path (AArch64)
\`\`\`rust
let a = vld1q_u8(ptr_a);
let b = vld1q_u8(ptr_b);
let cmp = vceqq_u8(a, b);
// Extract mismatch position via vshrn + clz
\`\`\`

### 4. Integration
- Runtime dispatch in \`common_prefix_len()\` via \`#[target_feature]\` + detect macros.
- Scalar 8-byte fallback for non-SIMD targets (current code).
- Ensure comparison doesn't read past buffer end: check remaining length before SIMD iteration, fall back to scalar for tail bytes.

## Performance expectations
- Levels 5-11 (HashChain, search depth 4-32): **+15-25%** encoding throughput
- Levels 1-4 (Simple/Dfast, short matches): **+5-10%** (matches shorter, less benefit)
- Levels 12-22 (long matches common): **+20-30%** encoding throughput

## Acceptance criteria
- [ ] SSE2 and NEON paths implemented with runtime dispatch.
- [ ] AVX2 path for high compression levels.
- [ ] No reads past buffer bounds (tail handled by scalar fallback).
- [ ] Encoding output byte-exact with scalar path (match length is deterministic).
- [ ] Benchmark improvement on Better/Best levels with mid-entropy corpus.

## Files involved
- \`zstd/src/encoding/match_generator.rs\` (\`common_prefix_len()\`)

## Dependencies
- Independent. Can be done in parallel with any other issue.

## Estimate
1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON) #70

Problem

Goal

Implementation plan

1. SSE2 path (x86-64 baseline)

2. AVX2 path

3. NEON path (AArch64)

4. Integration

Performance expectations

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON) #70

Description

Problem

Goal

Implementation plan

1. SSE2 path (x86-64 baseline)

2. AVX2 path

3. NEON path (AArch64)

4. Integration

Performance expectations

Acceptance criteria

Files involved

Dependencies

Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions