Problem
`common_prefix_len()` in `match_generator.rs:875-895` determines match length by comparing 8 bytes per iteration (word-sized). C zstd uses SIMD to compare 16-32 bytes per iteration, reducing match-length determination time on long matches (common at levels 5+).
This is the primary encoding bottleneck at higher compression levels where match lengths frequently exceed 32 bytes and search depth is 8-32 candidates.
Goal
Vectorize match-length comparison to process 16-32 bytes per iteration using platform SIMD.
Implementation plan
1. SSE2 path (x86-64 baseline)
```rust
// Compare 16 bytes, find first mismatch
let a = _mm_loadu_si128(ptr_a.cast());
let b = _mm_loadu_si128(ptr_b.cast());
let cmp = _mm_cmpeq_epi8(a, b);
let mask = _mm_movemask_epi8(cmp);
if mask != 0xFFFF {
return offset + mask.trailing_ones() as usize;
}
```
2. AVX2 path
Same pattern with `_mm256_cmpeq_epi8` — 32 bytes per iteration. Beneficial when average match length > 32 bytes (levels 7+).
3. NEON path (AArch64)
```rust
let a = vld1q_u8(ptr_a);
let b = vld1q_u8(ptr_b);
let cmp = vceqq_u8(a, b);
// Extract mismatch position via vshrn + clz
```
4. Integration
- Runtime dispatch in `common_prefix_len()` via `#[target_feature]` + detect macros.
- Scalar 8-byte fallback for non-SIMD targets (current code).
- Ensure comparison doesn't read past buffer end: check remaining length before SIMD iteration, fall back to scalar for tail bytes.
Performance expectations
- Levels 5-11 (HashChain, search depth 4-32): +15-25% encoding throughput
- Levels 1-4 (Simple/Dfast, short matches): +5-10% (matches shorter, less benefit)
- Levels 12-22 (long matches common): +20-30% encoding throughput
Acceptance criteria
Files involved
- `zstd/src/encoding/match_generator.rs` (`common_prefix_len()`)
Dependencies
- Independent. Can be done in parallel with any other issue.
Estimate
1d
Problem
`common_prefix_len()` in `match_generator.rs:875-895` determines match length by comparing 8 bytes per iteration (word-sized). C zstd uses SIMD to compare 16-32 bytes per iteration, reducing match-length determination time on long matches (common at levels 5+).
This is the primary encoding bottleneck at higher compression levels where match lengths frequently exceed 32 bytes and search depth is 8-32 candidates.
Goal
Vectorize match-length comparison to process 16-32 bytes per iteration using platform SIMD.
Implementation plan
1. SSE2 path (x86-64 baseline)
```rust
// Compare 16 bytes, find first mismatch
let a = _mm_loadu_si128(ptr_a.cast());
let b = _mm_loadu_si128(ptr_b.cast());
let cmp = _mm_cmpeq_epi8(a, b);
let mask = _mm_movemask_epi8(cmp);
if mask != 0xFFFF {
return offset + mask.trailing_ones() as usize;
}
```
2. AVX2 path
Same pattern with `_mm256_cmpeq_epi8` — 32 bytes per iteration. Beneficial when average match length > 32 bytes (levels 7+).
3. NEON path (AArch64)
```rust
let a = vld1q_u8(ptr_a);
let b = vld1q_u8(ptr_b);
let cmp = vceqq_u8(a, b);
// Extract mismatch position via vshrn + clz
```
4. Integration
Performance expectations
Acceptance criteria
Files involved
Dependencies
Estimate
1d