Skip to content

perf(encoding): SIMD match-length comparison (SSE4.2/AVX2/NEON) #70

@polaz

Description

@polaz

Problem

`common_prefix_len()` in `match_generator.rs:875-895` determines match length by comparing 8 bytes per iteration (word-sized). C zstd uses SIMD to compare 16-32 bytes per iteration, reducing match-length determination time on long matches (common at levels 5+).

This is the primary encoding bottleneck at higher compression levels where match lengths frequently exceed 32 bytes and search depth is 8-32 candidates.

Goal

Vectorize match-length comparison to process 16-32 bytes per iteration using platform SIMD.

Implementation plan

1. SSE2 path (x86-64 baseline)

```rust
// Compare 16 bytes, find first mismatch
let a = _mm_loadu_si128(ptr_a.cast());
let b = _mm_loadu_si128(ptr_b.cast());
let cmp = _mm_cmpeq_epi8(a, b);
let mask = _mm_movemask_epi8(cmp);
if mask != 0xFFFF {
return offset + mask.trailing_ones() as usize;
}
```

2. AVX2 path

Same pattern with `_mm256_cmpeq_epi8` — 32 bytes per iteration. Beneficial when average match length > 32 bytes (levels 7+).

3. NEON path (AArch64)

```rust
let a = vld1q_u8(ptr_a);
let b = vld1q_u8(ptr_b);
let cmp = vceqq_u8(a, b);
// Extract mismatch position via vshrn + clz
```

4. Integration

  • Runtime dispatch in `common_prefix_len()` via `#[target_feature]` + detect macros.
  • Scalar 8-byte fallback for non-SIMD targets (current code).
  • Ensure comparison doesn't read past buffer end: check remaining length before SIMD iteration, fall back to scalar for tail bytes.

Performance expectations

  • Levels 5-11 (HashChain, search depth 4-32): +15-25% encoding throughput
  • Levels 1-4 (Simple/Dfast, short matches): +5-10% (matches shorter, less benefit)
  • Levels 12-22 (long matches common): +20-30% encoding throughput

Acceptance criteria

  • SSE2 and NEON paths implemented with runtime dispatch.
  • AVX2 path for high compression levels.
  • No reads past buffer bounds (tail handled by scalar fallback).
  • Encoding output byte-exact with scalar path (match length is deterministic).
  • Benchmark improvement on Better/Best levels with mid-entropy corpus.

Files involved

  • `zstd/src/encoding/match_generator.rs` (`common_prefix_len()`)

Dependencies

  • Independent. Can be done in parallel with any other issue.

Estimate

1d

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions