Vectorize RegexInterpreter opcode loops for Oneloop, Onerep, Notonerep, and MatchString#124628
Vectorize RegexInterpreter opcode loops for Oneloop, Onerep, Notonerep, and MatchString#124628danmoseley wants to merge 4 commits intodotnet:mainfrom
Conversation
Replace the per-character loop in the Oneloop/Oneloopatomic opcode handler
with a vectorized IndexOfAnyExcept call for left-to-right matching. This
mirrors the existing optimization already applied to Notoneloop (which uses
IndexOf), enabling SIMD-accelerated scanning when matching repeated
occurrences of a single character (e.g. a+ or a{3,}).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the per-character loop in the Onerep opcode handler with a
vectorized ContainsAnyExcept call for left-to-right matching. This enables
SIMD-accelerated verification when matching a fixed number of occurrences
of a single character (e.g. the minimum repetitions of a{5,}).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the per-character loop in the Notonerep opcode handler with a
vectorized Contains call for left-to-right matching. This enables
SIMD-accelerated verification when matching a fixed number of characters
that must not be a specific character (e.g. [^a]{5}).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the per-character backwards comparison loop in MatchString with a vectorized SequenceEqual call for left-to-right matching. This enables SIMD-accelerated string comparison when matching literal multi-character strings within regex patterns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR optimizes hot-path opcode handling in RegexInterpreter by replacing per-character loops with SIMD-accelerated span operations for left-to-right matching, extending the existing vectorization precedent in the interpreter.
Changes:
- Vectorize literal string matching (
Multi/MatchString) usingReadOnlySpan<char>.SequenceEqual. - Vectorize fixed-count opcodes
OnerepandNotonerepusingContainsAnyExcept/Containsfor left-to-right paths. - Vectorize greedy single-char loops
Oneloop/OneloopatomicusingIndexOfAnyExceptfor left-to-right paths.
|
Real-world impact estimate: Analyzing the 15,817 unique patterns in the regex test corpus (assuming interpreter engine):
Follow-up PR #124630 adds SearchValues-based vectorization for |
|
@MihuBot benchmark Regex |
|
See benchmark results at https://gist.github.com/MihuBot/2004009ab9c5dbd509e407cc62b2d5a5 |
MihuBot Benchmark AnalysisCompiled and NonBacktracking paths are entirely unaffected (ratios 0.98–1.02 across all suites), as expected since the PR only modifies interpreter opcodes. Interpreter regressions flagged by MihuBot
Investigation: do these hit modified opcodes?I mapped each regressed benchmark's pattern to the interpreter opcodes it exercises:
5 of 8 regressions don't exercise any modified opcode. The 3 that marginally touch modified code are dominated by other costs (backtracking, alternation, cache behavior). Root cause: JIT code layout effects
These are interpreter-only, sub-microsecond-scale, on shared cloud VMs, affecting unmodified code paths — classic JIT layout noise. |
|
build analysis is green - test failures are unrelated. ready for review? |
MihuBot Results vs. Local BenchmarksThe MihuBot standard benchmark suites (Sherlock, Leipzig, BoostDocs, etc.) don't directly validate the 2x-7x local speedups because they use complex real-world patterns where the hot paths are mostly Setloop/Setrep (character classes) rather than the Oneloop/Onerep/Notonerep/MatchString opcodes modified here, and literal strings in the patterns are short (e.g. What MihuBot does confirm:
The local microbenchmarks are the right tool for validating these specific codepaths since they isolate the modified opcodes with long enough inputs to show the SIMD gains. |
| if (inputSpan.Length - runtextpos < c) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| pos = runtextpos + c; | ||
| if (!inputSpan.Slice(runtextpos, c).SequenceEqual(str.AsSpan())) |
There was a problem hiding this comment.
Can this be simplified to:
| if (!inputSpan.Slice(runtextpos, c).SequenceEqual(str.AsSpan())) | |
| if (runtextpos > inputSpan.Length || | |
| !inputSpan.Slice(runtextpos).StartsWith(str.AsSpan())) |
? Then you also wouldn't need the earlier int c = str.Length for this code path.
| while (c != 0) | ||
| { | ||
| return false; | ||
| if (str[--c] != inputSpan[--pos]) | ||
| { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Can/should this also be a SequenceEqual or EndsWith or equivalent?
| if (!_rightToLeft) | ||
| { | ||
| if (Forwardcharnext(inputSpan) != ch) | ||
| if (inputSpan.Slice(runtextpos, c).ContainsAnyExcept(ch)) |
There was a problem hiding this comment.
Is this one actually beneficial? It's pretty rare to have long runs of a single specific character.
| if (Forwardcharnext(inputSpan) != ch) | ||
| // We're left-to-right, so we can employ the vectorized IndexOfAnyExcept | ||
| // to search for any character that isn't the target. | ||
| i = inputSpan.Slice(runtextpos, len).IndexOfAnyExcept(ch); |
There was a problem hiding this comment.
Same question.
If these onerep/loop/loopatomic actually improve performance on real regexes rather than microbenchmarks targeting these cases, great. Otherwise, though, I'd rather avoid adding more specialized code paths for things that only help with fake scenarios.
The
RegexInterpreteralready had a precedent for vectorizing per-character loops: theNotoneloop/Notoneloopatomicopcode usedIndexOffor left-to-right matching. This PR extends that pattern to four more opcodes:a+,a*): UseIndexOfAnyExcept(ch)instead of a per-char loopa{N}): UseContainsAnyExcept(ch)instead of a per-char equality loop[^x]{N}): UseContains(ch)instead of a per-char inequality loopSequenceEqualinstead of a per-char comparison loopAll optimizations apply only to left-to-right matching paths. Right-to-left paths (rare) are left unchanged as they can't benefit from forward-scanning vectorization.
These methods (
IndexOfAnyExcept,ContainsAnyExcept,Contains,SequenceEqual) are SIMD-accelerated in .NET and process 16–32 chars at a time vs 1-at-a-time in the original loops.Benchmark Results
Tested on Intel Core i9-14900K, .NET 11.0.0-dev, using BenchmarkDotNet with
--coreruncomparing before and after builds:a+(64 chars)a+(256 chars)a+(1024 chars)a*(256 chars)a{64}a{256}[^x]{64}[^x]{256}Zero regressions. Zero allocation changes. Improvements scale with input length as expected from SIMD vectorization.
Benchmark source code