Fix perf bug in RegexCompiler when handling .*?#118373
Merged
stephentoub merged 1 commit intodotnet:mainfrom Aug 5, 2025
Merged
Fix perf bug in RegexCompiler when handling .*?#118373stephentoub merged 1 commit intodotnet:mainfrom
stephentoub merged 1 commit intodotnet:mainfrom
Conversation
We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a performance bug in the regex compiler's optimization for lazy quantifiers followed by literals. The bug occurred when handling patterns like .*? (singleline lazy dot-star) followed by a literal, where the compiler was incorrectly passing the wrong node to the IndexOf emission helper.
- Corrects the node parameter passed to
EmitIndexOffrom the loop node to the literal node - Fixes performance degradation caused by inefficient
IndexOfAnyInRange(0, \uFFFF)calls - Maintains functional correctness while improving performance for this optimization path
Contributor
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
MihaZupan
approved these changes
Aug 5, 2025
Member
MihaZupan
left a comment
There was a problem hiding this comment.
Nice. Did you spot this in some benchmark given it's compiler-only?
Member
Author
Yup, the numbers I was getting out made no sense. |
radekdoulik
pushed a commit
to radekdoulik/runtime
that referenced
this pull request
Aug 5, 2025
We have a special-code path that exists to optimize a singleline `.*?`, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have a special-code path that exists to optimize a singleline
.*?, in which case we can just search for what comes after the loop in the pattern because the loop itself will lazily match everything. Unfortunately, we're passing the wrong node to the EmitIndexOf helper that emits that search. We should be passing the node which represents the subsequent literal, but we're accidentally passing the set loop itself. We're only here if that set loop matches everything, so we're emitting an IndexOfAnyInRange(0, \uFFFF) call. This is functionally ok, but perf tanks because we end up needing to do non-trivial work for every character that matches the loop.