Fix NonBacktracking correctness issues and exponential blowup on nested loops#125457
Fix NonBacktracking correctness issues and exponential blowup on nested loops#125457danmoseley wants to merge 6 commits intodotnet:mainfrom
Conversation
The NonBacktracking regex engine suffered exponential time complexity when
processing deeply nested loop patterns like ((a)*)*. Even depth 20 could
take >60 seconds, and depth 2000 with a single-character input took ~11s.
Two changes fix this:
1. Loop flattening in CreateLoop: simplify (R*)* to R* (and variants like
(R+)*, (R*)+) when both loops have unbounded upper, compatible lower
bounds, same laziness, and no capture effects. This eliminates the
exponential state space in derivative computation.
2. StripEffects caching: cache results of StripEffects to avoid repeatedly
traversing shared sub-trees in the derivative DAG, which otherwise
causes exponential work even after loop flattening for patterns with
capture groups.
The loop flattening condition is carefully restricted to preserve
correctness:
- body._lower must be 0 or 1 (with outer lower 0): (R{2,})* cannot
match a single R, so it must not be simplified to R*.
- Both loops must have the same laziness: (?:R*)+? (greedy inner, lazy
outer) has different match semantics than R*?.
- No capture effects: collapsing loops would change capture group
bindings.
Re-enables StressTestDeepNestingOfLoops for NonBacktracking (depth 120).
Fixes dotnet#84188
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change the nested loop flattening in CreateLoop from a recursive call (if + return CreateLoop) to a while loop that mutates lower/body in place. This avoids consuming stack proportional to the nesting depth, which could cause a stack overflow for deeply nested patterns. Also note in comments that the exponential blowup occurs during lazy DFA construction within matching, not just pattern compilation, so it violates the engine's linear-time-in-input-length guarantee. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The engine remains O(input_length) for any given pattern. The blowup is in the per-character cost (pattern-dependent constant), not in input-length scaling. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR improves the NonBacktracking (symbolic) regex engine’s behavior on deeply nested quantifiers by reducing derivative-state explosion and redundant effect-stripping work, and updates tests to validate both correctness and stress behavior.
Changes:
- Add loop-flattening in symbolic AST construction to simplify redundant nested unbounded loops when semantics are preserved.
- Add caching for
StripEffectsto avoid repeated traversal of shared derivative subtrees. - Extend/adjust regex tests to cover nested-loop simplification correctness and re-enable NonBacktracking stress coverage with a safer nesting depth.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs | Adds match-count correctness cases for nested-loop simplification and updates deep loop stress data to include NonBacktracking with reduced depth. |
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexNode.cs | Implements nested-loop flattening in CreateLoop and adds StripEffects caching (plus improved alternation handling) to prevent exponential blowups. |
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexBuilder.cs | Introduces _stripEffectsCache to support the new cached StripEffects behavior. |
...tem.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexNode.cs
Show resolved
Hide resolved
|
@olsaarik are you comfortable signing off? I responded to your comment above; no code changes, just comment tweak. |
Improve CountSingletons to multiply the estimate by 2 when an unbounded
loop's body is itself an unbounded loop. This causes the NFA size
estimate to grow exponentially with nesting depth, so patterns with
alternating lazy/greedy nesting at 15+ levels are rejected upfront.
Add tests:
- Alternating-laziness depth-20 pattern exceeds unsafe threshold
- Same-laziness depth-100 stays safe (loop flattening collapses it)
- (a{2})* correctness: odd-length strings must not match
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added defense-in-depth in the last commit: With loop flattening, same-laziness nesting (e.g. I verified that with loop flattening disabled, the |
|
/azp cancel |
|
Command 'cancel' is not supported by Azure Pipelines. Supported commands
See additional documentation. |
|
/azp help |
Supported commands
See additional documentation. |
| // (which requires matching laziness) and cause exponential derivative blowup. | ||
| // CountSingletons accounts for this by multiplying by 2 per nesting level, so | ||
| // at depth 15+ the estimate exceeds the default 10,000 threshold. | ||
| // Build: (?:(?:(?:a)*?)*)*? ... with alternating greedy/lazy at each level |
There was a problem hiding this comment.
The comment showing the example pattern doesn't match the actual pattern constructed by the loops: the closing loop runs from depth-1 down to 0, making the outermost quantifier greedy ("") rather than lazy ("?") as implied by ...)*? .... Please update the comment to reflect the real generated pattern (or describe the alternation without suggesting a specific outermost suffix).
| // Build: (?:(?:(?:a)*?)*)*? ... with alternating greedy/lazy at each level | |
| // Build: (?:(?:(?:a)*)*?)* ... with alternating greedy/lazy quantifiers at each level |
Fix NonBacktracking correctness issues and exponential blowup on nested loops
Fixes #84188
Problem
The NonBacktracking regex engine has exponential time complexity on deeply nested loop patterns like
((a)*)*. The root cause is in the Brzozowski derivative computation: when the left side of aConcatis nullable (as with*loops), the derivative creates a 2-branchAlternate. With N levels of nesting, this produces O(2^N) intermediate nodes per derivative step.The last row shows that even matching a single character takes 10.7s at depth 2000 --- the bottleneck is the first derivative computation, not cumulative per-character cost.
Fix
Three complementary changes:
1. Loop flattening in
CreateLoop--- Simplify(R*)*toR*(and variants(R+)*,(R*)+) during symbolic AST construction. This directly eliminates the exponential state space. The simplification is restricted to cases where it preserves semantics:(R{2,})*cannot match a singleR, so must not becomeR*.(?:R*)+?(greedy inner, lazy outer) has different match behavior thanR*?.2.
StripEffectscaching --- The derivative of deeply nested loop patterns with captures produces a DAG with shared sub-trees wrapped in differentEffectnodes. Without caching,StripEffectstraverses each shared sub-tree once per reference rather than once per unique node, leading to exponential work. Adding a cache (following the existing pattern of_pruneLowerPriorityThanNullabilityCache) fixes this.3.
CountSingletonsdefense-in-depth --- Improve the NFA size estimate so that nested unbounded loops multiply the estimate by 2 per nesting level, matching the actual derivative branching cost. This causes the existing safe-size threshold (10,000) to reject patterns with deeply nested unbounded loops that survive loop flattening (e.g., alternating lazy/greedy nesting like(?:(?:a*)*?)*at 15+ levels). With loop flattening, same-laziness nesting is already collapsed before the estimate runs, so the multiplier only fires for patterns that would actually cause exponential blowup.Note on linear-time guarantee
The NonBacktracking engine guarantees linear time in the length of the input, and this guarantee was never violated by the nested loop issue. The engine is O(C * N) where N is the input length and C is a pattern-dependent constant representing the per-character cost. For deeply nested patterns like
((a)*)*, the constant C grows exponentially with the nesting depth of the pattern, but for any given pattern the matching time remains linear in the input length.In other words, this was an exponential blowup in the pattern complexity constant, not in input-length scaling. In practice, triggering the blowup requires deliberately nesting redundant quantifiers 15+ levels deep, which does not arise naturally.
Tests
StressTestDeepNestingOfLoopsfor NonBacktracking at depth 120 (previously disabled citing NonBacktracking engine is pathologically slow on patterns like "((((((((a)*)*)*)*)*)*)*" and input like "aaaaaaa" which other engines handle fine #84188; even depth 20 was intractable before the fix).^(?:(?:a){2,})*$on"a"--- 0 matches (would be 1 ifbody._lower >= 2check were missing)^(?:(?:a){2})*$on"a","aaa"--- 0 matches (would be 1 if incorrectly simplified toa*)(?:(?:a*)+?)on"aaa"--- 2 matches (would be 4 if laziness guard were missing)CountSingletonsdefense-in-depth: