Skip to content

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Sep 17, 2020

Backport of #42408 to release/5.0-rc2

/cc @stephentoub

Fixes #42390
Fixes #42392

Customer Impact

Regex patterns that begin with .* may fail to match correctly when the developer requests to start the match in middle of the input string, e.g.

new Regex(".*").Match("abc", 1).Success

returns true on .NET 3.1 and .NET Framework 4.8, but erroneously returns false on .NET 5.

The fix is to delete a specific optimization that was added in .NET 5 which automatically adds an anchor to patterns that begin with “.*”, e.g. “.*abc” becomes “^.*abc”. This can have a huge impact on execution time of certain patterns + inputs. Imagine the pattern .*a and the pattern bcdefghijklmnopqrstuvwxyz. This is going to start matching at b, find the next newline, and then backtrack from there looking for the a; it won't find it and will backtrack all the way, failing the match at that position. At that point it'll bump to the next position, starting at c, and do it all over. It'll fail, backtrack all the way, and bump again, starting at d, and doing it all over. Etc. The optimization recognizes that since . will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail. However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text. In that case, the implicitly added anchor will fail the match in the actual matching logic. There are safe ways to do this optimization, but they’re all too involved to do at this point for .NET 5 and we can revisit for .NET 6.

Testing

New unit tests added, including the specific ones provided as part of the supplied repros in the filed GitHub issues. All tests run on both .NET 5 and .NET Framework 4.8.

Risk

Relatively low. The optimization is well-isolated and was just adding an anchor node to the internal pattern tree; deleting it just no longer adds that node, as was the case prior to .NET 5. A developer that wants the optimization can also explicitly add the anchor themselves (the optimization was “helping” in the case where the developer didn’t or didn’t know they could or should).

In .NET 5 we added a bunch of optimizations to Regex.  One of them was a transform that optimized for the case where the pattern begins with `.*`.  If it does, then we insert an implicit anchor at the beginning in order to avoid unnecessary backtracking.  Imagine the pattern `.*a` and the pattern `bcdefghijklmnopqrstuvwxyz`.  This is going to start matching at `b`, find the next newline, and then backtrack from there looking for the `a`; it won't find it and will backtrack all the way, failing the match at that position.  At that point it'll bump to the next position, starting at `c`, and do it all over.  It'll fail, backtrack all the way, and bump again, starting at `d`, and doing it all over.  Etc.  The optimization recognizes that since `.` will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail.

However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text.  In that case, the implicitly added anchor will fail the match in the actual "Go" matching logic.

There are safe ways to do this optimization, e.g. introducing a variant of these anchors that let FindFirstChar skip ahead but that aren't considered for Go's matching purposes, but we can look at employing those for .NET 6.  For now for .NET 5, this commit just deletes the faulty optimization and adds a few tests that were failing it.
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Sep 17, 2020

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member

@maryamariyan given the title, I'm a bit surprised the bot couldn't figure out this label..

@jeffhandley jeffhandley added the Servicing-approved Approved for servicing release label Sep 19, 2020
@jeffhandley
Copy link
Member

Remaining check timeouts are unrelated. This was approved via email 2020-09-17.

@jeffhandley jeffhandley merged commit eeb307c into release/5.0-rc2 Sep 19, 2020
@jeffhandley jeffhandley deleted the backport/pr-42408-to-release/5.0-rc2 branch September 19, 2020 00:59
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants