[release/5.0-rc2] Remove implicit anchoring optimization from Regex #42409
+55
−57
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #42408 to release/5.0-rc2
/cc @stephentoub
Fixes #42390
Fixes #42392
Customer Impact
Regex patterns that begin with .* may fail to match correctly when the developer requests to start the match in middle of the input string, e.g.
returns true on .NET 3.1 and .NET Framework 4.8, but erroneously returns false on .NET 5.
The fix is to delete a specific optimization that was added in .NET 5 which automatically adds an anchor to patterns that begin with “.*”, e.g. “.*abc” becomes “^.*abc”. This can have a huge impact on execution time of certain patterns + inputs. Imagine the pattern .*a and the pattern bcdefghijklmnopqrstuvwxyz. This is going to start matching at b, find the next newline, and then backtrack from there looking for the a; it won't find it and will backtrack all the way, failing the match at that position. At that point it'll bump to the next position, starting at c, and do it all over. It'll fail, backtrack all the way, and bump again, starting at d, and doing it all over. Etc. The optimization recognizes that since . will match anything other than newline, after it fails to match at the first position, we can just skip all subsequent positions until the next newline, as they're all going to fail. However, the optimization failed to take into account that someone can explicitly start a match in the middle of the provided text. In that case, the implicitly added anchor will fail the match in the actual matching logic. There are safe ways to do this optimization, but they’re all too involved to do at this point for .NET 5 and we can revisit for .NET 6.
Testing
New unit tests added, including the specific ones provided as part of the supplied repros in the filed GitHub issues. All tests run on both .NET 5 and .NET Framework 4.8.
Risk
Relatively low. The optimization is well-isolated and was just adding an anchor node to the internal pattern tree; deleting it just no longer adds that node, as was the case prior to .NET 5. A developer that wants the optimization can also explicitly add the anchor themselves (the optimization was “helping” in the case where the developer didn’t or didn’t know they could or should).