Skip to content

Handle single-node branches in ExtractCommonPrefixNode#124881

Merged
danmoseley merged 3 commits intodotnet:mainfrom
danmoseley:regex-redux/fix-alternation-prefix
Feb 26, 2026
Merged

Handle single-node branches in ExtractCommonPrefixNode#124881
danmoseley merged 3 commits intodotnet:mainfrom
danmoseley:regex-redux/fix-alternation-prefix

Conversation

@danmoseley
Copy link
Member

When alternation branches are reduced to single nodes (e.g., Set[Pp] from a single-child Concatenation after prior prefix extraction), ExtractCommonPrefixNode previously bailed because it required all branches to be Concatenations. This caused IgnoreCase alternation prefix extraction to stop one character short (e.g., htt instead of http for (http|https)).

Fix: Remove the upfront gate check and handle both Concatenation and single-node branches throughout the extraction loop. When a single-node branch matches the common prefix, it is replaced with Empty.

Fixes #124871

Tests: 10 new test cases covering http/https IC, shorter-branch-is-prefix, multi-char difference, 3-branch variants, case-sensitive regression guard, behavioral match correctness, 4-branch mixed single-node/Concat, and non-IgnoreCase Set-node branches.

danmoseley and others added 2 commits February 25, 2026 15:22
When alternation branches are reduced to single nodes (e.g., Set[Pp]
from a single-child Concatenation after prior prefix extraction),
ExtractCommonPrefixNode previously bailed because it required all
branches to be Concatenations. This caused IgnoreCase alternation
prefix extraction to stop one character short (e.g., 'htt' instead
of 'http' for (http|https)).

Fix: remove the upfront gate check, and handle both Concatenation and
single-node branches throughout the extraction loop.

Fixes dotnet#124871

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Behavioral correctness test for (?:http|https)://foo with IgnoreCase
- 4-branch regression test exercising single-node branch after recursive prefix extraction
- Non-IgnoreCase Set-node branch test via character class alternation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley
Copy link
Member Author

I believe I follow this, so opening a PR. At least one test fails before the change - the others are just completeness.

-- Dan

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regex optimization issue where alternation prefix extraction stopped prematurely when branches were reduced to single nodes (e.g., Set[Pp]) after prior IgnoreCase prefix extraction. Previously, ExtractCommonPrefixNode required all branches to be Concatenations with at least 2 children, causing patterns like (http|https) with IgnoreCase to extract only "htt" instead of "http".

Changes:

  • Removed the upfront gate check that required all branches to be multi-child Concatenations
  • Modified extraction logic to handle both Concatenation and single-node branches by checking branch type before accessing children
  • When a single-node branch matches the common prefix and is fully extracted, it's replaced with an Empty node

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
RegexNode.cs Core logic changes to handle single-node branches in ExtractCommonPrefixNode, removing gate check and adding branch type checks at lines 1205, 1226, and 1248-1257
RegexFindOptimizationsTests.cs 10 new test cases covering http/https IC, various alternation scenarios with shared prefixes, case-sensitive regression guard, and Set-node branches
Regex.Match.Tests.cs 3 functional tests verifying correctness of http/https pattern matching with IgnoreCase after the optimization

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

@danmoseley
Copy link
Member Author

@MihuBot regexdiff

@MihuBot
Copy link

MihuBot commented Feb 26, 2026

396 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"[^'\",]+'[^^']+'|[^'\",]+\"[^\"]+\"|[^,]+" (21563 uses)
[GeneratedRegex("[^'\",]+'[^^']+'|[^'\",]+\"[^\"]+\"|[^,]+")]
  /// <code>[^'",]+'[^^']+'|[^'",]+"[^"]+"|[^,]+</code><br/>
  /// Explanation:<br/>
  /// <code>
-   /// ○ Match with 3 alternative expressions, atomically.<br/>
+   /// ○ Match with 2 alternative expressions, atomically.<br/>
  ///     ○ Match a sequence of expressions.<br/>
  ///         ○ Match a character in the set [^"',] atomically at least once.<br/>
-   ///         ○ Match '\''.<br/>
-   ///         ○ Match a character in the set [^'^] atomically at least once.<br/>
-   ///         ○ Match '\''.<br/>
-   ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match a character in the set [^"',] atomically at least once.<br/>
-   ///         ○ Match '"'.<br/>
-   ///         ○ Match a character other than '"' atomically at least once.<br/>
-   ///         ○ Match '"'.<br/>
+   ///         ○ Match with 2 alternative expressions, atomically.<br/>
+   ///             ○ Match a sequence of expressions.<br/>
+   ///                 ○ Match '\''.<br/>
+   ///                 ○ Match a character in the set [^'^] atomically at least once.<br/>
+   ///                 ○ Match '\''.<br/>
+   ///             ○ Match a sequence of expressions.<br/>
+   ///                 ○ Match '"'.<br/>
+   ///                 ○ Match a character other than '"' atomically at least once.<br/>
+   ///                 ○ Match '"'.<br/>
  ///     ○ Match a character other than ',' atomically at least once.<br/>
  /// </code>
  /// </remarks>
                  int matchStart = pos;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
-                   // Match with 3 alternative expressions, atomically.
+                   // Match with 2 alternative expressions, atomically.
                  {
                      int alternation_starting_pos = pos;
                      
                              pos += iteration;
                          }
                          
-                           // Match '\''.
-                           if (slice.IsEmpty || slice[0] != '\'')
+                           // Match with 2 alternative expressions, atomically.
                          {
-                               goto AlternationBranch;
-                           }
-                           
-                           // Match a character in the set [^'^] atomically at least once.
-                           {
-                               int iteration1 = slice.Slice(1).IndexOfAny('\'', '^');
-                               if (iteration1 < 0)
-                               {
-                                   iteration1 = slice.Length - 1;
-                               }
-                               
-                               if (iteration1 == 0)
+                               if (slice.IsEmpty)
                              {
                                  goto AlternationBranch;
                              }
                              
-                               slice = slice.Slice(iteration1);
-                               pos += iteration1;
+                               switch (slice[0])
+                               {
+                                   case '\'':
+                                       
+                                       // Match a character in the set [^'^] atomically at least once.
+                                       {
+                                           int iteration1 = slice.Slice(1).IndexOfAny('\'', '^');
+                                           if (iteration1 < 0)
+                                           {
+                                               iteration1 = slice.Length - 1;
+                                           }
+                                           
+                                           if (iteration1 == 0)
+                                           {
+                                               goto AlternationBranch;
+                                           }
+                                           
+                                           slice = slice.Slice(iteration1);
+                                           pos += iteration1;
+                                       }
+                                       
+                                       // Match '\''.
+                                       if ((uint)slice.Length < 2 || slice[1] != '\'')
+                                       {
+                                           goto AlternationBranch;
+                                       }
+                                       
+                                       pos += 2;
+                                       slice = inputSpan.Slice(pos);
+                                       break;
+                                       
+                                   case '"':
+                                       
+                                       // Match a character other than '"' atomically at least once.
+                                       {
+                                           int iteration2 = slice.Slice(1).IndexOf('"');
+                                           if (iteration2 < 0)
+                                           {
+                                               iteration2 = slice.Length - 1;
+                                           }
+                                           
+                                           if (iteration2 == 0)
+                                           {
+                                               goto AlternationBranch;
+                                           }
+                                           
+                                           slice = slice.Slice(iteration2);
+                                           pos += iteration2;
+                                       }
+                                       
+                                       // Match '"'.
+                                       if ((uint)slice.Length < 2 || slice[1] != '"')
+                                       {
+                                           goto AlternationBranch;
+                                       }
+                                       
+                                       pos += 2;
+                                       slice = inputSpan.Slice(pos);
+                                       break;
+                                       
+                                   default:
+                                       goto AlternationBranch;
+                               }
                          }
                          
-                           // Match '\''.
-                           if ((uint)slice.Length < 2 || slice[1] != '\'')
-                           {
-                               goto AlternationBranch;
-                           }
-                           
-                           pos += 2;
-                           slice = inputSpan.Slice(pos);
                          goto AlternationMatch;
                          
                          AlternationBranch:
                      
                      // Branch 1
                      {
-                           // Match a character in the set [^"',] atomically at least once.
+                           // Match a character other than ',' atomically at least once.
                          {
-                               int iteration2 = slice.IndexOfAny('"', '\'', ',');
-                               if (iteration2 < 0)
-                               {
-                                   iteration2 = slice.Length;
-                               }
-                               
-                               if (iteration2 == 0)
-                               {
-                                   goto AlternationBranch1;
-                               }
-                               
-                               slice = slice.Slice(iteration2);
-                               pos += iteration2;
-                           }
-                           
-                           // Match '"'.
-                           if (slice.IsEmpty || slice[0] != '"')
-                           {
-                               goto AlternationBranch1;
-                           }
-                           
-                           // Match a character other than '"' atomically at least once.
-                           {
-                               int iteration3 = slice.Slice(1).IndexOf('"');
+                               int iteration3 = slice.IndexOf(',');
                              if (iteration3 < 0)
                              {
-                                   iteration3 = slice.Length - 1;
+                                   iteration3 = slice.Length;
                              }
                              
                              if (iteration3 == 0)
                              {
-                                   goto AlternationBranch1;
+                                   return false; // The input didn't match.
                              }
                              
                              slice = slice.Slice(iteration3);
                              pos += iteration3;
                          }
                          
-                           // Match '"'.
-                           if ((uint)slice.Length < 2 || slice[1] != '"')
-                           {
-                               goto AlternationBranch1;
-                           }
-                           
-                           pos += 2;
-                           slice = inputSpan.Slice(pos);
-                           goto AlternationMatch;
-                           
-                           AlternationBranch1:
-                           pos = alternation_starting_pos;
-                           slice = inputSpan.Slice(pos);
-                       }
-                       
-                       // Branch 2
-                       {
-                           // Match a character other than ',' atomically at least once.
-                           {
-                               int iteration4 = slice.IndexOf(',');
-                               if (iteration4 < 0)
-                               {
-                                   iteration4 = slice.Length;
-                               }
-                               
-                               if (iteration4 == 0)
-                               {
-                                   return false; // The input didn't match.
-                               }
-                               
-                               slice = slice.Slice(iteration4);
-                               pos += iteration4;
-                           }
-                           
                      }
                      
                      AlternationMatch:;
"^(http|https)\\://[a-zA-Z0-9\\-\\.]+(:[a-zA- ..." (821 uses)
[GeneratedRegex("^(http|https)\\://[a-zA-Z0-9\\-\\.]+(:[a-zA-Z0-9]*)?(/[a-zA-Z0-9\\-\\._]*)*$", RegexOptions.IgnoreCase)]
  /// ○ 1st capture group.<br/>
  ///     ○ Match a character in the set [Hh].<br/>
  ///     ○ Match a character in the set [Tt] exactly 2 times.<br/>
-   ///     ○ Match with 2 alternative expressions.<br/>
-   ///         ○ Match a character in the set [Pp].<br/>
-   ///         ○ Match a sequence of expressions.<br/>
-   ///             ○ Match a character in the set [Pp].<br/>
-   ///             ○ Match a character in the set [Ss].<br/>
+   ///     ○ Match a character in the set [Pp].<br/>
+   ///     ○ Match a character in the set [Ss] atomically, optionally.<br/>
  /// ○ Match the string "://".<br/>
  /// ○ Match a character in the set [\-.0-9A-Za-z\u212A] greedily at least once.<br/>
  /// ○ Optional (greedy).<br/>
              {
                  int pos = base.runtextpos;
                  int matchStart = pos;
-                   int alternation_branch = 0;
-                   int alternation_starting_capturepos = 0;
-                   int alternation_starting_pos = 0;
                  int capture_starting_pos = 0;
                  int charloop_capture_pos = 0;
                  int charloop_starting_pos = 0, charloop_ending_pos = 0;
                  }
                  
                  // 1st capture group.
-                   //{
+                   {
                      capture_starting_pos = pos;
                      
-                       if ((uint)slice.Length < 3 ||
-                           !slice.StartsWith("htt", StringComparison.OrdinalIgnoreCase)) // Match the string "htt" (ordinal case-insensitive)
+                       if ((uint)slice.Length < 4 ||
+                           !slice.StartsWith("http", StringComparison.OrdinalIgnoreCase)) // Match the string "http" (ordinal case-insensitive)
                      {
                          UncaptureUntil(0);
                          return false; // The input didn't match.
                      }
                      
-                       // Match with 2 alternative expressions.
-                       //{
-                           alternation_starting_pos = pos;
-                           alternation_starting_capturepos = base.Crawlpos();
-                           
-                           // Branch 0
-                           //{
-                               // Match a character in the set [Pp].
-                               if ((uint)slice.Length < 4 || ((slice[3] | 0x20) != 'p'))
-                               {
-                                   goto AlternationBranch;
-                               }
-                               
-                               alternation_branch = 0;
-                               pos += 4;
-                               slice = inputSpan.Slice(pos);
-                               goto AlternationMatch;
-                               
-                               AlternationBranch:
-                               pos = alternation_starting_pos;
-                               slice = inputSpan.Slice(pos);
-                               UncaptureUntil(alternation_starting_capturepos);
-                           //}
-                           
-                           // Branch 1
-                           //{
-                               if ((uint)slice.Length < 5 ||
-                                   !slice.Slice(3).StartsWith("ps", StringComparison.OrdinalIgnoreCase)) // Match the string "ps" (ordinal case-insensitive)
-                               {
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                               }
-                               
-                               alternation_branch = 1;
-                               pos += 5;
-                               slice = inputSpan.Slice(pos);
-                               goto AlternationMatch;
-                           //}
-                           
-                           AlternationBacktrack:
-                           if (Utilities.s_hasTimeout)
+                       // Match a character in the set [Ss] atomically, optionally.
+                       {
+                           if ((uint)slice.Length > (uint)4 && ((slice[4] | 0x20) == 's'))
                          {
-                               base.CheckTimeout();
+                               slice = slice.Slice(1);
+                               pos++;
                          }
-                           
-                           switch (alternation_branch)
-                           {
-                               case 0:
-                                   goto AlternationBranch;
-                               case 1:
-                                   UncaptureUntil(0);
-                                   return false; // The input didn't match.
-                           }
-                           
-                           AlternationMatch:;
-                       //}
+                       }
                      
+                       pos += 4;
+                       slice = inputSpan.Slice(pos);
                      base.Capture(1, capture_starting_pos, pos);
-                       
-                       goto CaptureSkipBacktrack;
-                       
-                       CaptureBacktrack:
-                       goto AlternationBacktrack;
-                       
-                       CaptureSkipBacktrack:;
-                   //}
+                   }
                  
                  // Match the string "://".
                  if (!slice.StartsWith("://"))
                  {
-                       goto CaptureBacktrack;
+                       UncaptureUntil(0);
+                       return false; // The input didn't match.
                  }
                  
                  // Match a character in the set [\-.0-9A-Za-z\u212A] greedily at least once.
                      
                      if (iteration == 0)
                      {
-                           goto CaptureBacktrack;
+                           UncaptureUntil(0);
+                           return false; // The input didn't match.
                      }
                      
                      slice = slice.Slice(iteration);
                      
                      if (charloop_starting_pos >= charloop_ending_pos)
                      {
-                           goto CaptureBacktrack;
+                           UncaptureUntil(0);
+                           return false; // The input didn't match.
                      }
                      pos = --charloop_ending_pos;
                      slice = inputSpan.Slice(pos);
                          base.Capture(2, capture_starting_pos1, pos);
                          
                          Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos1);
-                           goto CaptureSkipBacktrack1;
+                           goto CaptureSkipBacktrack;
                          
-                           CaptureBacktrack1:
+                           CaptureBacktrack:
                          capture_starting_pos1 = base.runstack![--stackpos];
                          goto CharLoopBacktrack1;
                          
-                           CaptureSkipBacktrack1:;
+                           CaptureSkipBacktrack:;
                      //}
                      
                      
                          // No iterations of the loop remain to backtrack into. Fail the loop.
                          goto CharLoopBacktrack;
                      }
-                       goto CaptureBacktrack1;
+                       goto CaptureBacktrack;
                      LoopEnd:;
                  //}
"^[\\s\\S]+?(?=[\\\\<!\\[*`~\\:]|\\b_|\\bhttp ..." (774 uses)
[GeneratedRegex("^[\\s\\S]+?(?=[\\\\<!\\[*`~\\:]|\\b_|\\bhttps?:\\/\\/| {2,}\\n|$)")]
  /// ○ Match if at the beginning of the string.<br/>
  /// ○ Match any character lazily at least once.<br/>
  /// ○ Zero-width positive lookahead.<br/>
-   ///     ○ Match with 5 alternative expressions, atomically.<br/>
+   ///     ○ Match with 4 alternative expressions, atomically.<br/>
  ///         ○ Match a character in the set [!*:&lt;[\\`~].<br/>
  ///         ○ Match a sequence of expressions.<br/>
  ///             ○ Match if at a word boundary.<br/>
-   ///             ○ Match '_'.<br/>
-   ///         ○ Match a sequence of expressions.<br/>
-   ///             ○ Match if at a word boundary.<br/>
-   ///             ○ Match the string "http".<br/>
-   ///             ○ Match 's' atomically, optionally.<br/>
-   ///             ○ Match the string "://".<br/>
+   ///             ○ Match with 2 alternative expressions, atomically.<br/>
+   ///                 ○ Match '_'.<br/>
+   ///                 ○ Match a sequence of expressions.<br/>
+   ///                     ○ Match the string "http".<br/>
+   ///                     ○ Match 's' atomically, optionally.<br/>
+   ///                     ○ Match the string "://".<br/>
  ///         ○ Match a sequence of expressions.<br/>
  ///             ○ Match ' ' atomically at least twice.<br/>
  ///             ○ Match '\n'.<br/>
                      
                      int atomic_stackpos = stackpos;
                      
-                       // Match with 5 alternative expressions, atomically.
+                       // Match with 4 alternative expressions, atomically.
                      {
                          int alternation_starting_pos = pos;
                          
                                  goto AlternationBranch1;
                              }
                              
-                               // Match '_'.
-                               if (slice.IsEmpty || slice[0] != '_')
+                               // Match with 2 alternative expressions, atomically.
                              {
-                                   goto AlternationBranch1;
+                                   if (slice.IsEmpty)
+                                   {
+                                       goto AlternationBranch1;
+                                   }
+                                   
+                                   switch (slice[0])
+                                   {
+                                       case '_':
+                                           pos++;
+                                           slice = inputSpan.Slice(pos);
+                                           break;
+                                           
+                                       case 'h':
+                                           // Match the string "ttp".
+                                           if (!slice.Slice(1).StartsWith("ttp"))
+                                           {
+                                               goto AlternationBranch1;
+                                           }
+                                           
+                                           // Match 's' atomically, optionally.
+                                           {
+                                               if ((uint)slice.Length > (uint)4 && slice[4] == 's')
+                                               {
+                                                   slice = slice.Slice(1);
+                                                   pos++;
+                                               }
+                                           }
+                                           
+                                           // Match the string "://".
+                                           if (!slice.Slice(4).StartsWith("://"))
+                                           {
+                                               goto AlternationBranch1;
+                                           }
+                                           
+                                           pos += 7;
+                                           slice = inputSpan.Slice(pos);
+                                           break;
+                                           
+                                       default:
+                                           goto AlternationBranch1;
+                                   }
                              }
                              
-                               pos++;
-                               slice = inputSpan.Slice(pos);
                              goto AlternationMatch;
                              
                              AlternationBranch1:
                          }
                          
                          // Branch 2
-                           {
-                               // Match if at a word boundary.
-                               if (!Utilities.IsPreWordCharBoundary(inputSpan, pos))
-                               {
-                                   goto AlternationBranch2;
-                               }
-                               
-                               // Match the string "http".
-                               if (!slice.StartsWith("http"))
-                               {
-                                   goto AlternationBranch2;
-                               }
-                               
-                               // Match 's' atomically, optionally.
-                               {
-                                   if ((uint)slice.Length > (uint)4 && slice[4] == 's')
-                                   {
-                                       slice = slice.Slice(1);
-                                       pos++;
-                                   }
-                               }
-                               
-                               // Match the string "://".
-                               if (!slice.Slice(4).StartsWith("://"))
-                               {
-                                   goto AlternationBranch2;
-                               }
-                               
-                               pos += 7;
-                               slice = inputSpan.Slice(pos);
-                               goto AlternationMatch;
-                               
-                               AlternationBranch2:
-                               pos = alternation_starting_pos;
-                               slice = inputSpan.Slice(pos);
-                           }
-                           
-                           // Branch 3
                          {
                              // Match ' ' atomically at least twice.
                              {
                                  
                                  if (iteration < 2)
                                  {
-                                       goto AlternationBranch3;
+                                       goto AlternationBranch2;
                                  }
                                  
                                  slice = slice.Slice(iteration);
                              // Match '\n'.
                              if (slice.IsEmpty || slice[0] != '\n')
                              {
-                                   goto AlternationBranch3;
+                                   goto AlternationBranch2;
                              }
                              
                              pos++;
                              slice = inputSpan.Slice(pos);
                              goto AlternationMatch;
                              
-                               AlternationBranch3:
+                               AlternationBranch2:
                              pos = alternation_starting_pos;
                              slice = inputSpan.Slice(pos);
                          }
                          
-                           // Branch 4
+                           // Branch 3
                          {
                              // Match if at the end of the string or if before an ending newline.
                              if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
"\\G(?:[\"“”]|\\s|\\\\[@#*]|\\\\[@#*bfmv])" (33 uses)
[GeneratedRegex("\\G(?:[\"“”]|\\s|\\\\[@#*]|\\\\[@#*bfmv])")]
  /// Explanation:<br/>
  /// <code>
  /// ○ Match if at the start position.<br/>
-   /// ○ Match with 3 alternative expressions, atomically.<br/>
+   /// ○ Match with 2 alternative expressions, atomically.<br/>
  ///     ○ Match a character in the set ["\u201C\u201D\s].<br/>
  ///     ○ Match a sequence of expressions.<br/>
  ///         ○ Match '\\'.<br/>
-   ///         ○ Match a character in the set [#*@].<br/>
-   ///     ○ Match a sequence of expressions.<br/>
-   ///         ○ Match '\\'.<br/>
  ///         ○ Match a character in the set [#*@bfmv].<br/>
  /// </code>
  /// </remarks>
                      return false; // The input didn't match.
                  }
                  
-                   // Match with 3 alternative expressions, atomically.
+                   // Match with 2 alternative expressions, atomically.
                  {
                      int alternation_starting_pos = pos;
                      
                      }
                      
                      // Branch 1
-                       {
-                           if ((uint)slice.Length < 2 ||
-                               slice[0] != '\\' || // Match '\\'.
-                               (((ch = slice[1]) != '#') & (ch != '*') & (ch != '@'))) // Match a character in the set [#*@].
-                           {
-                               goto AlternationBranch1;
-                           }
-                           
-                           pos += 2;
-                           slice = inputSpan.Slice(pos);
-                           goto AlternationMatch;
-                           
-                           AlternationBranch1:
-                           pos = alternation_starting_pos;
-                           slice = inputSpan.Slice(pos);
-                       }
-                       
-                       // Branch 2
                      {
                          if ((uint)slice.Length < 2 ||
                              slice[0] != '\\' || // Match '\\'.

For more diff examples, see https://gist.github.com/MihuBot/8b933494bb1466554d325dc4ca9fc8d4

JIT assembly changes
Total bytes of base: 54284087
Total bytes of diff: 54418439
Total bytes of delta: 134352 (0.25 % of base)
Total relative delta: 50.97
    diff is a regression.
    relative diff is a regression.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1798.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FH1OjmHA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@MihuBot
Copy link

MihuBot commented Feb 26, 2026

@danmoseley
Copy link
Member Author

danmoseley commented Feb 26, 2026

Diffs seem to be expected as far as I can tell. Some of them show cascading improvements.

eg [GeneratedRegex("\\G(?:[\"“”]|\\s|\\\\[@#*]|\\\\[@#*bfmv])")] above goes from 3 branches to 2 because it now sees the last two branches start with \ and that allows it to see the subsequent parts collapse into [@#*bfmw] instead of their own alternation.

-- Dan

@danmoseley
Copy link
Member Author

danmoseley commented Feb 26, 2026

AI digestion of mihubot perf run. As expected, none of our benchmarks really benefit from this - most have unchanged code generated for a start. The reason to take this change would be to get the diffs on more real world patterns shown by the mihubot diffs. Almost by definition, they are more relevant to customers than many of our perf tests scenarios.

-- Dan

======

Suite Test Pattern/Options Ratio Notes
Sherlock Count .* / None 1.05 not changed by PR
Sherlock Count (?i)Sher[a-z]+|Hol[a-z]+ / Compiled 1.17 not changed by PR (no shared prefix)
Sherlock Count (?i)Sherlock / NonBacktracking 0.77 not changed by PR
Sherlock Count (?i)Sherlock|Holmes|Watson / Compiled 1.04 not changed by PR (no shared prefix)
Sherlock Count (?i)Sherlock|Holmes|Watson / NonBacktracking 1.09 not changed by PR (no shared prefix)
Sherlock Count (?i)Sherlock|...|Baker / Compiled 1.10 not changed by PR (no shared prefix)
Sherlock Count (?i)the / None 1.08 not changed by PR
Sherlock Count \p{L} / NonBacktracking 1.08 not changed by PR
Sherlock Count \p{Lu} / Compiled 1.04 not changed by PR
Sherlock Count \w+ / NonBacktracking 1.04 not changed by PR
Sherlock Count \w+\s+Holmes\s+\w+ / Compiled 0.95 not changed by PR
Sherlock Count (?s).* / Compiled 0.97 not changed by PR
Russian Count Шерлок Холмс / None 0.84 not changed by PR
Russian Count Шерлок Холмс / NonBacktracking 0.96 not changed by PR
Mariomkas Ctor IP pattern / None 0.97 not changed by PR
Mariomkas Count IP pattern / NonBacktracking 1.04 not changed by PR
Mariomkas Ctor email / NonBacktracking 0.93 not changed by PR
BoostDocs IsMatch #2 / NonBacktracking 1.09 not changed by PR (unknown pattern)
BoostDocs IsMatch #6 / None 1.11 not changed by PR (unknown pattern)
BoostDocs IsMatch #12 / NonBacktracking 1.04 not changed by PR (unknown pattern)
Common Email_IsMatch / None 1.08 not changed by PR
Common Uri_IsNotMatch / Compiled 0.94 not changed by PR
Common IP_IsNotMatch / Compiled 0.97 not changed by PR
Common ReplaceWords / Compiled 0.91 not changed by PR
Common SplitWords / Compiled 0.94 not changed by PR
Common MatchesWords / IC, Compiled 1.21 not changed by PR
Common ReplaceWords / IC, Compiled 0.82 not changed by PR
Common SplitWords / IC, Compiled 0.66 not changed by PR
Common Date_IsMatch / IC, Compiled 0.90 not changed by PR
Common MatchWord / IC, Compiled 1.11 not changed by PR

Bottom line: Every significant benchmark deviation is noise. None of the benchmark patterns exercise the single-node branch fix. The wildly swinging Common results (SplitWords IC Compiled at 0.66, MatchesWords
IC Compiled at 1.21) in opposite directions confirm these are measurement artifacts, not real effects. The real impact of this PR is shown by the 396 source code diffs, not this benchmark suite.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The improvement looks good. I'd like to see some more tests as commented.

Address review feedback:
- Add PatternsReduceIdentically tests validating reduced tree shapes
  for IgnoreCase and non-IgnoreCase node prefix extraction
- Add PatternsReduceDifferently tests confirming optional vs required
  patterns produce distinct trees
- Add non-IgnoreCase variants of LeadingPrefix test cases
- Add LeadingSet tests for reversed branch order and IgnoreCase sets

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley
Copy link
Member Author

tests added. I'll merge on green.

@danmoseley danmoseley enabled auto-merge (squash) February 26, 2026 16:58
@danmoseley danmoseley merged commit 371aca3 into dotnet:main Feb 26, 2026
87 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Patterns with literal string in ASCII alternation of different length do not fast-search for longest string

4 participants