<regex>: Remove capture validity vectors from stack frames
#5918
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This removes the capturing group validity vectors in stack frames that were used to snapshot the validity/matched status of capturing groups whenever a stack frame was pushed. The restoration of this status is now achieved by processing new or modified opcodes during unwinding plus some changed logic when the pattern of a lookahead assertion matched:
_Capture_restore_endis split into_Capture_restore_matched_endand_Capture_restore_unmatched_end, which are pushed depending on whether a capturing group is already matched while the_N_end_capturenode is processed._Capture_restore_matched_endkeeps the capturing group matched (so doesn't do anything about the matched status),_Capture_restore_unmatched_endresets it to unmatched._Matcher3::_Reset_capture_groups(), replacing the prior calls tostd::fill(). This function pushes a new stack frame with opcode_Capture_restore_matchedfor every capture group whose status is changed to unmatched._Capture_restore_unmatched_endand any with opcode_Capture_restore_beginbefore them on the stack. We don't have to keep those with opcode_Capture_restore_matched_endbecause ECMAScript rules guarantee that the capturing groups inside the lookahead assertion are unmatched when processing the lookahead assertion starts, so the stack frame pushed for the first modification of a capturing group's end pointer in this lookahead assertion must have opcode_Capture_restore_unmatched_end.std::fill()to reset their status to unmatched. (We don't have to worry about restoring the begin and end pointers of the capturing groups because the capturing groups are always unmatched when leaving a negative lookahead assertion, so the pointers are meaningless.)With this PR, the worst-case number of allocations is logarithmic in the size of the input (pattern + searched string) and no longer linear. But even for patterns like "a*", where the capture extent and validity vectors in the stack frames did not actually allocate, we still see some major performance improvement, probably because the overhead of managing these vectors is gone.
This change also makes the structure
_Rx_state_frametrivially copyable and destructible iff the unwrapped iterator type is trivially copyable or destructible, usually simplifying destruction of the stack frame vector.Drive-by change: Since the stack frames have a new member
_Capture_idx, this member is now used to store the relevant index of the capturing group for all_Captureopcodes, so they no longer have to access the contents of the_Node_captureNFA node.Benchmark
Improvement beginning with #5865 (capture extent vector removal)