DoS fix / regex performance tuning#368
Merged
tsellers-r7 merged 6 commits intorapid7:masterfrom Aug 2, 2021
Merged
Conversation
Contributor
Author
Contributor
Author
|
RE: DoS of the pattern Single pass of a very long string The impact would only be at scale in a tight loop and so unlikely to be seen in a production use. In my testing processing a string with 10,000 spaces 100,000 times didn't finish in less than an hour. 100,000 passes against strings of varying lengths |
rkirk-nos
approved these changes
Aug 2, 2021
3 tasks
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
TL:DR - Fixed an unlikely DoS condition in a pattern in
xml/ssh_banners.xmland improved performance of certain patterns. Metrics included here. This PR doesn't address all of the slower patterns but does get the worst of them.In a PR ( rapid7/recog-java#7 ) for
recog-java@hudclark provided some data ( rapid7/recog-java#7 (comment) ) that would cause one of our regexes to hang indefinitely. That specific problem has already been fixed #353.Since
recogis generally used to process data from untrustworthy sources I wanted to make sure that this specific data and certain other patterns wouldn't cause indefinite hangs or otherwise have unacceptable performance characteristics when processed using any ofrecog's other fingerprint patterns. In Project Sonar we process 100's of millions of banners on a regular basis so small performance improvements can add up.Findings:
Testing methodology
Script that processes a specific test string with every fingerprint pattern currently in
recog. Since the process can be incredibly fast the script timed how long it took to perform this process 100,000 times per pattern. The script then emitted results sorted by duration with the highest duration first.There are 3,879 patterns currently in
recog. Using @hudclark's test string I found that 3,338 took less that 0.1 seconds to process. The longest duration was nearly 9 seconds. Using that as my benchmark I decided that anything over 2 seconds should probably be reviewed. There were 151 of these in this dataset alone. I did not address all of them because later tests revealed patterns that took up to 62.2 seconds and thus took priority.First test case - hudclark's
The data that hudclark provided was string ~1,500 characters long that consisted solely of digits. It pointed out some patterns that took 2x to 4x longer than I'd like. These were quickly overshadowed by later tests.
Original results
Click to expand results!
Results after changes
Click to expand results!
Second test case - High numbers of spaces
The second test was high numbers of spaces followed by various characters.
" " * 10000 + "1!(&(HFUH*&GEGG#((@*#(@&#H@H 37H7293H423H4H H&#(@$&H#$H@#$"This found some really poorly performing patterns.
Original results
Click to expand results!
Results after changes
Click to expand results!
It also found an indefinite hang in the following fingerprint.
The problem here, besides the fact that it doesn't actually match what it should match, is that there are two back to back regex patterns,
([\s]*), and\s*, that match arbitrary length strings consisting solely of whitespace. When processing data that starts with a very long string of whitespace this causes catastrophic backtracking to occur ultimately resulting in a DoS.This has been fixed so as to actually correctly capture a version (
([\s]*)->([\d.]{0,8})) and I have bounded both the capture and the number of spaces that follow. I've also located and added an example so that it can be tested.Third test case - High numbers of digits
The third test was high numbers of digits followed by various characters.
"1" * 10000 + " 1 1 1 1 1 1 1!(&(HFUH*&GEGG#((@*#(@&#H@H 37H7293H423H4H H&#(@$&H#$H@#$"This found some really poorly performing patterns.
Original results
Click to expand results!
Results after changes
Click to expand results!
Changes
Most of the performance issues are due to unbounded matches at the beginning of a regex pattern (^(.+)
). These are compounded when followed by another unbounded match or optional string^([\s])\s`.In general I have modified these matches so as to limit the number of characters that it can match. The count limit is based on the maximum string that I think is plausible in that location + some padding. Note, I have only updated specific cases where performance is particularly poor. I don't currently have time to change all of them.
Changes:
' '*or' '+) have typically been bounded to 8 spaces. We typically see 1 to 3 in those cases generally.How Has This Been Tested?
test script, rspec + the build in banner examples.
Types of changes
Checklist: