-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Python: port ReDoS queries from Javascript #6175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added naive implementation of `charRange` so the test can run. - Made predicates public as needed.
repeats. This in preparation for ReDoS
This contains several contributions from @erik-krogh and also some fixes from @nickrolfe
This work is due to @erik-krogh who also - made corresponding fixes to `RegexTreeView.qll` - implemented `toUnicode` so it is available on `String`s
the extra ordering conditions in ReDoSUtil will be needed for the Python implementation.
The library specific file is `RegExpTreeView`. The files are recorded as identical via the mapping in `identical-files.json`.
- `KnownCVEs` contain the currently triaged Python CVEs - `unittest.py` contains some tests constructed by @erik-krogh - `redos.py` contains a port of `tst.js` from javascript The expected file has been ported as well with some fixups by @tausbn
so we can see the light go green. But we should perhaps do something about those duplicate results.
|
DCA run here |
| * a suffix `x` (possible empty) that is most likely __not__ accepted. | ||
| */ | ||
|
|
||
| import ReDoSUtil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driveby comment: Make as much as possible private. The import plus most of the classes and predicates all ought to be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this wasn't actually done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had done this, but you are right; PR here.
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the first commit and tried to understand the improvements you made to our regex library... I was taken by surprise by how complicated was for me to understand. Must have been a huge effort to actually make these things 💪
I made a few very detailed comments on that first commit, and have glanced through the other commits and made a few comments along the way.
I think it would have been ideal to have such detailed explanatory comments on the whole regex.qll, but I think we need to evaluate whether that's actually going to be worth the effort (even just for the new parts added). I suggest we talk this over "in-person", once you've had a chance to look through my comments.
I think a few of the non-private member-predicates of the RegexString class have been changed (like qualifiedItem/qualifiedPart in 21007d2), so technically that means if someone is using these, their code will break if we push this update 😞 So I guess we should be following our normal deprecation policy on this as well, although it is a bit burdensome. We should consider whether new versions should be marked with INTERNAL: Do not use., which can allow us some more freedom (if for example, the predicates are only exposed to be able to test things).
| result = this.(Unicode).getText() | ||
| } | ||
|
|
||
| /** result is true for those start chars that actually mark a start of a char set. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only care about the cases where result = true, could we just rewrite this predicate to only hold when this is the case? (and then alter usage from this.char_set_start(pos) = true ⇒ this.char_set_start(pos))
Hmm, I see this becomes difficult with the if/then/else, since you need to duplicate some of the logic then :| OH wait, can't you just do
if cond()
then your_stuff()
else any()to get the same effect?
OH, since we want to properly handle [[][A-Z] we actually need this predicate to only hold for the positions which has the non-escaped char [, and want to make use of the booleanNot ... although this is subtle point 😰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly I forgot to write proper comments here. This is following the same pattern as predicate escapingChar(int pos) { this.escaping(pos) = true } and private boolean escaping(int pos) to avoid negative recursion. The pattern is used again in scanning through char sets to find character ranges. Both of these instances are a bit more elaborate than escaping and therefor the pattern is likely harder to detect without documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more elaborate comments to this effect now.
tausbn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still chewing through this PR, but I thought I would give some interim comments. (I am happy to see that RasmusWL has addressed some of them already. 🙂)
| */ | ||
| predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) { | ||
| // mirror logic from `simpleCharacter` | ||
| exists(int x, int y | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename x and y to something a bit more... apt? I had to stare at this for a while to figure out what they were supposed to mean. Perhaps contents_start and contents_end would be appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular implementation is only for illustrative purposes (there was no such predicate to test). A better implementation is introduced later.
| * with lower bound found between `start` and `lower_end` | ||
| * and upper bound found between `upper_start` and `end`. | ||
| */ | ||
| predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would be more intuitive if start was lower_start and end was upper_end. I realise these are the start and end for the entire range, but somehow I have to think extra hard to see why it makes sense to link together start with lower_end, etc.
| override string getARelevantTag() { result = "charSet" } | ||
|
|
||
| override predicate hasActualResult(Location location, string element, string tag, string value) { | ||
| exists(location.getFile().getRelativePath()) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really minor, but it seems odd to me to both require that the test file is not a dependency, and that it also has a specific name. As far as I can tell, these tests shouldn't even require the re module to be imported (during extraction, that is), since we no longer rely on points-to for this.
| ) | ||
| } | ||
|
|
||
| /** result denotes if the index is a left bracket */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This QLDoc seems underspecified. My reading is that it holds if position pos contains the indexth char set delimiter, and the result is true iff the bracket is a left bracket.
|
|
||
| /** result denotes if the index is a left bracket */ | ||
| boolean char_set_delimiter(int index, int pos) { | ||
| pos = rank[index](int p | this.nonEscapedCharAt(p) = "[" or this.nonEscapedCharAt(p) = "]") and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm grossly mistaken, rank indexes from 1, so the first bracket (if any) will be at index 1 as well. Is this intended?
(You quite often see rank[index + 1](...) for this reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended in the sense that I did not see a reason to change the default indexing...
| index = 1 and result = true // if a '[' is first in the string (among brackets), it starts a char set | ||
| or | ||
| index > 1 and | ||
| not char_set_delimiter(index - 1, _) = false and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit curious about this line. Why not just char_set_delimiter(index - 1, _) = true? As far as I can tell, char_set_delimiter cannot fail for index - 1 (edit: assuming index > 1) if it has already succeeded for index (though I may have missed something).
Also, it would be great if each disjunct had a brief comment about which case it is handling.
| result = this.(Unicode).getText() | ||
| } | ||
|
|
||
| /** result is true for those start chars that actually mark a start of a char set. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.
|
DCA run without attempting to use special CLI here. |
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
| import python | ||
| import semmle.python.regex.SuperlinearBackTracking | ||
|
|
||
| from PolynomialBackTrackingTerm t | ||
| where t.getLocation().getFile().getBaseName() = "KnownCVEs.py" | ||
| select t.getRegex(), t, t.getReason() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks more like a file that should be under test/ and not src/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, yes..
until supporting CLI is released
|
DCA run actually running ReDoS here. Performance looks fine. I do not know if the new results are due to duplication, they look fine. |
asgerf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS changes lgtm
|
🚢 🚀 💥 |
This PR ports the very succesful ReDoS queries from Javascript. In order to do so it first:
Some tests are included here, more can be found in the scratch branch.
This work was greatly helped by @erik-krogh who contributed numerous fixes along the way as well as making parsed regexes viewable in the AST-viewer and making the
toUnicodepredicate available onStrings. And he enthusiastically triaged several results from the run-on-alls.Some things that might still need to be tweaked: