Skip to content

Conversation

@yoff
Copy link
Contributor

@yoff yoff commented Jun 28, 2021

This PR ports the very succesful ReDoS queries from Javascript. In order to do so it first:

  • exposes a few predicates in our regex parser (and fixes a few issues)
  • builds a parse-tree-view on top
  • refactors the js queries to admit identical files

Some tests are included here, more can be found in the scratch branch.

This work was greatly helped by @erik-krogh who contributed numerous fixes along the way as well as making parsed regexes viewable in the AST-viewer and making the toUnicode predicate available on Strings. And he enthusiastically triaged several results from the run-on-alls.

Some things that might still need to be tweaked:

  • Python (and Ruby) have extra anchors that should probably be accounted for in the queries, specifically during prefix generation.
  • The value of constants which are anchors should also be considered so they do not get confused with character classes.
  • It would be nice to add a test file including all the currently triaged results.

yoff added 8 commits June 28, 2021 17:04
- Added naive implementation of `charRange` so the test can run.
- Made predicates public as needed.
repeats. This in preparation for ReDoS
This contains several contributions from @erik-krogh
and also some fixes from @nickrolfe
This work is due to @erik-krogh who also
 - made corresponding fixes to `RegexTreeView.qll`
 - implemented `toUnicode` so it is available on `String`s
the extra ordering conditions in ReDoSUtil will be needed
for the Python implementation.
The library specific file is `RegExpTreeView`.
The files are recorded as identical via the mapping
in `identical-files.json`.
- `KnownCVEs` contain the currently triaged Python CVEs
- `unittest.py` contains some tests constructed by @erik-krogh
- `redos.py` contains a port of `tst.js` from javascript
The expected file has been ported as well with some fixups by @tausbn
@yoff yoff requested review from a team as code owners June 28, 2021 15:17
@yoff
Copy link
Contributor Author

yoff commented Jun 29, 2021

DCA run here

* a suffix `x` (possible empty) that is most likely __not__ accepted.
*/

import ReDoSUtil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driveby comment: Make as much as possible private. The import plus most of the classes and predicates all ought to be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this wasn't actually done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had done this, but you are right; PR here.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the first commit and tried to understand the improvements you made to our regex library... I was taken by surprise by how complicated was for me to understand. Must have been a huge effort to actually make these things 💪

I made a few very detailed comments on that first commit, and have glanced through the other commits and made a few comments along the way.

I think it would have been ideal to have such detailed explanatory comments on the whole regex.qll, but I think we need to evaluate whether that's actually going to be worth the effort (even just for the new parts added). I suggest we talk this over "in-person", once you've had a chance to look through my comments.

I think a few of the non-private member-predicates of the RegexString class have been changed (like qualifiedItem/qualifiedPart in 21007d2), so technically that means if someone is using these, their code will break if we push this update 😞 So I guess we should be following our normal deprecation policy on this as well, although it is a bit burdensome. We should consider whether new versions should be marked with INTERNAL: Do not use., which can allow us some more freedom (if for example, the predicates are only exposed to be able to test things).

result = this.(Unicode).getText()
}

/** result is true for those start chars that actually mark a start of a char set. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only care about the cases where result = true, could we just rewrite this predicate to only hold when this is the case? (and then alter usage from this.char_set_start(pos) = truethis.char_set_start(pos))

Hmm, I see this becomes difficult with the if/then/else, since you need to duplicate some of the logic then :| OH wait, can't you just do

if cond()
then your_stuff()
else any()

to get the same effect?

OH, since we want to properly handle [[][A-Z] we actually need this predicate to only hold for the positions which has the non-escaped char [, and want to make use of the booleanNot ... although this is subtle point 😰

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly I forgot to write proper comments here. This is following the same pattern as predicate escapingChar(int pos) { this.escaping(pos) = true } and private boolean escaping(int pos) to avoid negative recursion. The pattern is used again in scanning through char sets to find character ranges. Both of these instances are a bit more elaborate than escaping and therefor the pattern is likely harder to detect without documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added more elaborate comments to this effect now.

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still chewing through this PR, but I thought I would give some interim comments. (I am happy to see that RasmusWL has addressed some of them already. 🙂)

*/
predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) {
// mirror logic from `simpleCharacter`
exists(int x, int y |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename x and y to something a bit more... apt? I had to stare at this for a while to figure out what they were supposed to mean. Perhaps contents_start and contents_end would be appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular implementation is only for illustrative purposes (there was no such predicate to test). A better implementation is introduced later.

* with lower bound found between `start` and `lower_end`
* and upper bound found between `upper_start` and `end`.
*/
predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this would be more intuitive if start was lower_start and end was upper_end. I realise these are the start and end for the entire range, but somehow I have to think extra hard to see why it makes sense to link together start with lower_end, etc.

override string getARelevantTag() { result = "charSet" }

override predicate hasActualResult(Location location, string element, string tag, string value) {
exists(location.getFile().getRelativePath()) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really minor, but it seems odd to me to both require that the test file is not a dependency, and that it also has a specific name. As far as I can tell, these tests shouldn't even require the re module to be imported (during extraction, that is), since we no longer rely on points-to for this.

)
}

/** result denotes if the index is a left bracket */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This QLDoc seems underspecified. My reading is that it holds if position pos contains the indexth char set delimiter, and the result is true iff the bracket is a left bracket.


/** result denotes if the index is a left bracket */
boolean char_set_delimiter(int index, int pos) {
pos = rank[index](int p | this.nonEscapedCharAt(p) = "[" or this.nonEscapedCharAt(p) = "]") and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm grossly mistaken, rank indexes from 1, so the first bracket (if any) will be at index 1 as well. Is this intended?

(You quite often see rank[index + 1](...) for this reason.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intended in the sense that I did not see a reason to change the default indexing...

index = 1 and result = true // if a '[' is first in the string (among brackets), it starts a char set
or
index > 1 and
not char_set_delimiter(index - 1, _) = false and
Copy link
Contributor

@tausbn tausbn Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious about this line. Why not just char_set_delimiter(index - 1, _) = true? As far as I can tell, char_set_delimiter cannot fail for index - 1 (edit: assuming index > 1) if it has already succeeded for index (though I may have missed something).

Also, it would be great if each disjunct had a brief comment about which case it is handling.

result = this.(Unicode).getText()
}

/** result is true for those start chars that actually mark a start of a char set. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.

@yoff
Copy link
Contributor Author

yoff commented Jun 30, 2021

DCA run without attempting to use special CLI here.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Comment on lines 1 to 6
import python
import semmle.python.regex.SuperlinearBackTracking

from PolynomialBackTrackingTerm t
where t.getLocation().getFile().getBaseName() = "KnownCVEs.py"
select t.getRegex(), t, t.getReason()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks more like a file that should be under test/ and not src/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, yes..

@yoff
Copy link
Contributor Author

yoff commented Jun 30, 2021

DCA run actually running ReDoS here. Performance looks fine. I do not know if the new results are due to duplication, they look fine.

Copy link
Contributor

@asgerf asgerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS changes lgtm

@tausbn
Copy link
Contributor

tausbn commented Jun 30, 2021

🚢 🚀 💥

@tausbn tausbn merged commit e4af146 into github:main Jun 30, 2021
This was referenced Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants