Python: port ReDoS queries from Javascript #6175

yoff · 2021-06-28T15:17:36Z

This PR ports the very succesful ReDoS queries from Javascript. In order to do so it first:

exposes a few predicates in our regex parser (and fixes a few issues)
builds a parse-tree-view on top
refactors the js queries to admit identical files

Some tests are included here, more can be found in the scratch branch.

This work was greatly helped by @erik-krogh who contributed numerous fixes along the way as well as making parsed regexes viewable in the AST-viewer and making the toUnicode predicate available on Strings. And he enthusiastically triaged several results from the run-on-alls.

Some things that might still need to be tweaked:

Python (and Ruby) have extra anchors that should probably be accounted for in the queries, specifically during prefix generation.
The value of constants which are anchors should also be considered so they do not get confused with character classes.
It would be nice to add a test file including all the currently triaged results.

- Added naive implementation of `charRange` so the test can run. - Made predicates public as needed.

repeats. This in preparation for ReDoS

@erik-krogh

This contains several contributions from @erik-krogh and also some fixes from @nickrolfe

@erik-krogh

This work is due to @erik-krogh who also - made corresponding fixes to `RegexTreeView.qll` - implemented `toUnicode` so it is available on `String`s

the extra ordering conditions in ReDoSUtil will be needed for the Python implementation.

The library specific file is `RegExpTreeView`. The files are recorded as identical via the mapping in `identical-files.json`.

@erik-krogh

- `KnownCVEs` contain the currently triaged Python CVEs - `unittest.py` contains some tests constructed by @erik-krogh - `redos.py` contains a port of `tst.js` from javascript The expected file has been ported as well with some fixups by @tausbn

python/ql/src/semmle/python/RegexTreeView.qll

so we can see the light go green. But we should perhaps do something about those duplicate results.

yoff · 2021-06-29T11:03:35Z

DCA run here

aschackmull · 2021-06-29T12:14:46Z

javascript/ql/src/semmle/javascript/security/performance/ExponentialBackTracking.qll

+ *     a suffix `x` (possible empty) that is most likely __not__ accepted.
+ */
+
+import ReDoSUtil


Driveby comment: Make as much as possible private. The import plus most of the classes and predicates all ought to be private.

Good point, thanks.

So this wasn't actually done?

I thought I had done this, but you are right; PR here.

RasmusWL

I looked through the first commit and tried to understand the improvements you made to our regex library... I was taken by surprise by how complicated was for me to understand. Must have been a huge effort to actually make these things 💪

I made a few very detailed comments on that first commit, and have glanced through the other commits and made a few comments along the way.

I think it would have been ideal to have such detailed explanatory comments on the whole regex.qll, but I think we need to evaluate whether that's actually going to be worth the effort (even just for the new parts added). I suggest we talk this over "in-person", once you've had a chance to look through my comments.

I think a few of the non-private member-predicates of the RegexString class have been changed (like qualifiedItem/qualifiedPart in 21007d2), so technically that means if someone is using these, their code will break if we push this update 😞 So I guess we should be following our normal deprecation policy on this as well, although it is a bit burdensome. We should consider whether new versions should be marked with INTERNAL: Do not use., which can allow us some more freedom (if for example, the predicates are only exposed to be able to test things).

config/identical-files.json

python/ql/src/semmle/python/regex.qll

RasmusWL · 2021-06-29T11:44:36Z

python/ql/src/semmle/python/regex.qll

+    result = this.(Unicode).getText()
+  }
+
+  /** result is true for those start chars that actually mark a start of a char set. */


If we only care about the cases where result = true, could we just rewrite this predicate to only hold when this is the case? (and then alter usage from this.char_set_start(pos) = true ⇒ this.char_set_start(pos))

Hmm, I see this becomes difficult with the if/then/else, since you need to duplicate some of the logic then :| OH wait, can't you just do

if cond() then your_stuff() else any()

to get the same effect?

OH, since we want to properly handle [[][A-Z] we actually need this predicate to only hold for the positions which has the non-escaped char [, and want to make use of the booleanNot ... although this is subtle point 😰

Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.

Clearly I forgot to write proper comments here. This is following the same pattern as predicate escapingChar(int pos) { this.escaping(pos) = true } and private boolean escaping(int pos) to avoid negative recursion. The pattern is used again in scanning through char sets to find character ranges. Both of these instances are a bit more elaborate than escaping and therefor the pattern is likely harder to detect without documentation.

I have added more elaborate comments to this effect now.

python/ql/src/semmle/python/regex.qll

python/ql/test/library-tests/regex/charRangeTest.py

python/ql/test/library-tests/regex/charSetTest.py

python/change-notes/2021-07-28-port-RoDoS-queries.md

tausbn

Still chewing through this PR, but I thought I would give some interim comments. (I am happy to see that RasmusWL has addressed some of them already. 🙂)

tausbn · 2021-06-29T12:40:25Z

python/ql/src/semmle/python/regex.qll

+   */
+  predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) {
+    // mirror logic from `simpleCharacter`
+    exists(int x, int y |


Could we rename x and y to something a bit more... apt? I had to stare at this for a while to figure out what they were supposed to mean. Perhaps contents_start and contents_end would be appropriate?

This particular implementation is only for illustrative purposes (there was no such predicate to test). A better implementation is introduced later.

tausbn · 2021-06-29T12:42:45Z

python/ql/src/semmle/python/regex.qll

+   * with lower bound found between `start` and `lower_end`
+   * and upper bound found between `upper_start` and `end`.
+   */
+  predicate charRange(int charset_start, int start, int lower_end, int upper_start, int end) {


I feel like this would be more intuitive if start was lower_start and end was upper_end. I realise these are the start and end for the entire range, but somehow I have to think extra hard to see why it makes sense to link together start with lower_end, etc.

tausbn · 2021-06-29T12:45:21Z

python/ql/test/library-tests/regex/SubstructureTests.ql

+  override string getARelevantTag() { result = "charSet" }
+
+  override predicate hasActualResult(Location location, string element, string tag, string value) {
+    exists(location.getFile().getRelativePath()) and


This is really minor, but it seems odd to me to both require that the test file is not a dependency, and that it also has a specific name. As far as I can tell, these tests shouldn't even require the re module to be imported (during extraction, that is), since we no longer rely on points-to for this.

tausbn · 2021-06-29T12:49:45Z

python/ql/src/semmle/python/regex.qll

+    )
+  }
+
+  /** result denotes if the index is a left bracket */


This QLDoc seems underspecified. My reading is that it holds if position pos contains the indexth char set delimiter, and the result is true iff the bracket is a left bracket.

tausbn · 2021-06-29T12:50:58Z

python/ql/src/semmle/python/regex.qll

+
+  /** result denotes if the index is a left bracket */
+  boolean char_set_delimiter(int index, int pos) {
+    pos = rank[index](int p | this.nonEscapedCharAt(p) = "[" or this.nonEscapedCharAt(p) = "]") and


Unless I'm grossly mistaken, rank indexes from 1, so the first bracket (if any) will be at index 1 as well. Is this intended?

(You quite often see rank[index + 1](...) for this reason.)

Intended in the sense that I did not see a reason to change the default indexing...

tausbn · 2021-06-29T13:00:17Z

python/ql/src/semmle/python/regex.qll

+        index = 1 and result = true // if a '[' is first in the string (among brackets), it starts a char set
+        or
+        index > 1 and
+        not char_set_delimiter(index - 1, _) = false and


I'm a bit curious about this line. Why not just char_set_delimiter(index - 1, _) = true? As far as I can tell, char_set_delimiter cannot fail for index - 1 (edit: assuming index > 1) if it has already succeeded for index (though I may have missed something).

Also, it would be great if each disjunct had a brief comment about which case it is handling.

tausbn · 2021-06-29T13:23:10Z

python/ql/src/semmle/python/regex.qll

+    result = this.(Unicode).getText()
+  }
+
+  /** result is true for those start chars that actually mark a start of a char set. */


Another problem is on line 154, where we recursively invoke result = char_set_start(p2).booleanNot(). Rewriting this to not use booleans would mean rewriting this line as not char_set_start(p2), but then we have introduced negative recursion.

yoff · 2021-06-30T08:11:02Z

DCA run without attempting to use special CLI here.

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

RasmusWL · 2021-06-30T10:12:21Z

python/ql/src/Security/CWE-730/PolynomialBackTracking.ql

+import python
+import semmle.python.regex.SuperlinearBackTracking
+
+from PolynomialBackTrackingTerm t
+where t.getLocation().getFile().getBaseName() = "KnownCVEs.py"
+select t.getRegex(), t, t.getReason()


This one looks more like a file that should be under test/ and not src/?

It does, yes..

until supporting CLI is released

…n-port-ReDoS

pattern

yoff · 2021-06-30T12:52:12Z

DCA run actually running ReDoS here. Performance looks fine. I do not know if the new results are due to duplication, they look fine.

asgerf

JS changes lgtm

tausbn · 2021-06-30T14:26:03Z

🚢 🚀 💥

yoff added 8 commits June 28, 2021 17:04

Python: inline test of regex components

e5f07cc

- Added naive implementation of `charRange` so the test can run. - Made predicates public as needed.

Python: More precise regex parsing

74ca1d0

Python: track if qualifiers allow unbounded

21007d2

repeats. This in preparation for ReDoS

Python: A parse-tree-view of regular expressions

d953ba8

This contains several contributions from @erik-krogh and also some fixes from @nickrolfe

Python: Make ast viewer see regexes

2c27ce7

This work is due to @erik-krogh who also - made corresponding fixes to `RegexTreeView.qll` - implemented `toUnicode` so it is available on `String`s

JS: Refactor ReDoS to make files sharable

d2eeaff

the extra ordering conditions in ReDoSUtil will be needed for the Python implementation.

Python: Add ReDoS as identical files from JS

591b6ef

The library specific file is `RegExpTreeView`. The files are recorded as identical via the mapping in `identical-files.json`.

yoff requested review from a team as code owners June 28, 2021 15:17

github-actions bot added JS Python labels Jun 28, 2021

Python: add change note

c7992f6

github-actions bot added the documentation label Jun 28, 2021

yoff added 2 commits June 29, 2021 11:01

Python: Apply performance fix by @hvitved

135b71b

Python: undo autoformat character mangling

ffb8938

yoff commented Jun 29, 2021

View reviewed changes

python/ql/src/semmle/python/RegexTreeView.qll Outdated Show resolved Hide resolved

yoff added 4 commits June 29, 2021 11:14

Python: Give up on providing values for form feeds

6f2cdbf

Python: Limit test files

fbfe415

Python: Adjust test expectations

e778a65

so we can see the light go green. But we should perhaps do something about those duplicate results.

Merge branch 'main' of github.com:github/codeql into python-port-ReDoS

b684434

aschackmull reviewed Jun 29, 2021

View reviewed changes

RasmusWL requested changes Jun 29, 2021

View reviewed changes

tausbn requested changes Jun 29, 2021

View reviewed changes

Apply suggestions from code review

c19522e

Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>

RasmusWL reviewed Jun 30, 2021

View reviewed changes

yoff added 2 commits June 30, 2021 12:21

Python: Disable use of toUnicode

6dfbf80

until supporting CLI is released

Python: update test expectations

09e71cf

yoff added 4 commits June 30, 2021 12:25

Merge branch 'python-port-ReDoS' of github.com:yoff/codeql into pytho…

52d9191

…n-port-ReDoS

Merge branch 'main' of github.com:github/codeql into python-port-ReDoS

4ca0ee8

Python: Add some comments on the booelan sweep

72986e1

pattern

Python: Avoid multiple results for toString

651f8ab

yoff added 3 commits June 30, 2021 15:03

Python: mimic JS file hierarchy

c306cee

Python: comment out temporarily unused predicate

45e30b0

Python: comment out temporarily unused predicate

a176e6a

tausbn approved these changes Jun 30, 2021

View reviewed changes

RasmusWL approved these changes Jun 30, 2021

View reviewed changes

asgerf approved these changes Jun 30, 2021

View reviewed changes

tausbn merged commit e4af146 into github:main Jun 30, 2021

This was referenced Sep 10, 2021

Python: ReDoS conservative #6038

Closed

Python: add regex parser #5866

Closed

Python: port ReDoS queries from Javascript #6175

Python: port ReDoS queries from Javascript #6175

Uh oh!

Conversation

yoff commented Jun 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yoff commented Jun 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tausbn Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Jun 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoff commented Jun 30, 2021

Uh oh!

asgerf left a comment

Choose a reason for hiding this comment

Uh oh!

tausbn commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yoff commented Jun 28, 2021 •

edited

Loading

tausbn Jun 29, 2021 •

edited

Loading