Python: ReDoS conservative #6038

yoff · 2021-06-08T09:43:29Z

Just enough to get going...

- expose everal predicates - better detection of character sets - predicate to detect character ranges - better detection of non-escaped characters - better detection of group end and group start - individual predicates for negative lookahead and looknehind - add boolean `may_repeat_forever` to `qualifier` - detect upper and lower bounds in repetition ranges Ideally this willbe broken up into individual commits, all illustrated with simple tests...

- exclude verbose mode regexes - correct value for (common) escaped characters

…to python-ReDoS-conservative

that should have never been touched

from `RegExpCharacterClassEscape`

nickrolfe · 2021-06-11T10:40:05Z

I have a suggestion for parsing backreferences correctly and not as normal characters.

diff --git a/python/ql/src/semmle/python/regex.qll b/python/ql/src/semmle/python/regex.qll
index e35a373016..36abe17424 100644
--- a/python/ql/src/semmle/python/regex.qll
+++ b/python/ql/src/semmle/python/regex.qll
@@ -368,7 +368,8 @@ abstract class RegexString extends Expr {
       or
       this.escapedCharacter(start, end)
     ) and
-    not exists(int x, int y | this.group_start(x, y) and x <= start and y >= end)
+    not exists(int x, int y | this.group_start(x, y) and x <= start and y >= end) and
+    not exists(int x, int y | this.backreference(x, y) and x <= start and y >= end)
   }
 
   predicate normalCharacter(int start, int end) {
@@ -650,6 +651,8 @@ abstract class RegexString extends Expr {
     this.group(start, end)
     or
     this.charSet(start, end)
+    or
+    this.backreference(start, end)
   }
 
   private predicate qualifier(int start, int end, boolean maybe_empty, boolean may_repeat_forever) {
@@ -748,7 +751,8 @@ abstract class RegexString extends Expr {
   private predicate item_start(int start) {
     this.character(start, _) or
     this.isGroupStart(start) or
-    this.charSet(start, _)
+    this.charSet(start, _) or
+    this.backreference(start, _)
   }
 
   private predicate item_end(int end) {

nickrolfe · 2021-06-11T11:03:28Z

Do I understand correctly that this doesn't attempt to strip whitespace and comments in VERBOSE regexes? (Ruby has a similar mechanism with the x flag).

…results for `toString`

yoff · 2021-06-14T08:45:23Z

Do I understand correctly that this doesn't attempt to strip whitespace and comments in VERBOSE regexes? (Ruby has a similar mechanism with the x flag).

Yes, I believe no such attempt is made. In fact we currently exclude regexes in verbose mode, which is of course not desirable.

yoff · 2021-06-14T08:47:40Z

Your suggestion regarding back references looks reasonable. In fact I found the whole character/normalchar code a little fuzzy...

erik-krogh · 2021-06-14T10:10:51Z

If we get a toUnicode() method, then here how that could be used to implement getValue() for escaped unicode chars.

…to python-ReDoS-conservative

erik-krogh · 2021-06-22T09:26:45Z

If we get a toUnicode() method, then here how that could be used to implement getValue() for escaped unicode chars.

I've pushed the unicode parsing to this PR.
So you now need a freshly build CLI to run the code in this PR.

…to python-ReDoS-conservative

nickrolfe · 2021-06-23T14:09:10Z

Here's what I believe is a false positive:

r"\A(?:\w|\w-\w|\n|\t)+\z"
    ^-----------------^

This part of the regular expression may cause exponential backtracking on strings starting with 'A' and containing many repetitions of 't'.

I guess what's happening is that it thinks t will match \t as well as \w. It's also suggesting the prefix A, which I think means it's confused about \A.

The .expected-files are generated by running the same queries against `tst.js` and converting the results. I am not sure if we want to keep these. The tests for ReDoS results could at least be expressed inline.

until it is shipped.

yoff · 2021-06-25T10:45:52Z

Here's what I believe is a false positive:
r"\A(?:\w|\w-\w|\n|\t)+\z"
    ^-----------------^
This part of the regular expression may cause exponential backtracking on strings starting with 'A' and containing many repetitions of 't'.

I guess what's happening is that it thinks t will match \t as well as \w. It's also suggesting the prefix A, which I think means it's confused about \A.

I think you are right. Some escapes code for themselves while others do not. We currently have too many in the first group, it seems (everyone except \n and \r):

  override string getValue() {
    this.isIdentityEscape() and result = this.getUnescaped()
    or
    this.getUnescaped() = "n" and result = "\n"
    or
    this.getUnescaped() = "r" and result = "\r"
    or
    isUnicode() and
    result = getUnicode()
  }

  predicate isIdentityEscape() { not this.getUnescaped() in ["n", "r"] }

and note aparent bugs found while reading through the code..

see if the tests pass..

This commit now records the differences between the Python and the Javascript parsing of regular expressions. There might be a better way to test conformity than this...

yoff · 2021-09-10T12:12:42Z

Superseded by #6175.

yoff added 3 commits June 8, 2021 11:14

Python: Add tree view

7ee63d1

Python: Add ReDoS query and utility library

f8764df

github-actions bot added JS Python labels Jun 8, 2021

yoff and others added 7 commits June 8, 2021 11:47

Python: Add the correct utility library

afe216e

implement printAst for python regexps

2e4000d

Python: fixes suggested by lgtm run

e9549eb

- exclude verbose mode regexes - correct value for (common) escaped characters

Merge branch 'python-ReDoS-conservative' of github.com:yoff/codeql in…

451e330

…to python-ReDoS-conservative

Python: Restore javascript file

0a6880d

that should have never been touched

Python: fix rawValue (thanks @erik-krogh)

696b808

Python: remove \b and \B

8875a17

from `RegExpCharacterClassEscape`

erik-krogh added 2 commits June 13, 2021 23:54

fix printAst for unicode/strconst strings

0d15ca7

remove RegExpLiteral::toString to avoid RegExpTerm having multiple …

cfcf429

…results for `toString`

erik-krogh and others added 11 commits June 15, 2021 09:22

remove debug code

dd75567

Python: add test with known CVEs

fc9a709

Python: deduplicate tests

fd851a3

Python: supply extra argument to qualifiedItem

3d17fa6

Python: remove test predicate

5d49101

Python: also test may_repeat_forever

cea0bf8

explicitly keep track on the number of steps in the prefix computation

3660a58

Merge branch 'python-ReDoS-conservative' of github.com:yoff/codeql in…

cd63099

…to python-ReDoS-conservative

better fix for prefix generation

44b9ad0

sync prefix fix to JavaScript

020a608

Merge branch 'python-ReDoS-conservative' of github.com:yoff/codeql in…

af38539

…to python-ReDoS-conservative

Python: Add regex we used to misparse

24be556

erik-krogh mentioned this pull request Jun 22, 2021

JS: add CWE-1333 to the JS ReDoS queries #6129

Merged

implement getValue() for escaped unicode chars

abcb2c3

yoff added 2 commits June 22, 2021 15:57

Python: add character range tests and fix bugs

29c15e1

Merge branch 'python-ReDoS-conservative' of github.com:yoff/codeql in…

ed2ebb2

…to python-ReDoS-conservative

yoff added 11 commits June 24, 2021 13:39

Python: last batch of unittest for now

673b1e1

Python: Test for conformity with js implementation

324f80a

The .expected-files are generated by running the same queries against `tst.js` and converting the results. I am not sure if we want to keep these. The tests for ReDoS results could at least be expressed inline.

Python: Do not use of String::toUnicode

a506b3c

until it is shipped.

Python: Rename directory

6cf0a9f

Python: rename directory

4b3e667

Python: Rename file

8fb86d7

Python: remove test predicate

7c31bc9

Python: remember to update refs

955f1b2

Python: refactor to allow sharing

e1465aa

Python/JS: Sync ReDoS files

70b93ab

Python: remember to update refs

7c7b199

yoff added 4 commits June 25, 2021 13:14

Python/JS: fix qldoc and ref

810e426

Python/JS: Fix qldoc

8293bbf

Python: Add missing qldocs

208268a

and note aparent bugs found while reading through the code..

Python: Try turning on toUnicode

59e0f8b

see if the tests pass..

Marcono1234 mentioned this pull request Jun 26, 2021

Java: Add CharacterLiteral.getIntValue #3635

Closed

yoff added 3 commits June 26, 2021 10:06

Python: fix qldoc

12a92ef

Python: handle a few more escapes

3b0bed7

Python: adjust test expectations.

42611c7

This commit now records the differences between the Python and the Javascript parsing of regular expressions. There might be a better way to test conformity than this...

yoff mentioned this pull request Jun 28, 2021

Python: port ReDoS queries from Javascript #6175

Merged

yoff closed this Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: ReDoS conservative #6038

Python: ReDoS conservative #6038

Uh oh!

yoff commented Jun 8, 2021

Uh oh!

nickrolfe commented Jun 11, 2021 •

edited

Loading

Uh oh!

nickrolfe commented Jun 11, 2021

Uh oh!

yoff commented Jun 14, 2021

Uh oh!

yoff commented Jun 14, 2021

Uh oh!

erik-krogh commented Jun 14, 2021 •

edited

Loading

Uh oh!

erik-krogh commented Jun 22, 2021

Uh oh!

nickrolfe commented Jun 23, 2021 •

edited

Loading

Uh oh!

yoff commented Jun 25, 2021

Uh oh!

yoff commented Sep 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Python: ReDoS conservative #6038

Python: ReDoS conservative #6038

Uh oh!

Conversation

yoff commented Jun 8, 2021

Uh oh!

nickrolfe commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nickrolfe commented Jun 11, 2021

Uh oh!

yoff commented Jun 14, 2021

Uh oh!

yoff commented Jun 14, 2021

Uh oh!

erik-krogh commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erik-krogh commented Jun 22, 2021

Uh oh!

nickrolfe commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoff commented Jun 25, 2021

Uh oh!

yoff commented Sep 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nickrolfe commented Jun 11, 2021 •

edited

Loading

erik-krogh commented Jun 14, 2021 •

edited

Loading

nickrolfe commented Jun 23, 2021 •

edited

Loading