Skip to content

Conversation

@yoff
Copy link
Contributor

@yoff yoff commented Jun 8, 2021

Just enough to get going...

yoff added 3 commits June 8, 2021 11:14
- expose everal predicates
- better detection of character sets
- predicate to detect character ranges
- better detection of non-escaped characters
- better detection of group end and group start
- individual predicates for negative lookahead and looknehind
- add boolean `may_repeat_forever` to `qualifier`
- detect upper and lower bounds in repetition ranges
Ideally this willbe broken up into individual commits,
all illustrated with simple tests...
@nickrolfe
Copy link
Contributor

nickrolfe commented Jun 11, 2021

I have a suggestion for parsing backreferences correctly and not as normal characters.

diff --git a/python/ql/src/semmle/python/regex.qll b/python/ql/src/semmle/python/regex.qll
index e35a373016..36abe17424 100644
--- a/python/ql/src/semmle/python/regex.qll
+++ b/python/ql/src/semmle/python/regex.qll
@@ -368,7 +368,8 @@ abstract class RegexString extends Expr {
       or
       this.escapedCharacter(start, end)
     ) and
-    not exists(int x, int y | this.group_start(x, y) and x <= start and y >= end)
+    not exists(int x, int y | this.group_start(x, y) and x <= start and y >= end) and
+    not exists(int x, int y | this.backreference(x, y) and x <= start and y >= end)
   }
 
   predicate normalCharacter(int start, int end) {
@@ -650,6 +651,8 @@ abstract class RegexString extends Expr {
     this.group(start, end)
     or
     this.charSet(start, end)
+    or
+    this.backreference(start, end)
   }
 
   private predicate qualifier(int start, int end, boolean maybe_empty, boolean may_repeat_forever) {
@@ -748,7 +751,8 @@ abstract class RegexString extends Expr {
   private predicate item_start(int start) {
     this.character(start, _) or
     this.isGroupStart(start) or
-    this.charSet(start, _)
+    this.charSet(start, _) or
+    this.backreference(start, _)
   }
 
   private predicate item_end(int end) {

@nickrolfe
Copy link
Contributor

Do I understand correctly that this doesn't attempt to strip whitespace and comments in VERBOSE regexes? (Ruby has a similar mechanism with the x flag).

@yoff
Copy link
Contributor Author

yoff commented Jun 14, 2021

Do I understand correctly that this doesn't attempt to strip whitespace and comments in VERBOSE regexes? (Ruby has a similar mechanism with the x flag).

Yes, I believe no such attempt is made. In fact we currently exclude regexes in verbose mode, which is of course not desirable.

@yoff
Copy link
Contributor Author

yoff commented Jun 14, 2021

Your suggestion regarding back references looks reasonable. In fact I found the whole character/normalchar code a little fuzzy...

@erik-krogh
Copy link
Contributor

erik-krogh commented Jun 14, 2021

If we get a toUnicode() method, then here how that could be used to implement getValue() for escaped unicode chars.

@erik-krogh
Copy link
Contributor

If we get a toUnicode() method, then here how that could be used to implement getValue() for escaped unicode chars.

I've pushed the unicode parsing to this PR.
So you now need a freshly build CLI to run the code in this PR.

@nickrolfe
Copy link
Contributor

nickrolfe commented Jun 23, 2021

Here's what I believe is a false positive:

r"\A(?:\w|\w-\w|\n|\t)+\z"
    ^-----------------^

This part of the regular expression may cause exponential backtracking on strings starting with 'A' and containing many repetitions of 't'.

I guess what's happening is that it thinks t will match \t as well as \w. It's also suggesting the prefix A, which I think means it's confused about \A.

@yoff
Copy link
Contributor Author

yoff commented Jun 25, 2021

Here's what I believe is a false positive:

r"\A(?:\w|\w-\w|\n|\t)+\z"
    ^-----------------^

This part of the regular expression may cause exponential backtracking on strings starting with 'A' and containing many repetitions of 't'.

I guess what's happening is that it thinks t will match \t as well as \w. It's also suggesting the prefix A, which I think means it's confused about \A.

I think you are right. Some escapes code for themselves while others do not. We currently have too many in the first group, it seems (everyone except \n and \r):

  override string getValue() {
    this.isIdentityEscape() and result = this.getUnescaped()
    or
    this.getUnescaped() = "n" and result = "\n"
    or
    this.getUnescaped() = "r" and result = "\r"
    or
    isUnicode() and
    result = getUnicode()
  }

  predicate isIdentityEscape() { not this.getUnescaped() in ["n", "r"] }

yoff added 4 commits June 25, 2021 13:14
and note aparent bugs found while reading through the code..
see if the tests pass..
yoff added 3 commits June 26, 2021 10:06
This commit now records the differences between the Python and
the Javascript parsing of regular expressions.
There might be a better way to test conformity than this...
@yoff
Copy link
Contributor Author

yoff commented Sep 10, 2021

Superseded by #6175.

@yoff yoff closed this Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants