JS: add query js/regex/missing-regexp-anchor #1387

ghost · 2019-05-31T10:24:59Z

This adds another query for a common regular expression mistake, this time for missing anchors. Similar to js/incomplete-hostname-regexp, the query is identifies suspicious regular expression patterns, but it does not check that the patterns are used in a security setting. Both queries use the refactored RegExpPatternSource, and a bunch of string heuristics.

An earlier iteration of the query used the RegExp AST classes, but I am happier with the string heuristic implementation as it is reasonably simple, requires less QL code, and flags more results (due to support for string literal results). For the record, using the RegExp AST classes during prototyping is very useful for homing in on the syntactic cases we really care about.

The results have been a mixed bag, and required a bunch of iterations, but I think they are quite good now. Unfortunately, most results are not security relevant. If we split the query into one that flags URL patterns, and one for non-URL patterns, then the flagged URL patterns will be mostly security relevant. I think we should run the query on LGTM, but not show it by default for a while, and then decide if we want to split, sharpen or display the query.

Running js/incomplete-hostname-regexp with the refactored RegExpPatternSource shows no performance change.

The performance for the new query is excellent. Running the query next to js/incomplete-hostname-regexp appears to have no measurable overhead (in fact there seems to be a speedup, I am trying to reproduce the numbers. Update: the speedup is not reproducible, but there still is no overhead).

I would like this to make it to 1.21, but it will probably not make it through the doc review today.

ghost · 2019-05-31T11:06:17Z

Some results for the curious.

asger-semmle

Looking good! Just have a few nit picks. Agree that we should get this into 1.21.

Do you have a link to an evaluation?

asger-semmle · 2019-05-31T11:29:07Z

javascript/ql/src/semmle/javascript/Regexp.qll

+  string pattern;
+
+  RegExpLiteralPatternSource() {
+    exists(string raw | raw = asExpr().(RegExpLiteral).getRoot().toString() |


As a drive-by change, could you update the ql doc for RegExpTerm.toString to indicate that this is always the full text (not truncated) or make a separate predicate for getting the raw source text?

As @xiemaisi has pointed out a few times toString doesn't generally guarantee anything about its output.

asger-semmle · 2019-05-31T11:37:03Z

javascript/ql/src/Security/CWE-020/MissingRegExpAnchor.ql

+predicate isAnInterestingSemiAnchoredRegExpString(RegExpPatternSource src, string msg) {
+  exists(string str, string maybeGroupedStr, string regex, string anchorPart, string posString, string escapedDot |
+    // a dot that might be escaped in a regular expression, for example `/\./` or new `RegExp('\\.')`
+    escapedDot = "\\\\\\\\?[.]" and


Oh dear, that's a lot of backslashes 😕

This matches a dot optionally preceeded by a backslash? In the str below, there is also a case for . on the right-hand side, though, which seems to suggest the question mark isn't needed here?

No, this matches a dot preceeded by one or two backslashes. The programmer will have used one backslash if this pattern is in a regular expression literal, and two backslashes if this pattern is in a string literal (see examples in the comment). I could have dispatched on the subclass of RegExpPatternSource, but that seems to be overkill.

it turns out that \\\\[.] is the right amount of backslashes, this works for both /\./ and new RegExp('\\.'). I have added a bunch of tests.

javascript/ql/src/Security/CWE-020/MissingRegExpAnchor.ql

asger-semmle · 2019-05-31T12:02:24Z

javascript/ql/src/Security/CWE-020/MissingRegExpAnchor.ql

+    anchorPart = src.getPattern().regexpCapture(regex, 1) and
+    anchorPart.regexpMatch("(?i).*[a-z].*") and
+    msg = "The alternative '" + anchorPart + "' uses an anchor to match from the " + posString +
+        " of a string, but the other alternatives of this regular expression do not use anchors."


I found this message to be quite long. If I might suggest another way to phrase it:

Misleading operator precedence. The subexpression '^http' is anchored, but the other parts of this regular expression are not.

asger-semmle · 2019-05-31T12:29:33Z

Opened https://jira.semmle.com/browse/ODASA-7961 with some ideas for future improvements that are out of scope for this round

ghost · 2019-05-31T13:03:47Z

Performance evaluation links.

js/incomplete-hostname-regexp vs js/incomplete-hostname-regexp+js/regex/missing-regexp-anchor: irreproducible speedups
- small reproduction attempt at big-apps.slugs, the performance is now just unchanged if comparisons are made for the big-apps.slugs entries

For results, I refer to https://lgtm.com/query/3732291658122574474/ which has the most up to date whitelist.

(I have started a full security evaluation)

This preserves the ad hoc message formatting in IncompleteHostnameRegExp.ql

ghost · 2019-06-03T06:35:32Z

Evaluation: https://git.semmle.com/esben/dist-compare-reports/tree/js/unanchored-url-regex_1559437972495

xiemaisi

LGTM on the whole, a few minor niggles.

xiemaisi · 2019-06-03T11:18:05Z

javascript/ql/src/semmle/javascript/Regexp.qll

+ * A node whose string value may flow to a position where it is interpreted
+ * as a part of a regular expression.
+ */
+class StringRegExpPatternSource extends RegExpPatternSource {


Could/should this class be private?

It can't be private due to the use of class checks and getAUse in two places for whitelisting and alert presentation:

js/regex/missing-regexp-anchor:

src.getARegExpObject().flowsTo(arg) or src.(StringRegExpPatternSource).getAUse() = arg

js/incomplete-hostname-regexp:

if re instanceof StringRegExpPatternSource then ( kind = "string, which is used as a regular expression $@," and aux = re.(StringRegExpPatternSource).getAUse() ) else ( kind = "regular expression" and aux = re )

xiemaisi · 2019-06-03T11:18:27Z

javascript/ql/src/semmle/javascript/Regexp.qll

+/**
+ * A regular expression literal, viewed as the pattern source for itself.
+ */
+class RegExpLiteralPatternSource extends RegExpPatternSource {


Could/should this class be private?

For consistency witth the public StringRegExpPatternSource, I think this should remain public

Hm, fine. I don't like having all these awkwardly named classes in the global namespace.

I can hide them in a module BuiltinRegExpPatternSources { class RegExpLiteralPatternSource ... }, that is half the awkwardness in the global namespace.

No, I don't think that would be better. Could the above two uses of StringRegExpPatternSource be encapsulated as member predicates of RegExpPatternSource?

I have introduced getAParse instead of getAUse, then the predicate name makes sense for the regular expression literals which otherwise should be flagged at all their usages. As an added bonus, the message in js/incomplete-hostname-regexp will no longer include a trivial link for cases where the string literal already is at a location where it is used as a regular expression:

That is, the message for:

new RegExp(`test.example.com$`); // NOT OK

is now:

This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected.

xiemaisi · 2019-06-03T11:19:30Z

javascript/ql/src/semmle/javascript/Regexp.qll

+    t2 = t.smallstep(result, succ)
+    or
+    any(TaintTracking::AdditionalTaintStep dts).step(result, succ) and
+    t = t2


t = t2.continue() would be slightly more correct, I believe (though I doubt there is any practical difference).

This is an unchanged move from #1211.

The tests in that PR shows a semantic difference.
The following NOT OK line is not flagged if t = t2.continue() is used.

let domains = [ { hostname: 'test.example.com$' } ]; // NOT OK function convert2(domain) { return new RegExp(domain.hostname); } domains.map(d => convert2(d));

Oh, right, forgot about this one.

xiemaisi · 2019-06-03T11:19:53Z

javascript/ql/src/semmle/javascript/Regexp.qll

+ */
+abstract class RegExpPatternSource extends DataFlow::Node {
+  /**
+   * Gets the pattern of this node.


It's not really clear what "the pattern" means here; could you please expand the comment?

xiemaisi · 2019-06-03T11:24:18Z

javascript/ql/src/semmle/javascript/Regexp.qll

+  StringRegExpPatternSource() { this = regExpSource(use) }
+
+  /**
+   * Gets a node that use this source as a regular expression pattern.


Suggested change

* Gets a node that use this source as a regular expression pattern.

* Gets a node that uses this source as a regular expression pattern.

xiemaisi · 2019-06-03T11:25:31Z

javascript/ql/src/semmle/javascript/Regexp.qll

+}
+
+/**
+ * Gets a node whose value may flow (inter-procedurally) to a position where it is interpreted


This doc comment should explain the role of re.

ghost · 2019-06-03T12:29:24Z

All comments addressed.

xiemaisi

Nice! LGTM now.

Ping @mc for doc review.

mc · 2019-06-03T13:03:52Z

Why me?

xiemaisi · 2019-06-03T13:12:13Z

Oops, sorry, muscle memory.

mchammer01

@esben-semmle - this LGTM
I've made a couple of minor inline comments for your consideration.

mchammer01 · 2019-06-03T13:57:03Z

javascript/ql/src/Security/CWE-020/MissingRegExpAnchor.qhelp

+		<p>
+
+			Sanitizing untrusted input with regular expressions is a
+			common technique.  However, it is error prone to match untrusted input


error-prone

mchammer01 · 2019-06-03T13:59:16Z

javascript/ql/src/Security/CWE-020/MissingRegExpAnchor.qhelp

+		<p>
+
+			Even if the matching is not done in a security-critical
+			context, it may still cause undesirable behaviors when the regular


Suggestion:
...undesirable behavior (I'd use the singular form here) when the regular expression accidentally matches (I'd put the advert before the verb)...

ghost · 2019-06-03T14:40:32Z

All comments addressed

xiemaisi · 2019-06-03T14:48:51Z

PR checks are failing due to rogue tabs in the query examples.

mchammer01

Thanks for the doc updates @esben-semmle 👍

ghost added the JS label May 31, 2019

ghost added this to the 1.21.0 milestone May 31, 2019

ghost self-requested a review as a code owner May 31, 2019 10:24

asger-semmle self-assigned this May 31, 2019

asger-semmle reviewed May 31, 2019

View reviewed changes

Esben Sparre Andreasen added 7 commits June 3, 2019 08:27

JS: refactor IncompleteHostnameRegExp::regexp to RegExp.qll

98ae259

JS: refactor the predicate RegExp::regexp to three classes.

3358e49

This preserves the ad hoc message formatting in IncompleteHostnameRegExp.ql

JS: add anchors to js/incomplete-hostname-regexp examples

69db54a

JS: add query js/regex/missing-regexp-anchor

0fa73b8

JS: address minor review comments

3289c62

JS: improve tests and regexp for js/regex/missing-regexp-anchor

7018a38

JS: fix comment typo

1464427

xiemaisi suggested changes Jun 3, 2019

View reviewed changes

Esben Sparre Andreasen added 2 commits June 3, 2019 13:59

JS: address docstring comments

7b65221

JS: add RegExpPatternSource::getAParse to hide the subclasses

bf51c54

xiemaisi previously approved these changes Jun 3, 2019

View reviewed changes

xiemaisi assigned mchammer01 Jun 3, 2019

mchammer01 reviewed Jun 3, 2019

View reviewed changes

JS: address qhelp review comments

9e0a97e

ghost dismissed xiemaisi’s stale review via 9e0a97e June 3, 2019 14:39

JS: format qhelp examples

04868e5

mchammer01 approved these changes Jun 3, 2019

View reviewed changes

xiemaisi approved these changes Jun 3, 2019

View reviewed changes

semmle-qlci merged commit 80ff63a into github:master Jun 3, 2019

kamarcum unassigned mchammer01 Apr 28, 2020

	* Gets a node that use this source as a regular expression pattern.
	* Gets a node that uses this source as a regular expression pattern.

JS: add query js/regex/missing-regexp-anchor #1387

JS: add query js/regex/missing-regexp-anchor #1387

Uh oh!

Conversation

ghost commented May 31, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented May 31, 2019

Uh oh!

asger-semmle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asger-semmle commented May 31, 2019

Uh oh!

ghost commented May 31, 2019 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 3, 2019

Uh oh!

xiemaisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Jun 3, 2019

Uh oh!

xiemaisi left a comment

Choose a reason for hiding this comment

Uh oh!

mc commented Jun 3, 2019

Uh oh!

xiemaisi commented Jun 3, 2019

Uh oh!

mchammer01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Jun 3, 2019

Uh oh!

xiemaisi commented Jun 3, 2019

ghost commented May 31, 2019 •

edited by ghost

Loading

ghost commented May 31, 2019 •

edited by ghost

Loading