Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented May 31, 2019

This adds another query for a common regular expression mistake, this time for missing anchors. Similar to js/incomplete-hostname-regexp, the query is identifies suspicious regular expression patterns, but it does not check that the patterns are used in a security setting. Both queries use the refactored RegExpPatternSource, and a bunch of string heuristics.

An earlier iteration of the query used the RegExp AST classes, but I am happier with the string heuristic implementation as it is reasonably simple, requires less QL code, and flags more results (due to support for string literal results). For the record, using the RegExp AST classes during prototyping is very useful for homing in on the syntactic cases we really care about.

The results have been a mixed bag, and required a bunch of iterations, but I think they are quite good now. Unfortunately, most results are not security relevant. If we split the query into one that flags URL patterns, and one for non-URL patterns, then the flagged URL patterns will be mostly security relevant. I think we should run the query on LGTM, but not show it by default for a while, and then decide if we want to split, sharpen or display the query.

Running js/incomplete-hostname-regexp with the refactored RegExpPatternSource shows no performance change.

The performance for the new query is excellent. Running the query next to js/incomplete-hostname-regexp appears to have no measurable overhead (in fact there seems to be a speedup, I am trying to reproduce the numbers. Update: the speedup is not reproducible, but there still is no overhead).

I would like this to make it to 1.21, but it will probably not make it through the doc review today.

@ghost ghost added the JS label May 31, 2019
@ghost ghost added this to the 1.21.0 milestone May 31, 2019
@ghost ghost self-requested a review as a code owner May 31, 2019 10:24
@ghost
Copy link
Author

ghost commented May 31, 2019

Some results for the curious.

@asger-semmle asger-semmle self-assigned this May 31, 2019
Copy link
Contributor

@asger-semmle asger-semmle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just have a few nit picks. Agree that we should get this into 1.21.

Do you have a link to an evaluation?

string pattern;

RegExpLiteralPatternSource() {
exists(string raw | raw = asExpr().(RegExpLiteral).getRoot().toString() |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a drive-by change, could you update the ql doc for RegExpTerm.toString to indicate that this is always the full text (not truncated) or make a separate predicate for getting the raw source text?

As @xiemaisi has pointed out a few times toString doesn't generally guarantee anything about its output.

predicate isAnInterestingSemiAnchoredRegExpString(RegExpPatternSource src, string msg) {
exists(string str, string maybeGroupedStr, string regex, string anchorPart, string posString, string escapedDot |
// a dot that might be escaped in a regular expression, for example `/\./` or new `RegExp('\\.')`
escapedDot = "\\\\\\\\?[.]" and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, that's a lot of backslashes 😕

This matches a dot optionally preceeded by a backslash? In the str below, there is also a case for . on the right-hand side, though, which seems to suggest the question mark isn't needed here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this matches a dot preceeded by one or two backslashes. The programmer will have used one backslash if this pattern is in a regular expression literal, and two backslashes if this pattern is in a string literal (see examples in the comment). I could have dispatched on the subclass of RegExpPatternSource, but that seems to be overkill.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out that \\\\[.] is the right amount of backslashes, this works for both /\./ and new RegExp('\\.'). I have added a bunch of tests.

anchorPart = src.getPattern().regexpCapture(regex, 1) and
anchorPart.regexpMatch("(?i).*[a-z].*") and
msg = "The alternative '" + anchorPart + "' uses an anchor to match from the " + posString +
" of a string, but the other alternatives of this regular expression do not use anchors."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this message to be quite long. If I might suggest another way to phrase it:

Misleading operator precedence. The subexpression '^http' is anchored, but the other parts of this regular expression are not.

@asger-semmle
Copy link
Contributor

Opened https://jira.semmle.com/browse/ODASA-7961 with some ideas for future improvements that are out of scope for this round

@ghost
Copy link
Author

ghost commented May 31, 2019

Performance evaluation links.

For results, I refer to https://lgtm.com/query/3732291658122574474/ which has the most up to date whitelist.

(I have started a full security evaluation)

@ghost
Copy link
Author

ghost commented Jun 3, 2019

Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the whole, a few minor niggles.

* A node whose string value may flow to a position where it is interpreted
* as a part of a regular expression.
*/
class StringRegExpPatternSource extends RegExpPatternSource {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could/should this class be private?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be private due to the use of class checks and getAUse in two places for whitelisting and alert presentation:

js/regex/missing-regexp-anchor:

          src.getARegExpObject().flowsTo(arg) or
          src.(StringRegExpPatternSource).getAUse() = arg

js/incomplete-hostname-regexp:

    if re instanceof StringRegExpPatternSource
    then (
      kind = "string, which is used as a regular expression $@," and
      aux = re.(StringRegExpPatternSource).getAUse()
    ) else (
      kind = "regular expression" and aux = re
    )

/**
* A regular expression literal, viewed as the pattern source for itself.
*/
class RegExpLiteralPatternSource extends RegExpPatternSource {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could/should this class be private?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency witth the public StringRegExpPatternSource, I think this should remain public

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, fine. I don't like having all these awkwardly named classes in the global namespace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can hide them in a module BuiltinRegExpPatternSources { class RegExpLiteralPatternSource ... }, that is half the awkwardness in the global namespace.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that would be better. Could the above two uses of StringRegExpPatternSource be encapsulated as member predicates of RegExpPatternSource?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have introduced getAParse instead of getAUse, then the predicate name makes sense for the regular expression literals which otherwise should be flagged at all their usages. As an added bonus, the message in js/incomplete-hostname-regexp will no longer include a trivial link for cases where the string literal already is at a location where it is used as a regular expression:

That is, the message for:

new RegExp(`test.example.com$`); // NOT OK

is now:

This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected. 

t2 = t.smallstep(result, succ)
or
any(TaintTracking::AdditionalTaintStep dts).step(result, succ) and
t = t2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t = t2.continue() would be slightly more correct, I believe (though I doubt there is any practical difference).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unchanged move from #1211.

The tests in that PR shows a semantic difference.
The following NOT OK line is not flagged if t = t2.continue() is used.

	let domains = [ { hostname: 'test.example.com$' } ];  // NOT OK
	function convert2(domain) {
		return new RegExp(domain.hostname);
	}
	domains.map(d => convert2(d));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, forgot about this one.

*/
abstract class RegExpPatternSource extends DataFlow::Node {
/**
* Gets the pattern of this node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really clear what "the pattern" means here; could you please expand the comment?

StringRegExpPatternSource() { this = regExpSource(use) }

/**
* Gets a node that use this source as a regular expression pattern.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Gets a node that use this source as a regular expression pattern.
* Gets a node that uses this source as a regular expression pattern.

}

/**
* Gets a node whose value may flow (inter-procedurally) to a position where it is interpreted
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment should explain the role of re.

@ghost
Copy link
Author

ghost commented Jun 3, 2019

All comments addressed.

xiemaisi
xiemaisi previously approved these changes Jun 3, 2019
Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM now.

Ping @mc for doc review.

@mc
Copy link

mc commented Jun 3, 2019

Why me?

@xiemaisi
Copy link

xiemaisi commented Jun 3, 2019

Oops, sorry, muscle memory.

Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esben-semmle - this LGTM
I've made a couple of minor inline comments for your consideration.

<p>

Sanitizing untrusted input with regular expressions is a
common technique. However, it is error prone to match untrusted input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error-prone

<p>

Even if the matching is not done in a security-critical
context, it may still cause undesirable behaviors when the regular
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
...undesirable behavior (I'd use the singular form here) when the regular expression accidentally matches (I'd put the advert before the verb)...

@ghost
Copy link
Author

ghost commented Jun 3, 2019

All comments addressed

@xiemaisi
Copy link

xiemaisi commented Jun 3, 2019

PR checks are failing due to rogue tabs in the query examples.

Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc updates @esben-semmle 👍

@semmle-qlci semmle-qlci merged commit 80ff63a into github:master Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants