-
Notifications
You must be signed in to change notification settings - Fork 1.9k
JS: add query js/regex/missing-regexp-anchor #1387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some results for the curious. |
asger-semmle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just have a few nit picks. Agree that we should get this into 1.21.
Do you have a link to an evaluation?
| string pattern; | ||
|
|
||
| RegExpLiteralPatternSource() { | ||
| exists(string raw | raw = asExpr().(RegExpLiteral).getRoot().toString() | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a drive-by change, could you update the ql doc for RegExpTerm.toString to indicate that this is always the full text (not truncated) or make a separate predicate for getting the raw source text?
As @xiemaisi has pointed out a few times toString doesn't generally guarantee anything about its output.
| predicate isAnInterestingSemiAnchoredRegExpString(RegExpPatternSource src, string msg) { | ||
| exists(string str, string maybeGroupedStr, string regex, string anchorPart, string posString, string escapedDot | | ||
| // a dot that might be escaped in a regular expression, for example `/\./` or new `RegExp('\\.')` | ||
| escapedDot = "\\\\\\\\?[.]" and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dear, that's a lot of backslashes 😕
This matches a dot optionally preceeded by a backslash? In the str below, there is also a case for . on the right-hand side, though, which seems to suggest the question mark isn't needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this matches a dot preceeded by one or two backslashes. The programmer will have used one backslash if this pattern is in a regular expression literal, and two backslashes if this pattern is in a string literal (see examples in the comment). I could have dispatched on the subclass of RegExpPatternSource, but that seems to be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it turns out that \\\\[.] is the right amount of backslashes, this works for both /\./ and new RegExp('\\.'). I have added a bunch of tests.
| anchorPart = src.getPattern().regexpCapture(regex, 1) and | ||
| anchorPart.regexpMatch("(?i).*[a-z].*") and | ||
| msg = "The alternative '" + anchorPart + "' uses an anchor to match from the " + posString + | ||
| " of a string, but the other alternatives of this regular expression do not use anchors." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this message to be quite long. If I might suggest another way to phrase it:
Misleading operator precedence. The subexpression '^http' is anchored, but the other parts of this regular expression are not.
|
Opened https://jira.semmle.com/browse/ODASA-7961 with some ideas for future improvements that are out of scope for this round |
|
Performance evaluation links.
For results, I refer to https://lgtm.com/query/3732291658122574474/ which has the most up to date whitelist. (I have started a full security evaluation) |
This preserves the ad hoc message formatting in IncompleteHostnameRegExp.ql
xiemaisi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the whole, a few minor niggles.
| * A node whose string value may flow to a position where it is interpreted | ||
| * as a part of a regular expression. | ||
| */ | ||
| class StringRegExpPatternSource extends RegExpPatternSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could/should this class be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be private due to the use of class checks and getAUse in two places for whitelisting and alert presentation:
js/regex/missing-regexp-anchor:
src.getARegExpObject().flowsTo(arg) or
src.(StringRegExpPatternSource).getAUse() = arg
js/incomplete-hostname-regexp:
if re instanceof StringRegExpPatternSource
then (
kind = "string, which is used as a regular expression $@," and
aux = re.(StringRegExpPatternSource).getAUse()
) else (
kind = "regular expression" and aux = re
)
| /** | ||
| * A regular expression literal, viewed as the pattern source for itself. | ||
| */ | ||
| class RegExpLiteralPatternSource extends RegExpPatternSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could/should this class be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency witth the public StringRegExpPatternSource, I think this should remain public
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, fine. I don't like having all these awkwardly named classes in the global namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can hide them in a module BuiltinRegExpPatternSources { class RegExpLiteralPatternSource ... }, that is half the awkwardness in the global namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that would be better. Could the above two uses of StringRegExpPatternSource be encapsulated as member predicates of RegExpPatternSource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have introduced getAParse instead of getAUse, then the predicate name makes sense for the regular expression literals which otherwise should be flagged at all their usages. As an added bonus, the message in js/incomplete-hostname-regexp will no longer include a trivial link for cases where the string literal already is at a location where it is used as a regular expression:
That is, the message for:
new RegExp(`test.example.com$`); // NOT OK
is now:
This regular expression has an unescaped '.' before 'example.com', so it might match more hosts than expected.
| t2 = t.smallstep(result, succ) | ||
| or | ||
| any(TaintTracking::AdditionalTaintStep dts).step(result, succ) and | ||
| t = t2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t = t2.continue() would be slightly more correct, I believe (though I doubt there is any practical difference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unchanged move from #1211.
The tests in that PR shows a semantic difference.
The following NOT OK line is not flagged if t = t2.continue() is used.
let domains = [ { hostname: 'test.example.com$' } ]; // NOT OK
function convert2(domain) {
return new RegExp(domain.hostname);
}
domains.map(d => convert2(d));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, forgot about this one.
| */ | ||
| abstract class RegExpPatternSource extends DataFlow::Node { | ||
| /** | ||
| * Gets the pattern of this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really clear what "the pattern" means here; could you please expand the comment?
| StringRegExpPatternSource() { this = regExpSource(use) } | ||
|
|
||
| /** | ||
| * Gets a node that use this source as a regular expression pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Gets a node that use this source as a regular expression pattern. | |
| * Gets a node that uses this source as a regular expression pattern. |
| } | ||
|
|
||
| /** | ||
| * Gets a node whose value may flow (inter-procedurally) to a position where it is interpreted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment should explain the role of re.
|
All comments addressed. |
xiemaisi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM now.
Ping @mc for doc review.
|
Why me? |
|
Oops, sorry, muscle memory. |
mchammer01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esben-semmle - this LGTM
I've made a couple of minor inline comments for your consideration.
| <p> | ||
|
|
||
| Sanitizing untrusted input with regular expressions is a | ||
| common technique. However, it is error prone to match untrusted input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error-prone
| <p> | ||
|
|
||
| Even if the matching is not done in a security-critical | ||
| context, it may still cause undesirable behaviors when the regular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
...undesirable behavior (I'd use the singular form here) when the regular expression accidentally matches (I'd put the advert before the verb)...
|
All comments addressed |
|
PR checks are failing due to rogue tabs in the query examples. |
mchammer01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc updates @esben-semmle 👍
This adds another query for a common regular expression mistake, this time for missing anchors. Similar to
js/incomplete-hostname-regexp, the query is identifies suspicious regular expression patterns, but it does not check that the patterns are used in a security setting. Both queries use the refactoredRegExpPatternSource, and a bunch of string heuristics.An earlier iteration of the query used the RegExp AST classes, but I am happier with the string heuristic implementation as it is reasonably simple, requires less QL code, and flags more results (due to support for string literal results). For the record, using the RegExp AST classes during prototyping is very useful for homing in on the syntactic cases we really care about.
The results have been a mixed bag, and required a bunch of iterations, but I think they are quite good now. Unfortunately, most results are not security relevant. If we split the query into one that flags URL patterns, and one for non-URL patterns, then the flagged URL patterns will be mostly security relevant. I think we should run the query on LGTM, but not show it by default for a while, and then decide if we want to split, sharpen or display the query.
Running
js/incomplete-hostname-regexpwith the refactoredRegExpPatternSourceshows no performance change.The performance for the new query is excellent. Running the query next to
js/incomplete-hostname-regexpappears to have no measurable overhead (in fact there seems to be a speedup, I am trying to reproduce the numbers. Update: the speedup is not reproducible, but there still is no overhead).I would like this to make it to 1.21, but it will probably not make it through the doc review today.