Python: add regex parser #5866

yoff · 2021-05-10T14:34:44Z

This as a first stage for ReDoS, but we will also try to build AST viewing on it.

…into python-regex-parser

Get the JS regex AST viewer

not having single char constants yet all redos results disappeared

nickrolfe · 2021-05-26T11:03:30Z

python/ql/src/semmle/python/RegexParserExtended.qll

+
+private string escapableChars() { result = "AbBdDsSwWZafnNrtuUvx\\\\" }
+
+private string keywordChars() { result = "()|*+?\\-\\[\\]" }


I think you're missing some characters here, e.g. { and }. I tested the tokenization with the regex a*b{9}, and you can see that tokens treats the braces as both normalchar and as part of fixedrepeat:

But I've only just started to try and understand this code, so it's possible I've misunderstood.

{ and } can be a keyword or not based on context.
E.g. in a{b the { is a string constant, but { is a keyword in a{2,}.
That might be why.

That is at least the case for JavaScript.
Even though I think it's in conflict with the specification.
But everyone seems to do it.

Ah, that makes sense.

But I assume we still need to maintain the invariant that each character in a regex literal is only part of one token. So in my example above, the last 3 rows should not be there.

But I assume we still need to maintain the invariant that each character in a regex literal is only part of one token. So in my example above, the last 3 rows should not be there.

We could do that. A regular expression should be able to capture exactly which { and } are keywords.
(I haven't dived deep into the implementation, so I might be wrong).

But I'm not sure we need to do that.
The way the parser works is that is spuriously creates everything that seems feasible when looking at the very local context.
And each layer/iteration of the parser then only uses the tokens/ast-nodes that can be used to create a valid ast-node.

So I don't think spurious tokens are an issue.

Spurious tokens also arise when parsing dashes inside character classes.
E.g. in /foo[A-Z-]bar/, where the first dash is a keyword, and the second is a constant.
And I'm not sure if a regular expression can correctly capture which dashes are keywords.

If the C# extension is installed, then it reports 25k+ errors on the C# extractor until it is properly built. This is pure noise because the solution would be opened and built from the correct subdirectory. This commit disables the C# compilation altogether.

yoff · 2021-09-10T12:14:52Z

Superseded by #6175.

yoff added 3 commits May 10, 2021 15:04

Python: Add parser

bd199b7

Python: Add regex parser tests

93c5896

Python: Add tree view for ReDoS and AST viewer

e73cb06

github-actions bot added the Python label May 10, 2021

erik-krogh and others added 8 commits May 11, 2021 00:07

Python: add printAst support for regular expressions

8d63d34

Merge branch 'main' of github.com:github/codeql into python-regex-parser

50fa05d

Merge branch 'python-regex-parser' of https://github.com/yoff/codeql …

26f9691

…into python-regex-parser

Merge branch 'main' of github.com:github/codeql into python-regex-parser

50ed9ba

Get the JS regex AST viewer

Python: Limit strings to parse

690b055

Python: A number of parser tweaks

99cbb11

Python: use constants

fc5f2e6

Python: collecting constants part I

d0f2857

not having single char constants yet all redos results disappeared

nickrolfe reviewed May 26, 2021

View reviewed changes

yoff closed this Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: add regex parser #5866

Python: add regex parser #5866

Uh oh!

yoff commented May 10, 2021

Uh oh!

nickrolfe May 26, 2021

Uh oh!

erik-krogh May 26, 2021 •

edited

Loading

Uh oh!

nickrolfe May 26, 2021

Uh oh!

erik-krogh May 26, 2021

Uh oh!

yoff commented Sep 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		private string escapableChars() { result = "AbBdDsSwWZafnNrtuUvx\\\\" }

		private string keywordChars() { result = "()\|*+?\\-\\[\\]" }

Python: add regex parser #5866

Python: add regex parser #5866

Uh oh!

Conversation

yoff commented May 10, 2021

Uh oh!

nickrolfe May 26, 2021

Choose a reason for hiding this comment

Uh oh!

erik-krogh May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickrolfe May 26, 2021

Choose a reason for hiding this comment

Uh oh!

erik-krogh May 26, 2021

Choose a reason for hiding this comment

Uh oh!

yoff commented Sep 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erik-krogh May 26, 2021 •

edited

Loading