Add support for parsing f-string as per PEP 701 #7041

dhruvmanila · 2023-09-01T13:05:20Z

Summary

This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node.

Grammar

Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings.

`string.rs`

This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:

parse_string: Used to parse a single string literal
parse_strings: Used to parse strings which were implicitly concatenated

Now, there are 3 entry points:

parse_string_literal: Renamed from parse_string
parse_fstring_middle: Used to parse a FStringMiddle token which is basically a string literal without the quotes
concatenate_strings: Renamed from parse_strings but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node.

A short primer on FStringMiddle token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in f"foo {bar:.3f{x}} bar", the foo , .3f and bar are FStringMiddle token content.

`Constant::kind` changed in the AST

Discussion in the official implementation: python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with u) and f-strings are used in an implicitly concatenated string value. For example,

u"foo" f"{bar}" "baz" " some"

Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both "foo" and "baz some" (implicit concatenation) would be given the u kind:

Pre 3.12 AST:

Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')

But, post Python 3.12, only the string with the u prefix will be assigned the value:

Pre 3.12 AST:

Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')

Here are some more iterations around the change:

"foo" f"{bar}" u"baz" "no"

Pre 3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')

"foo" f"{bar}" "baz" u"no"

Pre 3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

3.12

Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')

u"foo" f"bar {baz} realy" u"bar" "no"

Pre 3.12

Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')

3.12

Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')

Errors

With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop:

A closing delimiter was not opened properly
An opening delimiter was not closed properly
Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc.

Test Plan

Refactor existing test cases to use parse_suite instead of parse_fstrings (doesn't exists anymore)
Additional test cases are added as required

Updated the snapshots. The change from parse_fstrings to parse_suite means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges.

Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835

dhruvmanila · 2023-09-01T13:05:26Z

Current dependencies on/for this PR:

main
- PR Add a NotebookError type to avoid returning Diagnostics on error #7035
  - PR Make SourceKind a required parameter #7013
    - PR Add support for the new f-string tokens per PEP 701 #6659
      - PR Add support for parsing f-string as per PEP 701 #7041 👈
        
        PR Use narrow type for string parsing patterns #7211
        
        PR Disallow non-parenthesized lambda expr in f-string #7263
        
        PR Fix curly brace escape handling in f-strings #7331
        
        PR Update Indexer to use new f-string tokens #7325
        
        PR Detect noqa directives for multi-line f-strings #7326
        PR Update F541 to use new f-string tokens #7327
        PR Update Stylist quote detection with new f-string token #7328
        PR Update W605 to check in f-strings #7329

This comment was auto-generated by Graphite.

codspeed-hq · 2023-09-01T13:19:50Z

CodSpeed Performance Report

Merging #7041 will degrade performances by 9.7%

_{⚠️ No base runs were found}

_{Falling back to comparing dhruv/fstring-parser (cd2e18b) with main (04183b0)}

Summary

❌ 8 regressions
✅ 17 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`dhruv/fstring-parser`	Change
❌	`lexer[numpy/globals.py]`	233.2 µs	252.9 µs	-7.8%
❌	`lexer[unicode/pypinyin.py]`	620.2 µs	673.4 µs	-7.89%
❌	`lexer[large/dataset.py]`	9.8 ms	10.7 ms	-8.42%
❌	`parser[numpy/ctypeslib.py]`	12.4 ms	12.7 ms	-2.35%
❌	`lexer[numpy/ctypeslib.py]`	2 ms	2.1 ms	-7.65%
❌	`parser[large/dataset.py]`	68.4 ms	70.1 ms	-2.34%
❌	`lexer[pydantic/types.py]`	4.1 ms	4.6 ms	-9.7%
❌	`parser[unicode/pypinyin.py]`	4.3 ms	4.4 ms	-2.6%

MichaReiser

Nice work! Overall this is looking good to me. It will be interesting to get some ecosystem checks once ruff compiles.

Most of my comments are nits but I think there's potential to clean up string.rs further (and hopefully improving performance at the same time). I leave it up to you if you want to do this as part of this or a follow up PR.

crates/ruff/src/linter.rs

MichaReiser · 2023-09-05T07:13:43Z

crates/ruff_python_parser/src/parser.rs

 /// ```
 pub fn parse_tokens(
    lxr: impl IntoIterator<Item = LexResult>,
+    source: &str,


We probably want to start grouping source, mode and source_path by an abstraction but we don't have to do this as part of this PR. But we should probably do it before introducing more arguments.

crates/ruff_python_parser/src/lexer.rs

crates/ruff_python_parser/src/python.lalrpop