Describe the bug
is_fstring_start using builtin.any for prefix matching is too slow and slows down fstring tokenization.
def is_fstring_start(token: str) -> bool:
return builtins.any(token.startswith(prefix) for prefix in fstring_prefix) # using `any` with `<genexpr>` is too slow
To Reproduce
Run this minimal reproducing script:
import cProfile, pstats, io
from blib2to3.pgen2 import tokenize
profiler = cProfile.Profile()
example = io.StringIO(','.join(['f"X"']*10000)).readline
profiler.enable()
tokenize.tokenize(example, lambda *_: None)
profiler.disable()
pstats.Stats(profiler).sort_stats(pstats.SortKey.TIME).print_stats("black", "src", 10)
The profiling output looks like this:
720011 function calls in 0.133 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
40001 0.040 0.000 0.124 0.000 black/src/blib2to3/pgen2/tokenize.py:559(generate_tokens)
190000 0.021 0.000 0.036 0.000 black/src/blib2to3/pgen2/tokenize.py:466(<genexpr>)
1 0.008 0.008 0.133 0.133 black/src/blib2to3/pgen2/tokenize.py:280(tokenize_loop)
10000 0.004 0.000 0.048 0.000 black/src/blib2to3/pgen2/tokenize.py:463(is_fstring_start)
59997 0.003 0.000 0.003 0.000 black/src/blib2to3/pgen2/tokenize.py:528(current)
10000 0.001 0.000 0.002 0.000 black/src/blib2to3/pgen2/tokenize.py:534(leave_fstring)
10000 0.001 0.000 0.002 0.000 black/src/blib2to3/pgen2/tokenize.py:531(enter_fstring)
1 0.000 0.000 0.133 0.133 black/src/blib2to3/pgen2/tokenize.py:260(tokenize)
2 0.000 0.000 0.000 0.000 black/src/blib2to3/pgen2/tokenize.py:525(is_in_fstring_expression)
1 0.000 0.000 0.000 0.000 black/src/blib2to3/pgen2/tokenize.py:522(__init__)
The <genexpr> in is_fstring_start typically occupies around 15-20% of the time, which is too much and easily optimizable.
Environment
- Black's version: [main]
- OS and Python version: [Mac/Python 3.12.6]
Proposed Solution
change the fstring_prefix to a tuple and use token.startswith(fstring_prefix) directly
cc. @JelleZijlstra @tusharsadhwani
Describe the bug
is_fstring_startusingbuiltin.anyfor prefix matching is too slow and slows down fstring tokenization.To Reproduce
Run this minimal reproducing script:
The profiling output looks like this:
The
<genexpr>inis_fstring_starttypically occupies around 15-20% of the time, which is too much and easily optimizable.Environment
Proposed Solution
change the
fstring_prefixto a tuple and usetoken.startswith(fstring_prefix)directlycc. @JelleZijlstra @tusharsadhwani