diff --git a/pep-0675.rst b/pep-0675.rst index 640452a3cde..52e1189d4a1 100644 --- a/pep-0675.rst +++ b/pep-0675.rst @@ -82,7 +82,7 @@ the AST or by other semantic pattern-matching. These tools, however, preclude common idioms like storing a large multi-line query in a variable before executing it, adding literal string modifiers to the query based on some conditions, or transforming the query string using -a function. (We survey existing tools in the "Rejected Alternatives" +a function. (We survey existing tools in the `Rejected Alternatives`_ section.) For example, many tools will detect a false positive issue in this benign snippet: @@ -112,7 +112,7 @@ generalization of the ``Literal["foo"]`` type from :pep:`586`. A string of type ``Literal[str]`` cannot contain user-controlled data. Thus, any API that only accepts ``Literal[str]`` will be immune to injection -vulnerabilities (with pragmatic `limitations `_). Since we want the ``sqlite3`` ``execute`` method to disallow strings @@ -202,9 +202,9 @@ heuristics, such as regex-filtering for obviously malicious payloads, there will always be a way to work around them (perfectly distinguishing good and bad queries reduces to the halting problem). -Static approaches like checking the AST to see if the query string is -a literal string expression cannot tell when a string is assigned to -an intermediate variable or when it is transformed by a benign +Static approaches, such as checking the AST to see if the query string +is a literal string expression, cannot tell when a string is assigned +to an intermediate variable or when it is transformed by a benign function. This makes them overly restrictive. The type checker, surprisingly, does better than both because it has @@ -300,6 +300,7 @@ if they evaluate to the same value (``str``), such as Type Inference ============== +.. _inferring_literal_str: Inferring ``Literal[str]`` -------------------------- @@ -327,6 +328,10 @@ following cases: has type ``Literal[str]`` if and only if ``s`` and the arguments have types compatible with ``Literal[str]``. ++ Literal-preserving methods: In `Appendix C `_, we have + provided an exhaustive list of ``str`` methods that preserve the + ``Literal[str]`` type. + In all other cases, if one or more of the composed values has a non-literal type ``str``, the composition of types will have type ``str``. For example, if ``s`` has type ``str``, then ``"hello" + s`` @@ -337,10 +342,6 @@ checkers. methods from ``str``. So, if we have a variable ``s`` of type ``Literal[str]``, it is safe to write ``s.startswith("hello")``. -Note that, beyond the few composition rules mentioned above, this PEP -doesn't change inference for other ``str`` methods such as -``literal_string.upper()``. - Some type checkers refine the type of a string when doing an equality check: @@ -366,7 +367,7 @@ See the examples below to help clarify the above rules: s: str = literal_string # OK literal_string: Literal[str] = s # Error: Expected Literal[str], got str. - literal_string: Literal[str] = "hello" # OK + literal_string: Literal[str] = "hello" # OK def expect_literal_str(s: Literal[str]) -> None: ... @@ -577,11 +578,10 @@ Rejected Alternatives Why not use tool X? ------------------- -Focusing solely on the example of preventing SQL injection, tooling to -catch this kind of issue seems to come in three flavors: AST based, -function level analysis, and taint flow analysis. +Tools to catch issues such as SQL injection seem to come in three +flavors: AST based, function level analysis, and taint flow analysis. -**AST based tools include Bandit**: `Bandit +**AST-based tools**: `Bandit `_ has a plugin to warn when SQL queries are not literal strings. The problem is that many perfectly safe SQL @@ -630,7 +630,7 @@ handles it with no burden on the programmer: # Example usage data_to_insert = { - "column_1": value_1, # Note: values are not literals + "column_1": value_1, # Note: values are not literals "column_2": value_2, "column_3": value_3, } @@ -650,6 +650,14 @@ on to library users instead of allowing the libraries themselves to specify precisely how their APIs must be called (as is possible with ``Literal[str]``). +One final reason to prefer using a new type over a dedicated tool is +that type checkers are more widely used than dedicated security +tooling; for example, MyPy was downloaded `over 7 million times +`_ in Jan 2022 vs `less than +2 million times `_ for +Bandit. Having security protections built right into type checkers +will mean that more developers benefit from them. + Why not use a ``NewType`` for ``str``? -------------------------------------- @@ -748,27 +756,8 @@ The implementation simply extends the type checker with ``Literal[str]`` as a supertype of literal string types. To support composition via addition, join, etc., it was sufficient to -overload the stubs for ``str`` in Pyre's copy of typeshed. For -example, we replaced ``str`` ``__add__``: - -:: - - # Before: - def __add__(self, s: str) -> str: ... - - # After: - @overload - def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ... - @overload - def __add__(self, other: str) -> str: ... +overload the stubs for ``str`` in Pyre's copy of typeshed. -This means that addition of non-literal string types remains to have -type ``str``. The only change is that addition of literal string types -now produces ``Literal[str]``. - -One implementation strategy is to update the official Typeshed `stub -`_ -for ``str`` with these changes. Appendix A: Other Uses ====================== @@ -868,6 +857,40 @@ the ``Template`` API to only accept ``Literal[str]``: def __init__(self, source: Literal[str]): ... +Logging Format String Injection +------------------------------- + +Logging frameworks often allow their input strings to contain +formatting directives. At its worst, allowing users to control the +logged string has led to `CVE-2021-44228 +`_ (colloquially +known as ``log4shell``), which has been described as the `"most +critical vulnerability of the last decade" +`_. +While no Python frameworks are currently known to be vulnerable to a +similar attack, the built-in logging framework does provide formatting +options which are vulnerable to Denial of Service attacks from +externally controlled logging strings. The following example +illustrates a simple denial of service scenario: + +:: + + external_string = "%(foo)999999999s" + ... + # Tries to add > 1GB of whitespace to the logged string: + logger.info(f'Received: {external_string}', some_dict) + +This kind of attack could be prevented by requiring that the format +string passed to the logger be a ``Literal[str]`` and that all +externally controlled data be passed separately as arguments (as +proposed in `Issue 46200 `_): + +:: + + def info(msg: Literal[str], *args: object) -> None: + ... + + Appendix B: Limitations ======================= @@ -913,6 +936,275 @@ is documentation, which is easily ignored and often not seen. With ``Literal[str]``, API misuse requires conscious thought and artifacts in the code that reviewers and future developers can notice. +.. _appendix_C: + +Appendix C: ``str`` methods that preserve ``Literal[str]`` +========================================================== + +The ``str`` class has several methods that would benefit from +``Literal[str]``. For example, users might expect +``"hello".capitalize()`` to have the type ``Literal[str]`` similar to +the other examples we have seen in the `Inferring Literal[str] +`_ section. Inferring the type ``Literal[str]`` +is correct because the string is not an arbitrary user-supplied string +- we know that it has the type ``Literal["HELLO"]``, which is +compatible with ``Literal[str]``. In other words, the ``capitalize`` +method preserves the ``Literal[str]`` type. There are several other +``str`` methods that preserve ``Literal[str]``. + +We propose updating the stub for ``str`` in typeshed so that the +methods are overloaded with the ``Literal[str]``-preserving +versions. This means type checkers do not have to hardcode +``Literal[str]`` behavior for each method. It also lets us easily +support new methods in the future by updating the typeshed stub. + +For example, to preserve literal types for the ``capitalize`` method, +we would change the stub as below: + +:: + + # before + def capitalize(self) -> str: ... + + # after + @overload + def capitalize(self: Literal[str]) -> Literal[str]: ... + @overload + def capitalize(self) -> str: ... + +The downside of changing the ``str`` stub is that the stub becomes +more complicated and can make error messages harder to +understand. Type checkers may need to special-case ``str`` to make +error messages understandable for users. + +Below is an exhaustive list of ``str`` methods which, when called as +indicated with arguments of type ``Literal[str]``, must be treated as +returning a ``Literal[str]``. If this PEP is accepted, we will update +these method signatures in typeshed: + +:: + + @overload + def capitalize(self: Literal[str]) -> Literal[str]: ... + @overload + def capitalize(self) -> str: ... + + @overload + def casefold(self: Literal[str]) -> Literal[str]: ... + @overload + def casefold(self) -> str: ... + + @overload + def center(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ... + @overload + def center(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... + + if sys.version_info >= (3, 8): + @overload + def expandtabs(self: Literal[str], tabsize: SupportsIndex = ...) -> Literal[str]: ... + @overload + def expandtabs(self, tabsize: SupportsIndex = ...) -> str: ... + + else: + @overload + def expandtabs(self: Literal[str], tabsize: int = ...) -> Literal[str]: ... + @overload + def expandtabs(self, tabsize: int = ...) -> str: ... + + @overload + def format(self: Literal[str], *args: Literal[str], **kwargs: Literal[str]) -> Literal[str]: ... + @overload + def format(self, *args: str, **kwargs: str) -> str: ... + + @overload + def join(self: Literal[str], __iterable: Iterable[Literal[str]]) -> Literal[str]: ... + @overload + def join(self, __iterable: Iterable[str]) -> str: ... + + @overload + def ljust(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ... + @overload + def ljust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... + + @overload + def lower(self: Literal[str]) -> Literal[str]: ... + @overload + def lower(self) -> Literal[str]: ... + + @overload + def lstrip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ... + @overload + def lstrip(self, __chars: str | None = ...) -> str: ... + + @overload + def partition(self: Literal[str], __sep: Literal[str]) -> tuple[Literal[str], Literal[str], Literal[str]]: ... + @overload + def partition(self, __sep: str) -> tuple[str, str, str]: ... + + @overload + def replace(self: Literal[str], __old: Literal[str], __new: Literal[str], __count: SupportsIndex = ...) -> Literal[str]: ... + @overload + def replace(self, __old: str, __new: str, __count: SupportsIndex = ...) -> str: ... + + if sys.version_info >= (3, 9): + @overload + def removeprefix(self: Literal[str], __prefix: Literal[str]) -> Literal[str]: ... + @overload + def removeprefix(self, __prefix: str) -> str: ... + + @overload + def removesuffix(self: Literal[str], __suffix: Literal[str]) -> Literal[str]: ... + @overload + def removesuffix(self, __suffix: str) -> str: ... + + @overload + def rjust(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ... + @overload + def rjust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ... + + @overload + def rpartition(self: Literal[str], __sep: Literal[str]) -> tuple[Literal[str], Literal[str], Literal[str]]: ... + @overload + def rpartition(self, __sep: str) -> tuple[str, str, str]: ... + + @overload + def rsplit(self: Literal[str], sep: Literal[str] | None = ..., maxsplit: SupportsIndex = ...) -> list[Literal[str]]: ... + @overload + def rsplit(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ... + + @overload + def rstrip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ... + @overload + def rstrip(self, __chars: str | None = ...) -> str: ... + + @overload + def split(self: Literal[str], sep: Literal[str] | None = ..., maxsplit: SupportsIndex = ...) -> list[Literal[str]]: ... + @overload + def split(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ... + + @overload + def splitlines(self: Literal[str], keepends: bool = ...) -> list[Literal[str]]: ... + @overload + def splitlines(self, keepends: bool = ...) -> list[str]: ... + + @overload + def strip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ... + @overload + def strip(self, __chars: str | None = ...) -> str: ... + + @overload + def swapcase(self: Literal[str]) -> Literal[str]: ... + @overload + def swapcase(self) -> str: ... + + @overload + def title(self: Literal[str]) -> Literal[str]: ... + @overload + def title(self) -> str: ... + + @overload + def upper(self: Literal[str]) -> Literal[str]: ... + @overload + def upper(self) -> str: ... + + @overload + def zfill(self: Literal[str], __width: SupportsIndex) -> Literal[str]: ... + @overload + def zfill(self, __width: SupportsIndex) -> str: ... + + @overload + def __add__(self: Literal[str], __s: Literal[str]) -> Literal[str]: ... + @overload + def __add__(self, __s: str) -> str: ... + + @overload + def __iter__(self: Literal[str]) -> Iterator[str]: ... + @overload + def __iter__(self) -> Iterator[str]: ... + + @overload + def __mod__(self: Literal[str], __x: Union[Literal[str], Tuple[Literal[str], ...]]) -> str: ... + @overload + def __mod__(self, __x: Union[str, Tuple[str, ...]]) -> str: ... + + @overload + def __mul__(self: Literal[str], __n: SupportsIndex) -> Literal[str]: ... + @overload + def __mul__(self, __n: SupportsIndex) -> str: ... + + @overload + def __repr__(self: Literal[str]) -> Literal[str]: ... + @overload + def __repr__(self) -> str: ... + + @overload + def __rmul__(self: Literal[str], n: SupportsIndex) -> Literal[str]: ... + @overload + def __rmul__(self, n: SupportsIndex) -> str: ... + + @overload + def __str__(self: Literal[str]) -> Literal[str]: ... + @overload + def __str__(self) -> str: ... + + +Appendix D: Guidelines for using ``Literal[str]`` in Stubs +========================================================== + +Libraries that do not contain type annotations within their source may +specify type stubs in Typeshed. Libraries written in other languages, +such as those for machine learning, may also provide Python type +stubs. This means the type checker cannot verify that the type +annotations match the source code and must trust the type stub. Thus, +authors of type stubs need to be careful when using ``Literal[str]`` +since a function may falsely appear to be safe when it is not. + +We recommend the following guidelines for using ``Literal[str]`` in stubs: + ++ If the stub is for a function, we recommend using ``Literal[str]`` + in the return type of the function or of its overloads only if all + the corresponding arguments have literal types (i.e., + ``Literal[str]`` or ``Literal["a", "b"]``). + + :: + + # OK + @overload + def my_transform(x: Literal[str], y: Literal["a", "b"]) -> Literal[str]: ... + @overload + def my_transform(x: str, y: str) -> str: ... + + # Not OK + @overload + def my_transform(x: Literal[str], y: str) -> Literal[str]: ... + @overload + def my_transform(x: str, y: str) -> str: ... + ++ If the stub is for a ``staticmethod``, we recommend the same + guideline as above. + ++ If the stub is for any other kind of method, we recommend against + using ``Literal[str]`` in the return type of the method or any of + its overloads. This is because, even if all the explicit arguments + have type ``Literal[str]``, the object itself may be created using + user data and thus the return type may be user-controlled. + ++ If the stub is for a class attribute or global variable, we also + recommend against using ``Literal[str]`` because the untyped code + may write arbitrary values to the attribute. + +However, we leave the final call to the library author. They may use +``Literal[str]`` if they feel confident that the string returned by +the method or function or the string stored in the attribute is +guaranteed to have a literal type - i.e., the string is created by +applying only literal-preserving ``str`` operations to a string +literal. + +Note that these guidelines do not apply to inline type annotations +since the type checker can verify that, say, a method returning +``Literal[str]`` does in fact return an expression of that type. + + Resources ========= @@ -936,7 +1228,8 @@ Thanks Thanks to the following people for their feedback on the PEP: -Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, and Shengye Wan +Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, +CAM Gerlach, and Shengye Wan Copyright =========