Skip to content

<regex>: Some escape sequences are mishandled #5244

@muellerj2

Description

@muellerj2

There are a number of escape sequences that the parser mistakenly accepts or miscompiles.

ECMAScript

  • Backreferences with leading zero digits (e.g., \01 for capture group 1) should be rejected. [ECMA-262 3rd ed., Section 15.10.2.11 "DecimalEscape"]
  • \00 and more zero digits should be rejected and not be interpreted as an escape for NUL. Only \0 is a valid escape sequence for NUL. [ECMA-262 3rd ed., Section 15.10.2.11 "DecimalEscape"]
  • When a custom traits implementation defines a new character class "z", [\z] matches the characters in this class and not the character z. (Meanwhile, \z without brackets matches the character z and not the characters in the class "z".) [ECMA-262 3rd ed., Sections 15.10.1 "Patterns" and 15.10.2.12 "CharacterClassEscape"]
  • [\b] should match U+0008 BACKSPACE, not b. [ECMA-262 3rd ed., Section 15.10.2.19 "ClassEscape"]

awk

See Section "Regular expressions" in the awk specification.

  • Octal escape sequences are not parsed correctly in square-bracket character class definitions. (E.g., [\040] should match U+0020 SPACE.)
  • Similarly, [\"] and [\/] match backslashes as well even though they shouldn't.
  • While the awk specification says that using unspecified escape sequences results in undefined behavior, I think we should reject them. (I believe we should handle this differently from ECMAScript mode, where unrecognized escape sequences just yield the escaped character.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfixedSomething works now, yay!regexmeow is a substring of homeowner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions