Skip to content

Add Lexer layer to Parser#324

Merged
ldayton merged 48 commits intomainfrom
add-lexer-layer
Jan 17, 2026
Merged

Add Lexer layer to Parser#324
ldayton merged 48 commits intomainfrom
add-lexer-layer

Conversation

@ldayton
Copy link
Copy Markdown
Owner

@ldayton ldayton commented Jan 17, 2026

Summary

  • Add a Lexer layer to the Parser for tokenization
  • Migrate compound command parsers to use Lexer-based token detection
  • Improve error messages with token positions

This PR introduces a Lexer class that handles tokenization (operators, reserved words, word tokens) and migrates the Parser to use it for compound command parsing. The Lexer provides cleaner separation between tokenization and parsing logic.

Key changes:

  • New Lexer class with next_token() for token-based parsing
  • TokenType constants and Token class for structured token output
  • Parser state tracking via ParserStateFlags for context-sensitive parsing
  • Lexer helper methods: _lex_peek_token(), _lex_peek_reserved_word(), _lex_consume_word(), etc.
  • Migrated all compound commands (if, while, for, case, etc.) to Lexer-based reserved word detection
  • Improved error positions using token positions instead of raw parser position

Comprehensive plan for adding a lexer layer to parable.py,
structured as 25+ incremental commits. Each increment passes
all tests before committing.
Add _sync_lexer() and _sync_parser() methods to Parser for synchronizing
position state between Parser and Lexer during incremental migration.
Add _lex_peek_token(), _lex_next_token(), and _lex_skip_blanks() methods
that handle position synchronization and token recording.
Delegate spaces/tabs skipping to Lexer while keeping comment and line
continuation handling in Parser.
Add _lex_skip_comment() wrapper and delegate comment skipping to
Lexer in skip_whitespace().
Returns (token_type, value) tuple for operator tokens, None otherwise.
Class-level assignments in Python become class attributes accessible via
ClassName.attr. In JavaScript, we need static fields for the same behavior.
Replace manual character-based detection of &&, ||, ;, & operators
with Lexer token checking via _lex_peek_operator() and _lex_next_token().
Replace manual | and |& character checking with Lexer token detection
via _lex_peek_operator() for PIPE and PIPE_AMP tokens.
Returns the word value if it's a reserved word, None otherwise.
Uses Lexer.classify_word() to check if a WORD token is actually a reserved word.
Returns the value of the next WORD token if available, None otherwise.
Keeps existing peek_word() for edge cases requiring character-based behavior.
Consumes a WORD token if it matches the expected value. For simple
reserved word consumption without edge cases like process sub handling.
Detects redirect operators: <, >, <<, >>, <&, >&, <>, >|, &>, &>>, <<-, <<<
via Lexer token type checking.
Replace peek_word() in stop_words checks with _lex_peek_reserved_word().
Also update _lex_peek_reserved_word() to:
- Strip trailing backslash-newline for classification (line continuation)
- Use module-level RESERVED_WORDS set for transpiler compatibility
Add efficient helper for checking if next token is a specific reserved word.
Replace consume_word()/peek_word() calls with _lex_consume_word() and
_lex_is_at_reserved_word() for keywords: if, then, elif, else, fi.
Replace consume_word() calls with _lex_consume_word() for keywords:
while, until, do, done.
Replace consume_word()/peek_word() calls with _lex_consume_word() and
_lex_is_at_reserved_word() for keywords: for, in, do, done.

Also update _lex_consume_word() to strip trailing backslash-newline
for proper line continuation handling.
Replace consume_word()/peek_word() calls with _lex_consume_word() and
_lex_is_at_reserved_word() for keywords: select, in, do.
Replace consume_word()/peek_word() calls with _lex_consume_word() and
_lex_is_at_reserved_word() for keywords: in, esac.

Keep consume_word() for initial 'case' keyword to handle leading } in
process substitutions (edge case).
Replace character-based { and } detection with _lex_consume_word().
Lexer already handles { vs {abc distinction correctly.
Replace peek_word()/consume_word() for 'time' with _lex_is_at_reserved_word()
and _lex_consume_word(). Keep ! negation character-based.
Replace peek_word() dispatch with _lex_peek_reserved_word().
Keep fallback for leading } in process subs (edge case).
Add helper to peek case terminators (;;, ;&, ;;&) using existing operator detection.
Replace character-based terminator consumption with Lexer-based
_lex_peek_case_terminator() and _lex_next_token().
Replace usages with _lex_peek_case_terminator() is not None.
Mark as kept for edge cases; most usage migrated to _lex_consume_word().
Update ParseError calls in migrated compound command parsers to use
self._lex_peek_token().pos instead of self.pos for more accurate error
positions.
Add helper function that creates ParseError with token context showing
the unexpected token value or "end of input" for EOF tokens.
- Use _lex_is_at_reserved_word() and _lex_consume_word() for 'function'
- Use _lex_consume_word() for 'coproc'
- Use _lex_peek_reserved_word() for COMPOUND_KEYWORDS checks
Replace manual case terminator detection with Lexer helper and remove
the now-unused _is_semicolon_or_amp() function.
Remove:
- _lex_peek_word() - never used
- _lex_peek_redirect_op() - never used
- _unexpected_token_error() - added but never used
- match_keyword() - legacy unused code
- _is_semicolon_or_amp() - replaced by _lex_peek_case_terminator()
@ldayton ldayton merged commit 0ce2c9c into main Jan 17, 2026
1 check passed
@ldayton ldayton deleted the add-lexer-layer branch January 17, 2026 09:01
ldayton added a commit that referenced this pull request Mar 25, 2026
Includes fixes for JS backend: constructor defaults (#321), startswith
pos arg (#324), operator precedence (#333), regex escaping (#322),
template literal backticks (#323), destructuring discard (#326),
isinstance primitives (#325, #327), backtick-heredoc (#352), and
UTF-8 encoding (#334).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant