Add Lexer layer to Parser by ldayton · Pull Request #324 · ldayton/Parable

ldayton · 2026-01-17T08:46:02Z

Summary

Add a Lexer layer to the Parser for tokenization
Migrate compound command parsers to use Lexer-based token detection
Improve error messages with token positions

This PR introduces a Lexer class that handles tokenization (operators, reserved words, word tokens) and migrates the Parser to use it for compound command parsing. The Lexer provides cleaner separation between tokenization and parsing logic.

Key changes:

New Lexer class with next_token() for token-based parsing
TokenType constants and Token class for structured token output
Parser state tracking via ParserStateFlags for context-sensitive parsing
Lexer helper methods: _lex_peek_token(), _lex_peek_reserved_word(), _lex_consume_word(), etc.
Migrated all compound commands (if, while, for, case, etc.) to Lexer-based reserved word detection
Improved error positions using token positions instead of raw parser position

Comprehensive plan for adding a lexer layer to parable.py, structured as 25+ incremental commits. Each increment passes all tests before committing.

Add _sync_lexer() and _sync_parser() methods to Parser for synchronizing position state between Parser and Lexer during incremental migration.

Add _lex_peek_token(), _lex_next_token(), and _lex_skip_blanks() methods that handle position synchronization and token recording.

Delegate spaces/tabs skipping to Lexer while keeping comment and line continuation handling in Parser.

Add _lex_skip_comment() wrapper and delegate comment skipping to Lexer in skip_whitespace().

Returns (token_type, value) tuple for operator tokens, None otherwise.

Class-level assignments in Python become class attributes accessible via ClassName.attr. In JavaScript, we need static fields for the same behavior.

Replace manual character-based detection of &&, ||, ;, & operators with Lexer token checking via _lex_peek_operator() and _lex_next_token().

Replace manual | and |& character checking with Lexer token detection via _lex_peek_operator() for PIPE and PIPE_AMP tokens.

Returns the word value if it's a reserved word, None otherwise. Uses Lexer.classify_word() to check if a WORD token is actually a reserved word.

Returns the value of the next WORD token if available, None otherwise. Keeps existing peek_word() for edge cases requiring character-based behavior.

Consumes a WORD token if it matches the expected value. For simple reserved word consumption without edge cases like process sub handling.

Detects redirect operators: <, >, <<, >>, <&, >&, <>, >|, &>, &>>, <<-, <<< via Lexer token type checking.

Replace peek_word() in stop_words checks with _lex_peek_reserved_word(). Also update _lex_peek_reserved_word() to: - Strip trailing backslash-newline for classification (line continuation) - Use module-level RESERVED_WORDS set for transpiler compatibility

Add efficient helper for checking if next token is a specific reserved word.

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: if, then, elif, else, fi.

Replace consume_word() calls with _lex_consume_word() for keywords: while, until, do, done.

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: for, in, do, done. Also update _lex_consume_word() to strip trailing backslash-newline for proper line continuation handling.

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: select, in, do.

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: in, esac. Keep consume_word() for initial 'case' keyword to handle leading } in process substitutions (edge case).

Replace character-based { and } detection with _lex_consume_word(). Lexer already handles { vs {abc distinction correctly.

Replace peek_word()/consume_word() for 'time' with _lex_is_at_reserved_word() and _lex_consume_word(). Keep ! negation character-based.

Replace peek_word() dispatch with _lex_peek_reserved_word(). Keep fallback for leading } in process subs (edge case).

Add helper to peek case terminators (;;, ;&, ;;&) using existing operator detection.

Replace character-based terminator consumption with Lexer-based _lex_peek_case_terminator() and _lex_next_token().

Replace usages with _lex_peek_case_terminator() is not None.

Mark as kept for edge cases; most usage migrated to _lex_consume_word().

Update ParseError calls in migrated compound command parsers to use self._lex_peek_token().pos instead of self.pos for more accurate error positions.

Add helper function that creates ParseError with token context showing the unexpected token value or "end of input" for EOF tokens.

- Use _lex_is_at_reserved_word() and _lex_consume_word() for 'function' - Use _lex_consume_word() for 'coproc' - Use _lex_peek_reserved_word() for COMPOUND_KEYWORDS checks

Replace manual case terminator detection with Lexer helper and remove the now-unused _is_semicolon_or_amp() function.

Remove: - _lex_peek_word() - never used - _lex_peek_redirect_op() - never used - _unexpected_token_error() - added but never used - match_keyword() - legacy unused code - _is_semicolon_or_amp() - replaced by _lex_peek_case_terminator()

Includes fixes for JS backend: constructor defaults (#321), startswith pos arg (#324), operator precedence (#333), regex escaping (#322), template literal backticks (#323), destructuring discard (#326), isinstance primitives (#325, #327), backtick-heredoc (#352), and UTF-8 encoding (#334).

ldayton added 30 commits January 17, 2026 09:35

docs: add incremental plan for lexer layer

46dcfe4

Comprehensive plan for adding a lexer layer to parable.py, structured as 25+ incremental commits. Each increment passes all tests before committing.

feat: add TokenType constants for lexer

851eb7a

feat: add Token class for lexer output

bf9408e

feat: add LexerState flags for lexer context

01c0369

feat: add Lexer class skeleton

b435338

feat: add character classification methods to Lexer

0613c99

feat: implement operator tokenization in Lexer

592b57e

feat: add skip_blanks and comment handling to Lexer

5266f2c

feat: implement word tokenization with quotes in Lexer

cbad863

feat: implement next_token() in Lexer

5d75f74

feat: add Lexer instance to Parser

4dbbd4e

feat: add token history tracking to Parser

87d0c52

feat: add reserved word detection to Lexer

aa305d1

feat: add ParserStateFlags for context tracking

3f9392d

feat: add parser_state bitmask to Parser

bbfc908

feat: set PST_CASEPAT during case pattern parsing

7d8fc76

feat: set PST_ARITH during arithmetic parsing

fdbcfac

feat: add position sync helpers for Lexer integration

5c5af4d

Add _sync_lexer() and _sync_parser() methods to Parser for synchronizing position state between Parser and Lexer during incremental migration.

feat: add Lexer wrapper methods to Parser

5542b59

Add _lex_peek_token(), _lex_next_token(), and _lex_skip_blanks() methods that handle position synchronization and token recording.

refactor: use Lexer.skip_blanks() in skip_whitespace()

80d7738

Delegate spaces/tabs skipping to Lexer while keeping comment and line continuation handling in Parser.

refactor: use Lexer._skip_comment() for comment handling

3fc253d

Add _lex_skip_comment() wrapper and delegate comment skipping to Lexer in skip_whitespace().

feat: add _lex_peek_operator() helper

3689f45

Returns (token_type, value) tuple for operator tokens, None otherwise.

fix(transpiler): emit static fields for class-level assignments

92906c7

Class-level assignments in Python become class attributes accessible via ClassName.attr. In JavaScript, we need static fields for the same behavior.

refactor: migrate parse_list_operator() to use Lexer tokens

d6e1aaf

Replace manual character-based detection of &&, ||, ;, & operators with Lexer token checking via _lex_peek_operator() and _lex_next_token().

refactor: migrate _parse_simple_pipeline() pipe detection to Lexer

53f9c66

Replace manual | and |& character checking with Lexer token detection via _lex_peek_operator() for PIPE and PIPE_AMP tokens.

feat: add _lex_peek_reserved_word() helper

95ba2be

Returns the word value if it's a reserved word, None otherwise. Uses Lexer.classify_word() to check if a WORD token is actually a reserved word.

feat: add _lex_peek_word() helper for Lexer-based word peeking

bf37d79

Returns the value of the next WORD token if available, None otherwise. Keeps existing peek_word() for edge cases requiring character-based behavior.

feat: add _lex_consume_word() helper for Lexer-based word consumption

b4e5c5a

Consumes a WORD token if it matches the expected value. For simple reserved word consumption without edge cases like process sub handling.

feat: add _lex_peek_redirect_op() helper

3564e26

Detects redirect operators: <, >, <<, >>, <&, >&, <>, >|, &>, &>>, <<-, <<< via Lexer token type checking.

ldayton added 18 commits January 17, 2026 10:23

feat: add _lex_is_at_reserved_word() helper

87a272c

Add efficient helper for checking if next token is a specific reserved word.

feat: migrate parse_if() and _parse_elif_chain() to Lexer

7a0fc5d

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: if, then, elif, else, fi.

feat: migrate parse_while() and parse_until() to Lexer

1825cf2

Replace consume_word() calls with _lex_consume_word() for keywords: while, until, do, done.

feat: migrate parse_select() to Lexer

9ba08fe

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: select, in, do.

feat: migrate parse_case() reserved words to Lexer

cee2db8

Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: in, esac. Keep consume_word() for initial 'case' keyword to handle leading } in process substitutions (edge case).

feat: migrate parse_brace_group() to Lexer

4c16a15

Replace character-based { and } detection with _lex_consume_word(). Lexer already handles { vs {abc distinction correctly.

feat: migrate parse_pipeline() time keyword to Lexer

4c48ec2

Replace peek_word()/consume_word() for 'time' with _lex_is_at_reserved_word() and _lex_consume_word(). Keep ! negation character-based.

feat: migrate parse_compound_command() dispatch to Lexer

c84da6d

Replace peek_word() dispatch with _lex_peek_reserved_word(). Keep fallback for leading } in process subs (edge case).

feat: add _lex_peek_case_terminator() helper

f6e2009

Add helper to peek case terminators (;;, ;&, ;;&) using existing operator detection.

feat: migrate _consume_case_terminator() to Lexer

5f169fd

Replace character-based terminator consumption with Lexer-based _lex_peek_case_terminator() and _lex_next_token().

refactor: remove _is_case_terminator() helper

93c918a

Replace usages with _lex_peek_case_terminator() is not None.

docs: update consume_word() docstring

82f50d8

Mark as kept for edge cases; most usage migrated to _lex_consume_word().

refactor: use token positions in ParseError for compound commands

c973674

Update ParseError calls in migrated compound command parsers to use self._lex_peek_token().pos instead of self.pos for more accurate error positions.

feat: add _unexpected_token_error() helper for improved error messages

3509c2a

Add helper function that creates ParseError with token context showing the unexpected token value or "end of input" for EOF tokens.

refactor: migrate parse_function() and parse_coproc() to Lexer

38d264a

- Use _lex_is_at_reserved_word() and _lex_consume_word() for 'function' - Use _lex_consume_word() for 'coproc' - Use _lex_peek_reserved_word() for COMPOUND_KEYWORDS checks

refactor: use _lex_peek_case_terminator() in parse_list_until()

e1e6c8e

Replace manual case terminator detection with Lexer helper and remove the now-unused _is_semicolon_or_amp() function.

ldayton merged commit 0ce2c9c into main Jan 17, 2026
1 check passed

ldayton deleted the add-lexer-layer branch January 17, 2026 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lexer layer to Parser#324

Add Lexer layer to Parser#324
ldayton merged 48 commits intomainfrom
add-lexer-layer

ldayton commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ldayton commented Jan 17, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant