Conversation
Comprehensive plan for adding a lexer layer to parable.py, structured as 25+ incremental commits. Each increment passes all tests before committing.
Add _sync_lexer() and _sync_parser() methods to Parser for synchronizing position state between Parser and Lexer during incremental migration.
Add _lex_peek_token(), _lex_next_token(), and _lex_skip_blanks() methods that handle position synchronization and token recording.
Delegate spaces/tabs skipping to Lexer while keeping comment and line continuation handling in Parser.
Add _lex_skip_comment() wrapper and delegate comment skipping to Lexer in skip_whitespace().
Returns (token_type, value) tuple for operator tokens, None otherwise.
Class-level assignments in Python become class attributes accessible via ClassName.attr. In JavaScript, we need static fields for the same behavior.
Replace manual character-based detection of &&, ||, ;, & operators with Lexer token checking via _lex_peek_operator() and _lex_next_token().
Replace manual | and |& character checking with Lexer token detection via _lex_peek_operator() for PIPE and PIPE_AMP tokens.
Returns the word value if it's a reserved word, None otherwise. Uses Lexer.classify_word() to check if a WORD token is actually a reserved word.
Returns the value of the next WORD token if available, None otherwise. Keeps existing peek_word() for edge cases requiring character-based behavior.
Consumes a WORD token if it matches the expected value. For simple reserved word consumption without edge cases like process sub handling.
Detects redirect operators: <, >, <<, >>, <&, >&, <>, >|, &>, &>>, <<-, <<< via Lexer token type checking.
Replace peek_word() in stop_words checks with _lex_peek_reserved_word(). Also update _lex_peek_reserved_word() to: - Strip trailing backslash-newline for classification (line continuation) - Use module-level RESERVED_WORDS set for transpiler compatibility
Add efficient helper for checking if next token is a specific reserved word.
Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: if, then, elif, else, fi.
Replace consume_word() calls with _lex_consume_word() for keywords: while, until, do, done.
Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: for, in, do, done. Also update _lex_consume_word() to strip trailing backslash-newline for proper line continuation handling.
Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: select, in, do.
Replace consume_word()/peek_word() calls with _lex_consume_word() and _lex_is_at_reserved_word() for keywords: in, esac. Keep consume_word() for initial 'case' keyword to handle leading } in process substitutions (edge case).
Replace character-based { and } detection with _lex_consume_word().
Lexer already handles { vs {abc distinction correctly.
Replace peek_word()/consume_word() for 'time' with _lex_is_at_reserved_word() and _lex_consume_word(). Keep ! negation character-based.
Replace peek_word() dispatch with _lex_peek_reserved_word(). Keep fallback for leading } in process subs (edge case).
Add helper to peek case terminators (;;, ;&, ;;&) using existing operator detection.
Replace character-based terminator consumption with Lexer-based _lex_peek_case_terminator() and _lex_next_token().
Replace usages with _lex_peek_case_terminator() is not None.
Mark as kept for edge cases; most usage migrated to _lex_consume_word().
Update ParseError calls in migrated compound command parsers to use self._lex_peek_token().pos instead of self.pos for more accurate error positions.
Add helper function that creates ParseError with token context showing the unexpected token value or "end of input" for EOF tokens.
- Use _lex_is_at_reserved_word() and _lex_consume_word() for 'function' - Use _lex_consume_word() for 'coproc' - Use _lex_peek_reserved_word() for COMPOUND_KEYWORDS checks
Replace manual case terminator detection with Lexer helper and remove the now-unused _is_semicolon_or_amp() function.
Remove: - _lex_peek_word() - never used - _lex_peek_redirect_op() - never used - _unexpected_token_error() - added but never used - match_keyword() - legacy unused code - _is_semicolon_or_amp() - replaced by _lex_peek_case_terminator()
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a Lexer class that handles tokenization (operators, reserved words, word tokens) and migrates the Parser to use it for compound command parsing. The Lexer provides cleaner separation between tokenization and parsing logic.
Key changes:
Lexerclass withnext_token()for token-based parsingTokenTypeconstants andTokenclass for structured token outputParserStateFlagsfor context-sensitive parsing_lex_peek_token(),_lex_peek_reserved_word(),_lex_consume_word(), etc.if,while,for,case, etc.) to Lexer-based reserved word detection