Conversation
Reviewer's GuideThis PR integrates a declarative, logos-based lexer by introducing a new tokenizer module with a public Sequence diagram for the new logos-based tokenization processsequenceDiagram
participant Client
participant Tokenizer
participant LogosLexer
participant SyntaxKind
Client->>Tokenizer: tokenize(src: &str)
Tokenizer->>LogosLexer: Token::lexer(src)
loop for each token
LogosLexer-->>Tokenizer: next() -> Token + span
Tokenizer->>SyntaxKind: map Token to SyntaxKind
Tokenizer-->>Tokenizer: collect (SyntaxKind, Span)
end
Tokenizer-->>Client: Vec<(SyntaxKind, Span)>
Class diagram for the new tokenizer module and public APIclassDiagram
class Tokenizer {
+tokenize(src: &str) Vec<(SyntaxKind, Span)>
}
class Token {
<<enum>>
Whitespace
Comment
Ident
Number
String
LParen
RParen
LBrace
RBrace
LBracket
RBracket
Semi
Comma
Dot
ColonColon
Colon
Pipe
Amp
EqEq
Eq
Implies
Percent
Star
Slash
Plus
Minus
Arrow
FatArrow
Lte
Spaceship
Gte
Lt
Gt
Neq
Shr
Shl
Tilde
At
Hash
Apostrophe
}
class SyntaxKind {
<<enum>>
T_WHITESPACE
T_COMMENT
T_IDENT
T_NUMBER
T_STRING
T_LPAREN
T_RPAREN
T_LBRACE
T_RBRACE
T_LBRACKET
T_RBRACKET
T_SEMI
T_COMMA
T_DOT
T_COLON_COLON
T_COLON
T_PIPE
T_AMP
T_EQEQ
T_EQ
T_IMPLIES
T_PERCENT
T_STAR
T_SLASH
T_PLUS
T_MINUS
T_ARROW
T_FAT_ARROW
T_LTE
T_SPACESHIP
T_GTE
T_LT
T_GT
T_NEQ
T_SHR
T_SHL
T_TILDE
T_AT
T_HASH
T_APOSTROPHE
N_ERROR
...
}
class Span {
<<type>>
std::ops::Range<usize>
}
Tokenizer --> Token : uses
Tokenizer --> SyntaxKind : maps to
Tokenizer --> Span : returns
Token --> SyntaxKind : mapped to
Class diagram for keyword mapping in the tokenizerclassDiagram
class KEYWORDS {
<<static HashMap<&'static str, SyntaxKind>>
}
class Tokenizer {
+keyword_kind(ident: &str) Option<SyntaxKind>
}
KEYWORDS <.. Tokenizer : used by
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary by CodeRabbit
Summary by CodeRabbit
WalkthroughA new tokenizer module for DDlog source code was introduced, utilising the Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Caller (e.g. Parser or Test)
participant Tokenizer as tokenizer::tokenize
participant Logos as logos::Lexer
Client->>Tokenizer: tokenize(src: &str)
Tokenizer->>Logos: create Lexer from src
loop For each token
Logos-->>Tokenizer: (Token, Span)
Tokenizer->>Tokenizer: Map Token to SyntaxKind (check for keywords)
end
Tokenizer-->>Client: Vec<(SyntaxKind, Span)>
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (4)
🧰 Additional context used📓 Path-based instructions (3)`Cargo.toml`: Use explicit version ranges in `Cargo.toml` and keep dependencies up-to-date.
`docs/**/*.md`: Use the markdown files within the `docs/` directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decision...
`**/*.rs`: Document public APIs using Rustdoc comments (`///`) so documentation can be generated with cargo doc. Every module must begin with a module level (`//!`) comment explain...
⏰ Context from checks skipped due to timeout of 90000ms (1)
🔇 Additional comments (11)
✨ Finishing Touches
🧪 Generate Unit Tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 7
🔭 Outside diff range comments (1)
src/lib.rs (1)
1-3: Update the crate documentation to reflect expanded scope.The module comment states "Currently exposes only the parser language definitions" but now also exposes tokenizer functionality.
//! Library crate for ddlint. //! -//! Currently exposes only the parser language definitions. +//! Exposes parser language definitions and lexical analysis functionality.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
Cargo.toml(1 hunks)docs/parser-plan.md(1 hunks)src/lib.rs(1 hunks)src/tokenizer.rs(1 hunks)tests/tokenizer.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
`docs/**/*.md`: Use the markdown files within the `docs/` directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decision...
docs/**/*.md: Use the markdown files within thedocs/directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decisions.
Proactively update the relevant file(s) in thedocs/directory to reflect the latest state when new decisions are made, requirements change, libraries are added/removed, or architectural patterns evolve.
Documentation indocs/must use en-GB-oxendict spelling and grammar, except for the word 'license'.
Validate Markdown files usingmake markdownlint.
Runmake fmtafter any documentation changes to format all Markdown files and fix table markup.
Validate Markdown Mermaid diagrams using themake nixie.
Markdown paragraphs and bullet points must be wrapped at 80 columns.
Code blocks in Markdown must be wrapped at 120 columns.
Tables and headings in Markdown must not be wrapped.
docs/parser-plan.md
`Cargo.toml`: Use explicit version ranges in `Cargo.toml` and keep dependencies up-to-date.
Cargo.toml: Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Cargo.toml
`**/*.rs`: Document public APIs using Rustdoc comments (`///`) so documentation can be generated with cargo doc. Every module must begin with a module level (`//!`) comment explain...
**/*.rs: Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Userstestfixtures for shared setup.
Replace duplicated tests with#[rstest(...)]parameterised cases.
Prefermockallfor mocks/stubs.
Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
src/lib.rstests/tokenizer.rssrc/tokenizer.rs
🔇 Additional comments (7)
docs/parser-plan.md (1)
36-39: LGTM! Clear explanation of the lexer design choice.The documentation effectively explains why the
logoslexer was chosen overchumsky's text utilities, highlighting the benefits of declarative token definitions whilst maintaining parser interoperability.src/lib.rs (1)
8-8: LGTM! Clean module export structure.The public module declaration and re-exports properly expose the tokenizer functionality at the crate root.
Also applies to: 11-11
tests/tokenizer.rs (1)
1-17: LGTM! Good use of rstest for parameterised testing.The test structure follows the coding guidelines well, using rstest fixtures and parameterised cases as recommended.
src/tokenizer.rs (4)
1-6: LGTM! Excellent module documentation.The module documentation clearly explains the purpose and approach, following the coding guidelines for module-level comments.
180-182: LGTM! Good use of #[must_use] attribute.The function signature correctly uses the
#[must_use]attribute and follows naming conventions.
187-187: Replace unwrap_or with expect per coding guidelines.The coding guidelines prefer
.expect()over.unwrap()methods for better error context.- let text = src.get(span.clone()).unwrap_or(""); + let text = src.get(span.clone()).expect("lexer span should be valid for source");⛔ Skipped due to learnings
Learnt from: CR PR: leynos/wireframe#0 File: AGENTS.md:0-0 Timestamp: 2025-06-24T16:40:23.802Z Learning: Prefer `.expect()` over `.unwrap()` in Rust.Learnt from: CR PR: leynos/ddlint#0 File: AGENTS.md:0-0 Timestamp: 2025-06-24T16:39:54.809Z Learning: Prefer `.expect()` over `.unwrap()` in Rust.
25-26: Review string literal pattern for completeness.The string regex
r#""([^"\\]|\\.)*""#handles basic escaped characters but may not cover all DDlog string literal requirements. Verify it handles:
- Escaped quotes:
"\""- Escaped backslashes:
"\\"- Unicode escapes:
"\u{1234}"- Other escape sequences:
"\n","\t", etc.#!/bin/bash # Test the string regex against various cases python3 << 'EOF' import re pattern = r'"([^"\\]|\\.)*"' test_cases = [ '""', # empty string '"hello"', # simple string '"don\'t"', # apostrophe '"\""', # escaped quote '"\\\\"', # escaped backslash '"line1\\nline2"', # escaped newline '"\\u{1234}"', # unicode escape ] for case in test_cases: match = re.fullmatch(pattern, case) print(f"{case:15} -> {'✓' if match else '✗'}") EOF
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
tests/tokenizer.rs (1)
1-72: Add required module-level documentation.The coding guidelines require that every module must begin with a module-level (
//!) comment explaining the module's purpose and utility.Add a module-level comment at the beginning of the file:
+//! Integration tests for the tokenizer module. +//! +//! Tests verify that the logos-based lexer correctly tokenises DDlog source code +//! into (SyntaxKind, Span) pairs, covering keywords, literals, trivia, and error cases. + use ddlint::{SyntaxKind, tokenize};
♻️ Duplicate comments (3)
tests/tokenizer.rs (3)
9-17: Expand test coverage for comprehensive tokenizer validation.The current tests only cover a few keywords and identifiers. Consider adding test cases for all token types, operators, punctuation, edge cases, and complex expressions to ensure comprehensive validation of the tokenizer functionality.
23-23: Improve error handling by using.expect()instead of.unwrap_or().The coding guidelines prefer
.expect()over.unwrap()methods for better error messages.Apply this diff:
- let text = simple_input.get(span.clone()).unwrap_or(""); + let text = simple_input.get(span.clone()).expect("span should be valid for input");
53-57: Replace panic with.expect()for clearer error handling.Same issue as the previous test function - use
.expect()instead of the panic pattern.Apply this diff:
- let first = tokens - .first() - .cloned() - .unwrap_or_else(|| panic!("no token")); + let first = tokens + .first() + .cloned() + .expect("tokenizer should produce at least one token");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tests/tokenizer.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.rs`: Document public APIs using Rustdoc comments (`///`) so documentation can be generated with cargo doc. Every module must begin with a module level (`//!`) comment explain...
**/*.rs: Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Userstestfixtures for shared setup.
Replace duplicated tests with#[rstest(...)]parameterised cases.
Prefermockallfor mocks/stubs.
Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
tests/tokenizer.rs
Docstrings generation was requested by @leynos. * #7 (comment) The following files were modified: * `src/tokenizer.rs` * `tests/tokenizer.rs`
|
Note Generated docstrings for this pull request at #8 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
Cargo.toml (1)
10-13: Dependency version ranges properly addressed.The explicit version ranges for both
logosandrstestdependencies now follow the coding guidelines and address the concerns raised in previous reviews.tests/tokenizer.rs (4)
29-29: Replace.unwrap_or("")with.expect()for better error handling.This unresolved issue from previous reviews violates the coding guidelines which prefer
.expect()over.unwrap()methods.Apply this diff:
- let text = simple_input.get(span.clone()).unwrap_or(""); + let text = simple_input.get(span.clone()).expect("span should be valid for input");
44-47: Replace.unwrap_or_else(|| panic!())with.expect()for clearer error handling.This unresolved issue from previous reviews should use
.expect()as preferred by the coding guidelines.Apply this diff:
- let first = tokens - .first() - .cloned() - .unwrap_or_else(|| panic!("no token")); + let first = tokens + .first() + .cloned() + .expect("tokenizer should produce at least one token");
59-62: Replace.unwrap_or_else(|| panic!())with.expect()for clearer error handling.Consistent with previous suggestions, use
.expect()for better error messages.Apply this diff:
- let first = tokens - .first() - .cloned() - .unwrap_or_else(|| panic!("no token")); + let first = tokens + .first() + .cloned() + .expect("tokenizer should produce at least one token");
72-75: Replace.unwrap_or_else(|| panic!())with.expect()for clearer error handling.This unresolved issue from previous reviews needs to be addressed for consistency with coding guidelines.
Apply this diff:
- let first = tokens - .first() - .cloned() - .unwrap_or_else(|| panic!("no token")); + let first = tokens + .first() + .cloned() + .expect("tokenizer should produce at least one token");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
Cargo.toml(1 hunks)docs/parser-plan.md(1 hunks)src/lib.rs(1 hunks)tests/tokenizer.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
`Cargo.toml`: Use explicit version ranges in `Cargo.toml` and keep dependencies up-to-date.
Cargo.toml: Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Cargo.toml
`**/*.rs`: Document public APIs using Rustdoc comments (`///`) so documentation can be generated with cargo doc. Every module must begin with a module level (`//!`) comment explain...
**/*.rs: Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Userstestfixtures for shared setup.
Replace duplicated tests with#[rstest(...)]parameterised cases.
Prefermockallfor mocks/stubs.
Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
tests/tokenizer.rssrc/lib.rs
`docs/**/*.md`: Use the markdown files within the `docs/` directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decision...
docs/**/*.md: Use the markdown files within thedocs/directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decisions.
Proactively update the relevant file(s) in thedocs/directory to reflect the latest state when new decisions are made, requirements change, libraries are added/removed, or architectural patterns evolve.
Documentation indocs/must use en-GB-oxendict spelling and grammar, except for the word 'license'.
Validate Markdown files usingmake markdownlint.
Runmake fmtafter any documentation changes to format all Markdown files and fix table markup.
Validate Markdown Mermaid diagrams using themake nixie.
Markdown paragraphs and bullet points must be wrapped at 80 columns.
Code blocks in Markdown must be wrapped at 120 columns.
Tables and headings in Markdown must not be wrapped.
docs/parser-plan.md
🔇 Additional comments (5)
src/lib.rs (2)
3-3: Documentation update appropriately reflects expanded functionality.The crate documentation accurately describes the addition of lexical analysis capabilities.
8-11: Clean integration of tokenizer module into public API.The tokenizer module declaration and re-exports follow Rust conventions and provide a clean public interface for the lexical analysis functionality.
docs/parser-plan.md (1)
36-55: Excellent documentation of tokenizer design and workflow.The addition clearly explains the rationale for choosing
logosoverchumskytext utilities, and the sequence diagram effectively illustrates the tokenization process from client request through to returning(SyntaxKind, Span)pairs.tests/tokenizer.rs (2)
1-8: Excellent module documentation and imports.The module-level documentation clearly explains the purpose and scope of the tokenizer tests, and the imports are appropriate for the testing framework used.
97-111: Comprehensive test coverage demonstrates robust tokenizer implementation.The test suite effectively covers edge cases, error conditions, and complex expressions, providing good confidence in the tokenizer's correctness across various input scenarios.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey @leynos - I've reviewed your changes - here's some feedback:
- The raw‐string regex for number literals uses
\\.which matches a backslash+dot rather than the intended literal dot; switch tor"[0-9]+(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?"(or split parts) to correctly match decimals. - Consider replacing the runtime
HashMapinKEYWORDSwith a compile‐timephf::Map(or equivalent) to eliminate startup overhead and speed up keyword lookups.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The raw‐string regex for number literals uses `\\.` which matches a backslash+dot rather than the intended literal dot; switch to `r"[0-9]+(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?"` (or split parts) to correctly match decimals.
- Consider replacing the runtime `HashMap` in `KEYWORDS` with a compile‐time `phf::Map` (or equivalent) to eliminate startup overhead and speed up keyword lookups.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tests/tokenizer.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.rs`: Document public APIs using Rustdoc comments (`///`) so documentation can be generated with cargo doc. Every module must begin with a module level (`//!`) comment explain...
**/*.rs: Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Use explicit version ranges inCargo.tomland keep dependencies up-to-date.
Userstestfixtures for shared setup.
Replace duplicated tests with#[rstest(...)]parameterised cases.
Prefermockallfor mocks/stubs.
Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
tests/tokenizer.rs
🔇 Additional comments (2)
tests/tokenizer.rs (2)
1-15: Excellent module setup and adherence to testing guidelines.The module documentation clearly explains the purpose, and the use of
rstestfixtures aligns perfectly with the coding guidelines. Theexpectattribute with a clear reason is appropriate for test code.
17-216: Comprehensive test coverage successfully addresses previous feedback.The test suite now provides excellent coverage of the tokenizer functionality, addressing the expansion suggestions from previous reviews. The tests cover all major token types, error conditions, edge cases, and complex expressions effectively.
Summary
logoslexer for tokenisationtokenizeAPI and add testsTesting
make fmtmake lintmake testmake markdownlinthttps://chatgpt.com/codex/tasks/task_e_685aced31d6883229a9c4a110e425f6b
Summary by Sourcery
Implement lexing via the
logoscrate, introduce atokenizeAPI producing(SyntaxKind, Span)streams, update documentation to reflect the new lexer design, and add end-to-end tests for the tokenizer.New Features:
(SyntaxKind, Span)pairs via atokenize()APItokenizefunction andSpantype in the library public APIEnhancements:
HashMapfor keyword lookup to map identifiers toSyntaxKindin O(1) timeparser-plan.mdto document the logos-based lexing flow with a sequence diagramBuild:
logosas a dependency andrstestas a dev-dependency in Cargo.tomlDocumentation:
Tests: