Conversation
Reviewer's GuideThis PR centralizes markdown tokenization by introducing a reusable Class diagram for the new reusable tokenizer and refactored process_tokensclassDiagram
class Token {
<<enum>>
+Text(&str)
+Code(&str)
+Fence(&str)
+Newline
}
class tokenize_markdown {
+tokenize_markdown(source: &str) Token[]
}
class process_tokens {
+process_tokens(lines: &[String], f: FnMut(Token, &mut String)) Vec<String>
}
Token <.. tokenize_markdown : emits
tokenize_markdown <.. process_tokens : used by
Flow diagram for centralized markdown tokenizationflowchart TD
Input[Markdown Source]
Tokenizer[tokenize_markdown]
Processor[process_tokens]
Output[Processed Lines]
Input --> Tokenizer --> Processor --> Output
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Rate limit exceeded@leynos has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 53 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
Summary by CodeRabbit
WalkthroughUpdate the Markdown architecture documentation to clarify the role of tokenization, removing outdated references to internal state machines. Refactor Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant textproc
participant wrap
participant tokenize
Client->>textproc: process_tokens(lines, f)
textproc->>wrap: tokenize_markdown(joined_lines)
wrap->>tokenize: tokenize_markdown(source)
tokenize-->>wrap: Vec<Token>
wrap-->>textproc: Vec<Token>
textproc->>textproc: Apply transformation f to each Token
textproc-->>Client: Vec<String> (processed lines)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Hey @leynos - I've reviewed your changes and found some issues that need to be addressed.
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/textproc.rs:60` </location>
<code_context>
- for _ in out_blanks..trailing_blanks {
- result.push(String::new());
- }
+ let mut result: Vec<String> = out.split('\n').map(ToOwned::to_owned).collect();
+ result.extend(std::iter::repeat_n(
+ String::new(),
+ trailing_blanks.saturating_sub(result.len()),
</code_context>
<issue_to_address>
Using repeat_n with saturating_sub may not always match the original trailing blank line count.
The new logic may add extra blank lines when result.len() is less than trailing_blanks. Please verify this aligns with the intended behavior for all cases.
</issue_to_address>
### Comment 2
<location> `src/wrap/tokenize.rs:150` </location>
<code_context>
+
+/// Tokenize a block of Markdown into [`Token`]s.
+#[must_use]
+pub fn tokenize_markdown(source: &str) -> Vec<Token<'_>> {
+ if source.is_empty() {
+ return Vec::new();
</code_context>
<issue_to_address>
Consider refactoring both `tokenize_markdown` and `tokenize_inline` to use streaming iterators and avoid intermediate allocations for improved clarity and efficiency.
```suggestion
You can collapse the two `for`-loops + `Vec<&str>` + `last_idx` logic in `tokenize_markdown` into a single streaming pass over `source.split_inclusive('\n')`. That removes one allocation and unifies your fence/newline handling:
```rust
pub fn tokenize_markdown(source: &str) -> Vec<Token<'_>> {
let mut tokens = Vec::new();
let mut in_fence = false;
for segment in source.split_inclusive('\n') {
let line = segment.trim_end_matches('\n');
let is_fence = super::is_fence(line);
if is_fence || in_fence {
tokens.push(Token::Fence(line));
if is_fence { in_fence = !in_fence; }
} else {
tokenize_inline(line, &mut |t| tokens.push(t));
}
tokens.push(Token::Newline);
}
// drop the trailing newline if source didn’t end with '\n'
if !source.ends_with('\n') {
tokens.pop();
}
tokens
}
```
Similarly, you can simplify `tokenize_inline` by turning it into an explicit iterator, removing manual `find` + `.repeat(…)` calls and flattening its internal loops:
```rust
struct InlineTok<'a> { rest: &'a str }
impl<'a> InlineTok<'a> {
fn new(s: &'a str) -> Self { InlineTok { rest: s } }
}
impl<'a> Iterator for InlineTok<'a> {
type Item = Token<'a>;
fn next(&mut self) -> Option<Self::Item> {
if self.rest.is_empty() {
return None;
}
if let Some(pos) = self.rest.find('`') {
if pos > 0 {
let (head, tail) = self.rest.split_at(pos);
self.rest = tail;
return Some(Token::Text(head));
}
// count backticks
let delim_len = self.rest.chars().take_while(|&c| c=='`').count();
let (d, after) = self.rest.split_at(delim_len);
if let Some(idx) = after.find(d) {
let (code, rest) = after.split_at(idx);
self.rest = &rest[d.len()..];
return Some(Token::Code(code));
}
}
// no more backticks, emit remainder
let rem = self.rest;
self.rest = "";
Some(Token::Text(rem))
}
}
fn tokenize_inline<'a, F>(text: &'a str, mut emit: F)
where F: FnMut(Token<'a>) {
for tok in InlineTok::new(text) {
emit(tok);
}
}
```
These changes preserve all current behavior but reduce nesting, eliminate the temporary `Vec<&str>`, and avoid manual string-repeat allocations.
</issue_to_address>
### Comment 3
<location> `src/wrap/tokenize.rs:122` </location>
<code_context>
-pub use crate::wrap::Token;
-use crate::wrap::is_fence;
-
-fn tokenize_inline<'a, F>(text: &'a str, emit: &mut F)
-where
- F: FnMut(Token<'a>),
</code_context>
<issue_to_address>
Module is missing a `//!` comment at the top.
Please add a `//!` module-level doc comment at the very top of this file to describe its purpose, as required by the review instructions.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
docs/architecture.md(1 hunks)src/textproc.rs(2 hunks)src/wrap.rs(1 hunks)src/wrap/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
docs/**/*.md
📄 CodeRabbit Inference Engine (AGENTS.md)
docs/**/*.md: Reference: Use the markdown files within thedocs/directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decisions.
Update: When new decisions are made, requirements change, libraries are added/removed, or architectural patterns evolve, proactively update the relevant file(s) in thedocs/directory to reflect the latest state. Ensure the documentation remains accurate and current.
Files:
docs/architecture.md
**/*.md
📄 CodeRabbit Inference Engine (AGENTS.md)
**/*.md: Documentation must use en-GB-oxendict spelling and grammar. (EXCEPTION: the naming of the "LICENSE" file, which is to be left unchanged for community consistency.)
Validate Markdown files usingmake markdownlint.
Runmake fmtafter any documentation changes to format all Markdown files and fix table markup.
Validate Mermaid diagrams in Markdown files by runningmake nixie.
Markdown paragraphs and bullet points must be wrapped at 80 columns.
Code blocks must be wrapped at 120 columns.
Tables and headings must not be wrapped.
Use dashes (-) for list bullets.
Use GitHub-flavoured Markdown footnotes ([^1]) for references and footnotes.
Files:
docs/architecture.md
⚙️ CodeRabbit Configuration File
**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")
- Use en-GB-oxendict (-ize / -our) spelling and grammar
- Paragraphs and bullets must be wrapped to 80 columns, except where a long URL would prevent this (in which case, silence MD013 for that line)
- Code blocks should be wrapped to 120 columns.
- Headings must not be wrapped.
- Documents must start with a level 1 heading
- Headings must correctly increase or decrease by no more than one level at a time
- Use GitHub-flavoured Markdown style for footnotes and endnotes.
- Numbered footnotes must be numbered by order of appearance in the document.
Files:
docs/architecture.md
**/*.rs
📄 CodeRabbit Inference Engine (AGENTS.md)
**/*.rs: Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Useconcat!()to combine long string literals rather than escaping newlines with a backslash.
Prefer semantic error enums: Derivestd::error::Error(via thethiserrorcrate) for any condition the caller might inspect, retry, or map to an HTTP status.
Use an opaque error only at the app boundary: Useeyre::Reportfor human-readable logs; these should not be exposed in public APIs.
Never export the opaque type from a library: Convert to domain enums at API boundaries, and toeyreonly in the mainmain()entrypoint or top-level async task.
Files:
src/wrap.rssrc/textproc.rssrc/wrap/tokenize.rs
⚙️ CodeRabbit Configuration File
**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.
Adhere to single responsibility and CQRS
Place function attributes after doc comments.
Do not use
returnin single-line functions.Move conditionals with >2 branches into a predicate function.
Avoid
unsafeunless absolutely necessary.Every module must begin with a
//!doc comment that explains the module's purpose and utility.Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar
Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.Use
rstestfixtures for shared setup and to avoid repetition between tests.Replace duplicated tests with
#[rstest(...)]parameterised cases.Prefer
mockallfor mocks/stubs.Prefer
.expect()over.unwrap()Ensure that any API or behavioural changes are reflected in the documentation in
docs/Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
Files:
src/wrap.rssrc/textproc.rssrc/wrap/tokenize.rs
🧬 Code Graph Analysis (3)
src/wrap.rs (1)
src/wrap/tokenize.rs (1)
tokenize_markdown(150-184)
src/textproc.rs (1)
src/wrap/tokenize.rs (1)
tokenize_markdown(150-184)
src/wrap/tokenize.rs (1)
src/wrap.rs (1)
is_fence(186-186)
🔇 Additional comments (5)
src/wrap/tokenize.rs (1)
122-146: Function correctly implements inline code span tokenization.The
tokenize_inlinefunction properly handles backtick-delimited code spans with matching fence lengths. The logic correctly handles unmatched delimiters by treating them as text.src/wrap.rs (1)
11-11: LGTM!The re-export correctly makes
tokenize_markdownavailable alongsideTokenfor external use.docs/architecture.md (1)
269-273: Documentation accurately reflects the architectural changes.The update correctly describes that tokenization is now handled by
wrap::tokenize_markdowninstead of the previous state machine approach inprocess_tokens.src/textproc.rs (2)
9-9: LGTM!The import correctly uses the new
tokenize_markdownfunction from the wrap module.
50-54: Refactoring correctly delegates tokenization logic.The simplification properly uses
tokenize_markdownto handle the entire source string instead of processing line by line. This aligns with the PR objective of centralising tokenization.
Summary
wrap::tokenize_markdownprocess_tokensto use that tokenizerTesting
make fmtmake lintmake testmake markdownlintmake nixie(fails: too many arguments)https://chatgpt.com/codex/tasks/task_e_688bb7f516308322b96f22d672c6d6b7
Summary by Sourcery
Introduce a reusable Markdown tokenizer via
tokenize_markdown, refactor the text processing pipeline to leverage this new function, and update architecture notes accordinglyNew Features:
tokenize_markdownfunction for reusable Markdown tokenizationEnhancements:
tokenize_markdownalongsideTokenin the wrap moduletextproc.rsto usetokenize_markdownand remove inline/fence parsing logicrepeat_nDocumentation: