Refactor tokenize_markdown to iterate lazily#191
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideThe PR refactors the tokenize_markdown function to use a peekable iterator over input lines instead of collecting them into a Vec, and updates newline insertion logic to rely on lookahead rather than index comparisons. Class diagram for the refactored tokenize_markdown functionclassDiagram
class TokenizeMarkdown {
+tokenize_markdown(source: &str) Vec<Token<'_>>
}
class Token {
<<enum>>
Fence(line: &str)
Newline
// ... other variants
}
TokenizeMarkdown --> Token
class PeekableIterator {
+next() &str
+peek() Option<&str>
}
TokenizeMarkdown ..> PeekableIterator : uses
class TokenizeInline {
+tokenize_inline(line: &str, callback: Fn(Token))
}
TokenizeMarkdown ..> TokenizeInline : calls
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary by CodeRabbit
WalkthroughIntroduce line-by-line Peekable iteration and a private newline helper in src/wrap/tokenize.rs to change when Newline tokens are emitted; add a public process_text helper in src/textproc.rs to split output while preserving specified trailing blank lines; include tests for newline and trailing-blank behaviour. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Tokenizer
participant Lines as Peekable<lines()>
Caller->>Tokenizer: tokenize_markdown(source)
Tokenizer->>Lines: create peekable iterator
loop per line
Lines->>Tokenizer: next() (line)
alt fence start/end
Tokenizer->>Tokenizer: push Fence/Code tokens
Tokenizer->>Lines: peek() to decide newline emission
Tokenizer->>Tokenizer: push_newline_if_needed()
Tokenizer->>Tokenizer: continue (skip re-processing)
else regular or inline
Tokenizer->>Tokenizer: push Text/Code tokens
Tokenizer->>Lines: peek()
Tokenizer->>Tokenizer: push_newline_if_needed()
end
end
Tokenizer-->>Caller: Vec<Token>
sequenceDiagram
participant Renderer
participant ProcessText
Renderer->>Renderer: render tokens -> out (String)
Renderer->>ProcessText: process_text(&out, trailing_blanks)
ProcessText->>ProcessText: split lines, preserve trailing newline, append blanks as needed
ProcessText-->>Renderer: Vec<String> (lines)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related issues
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
- Extract the repeated peek-and-push-newline logic into a small helper function to reduce code duplication.
- Consider using the standard
lines()iterator (orsplit_terminator) instead ofsplit('\n')for more idiomatic handling of line boundaries and trailing newlines. - Add a targeted test case for inputs ending with a newline to ensure the refactored function preserves the original trailing-newline behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the repeated peek-and-push-newline logic into a small helper function to reduce code duplication.
- Consider using the standard `lines()` iterator (or `split_terminator`) instead of `split('\n')` for more idiomatic handling of line boundaries and trailing newlines.
- Add a targeted test case for inputs ending with a newline to ensure the refactored function preserves the original trailing-newline behavior.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- Jira integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/wrap/tokenize.rs(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit Inference Engine (AGENTS.md)
**/*.rs: Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Useconcat!()to combine long string literals rather than escaping newlines with a backslash.
Prefer semantic error enums: Derivestd::error::Error(via thethiserrorcrate) for any condition the caller might inspect, retry, or map to an HTTP status.
Use an opaque error only at the app boundary: Useeyre::Reportfor human-readable logs; these should not be exposed in public APIs.
Never export the opaque type from a library: Convert to domain enums at API boundaries, and toeyreonly in the mainmain()entrypoint or top-level async task.
Files:
src/wrap/tokenize.rs
⚙️ CodeRabbit Configuration File
**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.
Adhere to single responsibility and CQRS
Place function attributes after doc comments.
Do not use
returnin single-line functions.Move conditionals with >2 branches into a predicate function.
Avoid
unsafeunless absolutely necessary.Every module must begin with a
//!doc comment that explains the module's purpose and utility.Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar
Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
Use
rstestfixtures for shared setup and to avoid repetition between tests.Replace duplicated tests with
#[rstest(...)]parameterised cases.Prefer
mockallfor mocks/stubs.Prefer
.expect()over.unwrap()Ensure that any API or behavioural changes are reflected in the documentation in
docs/Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
Files:
src/wrap/tokenize.rs
🧬 Code Graph Analysis (1)
src/wrap/tokenize.rs (1)
src/wrap.rs (1)
is_fence(173-173)
🔇 Additional comments (1)
src/wrap/tokenize.rs (1)
287-287: No-op change.Skip reviewing this line; it introduces no functional change.
|
@coderabbitai Have the following now been resolved?
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 13539 🏁 Script executed: Length of output: 3686 @leynos Yes, all three issues have been resolved in the current implementation: ✅ 1. Extract repeated peek-and-push-newline logic into helper functionResolved - A
The helper eliminates the code duplication and centralizes the newline emission logic. ✅ 2. Use standard
|
Avoid collecting all lines up front to reduce memory usage when processing large Markdown sources.
ac25c53 to
96ae005
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/textproc.rs (1)
110-118: Drop MSRV concern; preserve or optionally simplifyrepeat_nuseThe repository pins a nightly toolchain (nightly-2025-06-26) and defines no explicit MSRV in Cargo.toml, so
std::iter::repeat_nis fully supported. Leave the code as is, or apply this optional refactor to reduce iterator plumbing:• Refactor in src/textproc.rs around lines 112–114:
- if out_blanks < trailing_blanks { - result.extend(std::iter::repeat_n( - String::new(), - trailing_blanks - out_blanks, - )); - } + if out_blanks < trailing_blanks { + let missing = trailing_blanks - out_blanks; + result.resize(result.len() + missing, String::new()); + }
♻️ Duplicate comments (1)
src/wrap/tokenize.rs (1)
277-284: Adoptlines().peekable()with a trailing-newline sentinel — this fixes CRLF handling and silences ClippySwitching to
source.lines()and trackinghad_trailing_newlinematches the desired lazy iteration goal, handles\r\ncorrectly, and addresses themanual_split_lineslint.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
💡 Knowledge Base configuration:
- Jira integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
src/textproc.rs(2 hunks)src/wrap/tokenize.rs(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit Inference Engine (AGENTS.md)
**/*.rs: Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider usingArcto reduce the amount of data returned.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessarymutbindings.
Handle errors with theResulttype instead of panicking where feasible.
Avoidunsafecode unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments.
Do not usereturnin single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallow.
Prefer.expect()over.unwrap().
Useconcat!()to combine long string literals rather than escaping newlines with a backslash.
Prefer semantic error enums: Derivestd::error::Error(via thethiserrorcrate) for any condition the caller might inspect, retry, or map to an HTTP status.
Use an opaque error only at the app boundary: Useeyre::Reportfor human-readable logs; these should not be exposed in public APIs.
Never export the opaque type from a library: Convert to domain enums at API boundaries, and toeyreonly in the mainmain()entrypoint or top-level async task.
Files:
src/textproc.rssrc/wrap/tokenize.rs
⚙️ CodeRabbit Configuration File
**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.
Adhere to single responsibility and CQRS
Place function attributes after doc comments.
Do not use
returnin single-line functions.Move conditionals with >2 branches into a predicate function.
Avoid
unsafeunless absolutely necessary.Every module must begin with a
//!doc comment that explains the module's purpose and utility.Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar
Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
Use
rstestfixtures for shared setup and to avoid repetition between tests.Replace duplicated tests with
#[rstest(...)]parameterised cases.Prefer
mockallfor mocks/stubs.Prefer
.expect()over.unwrap()Ensure that any API or behavioural changes are reflected in the documentation in
docs/Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
Files:
src/textproc.rssrc/wrap/tokenize.rs
🧬 Code Graph Analysis (2)
src/textproc.rs (1)
src/html.rs (1)
lines(179-182)
src/wrap/tokenize.rs (1)
src/wrap/fence.rs (1)
is_fence(31-38)
🔇 Additional comments (3)
src/wrap/tokenize.rs (1)
285-301: Centralise newline emission — helper usage is clean and correctRoute all paths (fence open/close, in-fence lines, and inline tokenisation) through
push_newline_if_needed. This removes duplication and keeps trailing-newline semantics consistent.src/textproc.rs (2)
85-86: Delegate line splitting toprocess_text— simplifiesprocess_tokensand keeps concerns separatedThis improves readability and makes trailing-blank behaviour explicit in one place.
104-109: Preserve trailing newline by appending an empty line — LGTMThis matches the tokeniser’s semantics and avoids dropping the final empty line.
| /// Split processed output into lines while preserving trailing blanks. | ||
| /// | ||
| /// # Examples | ||
| /// | ||
| /// ```rust | ||
| /// use mdtablefix::textproc::process_text; | ||
| /// | ||
| /// let lines = process_text("a\nb\n", 0); | ||
| /// assert_eq!(lines, vec!["a".to_string(), "b".to_string(), String::new()]); | ||
| /// ``` | ||
| #[must_use] | ||
| pub fn process_text(out: &str, trailing_blanks: usize) -> Vec<String> { |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Define behaviour for out.is_empty() with non-zero trailing_blanks
Either document the current behaviour (returns an empty Vec regardless of trailing_blanks), or return exactly trailing_blanks empty lines to mirror the process_tokens blanks-only branch. Implement the latter for consistency.
#[must_use]
pub fn process_text(out: &str, trailing_blanks: usize) -> Vec<String> {
- if out.is_empty() {
- return Vec::new();
- }
+ if out.is_empty() {
+ return vec![String::new(); trailing_blanks];
+ }Add a test:
#[test]
fn process_text_empty_out_preserves_requested_blanks() {
let lines = process_text("", 3);
assert_eq!(lines, vec![String::new(), String::new(), String::new()]);
}🤖 Prompt for AI Agents
In src/textproc.rs around lines 88 to 99, the function process_text currently
returns an empty Vec when out.is_empty() even if trailing_blanks > 0; change it
to return a Vec containing exactly trailing_blanks empty String entries (to
mirror process_tokens' blanks-only branch) and add the unit test
process_text_empty_out_preserves_requested_blanks that asserts process_text("",
3) yields three empty strings; ensure the implementation handles trailing_blanks
== 0 by returning an empty Vec as before and keep behavior for non-empty out
unchanged.
| fn push_newline_if_needed<I>( | ||
| tokens: &mut Vec<Token<'_>>, | ||
| lines: &mut std::iter::Peekable<I>, | ||
| had_trailing_newline: bool, | ||
| ) where | ||
| I: Iterator, | ||
| { | ||
| // Emit a newline token if another line follows or when the | ||
| // original input ended with a trailing newline. The peek avoids | ||
| // prematurely allocating for the final newline when it isn't | ||
| // necessary. | ||
| if lines.peek().is_some() || (had_trailing_newline && lines.peek().is_none()) { | ||
| tokens.push(Token::Newline); | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Remove the double peek to avoid redundant work and borrow churn
Call peek() once and branch on a local boolean. This is cheaper and makes the borrow scope obvious.
fn push_newline_if_needed<I>(
tokens: &mut Vec<Token<'_>>,
lines: &mut std::iter::Peekable<I>,
had_trailing_newline: bool,
) where
I: Iterator,
{
- // Emit a newline token if another line follows or when the
- // original input ended with a trailing newline. The peek avoids
- // prematurely allocating for the final newline when it isn't
- // necessary.
- if lines.peek().is_some() || (had_trailing_newline && lines.peek().is_none()) {
- tokens.push(Token::Newline);
- }
+ // Emit a newline token if another line follows or when the original input
+ // ended with a trailing newline.
+ let at_end = lines.peek().is_none();
+ if !at_end || (had_trailing_newline && at_end) {
+ tokens.push(Token::Newline);
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn push_newline_if_needed<I>( | |
| tokens: &mut Vec<Token<'_>>, | |
| lines: &mut std::iter::Peekable<I>, | |
| had_trailing_newline: bool, | |
| ) where | |
| I: Iterator, | |
| { | |
| // Emit a newline token if another line follows or when the | |
| // original input ended with a trailing newline. The peek avoids | |
| // prematurely allocating for the final newline when it isn't | |
| // necessary. | |
| if lines.peek().is_some() || (had_trailing_newline && lines.peek().is_none()) { | |
| tokens.push(Token::Newline); | |
| } | |
| } | |
| fn push_newline_if_needed<I>( | |
| tokens: &mut Vec<Token<'_>>, | |
| lines: &mut std::iter::Peekable<I>, | |
| had_trailing_newline: bool, | |
| ) where | |
| I: Iterator, | |
| { | |
| // Emit a newline token if another line follows or when the original input | |
| // ended with a trailing newline. | |
| let at_end = lines.peek().is_none(); | |
| if !at_end || (had_trailing_newline && at_end) { | |
| tokens.push(Token::Newline); | |
| } | |
| } |
🤖 Prompt for AI Agents
In src/wrap/tokenize.rs around lines 254-268, avoid calling lines.peek() twice
by calling it once into a local boolean (e.g. let has_next =
lines.peek().is_some();) and then use that boolean in the if-condition (if
has_next || (had_trailing_newline && !has_next) { tokens.push(Token::Newline);
}) so you eliminate redundant work and tighten the borrow scope.
| #[test] | ||
| fn tokenize_marks_trailing_newline() { | ||
| let tokens = tokenize_markdown("foo\n"); | ||
| assert_eq!(tokens, vec![Token::Text("foo"), Token::Newline]); | ||
| } | ||
|
|
||
| #[test] | ||
| fn tokenize_handles_crlf() { | ||
| let tokens = tokenize_markdown("foo\r\nbar"); | ||
| assert_eq!( | ||
| tokens, | ||
| vec![Token::Text("foo"), Token::Newline, Token::Text("bar")] | ||
| ); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Augment tests with more edge cases for trailing blanks and fences
Good additions for trailing newline and CRLF. Add explicit coverage for:
- Multiple trailing blanks: "a\n\n" should yield Text("a"), Newline, Text(""), Newline
- Final newline inside a fenced block
Example tests to add:
#[test]
fn tokenize_multiple_trailing_blanks() {
let tokens = tokenize_markdown("a\n\n");
assert_eq!(
tokens,
vec![Token::Text("a"), Token::Newline, Token::Text(""), Token::Newline]
);
}
#[test]
fn tokenize_fence_at_eof_without_trailing_newline() {
let tokens = tokenize_markdown("```\ncode\n```");
assert_eq!(
tokens,
vec![
Token::Fence("```"),
Token::Newline,
Token::Fence("code"),
Token::Newline,
Token::Fence("```"),
]
);
}🤖 Prompt for AI Agents
In src/wrap/tokenize.rs around lines 333 to 346, add two unit tests to cover
edge cases missing from the current suite: (1) a test for multiple trailing
blank lines where input "a\n\n" should produce Token::Text("a"), Token::Newline,
Token::Text(""), Token::Newline; and (2) a test for a fenced block ending at EOF
without a trailing newline where input "```\ncode\n```" should produce
Token::Fence("```"), Token::Newline, Token::Fence("code"), Token::Newline,
Token::Fence("```"). Implement these as #[test] functions using
tokenize_markdown and assert_eq! with the specified expected vectors.
Summary
tokenize_markdownby iterating lazilycloses #63
Testing
make fmtmake lintmake testhttps://chatgpt.com/codex/tasks/task_e_68a3a4ba57c08322979aa9fe8a3af687
Summary by Sourcery
Refactor tokenize_markdown to iterate lazily over input lines using a Peekable iterator, removing the intermediate Vec and index tracking.
Enhancements: