Refactor segment_inline to use &str parsing (no Vec<char>)#255
Conversation
…iency Refactored the tokenize module to operate directly on string slices (&str) rather than allocating and indexing slices of chars (&[char]). This avoids unnecessary allocation and improves performance by iterating over string slices and handling UTF-8 boundaries correctly. Helper functions were updated accordingly to maintain consistent byte indexing and UTF-8 character handling throughout the tokenization logic. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
|
Note Reviews pausedUse the following commands to manage reviews:
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Summary by CodeRabbitRelease Notes
WalkthroughDescribe the changes: add tokenizer documentation and refactor the inline tokenizer to operate on UTF‑8 Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Text (&str)
participant Segment as segment_inline()
participant Helpers as UTF‑8 Helpers\n(char_at, advance_char)
participant Scanners as Scanners\n(scan_while, collect_range)
participant Tokens as Token Vec
Input->>Segment: provide &str text
activate Segment
loop iterate by byte idx
Segment->>Helpers: query char_at / advance_char
activate Helpers
Helpers-->>Segment: char, next idx
deactivate Helpers
alt pattern match (code/link/punct)
Segment->>Scanners: call scan_while / collect_range / parse_link_or_image
activate Scanners
Scanners-->>Segment: matched slice, new idx
deactivate Scanners
Segment->>Tokens: push token (text slice)
else no special match
Segment->>Tokens: emit text token for single code point
end
end
deactivate Segment
Segment-->>Input: return Vec<Tokens>
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related issues
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Comment |
Rewrote advance_char to use map_or instead of map followed by unwrap_or_else for more concise and readable code. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Reviewer's GuideThis PR refactors segment_inline and its core parsing utilities to eliminate Vec allocations, instead leveraging &str slices and byte indices with UTF-8 boundary awareness. Key functions (scan_while, collect_range, parse_link_or_image) and backslash-escape logic are rewritten to operate on &str and &[u8], while new helpers (char_at, advance_char) ensure safe navigation. All changes preserve existing Markdown tokenization behavior and the public API. Class diagram for refactored tokenization helpersclassDiagram
class TokenizeHelpers {
+scan_while(text: &str, idx: usize, cond: FnMut(char) -> bool) -> usize
+collect_range(text: &str, start: usize, end: usize) -> String
+char_at(text: &str, idx: usize) -> Option<char>
+advance_char(text: &str, idx: usize) -> usize
+has_odd_backslash_escape_bytes(bytes: &[u8], idx: usize) -> bool
+parse_link_or_image(text: &str, idx: usize) -> (String, usize)
}
class segment_inline {
+segment_inline(text: &str) -> Vec<String>
}
TokenizeHelpers <.. segment_inline : uses
Flow diagram for segment_inline tokenization processflowchart TD
A["Input text (&str)"] --> B["Initialize tokens Vec"]
B --> C["Iterate over text by byte index"]
C --> D{"Current char is whitespace?"}
D -- Yes --> E["scan_while for whitespace"]
E --> F["collect_range and push token"]
D -- No --> G{"Current char is '`'?"}
G -- Yes --> H["Check backslash escape (has_odd_backslash_escape_bytes)"]
H -- Escaped --> I["Push '`' as token"]
H -- Not escaped --> J["scan_while for code fence"]
J --> K["Find closing fence, collect_range and push token"]
G -- No --> L{"Current char is '[' or '!['?"}
L -- Yes --> M["parse_link_or_image"]
M --> N["Push link/image token"]
N --> O["scan_while for trailing punctuation"]
O --> P["collect_range and push punctuation token"]
L -- No --> Q["scan_while for non-whitespace/non-` chars"]
Q --> R["collect_range and push token"]
F & I & K & P & R --> S["Continue iteration"]
S --> C
C -->|End| T["Return tokens Vec"]
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/wrap/tokenize.rs:19` </location>
<code_context>
/// assert_eq!(end, 3);
/// ```
-fn scan_while<F>(chars: &[char], mut i: usize, mut cond: F) -> usize
+fn scan_while<F>(text: &str, mut idx: usize, mut cond: F) -> usize
where
F: FnMut(char) -> bool,
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing manual byte indexing and helper functions with iterator-based loops for character and escape handling.
```markdown
Consider collapsing the byte‐slicing helpers and manual index juggling into straightforward `for c in text[start..].chars()` loops and iterator‐based escapes. You can then drop `char_at`/`advance_char` entirely. For example:
```rust
/// Advance idx while `cond` holds, returning the new byte index.
fn scan_while(text: &str, start: usize, cond: impl Fn(char) -> bool) -> usize {
let mut idx = start;
for c in text[start..].chars() {
if !cond(c) {
break;
}
idx += c.len_utf8();
}
idx
}
```
And simplify escape detection with a reverse‐take_while:
```rust
const BACKSLASH: u8 = b'\\';
fn has_odd_backslash_escape(bytes: &[u8], idx: usize) -> bool {
bytes[..idx]
.iter()
.rev()
.take_while(|&&b| b == BACKSLASH)
.count() % 2 == 1
}
```
Then in your main loop you can write:
```rust
let mut i = 0;
let bytes = text.as_bytes();
while i < text.len() {
let ch = text[i..].chars().next().unwrap();
if ch.is_whitespace() {
let start = i;
i = scan_while(text, i, char::is_whitespace);
tokens.push(text[start..i].to_string());
}
// …and so on…
}
```
This:
- Removes `char_at`, `advance_char`, and the old `&[char]` APIs
- Eliminates the repeating `text[idx..].chars().next().expect…` boilerplate
- Makes intent clearer and bounds/UTF-8 handling explicit
```
</issue_to_address>
### Comment 2
<location> `src/wrap/tokenize.rs:183` </location>
<code_context>
/// ```
pub(super) fn segment_inline(text: &str) -> Vec<String> {
let mut tokens = Vec::new();
- let chars: Vec<char> = text.chars().collect();
</code_context>
<issue_to_address>
**issue (review_instructions):** Add behavioural and unit tests for the refactored segment_inline and related functions.
The refactor from char buffer to string slices in segment_inline and its helpers constitutes a significant change to the feature's implementation. You must provide both behavioural and unit tests to demonstrate correctness and prevent regressions.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>
### Comment 3
<location> `src/wrap/tokenize.rs:19` </location>
<code_context>
/// assert_eq!(end, 3);
/// ```
-fn scan_while<F>(chars: &[char], mut i: usize, mut cond: F) -> usize
+fn scan_while<F>(text: &str, mut idx: usize, mut cond: F) -> usize
where
F: FnMut(char) -> bool,
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit tests for scan_while to verify correct handling of UTF-8 boundaries and predicates.
The new scan_while implementation operates on string slices and byte indices, which can introduce subtle bugs with multi-byte characters. Provide unit tests to verify correct behaviour, especially for edge cases.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>
### Comment 4
<location> `src/wrap/tokenize.rs:41` </location>
<code_context>
/// ```
-fn collect_range(chars: &[char], start: usize, end: usize) -> String {
- chars[start..end].iter().collect()
+fn collect_range(text: &str, start: usize, end: usize) -> String {
+ text[start..end].to_string()
}
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit tests for collect_range to ensure correct substring extraction for various UTF-8 inputs.
The collect_range function now works on string slices and byte indices. You must add unit tests to confirm it extracts substrings correctly, especially for multi-byte characters and edge cases.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>
### Comment 5
<location> `src/wrap/tokenize.rs:107` </location>
<code_context>
- let start = i;
- if chars[i] == '!' {
- i += 1;
+fn parse_link_or_image(text: &str, mut idx: usize) -> (String, usize) {
+ let start = idx;
+ if char_at(text, idx) == Some('!') {
</code_context>
<issue_to_address>
**issue (review_instructions):** Add behavioural and unit tests for parse_link_or_image to verify correct parsing of links and images.
The refactored parse_link_or_image function changes how links and images are parsed from string slices. You must provide tests to demonstrate correct behaviour for valid, invalid, and edge-case inputs.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>
### Comment 6
<location> `src/wrap/tokenize.rs:47` </location>
<code_context>
- } else {
- break;
- }
+fn char_at(text: &str, idx: usize) -> Option<char> {
+ if idx >= text.len() {
+ None
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit tests for char_at to verify correct character extraction at byte indices.
The char_at helper is new and should be covered by unit tests to ensure it correctly handles valid and invalid indices, especially with multi-byte characters.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>
### Comment 7
<location> `src/wrap/tokenize.rs:55` </location>
<code_context>
}
-/// Check if a byte at the given index is preceded by an odd number of backslashes.
+fn advance_char(text: &str, idx: usize) -> usize {
+ char_at(text, idx).map_or(text.len(), |ch| idx + ch.len_utf8())
+}
</code_context>
<issue_to_address>
**issue (review_instructions):** Add unit tests for advance_char to confirm correct advancement over UTF-8 characters.
advance_char is a new utility for moving byte indices forward by one character. You must add unit tests to verify its correctness for ASCII and multi-byte characters.
<details>
<summary>Review instructions:</summary>
**Path patterns:** `**/*`
**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…enizer flow Added a detailed section with a flowchart diagram explaining the control flow of the inline tokenizer. This includes handling of whitespace, code spans, and links, improving the documentation on how tokens are generated from input text. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/wrap/tokenize.rs (1)
183-243: Eliminate.expect()call and reduce cognitive complexity.Line 188 uses
.expect(), which is forbidden outside tests per coding guidelines. Additionally, the function has high cognitive complexity due to the deeply nested backtick handling logic (lines 193-226).Fix 1: Replace expect() with safe unwrap_or pattern:
while i < text.len() { - let c = char_at(text, i).expect("valid char boundary"); + let c = match char_at(text, i) { + Some(ch) => ch, + None => break, + };Fix 2: Extract backtick handling to reduce complexity:
Extract lines 193-226 into a separate function:
fn handle_backtick_fence( text: &str, bytes: &[u8], mut i: usize, ) -> (String, usize) { let start = i; let fence_end = scan_while(text, i, |ch| ch == '`'); let fence_len = fence_end - start; i = fence_end; let mut end = i; let mut closing = None; while end < text.len() { let j = scan_while(text, end, |ch| ch == '`'); if j - end == fence_len && !has_odd_backslash_escape_bytes(bytes, end) { closing = Some(j); break; } end = advance_char(text, end); } if let Some(end_idx) = closing { (collect_range(text, start, end_idx), end_idx) } else { (collect_range(text, start, fence_end), fence_end) } }Then simplify segment_inline:
} else if c == '`' { if has_odd_backslash_escape_bytes(bytes, i) { if let Some(last) = tokens.last_mut() { last.push('`'); } else { tokens.push(String::from("`")); } i += c.len_utf8(); continue; } - let start = i; - let fence_end = scan_while(text, i, |ch| ch == '`'); - let fence_len = fence_end - start; - i = fence_end; - - let mut end = i; - let mut closing = None; - while end < text.len() { - let j = scan_while(text, end, |ch| ch == '`'); - if j - end == fence_len && !has_odd_backslash_escape_bytes(bytes, end) { - closing = Some(j); - break; - } - end = advance_char(text, end); - } - - if let Some(end_idx) = closing { - tokens.push(collect_range(text, start, end_idx)); - i = end_idx; - } else { - tokens.push(collect_range(text, start, fence_end)); - i = fence_end; - } + let (token, new_i) = handle_backtick_fence(text, bytes, i); + tokens.push(token); + i = new_i;This achieves:
- Removes forbidden
.expect()- Reduces cognitive complexity to ≤9
- Improves maintainability
- Adheres to single responsibility principle
As per coding guidelines.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
docs/architecture.md(1 hunks)src/wrap/tokenize.rs(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
docs/**/*.{rs,md}
📄 CodeRabbit inference engine (docs/rust-doctest-dry-guide.md)
docs/**/*.{rs,md}: In doctest code fences, explicitly specify the language as rust (```rust)
Use assertion macros (assert!, assert_eq!, assert_ne!) to verify outputs in doctests
For fallible doctests, write an explicit fn main() -> Result<…> and hide the boilerplate
Alternatively, use the (()) shorthand to enable ? in doctests, written as a single contiguous token
Use hidden lines (# …) to hide setup, use statements, and main wrappers in doctests
Avoid using .unwrap()/.expect() in documentation examples; prefer propagating errors with ?
Prefer no_run for examples with side effects (network, filesystem, GUI) to compile but not execute
Use should_panic for examples that intentionally panic; optionally assert the panic message
Use compile_fail for examples that must not compile, while acknowledging brittleness across Rust versions
Avoid ignore except for pseudo-code or non-Rust examples; prefer more specific attributes or hidden-line fixes
Specify edition with edition20xx in doctests when demonstrating edition-specific syntax
Files:
docs/architecture.md
**/*.md
⚙️ CodeRabbit configuration file
**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")
- Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Headings must not be wrapped.
- Documents must start with a level 1 heading
- Headings must correctly increase or decrease by no more than one level at a time
- Use GitHub-flavoured Markdown style for footnotes and endnotes.
- Numbered footnotes must be numbered by order of appearance in the document.
Files:
docs/architecture.md
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.
- Adhere to single responsibility and CQRS
- Place function attributes after doc comments.
- Do not use
returnin single-line functions.- Move conditionals with >2 branches into a predicate function.
- Avoid
unsafeunless absolutely necessary.- Every module must begin with a
//!doc comment that explains the module's purpose and utility.- Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.- Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
- Use
rstestfixtures for shared setup and to avoid repetition between tests.- Replace duplicated tests with
#[rstest(...)]parameterised cases.- Prefer
mockallfor mocks/stubs.- Prefer
.expect()over.unwrap()in tests..expect()and.unwrap()are forbidden outside of tests. Errors must be propagated.- Ensure that any API or behavioural changes are reflected in the documentation in
docs/- Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/- Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
- Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
- For testing of functionality depending upon environment variables, dependency injection and...
Files:
src/wrap/tokenize.rs
🧬 Code graph analysis (1)
src/wrap/tokenize.rs (1)
src/wrap/line_buffer.rs (1)
text(20-22)
🪛 LanguageTool
docs/architecture.md
[uncategorized] ~338-~338: Possible missing comma found.
Context: ...kenizer iterates over the source string lazily so no duplicate Vec<char> representat...
(AI_HYDRA_LEO_MISSING_COMMA)
[style] ~339-~339: Would you like to use the Oxford spelling “summarizes”? The spelling ‘summarises’ is also correct.
Context: ...tion is required. The following diagram summarises the control flow, highlighting the help...
(OXFORD_SPELLING_Z_NOT_S)
🔍 Remote MCP Deepwiki
Summary of additional, review-relevant context I found
-
A new "Tokenizer flow" section was added to docs/architecture.md; tokenization is a central shared utility used by wrap_text, footnote/ellipsis processing, and other transforms — changes to tokenize.rs therefore affect many modules.
-
Important runtime consumers to check while reviewing behavior/regressions:
- src/wrap/* (wrap_text, tokenization, FenceTracker).
- textproc/process_tokens, ellipsis, footnotes (token-based transforms).
- table reflow (reflow_table) and other modules that rely on Unicode-width/byte/char indexing.
Verify these modules still behave correctly with the &str + byte-index helpers (char_at / advance_char / has_odd_backslash_escape_bytes).
-
Existing tests you should run / inspect (cover the prior behavior and important edge cases):
- Unit tests in src/wrap/tokenize.rs and tests/wrap/tokenize_markdown.rs (inline code spans, escaped backticks, unmatched backticks).
- wrap tests in src/wrap/tests.rs (determine_token_span, punctuation attachment, paragraph wrapping).
- tests/fences.rs and tests/cli.rs (fence normalization, --fences behavior, --wrap/--ellipsis interactions).
- tests/parallel.rs (multi-file/in-place behavior).
These are where regressions from changing char-slicing → &str slicing most likely surface.
-
Key behavior/edge cases to verify manually or add tests for (UTF‑8 / byte-index risks introduced by the refactor):
- Multibyte characters (emoji, CJK, combining marks) inside inline code, links, and ordinary text (ensure char_at/advance_char compute code-point boundaries correctly).
- Escaped-backslash/backtick sequences (odd-number-of-backslashes logic) — ensure has_odd_backslash_escape_bytes preserves prior semantics.
- parse_link_or_image and collect_range for nested parentheses and UTF‑8 boundaries.
- That segment_inline public behavior is unchanged (it’s used by wrapping logic).
Files to inspect closely in the review
- src/wrap/tokenize.rs (the refactor itself), plus the new helpers char_at/advance_char and has_odd_backslash_escape_bytes.
- Call sites: src/wrap.rs, src/textproc.rs, src/ellipsis.rs, src/footnotes.rs, src/table.rs (reflow) — check any assumptions about indexing/slicing.
- Test files listed above to confirm coverage.
Recommended reviewer checklist (minimal)
- Run full test suite: cargo test (unit + integration + CLI tests).
- Run specific tokenization/wrap tests: src/wrap/tokenize.rs tests + tests/wrap/*.
- Add or run targeted tests with multibyte and escaped-backslash/backtick inputs if not covered.
- Confirm pipeline ordering (fence normalization → HTML table conversion → core loop → wrapping/ellipsis/footnotes) still yields same external behavior.
Sources / tools used
- Deepwiki read of repository wiki/pages for architecture, modules, and tests.
- Deepwiki page contents (Architecture, Tokenization, Tests, Processing pipeline).
- Deepwiki Q&A about tests and issue context (tests covering segment_inline/tokenize, fence/CLI tests).
⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (5)
docs/architecture.md (1)
343-365: LGTM: Flowchart accurately represents the refactored implementation.The Mermaid diagram correctly documents the UTF-8 aware, &str-based tokenizer flow, including the new helpers (
scan_while,collect_range,has_odd_backslash_escape_bytes,char_at,advance_char) and the control flow for whitespace, code spans, and links/images.src/wrap/tokenize.rs (4)
41-43: LGTM: Simple substring extraction.The function correctly extracts a substring. The implementation assumes
startandendare valid UTF-8 byte boundaries, which is upheld by all callers usingscan_whileandadvance_char.
47-57: LGTM: UTF-8 safe boundary helpers.Both
char_atandadvance_charcorrectly handle UTF-8 code point boundaries.char_atsafely returnsNoneat end-of-string, andadvance_charproperly advances bylen_utf8()bytes for multibyte characters.
63-74: LGTM: Efficient byte-based escape detection.The function correctly counts consecutive backslashes before the given index and returns whether the count is odd. Operating on bytes is efficient and appropriate for ASCII backslash detection.
405-480: LGTM: Comprehensive test coverage.The tests thoroughly cover critical edge cases:
- Links with trailing punctuation
- Nested parentheses in URLs
- Inline code fences
- Unmatched backticks
- Escaped backticks (including adjacent to multibyte characters)
- CRLF handling
- Trailing newline preservation
The multibyte character tests (lines 467-479) specifically verify UTF-8 safety of the refactored implementation.
…sing logic - Replace index-based char access with slicing and char iterators to properly handle multibyte chars. - Remove unsafe index manipulations and simplify advancing through UTF-8 strings. - Improve parse_link_or_image to handle nested parentheses correctly. - Refactor segment_inline to better handle multibyte tokens and special markdown chars. - Add extensive tests for multibyte char handling and parsing correctness. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
…function Refactored the backtick fence parsing logic in segment_inline by moving it into a new helper function `handle_backtick_fence`. This improves code readability and reduces duplication by isolating the backtick fence scanning and matching logic. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/wrap/tokenize.rs (1)
1-543: Decompose this file to comply with the 400-line limit.This file contains 543 lines, exceeding the coding guideline maximum of 400 lines by 143 lines. Extract the test module (lines 414-543) to a separate file such as
tests/unit/wrap_tokenize.rsor split the main module by moving private helpers into a submodule.As per coding guidelines.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
docs/architecture.md(1 hunks)src/wrap/tokenize.rs(5 hunks)tests/wrap/tokenize_markdown.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
⚙️ CodeRabbit configuration file
**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.
- Adhere to single responsibility and CQRS
- Place function attributes after doc comments.
- Do not use
returnin single-line functions.- Move conditionals with >2 branches into a predicate function.
- Avoid
unsafeunless absolutely necessary.- Every module must begin with a
//!doc comment that explains the module's purpose and utility.- Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.- Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
- Use
rstestfixtures for shared setup and to avoid repetition between tests.- Replace duplicated tests with
#[rstest(...)]parameterised cases.- Prefer
mockallfor mocks/stubs.- Prefer
.expect()over.unwrap()in tests..expect()and.unwrap()are forbidden outside of tests. Errors must be propagated.- Ensure that any API or behavioural changes are reflected in the documentation in
docs/- Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/- Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
- Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
- For testing of functionality depending upon environment variables, dependency injection and...
Files:
tests/wrap/tokenize_markdown.rssrc/wrap/tokenize.rs
docs/**/*.{rs,md}
📄 CodeRabbit inference engine (docs/rust-doctest-dry-guide.md)
docs/**/*.{rs,md}: In doctest code fences, explicitly specify the language as rust (```rust)
Use assertion macros (assert!, assert_eq!, assert_ne!) to verify outputs in doctests
For fallible doctests, write an explicit fn main() -> Result<…> and hide the boilerplate
Alternatively, use the (()) shorthand to enable ? in doctests, written as a single contiguous token
Use hidden lines (# …) to hide setup, use statements, and main wrappers in doctests
Avoid using .unwrap()/.expect() in documentation examples; prefer propagating errors with ?
Prefer no_run for examples with side effects (network, filesystem, GUI) to compile but not execute
Use should_panic for examples that intentionally panic; optionally assert the panic message
Use compile_fail for examples that must not compile, while acknowledging brittleness across Rust versions
Avoid ignore except for pseudo-code or non-Rust examples; prefer more specific attributes or hidden-line fixes
Specify edition with edition20xx in doctests when demonstrating edition-specific syntax
Files:
docs/architecture.md
**/*.md
⚙️ CodeRabbit configuration file
**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")
- Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Headings must not be wrapped.
- Documents must start with a level 1 heading
- Headings must correctly increase or decrease by no more than one level at a time
- Use GitHub-flavoured Markdown style for footnotes and endnotes.
- Numbered footnotes must be numbered by order of appearance in the document.
Files:
docs/architecture.md
🧬 Code graph analysis (2)
tests/wrap/tokenize_markdown.rs (1)
src/wrap/tokenize.rs (1)
tokenize_markdown(380-412)
src/wrap/tokenize.rs (1)
src/wrap/line_buffer.rs (1)
text(20-22)
🪛 LanguageTool
docs/architecture.md
[style] ~339-~339: Would you like to use the Oxford spelling “summarizes”? The spelling ‘summarises’ is also correct.
Context: ...tion is required. The following diagram summarises the control flow, highlighting the help...
(OXFORD_SPELLING_Z_NOT_S)
🔍 Remote MCP Deepwiki, Ref
Summary — additional review-relevant facts (concise)
-
High-level impact: tokenization/wrapping are central shared utilities used by many transforms; changes in src/wrap/tokenize.rs and src/wrap/tokenize.rs affect wrap_text, footnotes, ellipsis, table reflow, and CLI behaviors.
-
Key changed files to review (priority order)
- src/wrap/tokenize.rs — UTF‑8 rework: &str + byte-index helpers (char_at, advance_char), scan_while, collect_range, parse_link_or_image, has_odd_backslash_escape_bytes, handle_backtick_fence — critical for correctness on multibyte and escape handling.
- src/wrap.rs / wrap_text call sites — consumers of tokenize/segment_inline; verify wrapping logic uses Unicode widths and fence tracking as before.
- src/process.rs — pipeline ordering (fences → html → core loop → post transforms); ensure new &str helpers preserve behavior.
- src/table.rs / src/reflow.rs — table reflow depends on tokenizer/wrapping; verify no regressions in reflow_table and split_cells.
- src/footnotes.rs, src/ellipsis.rs, src/lists.rs — token-based transforms and list renumbering that must still skip code spans/fences.
-
Main risks to verify
- UTF‑8 boundary correctness (char_at / advance_char): ensure no off‑bytes slicing or panics on multibyte characters. Test with emoji, CJK, combining marks.
- Backslash-escape and odd-backslash logic moved to byte-based helper — confirm identical semantics for odd-number-of-backslashes before delimiters (code spans, escapes).
- parse_link_or_image correctness for nested parentheses and UTF‑8 indices.
- No change to public API surface: segment_inline remains pub(super) and tokenize_markdown public. Confirm callers compile.
- Interaction with fence tracking: new tokenization must preserve fence detection semantics (FenceTracker usage across process/wrap/lists).
-
Tests to run / add (minimal actionable list)
- Run full test suite: cargo test (unit + integration + CLI) — CI relies on this.
- Focused tests to run/inspect:
- tests/wrap/tokenize_markdown.rs (new multibyte_characters_round_trip).
- src/wrap/tokenize.rs unit tests covering backtick fences, escaped backticks, nested parentheses links.
- tests/wrap/* and tests/fences.rs (fence normalization + orphan specifiers).
- tests/cli.rs in-place and multi-file parallel tests (ensure ordering and in-place newline behavior unchanged).
- Add targeted failing-case tests if missing:
- Multibyte inside inline code with surrounding punctuation and links (emoji + backticks).
- Nested parentheses in link URLs with multibyte chars.
- Escaped-backslash sequences at byte boundaries (odd/even counts) immediately before backticks or brackets.
-
Local manual checks to run (fast)
- Compile and run small round‑trip: tokenize_markdown("ßß
λfin") and assert token raw/code preserved (matches new test). - Fuzz a few UTF‑8 strings through segment_inline and wrap_text; check for panics.
- Run mdtablefix CLI on representative Markdown files containing tables, HTML tables, fences, footnotes, and confirm outputs match previous commit (if available).
- Compile and run small round‑trip: tokenize_markdown("ßß
-
Notable behavioral guarantees to confirm
- Public behavior of segment_inline(tokenization) and tokenize_markdown preserved (no API change).
- In-place write behavior: trailing newline semantics unchanged.
- Parallel CLI processing (rayon) unaffected — ordering preserved.
Tools/sources consulted
- Repository architecture, modules, tests, CI and release docs: Deepwiki read of repository pages and contents.
- (Attempted additional doc search; some requests returned transient errors — nevertheless useful context above came from Deepwiki.)
⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Sourcery review
🔇 Additional comments (9)
src/wrap/tokenize.rs (7)
19-31: LGTM!The iterator-based approach correctly handles UTF-8 boundaries and eliminates the forbidden
.expect()call. The logic is clear and concise.
41-43: LGTM!The implementation correctly extracts a substring slice. All call sites in this file use UTF-8 boundary-safe indices from
scan_whileorchars().next(), so the slice operation is sound.
51-62: LGTM!The byte-based backslash escape detection is correct and efficient. Since backslash (
\) is a single-byte ASCII character (0x5C), scanning bytes backwards is safe and avoids UTF-8 decoding overhead.
95-138: LGTM!The link/image parser correctly handles nested parentheses in URLs and uses safe UTF-8 boundary-aware indexing throughout. The Option pattern matching eliminates the forbidden
.expect()calls whilst maintaining clarity.
159-179: LGTM!The backtick fence scanner correctly matches opening and closing fences of equal length whilst respecting backslash escapes. The fallback to returning only the opening fence when no closing match is found preserves the original tokenisation behaviour.
203-252: LGTM!The refactored tokeniser correctly operates on
&strwith UTF-8 boundary-safe indexing. The logic flow is clear, and the use of helper functions keeps complexity manageable. Whitespace, code spans, links, and escape handling all appear correct.
418-542: LGTM!The expanded test coverage thoroughly validates UTF-8 boundary handling across the refactored helpers. The tests cover multibyte characters (including Greek, Cyrillic, and CJK), nested parentheses, and edge cases such as malformed input and escaped backticks adjacent to multibyte characters.
docs/architecture.md (1)
343-365: LGTM!The Mermaid flowchart accurately illustrates the tokenisation control flow, correctly referencing the helper functions and decision points in the implementation.
tests/wrap/tokenize_markdown.rs (1)
80-96: LGTM!The test comprehensively validates that multibyte UTF-8 characters (ß and λ) are correctly tokenised both inside and outside inline code spans, ensuring the refactored tokeniser preserves character boundaries and content.
|
@coderabbitai pause |
✅ Actions performedReviews paused. |
|
@coderabbitai please create a new GitHub issue for the following: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 47 @leynos: I've created GitHub issue #257 for refactoring |
…in documentation Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Summary
Changes
Core parsing
Segment tokenization
Documentation and safety
Tests and compatibility
Rationale
Compatibility
Testing
How to test manually
Documentation updates
🌿 Generated by Terry
ℹ️ Tag @terragon-labs to ask questions and address PR feedback
📎 Task: https://www.terragonlabs.com/task/a5e3eaee-d5c6-485d-b5ec-4f9631f54434