Skip to content

Refactor segment_inline to use &str parsing (no Vec<char>)#255

Merged
leynos merged 7 commits intomainfrom
terragon/optimize-tokenize-markdown-x4e2p6
Nov 12, 2025
Merged

Refactor segment_inline to use &str parsing (no Vec<char>)#255
leynos merged 7 commits intomainfrom
terragon/optimize-tokenize-markdown-x4e2p6

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Nov 9, 2025

Summary

  • Refactor segment_inline to avoid allocating Vec by operating directly on &str and UTF-8 boundaries. This reduces allocations and improves safety and performance while preserving existing behavior for Markdown tokenization (whitespace, code spans, and links). Additionally updated docs and tests to reflect &str-based parsing.

Changes

Core parsing

  • Replace Vec-based scanning with string-slice based utilities:
    • scan_while now operates on &str and byte indices, using UTF-8 boundary aware traversal.
    • collect_range updated to extract substrings from &str.
    • parse_link_or_image now accepts &str and uses boundary-aware helpers to parse links/images without materialising chars.
    • Introduced helpers for UTF-8 safe navigation: char_at(text, idx) and advance_char(text, idx).
    • Backslash-escape checks now operate on bytes via has_odd_backslash_escape_bytes.
  • All internal logic that previously indexed by Vec now uses string indexing and boundary-aware char access.

Segment tokenization

  • segment_inline implemented to work directly with &str and bytes, using the new helpers:
    • Uses char_at/advance_char for boundary checks.
    • Replaced has_odd_backslash_escape(&chars, i) with has_odd_backslash_escape_bytes(bytes, i).
    • Updated fence scanning and trailing punctuation detection to operate on text slices.
    • Updated link/image detection to rely on text and boundary-aware checks.
  • Updated in-code documentation examples to reflect &str-based usage.
  • Added handle_backtick_fence to support UTF-8 safe fence scanning and proper closing detection.

Documentation and safety

  • Doc comments updated to reflect avoidance of materialising the source as a Vec.
  • Updated architecture/docs to describe the new tokenizer flow and boundary-aware, &str-based parsing.
  • Maintains identical behavior for whitespace handling, code spans, and link parsing.

Tests and compatibility

  • Tests updated/added to cover multibyte characters and nested parentheses within links, and to ensure boundary-aware behavior.
  • Public API remains functionally the same (segment_inline(text: &str) -> Vec); internal helper signatures changed to &str, but no external API changes.

Rationale

  • Removing Vec allocations reduces memory usage and avoids extra copying. Operating directly on &str with UTF-8 boundary awareness ensures correctness for non-ASCII input and improves performance.

Compatibility

  • Public API remains functionally the same (segment_inline(text: &str) -> Vec); internal helper signatures changed to &str, but no external API changes.

Testing

  • cargo test passes locally
  • Tokenization of whitespace, inline code spans (backticks), and links remains correct
  • Backslash escaping behavior remains correct across inputs
  • Added tests for multibyte characters and nested parentheses in links
  • Multibyte characters round-trip tests added to verify correct tokenization

How to test manually

  • Run the test suite: cargo test
  • Exercise sample markdowns including:
    • Inline code spans with and without escaping
    • Links and images with nested parentheses
    • Mixed whitespace and punctuation around tokens
    • Multibyte characters to ensure proper token boundaries

Documentation updates

  • docs/architecture.md updated to reflect tokenizer flow with &str-based streaming and UTF-8 boundary-aware traversal.
  • Doc comments in code updated to reflect avoidance of Vec materialisation.

🌿 Generated by Terry


ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/a5e3eaee-d5c6-485d-b5ec-4f9631f54434

…iency

Refactored the tokenize module to operate directly on string slices (&str) rather than allocating and indexing slices of chars (&[char]). This avoids unnecessary allocation and improves performance by iterating over string slices and handling UTF-8 boundaries correctly. Helper functions were updated accordingly to maintain consistent byte indexing and UTF-8 character handling throughout the tokenization logic.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @leynos, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 9, 2025

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Enhanced architecture documentation with detailed tokeniser flow diagrams and descriptions for improved system understanding
  • Improvements

    • Improved support for multibyte and international character handling in tokenisation
    • Expanded test coverage to verify proper edge case processing and character validation
  • Deprecations

    • Deprecated html_table_to_markdown; use convert_html_tables as the preferred replacement

Walkthrough

Describe the changes: add tokenizer documentation and refactor the inline tokenizer to operate on UTF‑8 &str with new helpers and updated scanner/parsing signatures; add tests for multibyte handling. No behavioural changes claimed beyond UTF‑8 safety and test coverage.

Changes

Cohort / File(s) Summary
Documentation
docs/architecture.md
Add "Tokenizer flow" section with Mermaid diagram and narrative on lazy iteration, whitespace handling, code fences, links/images, punctuation; mark html_table_to_markdown deprecated in favour of convert_html_tables.
Tokeniser refactor
src/wrap/tokenize.rs
Replace &[char] APIs with &str-based APIs; change scan_while, collect_range, parse_link_or_image, segment_inline, and related paths to use UTF‑8 aware byte/char helpers; add char_at(), advance_char(), has_odd_backslash_escape_bytes(), and handle_backtick_fence(); update visibility for segment_inline.
Tests
tests/wrap/tokenize_markdown.rs
Add multibyte_characters_round_trip test asserting correct tokenisation of multibyte chars inside inline code spans.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Text (&str)
    participant Segment as segment_inline()
    participant Helpers as UTF‑8 Helpers\n(char_at, advance_char)
    participant Scanners as Scanners\n(scan_while, collect_range)
    participant Tokens as Token Vec

    Input->>Segment: provide &str text
    activate Segment
    loop iterate by byte idx
        Segment->>Helpers: query char_at / advance_char
        activate Helpers
        Helpers-->>Segment: char, next idx
        deactivate Helpers
        alt pattern match (code/link/punct)
            Segment->>Scanners: call scan_while / collect_range / parse_link_or_image
            activate Scanners
            Scanners-->>Segment: matched slice, new idx
            deactivate Scanners
            Segment->>Tokens: push token (text slice)
        else no special match
            Segment->>Tokens: emit text token for single code point
        end
    end
    deactivate Segment
    Segment-->>Input: return Vec<Tokens>
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Inspect UTF‑8 boundary correctness in char_at() and advance_char() for off‑by‑one or slicing across code‑point boundaries.
  • Verify all idx updates and slice ranges use byte indices consistently and never split multibyte sequences.
  • Review has_odd_backslash_escape_bytes() semantics to ensure escape logic matches previous behaviour.
  • Check parse_link_or_image() nesting/parenthesis handling and test coverage for edge cases.

Possibly related issues

Poem

From char arrays to UTF‑8 strings aligned,
Scanners step safe, no bytes left behind.
Fences and links parse true and neat,
Tokens march on with a steady beat. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately describes the main refactoring effort: replacing Vec-based parsing with &str parsing in segment_inline.
Description check ✅ Passed The description comprehensively relates to the changeset, detailing the refactoring from Vec to &str-based parsing, UTF-8 safety improvements, and updated tests.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch terragon/optimize-tokenize-markdown-x4e2p6

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)
  • UTF-8: Entity not found: Issue - Could not find referenced Issue.

Comment @coderabbitai help to get the list of available commands and usage tips.

Rewrote advance_char to use map_or instead of map followed by unwrap_or_else for more concise and readable code.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@leynos leynos marked this pull request as ready for review November 11, 2025 01:39
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Nov 11, 2025

Reviewer's Guide

This PR refactors segment_inline and its core parsing utilities to eliminate Vec allocations, instead leveraging &str slices and byte indices with UTF-8 boundary awareness. Key functions (scan_while, collect_range, parse_link_or_image) and backslash-escape logic are rewritten to operate on &str and &[u8], while new helpers (char_at, advance_char) ensure safe navigation. All changes preserve existing Markdown tokenization behavior and the public API.

Class diagram for refactored tokenization helpers

classDiagram
    class TokenizeHelpers {
        +scan_while(text: &str, idx: usize, cond: FnMut(char) -> bool) -> usize
        +collect_range(text: &str, start: usize, end: usize) -> String
        +char_at(text: &str, idx: usize) -> Option<char>
        +advance_char(text: &str, idx: usize) -> usize
        +has_odd_backslash_escape_bytes(bytes: &[u8], idx: usize) -> bool
        +parse_link_or_image(text: &str, idx: usize) -> (String, usize)
    }
    class segment_inline {
        +segment_inline(text: &str) -> Vec<String>
    }
    TokenizeHelpers <.. segment_inline : uses
Loading

Flow diagram for segment_inline tokenization process

flowchart TD
    A["Input text (&str)"] --> B["Initialize tokens Vec"]
    B --> C["Iterate over text by byte index"]
    C --> D{"Current char is whitespace?"}
    D -- Yes --> E["scan_while for whitespace"]
    E --> F["collect_range and push token"]
    D -- No --> G{"Current char is '`'?"}
    G -- Yes --> H["Check backslash escape (has_odd_backslash_escape_bytes)"]
    H -- Escaped --> I["Push '`' as token"]
    H -- Not escaped --> J["scan_while for code fence"]
    J --> K["Find closing fence, collect_range and push token"]
    G -- No --> L{"Current char is '[' or '!['?"}
    L -- Yes --> M["parse_link_or_image"]
    M --> N["Push link/image token"]
    N --> O["scan_while for trailing punctuation"]
    O --> P["collect_range and push punctuation token"]
    L -- No --> Q["scan_while for non-whitespace/non-` chars"]
    Q --> R["collect_range and push token"]
    F & I & K & P & R --> S["Continue iteration"]
    S --> C
    C -->|End| T["Return tokens Vec"]
Loading

File-Level Changes

Change Details Files
Refactor parsing utilities to operate on &str slices
  • Updated scan_while to use &str, byte indices, and UTF-8 traversal
  • Replaced collect_range to extract substrings directly from &str
  • Adapted parse_link_or_image signature and internal logic for &str parsing
src/wrap/tokenize.rs
Introduce UTF-8 safe navigation helpers
  • Added char_at(text, idx) to retrieve the next char at a byte index
  • Added advance_char(text, idx) to advance to the next valid char boundary
src/wrap/tokenize.rs
Replace Vec-based backslash escape checks
  • Removed has_odd_backslash_escape on Vec
  • Introduced has_odd_backslash_escape_bytes on &[u8] and updated code span logic
src/wrap/tokenize.rs
Refactor segment_inline to use &str and byte-level parsing
  • Eliminated chars Vec in favor of text.as_bytes() for escape checks
  • Rewrote whitespace, backtick fence, link/image, and trailing punctuation scanning to use &str helpers
  • Updated loops to utilize char_at and advance_char for boundary-safe indexing
src/wrap/tokenize.rs
Update documentation and examples for &str-based parsing
  • Rewrote doc comments to remove Vec references
  • Updated code examples for scan_while, collect_range, and parse_link_or_image to use &str
src/wrap/tokenize.rs

Possibly linked issues

  • #None: The PR refactors segment_inline to avoid Vec<char> allocation, addressing the memory and performance issues in markdown tokenization described in the issue.
  • Memory allocation concern in tokenize_markdown function #63: The PR refactors segment_inline to avoid Vec<char> and use &str parsing, directly addressing the memory allocation concern.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and found some issues that need to be addressed.

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `src/wrap/tokenize.rs:19` </location>
<code_context>
 /// assert_eq!(end, 3);
 /// ```
-fn scan_while<F>(chars: &[char], mut i: usize, mut cond: F) -> usize
+fn scan_while<F>(text: &str, mut idx: usize, mut cond: F) -> usize
 where
     F: FnMut(char) -> bool,
</code_context>

<issue_to_address>
**issue (complexity):** Consider replacing manual byte indexing and helper functions with iterator-based loops for character and escape handling.

```markdown
Consider collapsing the byte‐slicing helpers and manual index juggling into straightforward `for c in text[start..].chars()` loops and iterator‐based escapes. You can then drop `char_at`/`advance_char` entirely. For example:

```rust
/// Advance idx while `cond` holds, returning the new byte index.
fn scan_while(text: &str, start: usize, cond: impl Fn(char) -> bool) -> usize {
    let mut idx = start;
    for c in text[start..].chars() {
        if !cond(c) {
            break;
        }
        idx += c.len_utf8();
    }
    idx
}
```

And simplify escape detection with a reverse‐take_while:

```rust
const BACKSLASH: u8 = b'\\';

fn has_odd_backslash_escape(bytes: &[u8], idx: usize) -> bool {
    bytes[..idx]
        .iter()
        .rev()
        .take_while(|&&b| b == BACKSLASH)
        .count() % 2 == 1
}
```

Then in your main loop you can write:

```rust
let mut i = 0;
let bytes = text.as_bytes();
while i < text.len() {
    let ch = text[i..].chars().next().unwrap();
    if ch.is_whitespace() {
        let start = i;
        i = scan_while(text, i, char::is_whitespace);
        tokens.push(text[start..i].to_string());
    }
    // …and so on…
}
```

This:

- Removes `char_at`, `advance_char`, and the old `&[char]` APIs
- Eliminates the repeating `text[idx..].chars().next().expect…` boilerplate
- Makes intent clearer and bounds/UTF-8 handling explicit
```
</issue_to_address>

### Comment 2
<location> `src/wrap/tokenize.rs:183` </location>
<code_context>
 /// ```
 pub(super) fn segment_inline(text: &str) -> Vec<String> {
     let mut tokens = Vec::new();
-    let chars: Vec<char> = text.chars().collect();
</code_context>

<issue_to_address>
**issue (review_instructions):** Add behavioural and unit tests for the refactored segment_inline and related functions.

The refactor from char buffer to string slices in segment_inline and its helpers constitutes a significant change to the feature's implementation. You must provide both behavioural and unit tests to demonstrate correctness and prevent regressions.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

### Comment 3
<location> `src/wrap/tokenize.rs:19` </location>
<code_context>
 /// assert_eq!(end, 3);
 /// ```
-fn scan_while<F>(chars: &[char], mut i: usize, mut cond: F) -> usize
+fn scan_while<F>(text: &str, mut idx: usize, mut cond: F) -> usize
 where
     F: FnMut(char) -> bool,
</code_context>

<issue_to_address>
**issue (review_instructions):** Add unit tests for scan_while to verify correct handling of UTF-8 boundaries and predicates.

The new scan_while implementation operates on string slices and byte indices, which can introduce subtle bugs with multi-byte characters. Provide unit tests to verify correct behaviour, especially for edge cases.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

### Comment 4
<location> `src/wrap/tokenize.rs:41` </location>
<code_context>
 /// ```
-fn collect_range(chars: &[char], start: usize, end: usize) -> String {
-    chars[start..end].iter().collect()
+fn collect_range(text: &str, start: usize, end: usize) -> String {
+    text[start..end].to_string()
 }
</code_context>

<issue_to_address>
**issue (review_instructions):** Add unit tests for collect_range to ensure correct substring extraction for various UTF-8 inputs.

The collect_range function now works on string slices and byte indices. You must add unit tests to confirm it extracts substrings correctly, especially for multi-byte characters and edge cases.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

### Comment 5
<location> `src/wrap/tokenize.rs:107` </location>
<code_context>
-    let start = i;
-    if chars[i] == '!' {
-        i += 1;
+fn parse_link_or_image(text: &str, mut idx: usize) -> (String, usize) {
+    let start = idx;
+    if char_at(text, idx) == Some('!') {
</code_context>

<issue_to_address>
**issue (review_instructions):** Add behavioural and unit tests for parse_link_or_image to verify correct parsing of links and images.

The refactored parse_link_or_image function changes how links and images are parsed from string slices. You must provide tests to demonstrate correct behaviour for valid, invalid, and edge-case inputs.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

### Comment 6
<location> `src/wrap/tokenize.rs:47` </location>
<code_context>
-        } else {
-            break;
-        }
+fn char_at(text: &str, idx: usize) -> Option<char> {
+    if idx >= text.len() {
+        None
</code_context>

<issue_to_address>
**issue (review_instructions):** Add unit tests for char_at to verify correct character extraction at byte indices.

The char_at helper is new and should be covered by unit tests to ensure it correctly handles valid and invalid indices, especially with multi-byte characters.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

### Comment 7
<location> `src/wrap/tokenize.rs:55` </location>
<code_context>
 }

-/// Check if a byte at the given index is preceded by an odd number of backslashes.
+fn advance_char(text: &str, idx: usize) -> usize {
+    char_at(text, idx).map_or(text.len(), |ch| idx + ch.len_utf8())
+}
</code_context>

<issue_to_address>
**issue (review_instructions):** Add unit tests for advance_char to confirm correct advancement over UTF-8 characters.

advance_char is a new utility for moving byte indices forward by one character. You must add unit tests to verify its correctness for ASCII and multi-byte characters.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*`

**Instructions:**
For any new feature or change to an existing feature, both behavioural *and* unit tests are required.

</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/wrap/tokenize.rs Outdated
…enizer flow

Added a detailed section with a flowchart diagram explaining the control flow of the inline tokenizer. This includes handling of whitespace, code spans, and links, improving the documentation on how tokens are generated from input text.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/wrap/tokenize.rs (1)

183-243: Eliminate .expect() call and reduce cognitive complexity.

Line 188 uses .expect(), which is forbidden outside tests per coding guidelines. Additionally, the function has high cognitive complexity due to the deeply nested backtick handling logic (lines 193-226).

Fix 1: Replace expect() with safe unwrap_or pattern:

     while i < text.len() {
-        let c = char_at(text, i).expect("valid char boundary");
+        let c = match char_at(text, i) {
+            Some(ch) => ch,
+            None => break,
+        };

Fix 2: Extract backtick handling to reduce complexity:

Extract lines 193-226 into a separate function:

fn handle_backtick_fence(
    text: &str,
    bytes: &[u8],
    mut i: usize,
) -> (String, usize) {
    let start = i;
    let fence_end = scan_while(text, i, |ch| ch == '`');
    let fence_len = fence_end - start;
    i = fence_end;

    let mut end = i;
    let mut closing = None;
    while end < text.len() {
        let j = scan_while(text, end, |ch| ch == '`');
        if j - end == fence_len && !has_odd_backslash_escape_bytes(bytes, end) {
            closing = Some(j);
            break;
        }
        end = advance_char(text, end);
    }

    if let Some(end_idx) = closing {
        (collect_range(text, start, end_idx), end_idx)
    } else {
        (collect_range(text, start, fence_end), fence_end)
    }
}

Then simplify segment_inline:

         } else if c == '`' {
             if has_odd_backslash_escape_bytes(bytes, i) {
                 if let Some(last) = tokens.last_mut() {
                     last.push('`');
                 } else {
                     tokens.push(String::from("`"));
                 }
                 i += c.len_utf8();
                 continue;
             }

-            let start = i;
-            let fence_end = scan_while(text, i, |ch| ch == '`');
-            let fence_len = fence_end - start;
-            i = fence_end;
-
-            let mut end = i;
-            let mut closing = None;
-            while end < text.len() {
-                let j = scan_while(text, end, |ch| ch == '`');
-                if j - end == fence_len && !has_odd_backslash_escape_bytes(bytes, end) {
-                    closing = Some(j);
-                    break;
-                }
-                end = advance_char(text, end);
-            }
-
-            if let Some(end_idx) = closing {
-                tokens.push(collect_range(text, start, end_idx));
-                i = end_idx;
-            } else {
-                tokens.push(collect_range(text, start, fence_end));
-                i = fence_end;
-            }
+            let (token, new_i) = handle_backtick_fence(text, bytes, i);
+            tokens.push(token);
+            i = new_i;

This achieves:

  • Removes forbidden .expect()
  • Reduces cognitive complexity to ≤9
  • Improves maintainability
  • Adheres to single responsibility principle

As per coding guidelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d87e55 and 80fafee.

📒 Files selected for processing (2)
  • docs/architecture.md (1 hunks)
  • src/wrap/tokenize.rs (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
docs/**/*.{rs,md}

📄 CodeRabbit inference engine (docs/rust-doctest-dry-guide.md)

docs/**/*.{rs,md}: In doctest code fences, explicitly specify the language as rust (```rust)
Use assertion macros (assert!, assert_eq!, assert_ne!) to verify outputs in doctests
For fallible doctests, write an explicit fn main() -> Result<…> and hide the boilerplate
Alternatively, use the (()) shorthand to enable ? in doctests, written as a single contiguous token
Use hidden lines (# …) to hide setup, use statements, and main wrappers in doctests
Avoid using .unwrap()/.expect() in documentation examples; prefer propagating errors with ?
Prefer no_run for examples with side effects (network, filesystem, GUI) to compile but not execute
Use should_panic for examples that intentionally panic; optionally assert the panic message
Use compile_fail for examples that must not compile, while acknowledging brittleness across Rust versions
Avoid ignore except for pseudo-code or non-Rust examples; prefer more specific attributes or hidden-line fixes
Specify edition with edition20xx in doctests when demonstrating edition-specific syntax

Files:

  • docs/architecture.md
**/*.md

⚙️ CodeRabbit configuration file

**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")

  • Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Headings must not be wrapped.
  • Documents must start with a level 1 heading
  • Headings must correctly increase or decrease by no more than one level at a time
  • Use GitHub-flavoured Markdown style for footnotes and endnotes.
  • Numbered footnotes must be numbered by order of appearance in the document.

Files:

  • docs/architecture.md
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.

  • Adhere to single responsibility and CQRS
  • Place function attributes after doc comments.
  • Do not use return in single-line functions.
  • Move conditionals with >2 branches into a predicate function.
  • Avoid unsafe unless absolutely necessary.
  • Every module must begin with a //! doc comment that explains the module's purpose and utility.
  • Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Lints must not be silenced except as a last resort.
    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.
  • Replace duplicated tests with #[rstest(...)] parameterised cases.
  • Prefer mockall for mocks/stubs.
  • Prefer .expect() over .unwrap() in tests.
  • .expect() and .unwrap() are forbidden outside of tests. Errors must be propagated.
  • Ensure that any API or behavioural changes are reflected in the documentation in docs/
  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/
  • Files must not exceed 400 lines in length
    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
  • Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
    • For testing of functionality depending upon environment variables, dependency injection and...

Files:

  • src/wrap/tokenize.rs
🧬 Code graph analysis (1)
src/wrap/tokenize.rs (1)
src/wrap/line_buffer.rs (1)
  • text (20-22)
🪛 LanguageTool
docs/architecture.md

[uncategorized] ~338-~338: Possible missing comma found.
Context: ...kenizer iterates over the source string lazily so no duplicate Vec<char> representat...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~339-~339: Would you like to use the Oxford spelling “summarizes”? The spelling ‘summarises’ is also correct.
Context: ...tion is required. The following diagram summarises the control flow, highlighting the help...

(OXFORD_SPELLING_Z_NOT_S)

🔍 Remote MCP Deepwiki

Summary of additional, review-relevant context I found

  • A new "Tokenizer flow" section was added to docs/architecture.md; tokenization is a central shared utility used by wrap_text, footnote/ellipsis processing, and other transforms — changes to tokenize.rs therefore affect many modules.

  • Important runtime consumers to check while reviewing behavior/regressions:

    • src/wrap/* (wrap_text, tokenization, FenceTracker).
    • textproc/process_tokens, ellipsis, footnotes (token-based transforms).
    • table reflow (reflow_table) and other modules that rely on Unicode-width/byte/char indexing.
      Verify these modules still behave correctly with the &str + byte-index helpers (char_at / advance_char / has_odd_backslash_escape_bytes).
  • Existing tests you should run / inspect (cover the prior behavior and important edge cases):

    • Unit tests in src/wrap/tokenize.rs and tests/wrap/tokenize_markdown.rs (inline code spans, escaped backticks, unmatched backticks).
    • wrap tests in src/wrap/tests.rs (determine_token_span, punctuation attachment, paragraph wrapping).
    • tests/fences.rs and tests/cli.rs (fence normalization, --fences behavior, --wrap/--ellipsis interactions).
    • tests/parallel.rs (multi-file/in-place behavior).
      These are where regressions from changing char-slicing → &str slicing most likely surface.
  • Key behavior/edge cases to verify manually or add tests for (UTF‑8 / byte-index risks introduced by the refactor):

    • Multibyte characters (emoji, CJK, combining marks) inside inline code, links, and ordinary text (ensure char_at/advance_char compute code-point boundaries correctly).
    • Escaped-backslash/backtick sequences (odd-number-of-backslashes logic) — ensure has_odd_backslash_escape_bytes preserves prior semantics.
    • parse_link_or_image and collect_range for nested parentheses and UTF‑8 boundaries.
    • That segment_inline public behavior is unchanged (it’s used by wrapping logic).

Files to inspect closely in the review

  • src/wrap/tokenize.rs (the refactor itself), plus the new helpers char_at/advance_char and has_odd_backslash_escape_bytes.
  • Call sites: src/wrap.rs, src/textproc.rs, src/ellipsis.rs, src/footnotes.rs, src/table.rs (reflow) — check any assumptions about indexing/slicing.
  • Test files listed above to confirm coverage.

Recommended reviewer checklist (minimal)

  • Run full test suite: cargo test (unit + integration + CLI tests).
  • Run specific tokenization/wrap tests: src/wrap/tokenize.rs tests + tests/wrap/*.
  • Add or run targeted tests with multibyte and escaped-backslash/backtick inputs if not covered.
  • Confirm pipeline ordering (fence normalization → HTML table conversion → core loop → wrapping/ellipsis/footnotes) still yields same external behavior.

Sources / tools used

  • Deepwiki read of repository wiki/pages for architecture, modules, and tests.
  • Deepwiki page contents (Architecture, Tokenization, Tests, Processing pipeline).
  • Deepwiki Q&A about tests and issue context (tests covering segment_inline/tokenize, fence/CLI tests).
⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (5)
docs/architecture.md (1)

343-365: LGTM: Flowchart accurately represents the refactored implementation.

The Mermaid diagram correctly documents the UTF-8 aware, &str-based tokenizer flow, including the new helpers (scan_while, collect_range, has_odd_backslash_escape_bytes, char_at, advance_char) and the control flow for whitespace, code spans, and links/images.

src/wrap/tokenize.rs (4)

41-43: LGTM: Simple substring extraction.

The function correctly extracts a substring. The implementation assumes start and end are valid UTF-8 byte boundaries, which is upheld by all callers using scan_while and advance_char.


47-57: LGTM: UTF-8 safe boundary helpers.

Both char_at and advance_char correctly handle UTF-8 code point boundaries. char_at safely returns None at end-of-string, and advance_char properly advances by len_utf8() bytes for multibyte characters.


63-74: LGTM: Efficient byte-based escape detection.

The function correctly counts consecutive backslashes before the given index and returns whether the count is odd. Operating on bytes is efficient and appropriate for ASCII backslash detection.


405-480: LGTM: Comprehensive test coverage.

The tests thoroughly cover critical edge cases:

  • Links with trailing punctuation
  • Nested parentheses in URLs
  • Inline code fences
  • Unmatched backticks
  • Escaped backticks (including adjacent to multibyte characters)
  • CRLF handling
  • Trailing newline preservation

The multibyte character tests (lines 467-479) specifically verify UTF-8 safety of the refactored implementation.

Comment thread docs/architecture.md Outdated
Comment thread src/wrap/tokenize.rs Outdated
Comment thread src/wrap/tokenize.rs
leynos and others added 3 commits November 11, 2025 02:10
…sing logic

- Replace index-based char access with slicing and char iterators to properly handle multibyte chars.
- Remove unsafe index manipulations and simplify advancing through UTF-8 strings.
- Improve parse_link_or_image to handle nested parentheses correctly.
- Refactor segment_inline to better handle multibyte tokens and special markdown chars.
- Add extensive tests for multibyte char handling and parsing correctness.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
…function

Refactored the backtick fence parsing logic in segment_inline by moving it
into a new helper function `handle_backtick_fence`. This improves code
readability and reduces duplication by isolating the backtick fence
scanning and matching logic.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/wrap/tokenize.rs (1)

1-543: Decompose this file to comply with the 400-line limit.

This file contains 543 lines, exceeding the coding guideline maximum of 400 lines by 143 lines. Extract the test module (lines 414-543) to a separate file such as tests/unit/wrap_tokenize.rs or split the main module by moving private helpers into a submodule.

As per coding guidelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80fafee and fa392b4.

📒 Files selected for processing (3)
  • docs/architecture.md (1 hunks)
  • src/wrap/tokenize.rs (5 hunks)
  • tests/wrap/tokenize_markdown.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

⚙️ CodeRabbit configuration file

**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.

  • Adhere to single responsibility and CQRS
  • Place function attributes after doc comments.
  • Do not use return in single-line functions.
  • Move conditionals with >2 branches into a predicate function.
  • Avoid unsafe unless absolutely necessary.
  • Every module must begin with a //! doc comment that explains the module's purpose and utility.
  • Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Lints must not be silenced except as a last resort.
    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.
  • Replace duplicated tests with #[rstest(...)] parameterised cases.
  • Prefer mockall for mocks/stubs.
  • Prefer .expect() over .unwrap() in tests.
  • .expect() and .unwrap() are forbidden outside of tests. Errors must be propagated.
  • Ensure that any API or behavioural changes are reflected in the documentation in docs/
  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/
  • Files must not exceed 400 lines in length
    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
  • Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
    • For testing of functionality depending upon environment variables, dependency injection and...

Files:

  • tests/wrap/tokenize_markdown.rs
  • src/wrap/tokenize.rs
docs/**/*.{rs,md}

📄 CodeRabbit inference engine (docs/rust-doctest-dry-guide.md)

docs/**/*.{rs,md}: In doctest code fences, explicitly specify the language as rust (```rust)
Use assertion macros (assert!, assert_eq!, assert_ne!) to verify outputs in doctests
For fallible doctests, write an explicit fn main() -> Result<…> and hide the boilerplate
Alternatively, use the (()) shorthand to enable ? in doctests, written as a single contiguous token
Use hidden lines (# …) to hide setup, use statements, and main wrappers in doctests
Avoid using .unwrap()/.expect() in documentation examples; prefer propagating errors with ?
Prefer no_run for examples with side effects (network, filesystem, GUI) to compile but not execute
Use should_panic for examples that intentionally panic; optionally assert the panic message
Use compile_fail for examples that must not compile, while acknowledging brittleness across Rust versions
Avoid ignore except for pseudo-code or non-Rust examples; prefer more specific attributes or hidden-line fixes
Specify edition with edition20xx in doctests when demonstrating edition-specific syntax

Files:

  • docs/architecture.md
**/*.md

⚙️ CodeRabbit configuration file

**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")

  • Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Headings must not be wrapped.
  • Documents must start with a level 1 heading
  • Headings must correctly increase or decrease by no more than one level at a time
  • Use GitHub-flavoured Markdown style for footnotes and endnotes.
  • Numbered footnotes must be numbered by order of appearance in the document.

Files:

  • docs/architecture.md
🧬 Code graph analysis (2)
tests/wrap/tokenize_markdown.rs (1)
src/wrap/tokenize.rs (1)
  • tokenize_markdown (380-412)
src/wrap/tokenize.rs (1)
src/wrap/line_buffer.rs (1)
  • text (20-22)
🪛 LanguageTool
docs/architecture.md

[style] ~339-~339: Would you like to use the Oxford spelling “summarizes”? The spelling ‘summarises’ is also correct.
Context: ...tion is required. The following diagram summarises the control flow, highlighting the help...

(OXFORD_SPELLING_Z_NOT_S)

🔍 Remote MCP Deepwiki, Ref

Summary — additional review-relevant facts (concise)

  • High-level impact: tokenization/wrapping are central shared utilities used by many transforms; changes in src/wrap/tokenize.rs and src/wrap/tokenize.rs affect wrap_text, footnotes, ellipsis, table reflow, and CLI behaviors.

  • Key changed files to review (priority order)

    1. src/wrap/tokenize.rs — UTF‑8 rework: &str + byte-index helpers (char_at, advance_char), scan_while, collect_range, parse_link_or_image, has_odd_backslash_escape_bytes, handle_backtick_fence — critical for correctness on multibyte and escape handling.
    2. src/wrap.rs / wrap_text call sites — consumers of tokenize/segment_inline; verify wrapping logic uses Unicode widths and fence tracking as before.
    3. src/process.rs — pipeline ordering (fences → html → core loop → post transforms); ensure new &str helpers preserve behavior.
    4. src/table.rs / src/reflow.rs — table reflow depends on tokenizer/wrapping; verify no regressions in reflow_table and split_cells.
    5. src/footnotes.rs, src/ellipsis.rs, src/lists.rs — token-based transforms and list renumbering that must still skip code spans/fences.
  • Main risks to verify

    • UTF‑8 boundary correctness (char_at / advance_char): ensure no off‑bytes slicing or panics on multibyte characters. Test with emoji, CJK, combining marks.
    • Backslash-escape and odd-backslash logic moved to byte-based helper — confirm identical semantics for odd-number-of-backslashes before delimiters (code spans, escapes).
    • parse_link_or_image correctness for nested parentheses and UTF‑8 indices.
    • No change to public API surface: segment_inline remains pub(super) and tokenize_markdown public. Confirm callers compile.
    • Interaction with fence tracking: new tokenization must preserve fence detection semantics (FenceTracker usage across process/wrap/lists).
  • Tests to run / add (minimal actionable list)

    • Run full test suite: cargo test (unit + integration + CLI) — CI relies on this.
    • Focused tests to run/inspect:
      • tests/wrap/tokenize_markdown.rs (new multibyte_characters_round_trip).
      • src/wrap/tokenize.rs unit tests covering backtick fences, escaped backticks, nested parentheses links.
      • tests/wrap/* and tests/fences.rs (fence normalization + orphan specifiers).
      • tests/cli.rs in-place and multi-file parallel tests (ensure ordering and in-place newline behavior unchanged).
    • Add targeted failing-case tests if missing:
      • Multibyte inside inline code with surrounding punctuation and links (emoji + backticks).
      • Nested parentheses in link URLs with multibyte chars.
      • Escaped-backslash sequences at byte boundaries (odd/even counts) immediately before backticks or brackets.
  • Local manual checks to run (fast)

    • Compile and run small round‑trip: tokenize_markdown("ßß λ fin") and assert token raw/code preserved (matches new test).
    • Fuzz a few UTF‑8 strings through segment_inline and wrap_text; check for panics.
    • Run mdtablefix CLI on representative Markdown files containing tables, HTML tables, fences, footnotes, and confirm outputs match previous commit (if available).
  • Notable behavioral guarantees to confirm

    • Public behavior of segment_inline(tokenization) and tokenize_markdown preserved (no API change).
    • In-place write behavior: trailing newline semantics unchanged.
    • Parallel CLI processing (rayon) unaffected — ordering preserved.

Tools/sources consulted

  • Repository architecture, modules, tests, CI and release docs: Deepwiki read of repository pages and contents.
  • (Attempted additional doc search; some requests returned transient errors — nevertheless useful context above came from Deepwiki.)
⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (9)
src/wrap/tokenize.rs (7)

19-31: LGTM!

The iterator-based approach correctly handles UTF-8 boundaries and eliminates the forbidden .expect() call. The logic is clear and concise.


41-43: LGTM!

The implementation correctly extracts a substring slice. All call sites in this file use UTF-8 boundary-safe indices from scan_while or chars().next(), so the slice operation is sound.


51-62: LGTM!

The byte-based backslash escape detection is correct and efficient. Since backslash (\) is a single-byte ASCII character (0x5C), scanning bytes backwards is safe and avoids UTF-8 decoding overhead.


95-138: LGTM!

The link/image parser correctly handles nested parentheses in URLs and uses safe UTF-8 boundary-aware indexing throughout. The Option pattern matching eliminates the forbidden .expect() calls whilst maintaining clarity.


159-179: LGTM!

The backtick fence scanner correctly matches opening and closing fences of equal length whilst respecting backslash escapes. The fallback to returning only the opening fence when no closing match is found preserves the original tokenisation behaviour.


203-252: LGTM!

The refactored tokeniser correctly operates on &str with UTF-8 boundary-safe indexing. The logic flow is clear, and the use of helper functions keeps complexity manageable. Whitespace, code spans, links, and escape handling all appear correct.


418-542: LGTM!

The expanded test coverage thoroughly validates UTF-8 boundary handling across the refactored helpers. The tests cover multibyte characters (including Greek, Cyrillic, and CJK), nested parentheses, and edge cases such as malformed input and escaped backticks adjacent to multibyte characters.

docs/architecture.md (1)

343-365: LGTM!

The Mermaid flowchart accurately illustrates the tokenisation control flow, correctly referencing the helper functions and decision points in the implementation.

tests/wrap/tokenize_markdown.rs (1)

80-96: LGTM!

The test comprehensively validates that multibyte UTF-8 characters (ß and λ) are correctly tokenised both inside and outside inline code spans, ensuring the refactored tokeniser preserves character boundaries and content.

Comment thread docs/architecture.md Outdated
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Nov 12, 2025

@coderabbitai pause

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 12, 2025

✅ Actions performed

Reviews paused.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Nov 12, 2025

@coderabbitai please create a new GitHub issue for the following:

⚠️ Outside diff range comments (1)

src/wrap/tokenize.rs (1)

    1-543: Decompose this file to comply with the 400-line limit.

    This file contains 543 lines, exceeding the coding guideline maximum of 400 lines by 143 lines. Extract the test module (lines 414-543) to a separate file such as tests/unit/wrap_tokenize.rs or split the main module by moving private helpers into a submodule.

    As per coding guidelines.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 12, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for file length violation

gh issue create \
  --title "Refactor src/wrap/tokenize.rs to comply with 400-line limit" \
  --body "## Problem

The file \`src/wrap/tokenize.rs\` currently contains 543 lines, exceeding the coding guideline maximum of 400 lines by 143 lines.

## Suggested Solutions

1. Extract the test module (lines 414-543) to a separate file such as \`tests/unit/wrap_tokenize.rs\`
2. Split the main module by moving private helpers into a submodule

## Context

This issue was identified during review of PR #255.

**Related PR:** https://github.com/leynos/mdtablefix/pull/255

**Requested by:** @leynos" \
  --assignee leynos

Length of output: 47


@leynos: I've created GitHub issue #257 for refactoring src/wrap/tokenize.rs to comply with the 400-line limit. The issue has been assigned to you and includes the problem description, suggested solutions, and a reference back to this PR.

@leynos leynos linked an issue Nov 12, 2025 that may be closed by this pull request
…in documentation

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@leynos leynos changed the title Refactor segment_inline to avoid Vec<char> and use &str parsing Refactor segment_inline to use &str parsing (no Vec<char>) Nov 12, 2025
@leynos leynos merged commit fae4ff8 into main Nov 12, 2025
3 checks passed
@leynos leynos deleted the terragon/optimize-tokenize-markdown-x4e2p6 branch November 12, 2025 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid collecting chars into Vec<char> in tokenize_markdown function

1 participant