Skip to content

Add streaming token processing#150

Merged
leynos merged 3 commits intomainfrom
h9ndnv-codex/create-process_tokens-function-and-refactor
Jul 31, 2025
Merged

Add streaming token processing#150
leynos merged 3 commits intomainfrom
h9ndnv-codex/create-process_tokens-function-and-refactor

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Jul 30, 2025

Summary

  • rewrite process_tokens to stream lines and preserve fence context
  • adjust documentation for the textproc and wrap modules
  • remove the unused tokenize_markdown helper
  • update architecture diagram
  • add a test for blank-only input

Testing

  • make fmt
  • make lint
  • make test
  • make markdownlint
  • make nixie (fails: too many arguments)

https://chatgpt.com/codex/tasks/task_e_6888f6ff67208322b8d20d3c1e042e70

Summary by Sourcery

Provide a shared streaming token-processing helper in a new textproc module, migrate existing Markdown transforms (ellipsis and footnotes) to use it, remove the old tokenizer, and update documentation accordingly.

New Features:

  • Introduce textproc::process_tokens API for streaming Markdown token transformations.

Enhancements:

  • Refactor ellipsis and footnote conversion to use process_tokens instead of tokenize_markdown.
  • Remove the unused tokenize_markdown function and adjust wrap module documentation.

Documentation:

  • Update architecture diagram and module docs to reference the new textproc module and streaming API.

Tests:

  • Add tests for process_tokens covering blank-only input, fence handling, and identity transformations.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jul 30, 2025

Reviewer's Guide

This PR refactors the Markdown tokenization workflow by introducing a streaming API in a new textproc module, replacing the legacy tokenize_markdown helper with process_tokens across modules, and updating documentation and architecture diagrams accordingly.

Class diagram for new textproc module and process_tokens API

classDiagram
    class textproc {
        <<module>>
        +process_tokens(lines: &[String], f: FnMut(Token, &mut String)) Vec<String>
    }
    class wrap {
        <<module>>
        Token
        is_fence()
    }
    textproc ..> wrap : uses Token, is_fence
Loading

Class diagram for updated ellipsis and footnotes modules using process_tokens

classDiagram
    class ellipsis {
        <<module>>
        +replace_ellipsis(lines: &[String]) Vec<String>
    }
    class footnotes {
        <<module>>
        +convert_footnotes(lines: &[String]) Vec<String>
    }
    class textproc {
        <<module>>
        +process_tokens()
    }
    ellipsis ..> textproc : uses process_tokens
    footnotes ..> textproc : uses process_tokens
Loading

Updated class diagram for architecture (excerpt)

classDiagram
    class ellipsis {
        <<module>>
        +replace_ellipsis()
    }
    class footnotes {
        <<module>>
        +convert_footnotes()
    }
    class textproc {
        <<module>>
        +process_tokens()
    }
    class process {
        <<module>>
        +process_stream()
    }
    class io {
        <<module>>
        +process_stream()
    }
    ellipsis ..> textproc : uses process_tokens
    footnotes ..> textproc : uses process_tokens
    process ..> ellipsis : uses replace_ellipsis
    process ..> footnotes : uses convert_footnotes
    io ..> process : uses process_stream
Loading

File-Level Changes

Change Details Files
Introduce streaming token API via textproc::process_tokens
  • Add new src/textproc.rs with process_tokens implementation
  • Handle fence context, inline code spans, and preserve trailing blanks
  • Add comprehensive unit tests including blank-only and fence handling
src/textproc.rs
Refactor ellipsis and footnotes modules to use process_tokens
  • Replace manual join/tokenize/split logic with process_tokens
  • Update match arms to feed and reconstruct tokens via closure
  • Remove empty-input early returns and inline tokenizer usage
src/ellipsis.rs
src/footnotes.rs
Remove legacy tokenize_markdown and update wrap module
  • Remove tokenize_markdown helper
  • Update Token doc comment to reference token-processing helpers
src/wrap.rs
Update module exports and architecture documentation
  • Add mod textproc to lib.rs
  • Update docs/architecture.md to include textproc and adjust references
  • Remove old tokenize_markdown links in diagram
src/lib.rs
docs/architecture.md

Possibly linked issues

  • #0: The PR removes the tokenize_markdown function and replaces it with a streaming process_tokens function, directly addressing the memory allocation concern.
  • Update CLI docs and concurrency section #123: The PR removes the problematic tokenize_markdown function and introduces a new process_tokens function in src/textproc.rs that correctly handles token streaming and preserves whitespace, directly addressing the issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 30, 2025

Summary by CodeRabbit

  • New Features
    • Introduced a new module for shared token-based text processing, enabling more consistent handling of Markdown transformations.
  • Refactor
    • Updated existing text normalisation and footnote conversion to use the new token processing module for improved maintainability.
  • Documentation
    • Refreshed architecture documentation to reflect the updated module relationships and processing flow.
  • Style
    • Enhanced module visibility and documentation for improved clarity.

Walkthrough

Refactor the token-processing logic in the ellipsis and footnotes modules to use a new shared textproc module, which provides the process_tokens function. Remove the old tokenize_markdown function from wrap, update documentation to reflect module relationships, and introduce the textproc module with tests.

Changes

Cohort / File(s) Change Summary
Documentation Update
docs/architecture.md
Update module relationship diagram and description to introduce textproc and clarify dependencies among modules.
Ellipsis Refactor
src/ellipsis.rs
Refactor replace_ellipsis to use process_tokens from textproc instead of manual tokenisation and joining.
Footnotes Refactor
src/footnotes.rs
Refactor convert_footnotes to use process_tokens from textproc for token handling.
Module Declaration
src/lib.rs
Add internal textproc module declaration.
New Token Processing Module
src/textproc.rs
Add new textproc module with process_tokens function and comprehensive tests for token-based Markdown transformations.
Tokenisation Cleanup
src/wrap.rs, src/wrap/tokenize.rs
Change Token visibility to public; remove tokenize_markdown function and update documentation accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as ellipsis/footnotes
    participant TextProc as textproc::process_tokens
    participant Wrap as wrap::tokenize_inline
    Caller->>TextProc: process_tokens(lines, closure)
    loop for each line
        TextProc->>Wrap: tokenize_inline(line)
        Wrap-->>TextProc: tokens
        loop for each token
            TextProc->>Caller: closure(token, output)
        end
    end
    TextProc-->>Caller: Vec<String> (processed lines)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Possibly related PRs

Poem

A token here, a token there,
Now textproc lends its flair.
Ellipses dance, footnotes sing,
Old code gone—a fresher spring!
Markdown flows through helpers new,
Refactored neat, concise, and true.
📝✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch h9ndnv-codex/create-process_tokens-function-and-refactor

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leynos - I've reviewed your changes - here's some feedback:

  • process_tokens has become quite large and mixes fence detection, inline code parsing, and newline handling—consider extracting those concerns into smaller private helper functions to improve readability and maintainability.
  • The example in textproc.rs is marked ignore, but turning it into a real doctest (using rust instead of ignore) would ensure the snippet stays up‐to‐date and compiles in CI.
  • process_tokens currently builds a single String and then splits on newlines; for very large inputs you might avoid that intermediate join/split to reduce allocations and improve performance.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- process_tokens has become quite large and mixes fence detection, inline code parsing, and newline handling—consider extracting those concerns into smaller private helper functions to improve readability and maintainability.
- The example in textproc.rs is marked `ignore`, but turning it into a real doctest (using `rust` instead of `ignore`) would ensure the snippet stays up‐to‐date and compiles in CI.
- process_tokens currently builds a single String and then splits on newlines; for very large inputs you might avoid that intermediate join/split to reduce allocations and improve performance.

## Individual Comments

### Comment 1
<location> `src/textproc.rs:74` </location>
<code_context>
+            }
+            continue;
+        }
+        let mut rest = trimmed;
+        while let Some(pos) = rest.find('`') {
+            if pos > 0 {
+                f(Token::Text(&rest[..pos]), &mut out);
</code_context>

<issue_to_address>
The inline code span parsing does not handle nested or multi-backtick code spans.

The logic only supports single backtick code spans, so it may fail to correctly parse valid Markdown with nested or multi-backtick code spans.
</issue_to_address>

### Comment 2
<location> `docs/architecture.md:267` </location>
<code_context>
-streaming helpers that combine the lower-level functions, including ellipsis
-replacement and footnote conversion. The `io` module handles filesystem
-operations, delegating the text processing to `process`.
+`ellipsis` module performs text normalization, while `footnotes` converts bare
+references. The `textproc` module contains shared token-processing helpers used
+by both. The `process` module provides streaming helpers that combine the
+lower-level functions. The `io` module handles filesystem operations,
+delegating the text processing to `process`.
</code_context>

<issue_to_address>
Clarify the antecedent of 'both' in the description of the `textproc` module.

Specify which modules 'both' refers to for clarity in the documentation.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
The `lib` module re-exports the public API from the other modules. The
`ellipsis` module performs text normalization, while `footnotes` converts bare
references. The `textproc` module contains shared token-processing helpers used
by both. The `process` module provides streaming helpers that combine the
lower-level functions. The `io` module handles filesystem operations,
delegating the text processing to `process`.
=======
The `lib` module re-exports the public API from the other modules. The
`ellipsis` module performs text normalization, while `footnotes` converts bare
references. The `textproc` module contains shared token-processing helpers used
by both the `ellipsis` and `footnotes` modules. The `process` module provides streaming helpers that combine the
lower-level functions. The `io` module handles filesystem operations,
delegating the text processing to `process`.
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `src/textproc.rs:41` </location>
<code_context>
+/// assert_eq!(out, lines);
+/// ```
+#[must_use]
+pub(crate) fn process_tokens<F>(lines: &[String], mut f: F) -> Vec<String>
+where
+    F: FnMut(Token<'_>, &mut String),
</code_context>

<issue_to_address>
Consider replacing the manual line and token parsing logic with a single call to the existing markdown tokenizer for simplicity.

You can collapse almost all of that manual per‐line/fence/backtick logic by leaning on the existing markdown tokenizer in `wrap::tokenize_markdown`. e.g.  

```rust
use crate::wrap::{Token, tokenize_markdown};

pub(crate) fn process_tokens<F>(lines: &[String], mut f: F) -> Vec<String>
where
    F: FnMut(Token<'_>, &mut String),
{
    if lines.is_empty() {
        return Vec::new();
    }

    // preserve run of trailing blank lines
    let trailing_blanks = lines.iter().rev().take_while(String::is_empty).count();

    // join once, tokenize, run your closure, then split back out
    let source = lines.join("\n");
    let mut out = String::new();
    for token in tokenize_markdown(&source) {
        f(token, &mut out);
    }

    // if user removed everything, return empty
    if out.is_empty() {
        return Vec::new();
    }

    // split back into lines and restore trailing blanks
    let mut result: Vec<String> = out.split('\n').map(ToOwned::to_owned).collect();
    result.extend(std::iter::repeat(String::new()).take(trailing_blanks.saturating_sub(result.len())));
    result
}
```

1. Drop all manual `in_fence` / back‐tick scanning loops; the `tokenize_markdown` call already handles fences, code spans, text, and newlines.  
2. Join once, tokenize once, then split once—this is far simpler and preserves your existing tests for identity, fences, and trailing blanks.
</issue_to_address>

### Comment 4
<location> `docs/architecture.md:268` </location>
<code_context>
-replacement and footnote conversion. The `io` module handles filesystem
-operations, delegating the text processing to `process`.
+`ellipsis` module performs text normalization, while `footnotes` converts bare
+references. The `textproc` module contains shared token-processing helpers used
+by both. The `process` module provides streaming helpers that combine the
+lower-level functions. The `io` module handles filesystem operations,
</code_context>

<issue_to_address>
This line is part of a paragraph that exceeds 80 columns and should be wrapped.

Ensure that all lines in this paragraph are wrapped to a maximum of 80 columns.
</issue_to_address>

### Comment 5
<location> `docs/architecture.md:270` </location>
<code_context>
+`ellipsis` module performs text normalization, while `footnotes` converts bare
+references. The `textproc` module contains shared token-processing helpers used
+by both. The `process` module provides streaming helpers that combine the
+lower-level functions. The `io` module handles filesystem operations,
+delegating the text processing to `process`.

</code_context>

<issue_to_address>
This line is part of a paragraph that exceeds 80 columns and should be wrapped.

Wrap this paragraph so that each line does not exceed 80 columns.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/textproc.rs Outdated
Comment thread docs/architecture.md Outdated
Comment thread src/textproc.rs Outdated
Comment thread docs/architecture.md
Comment thread docs/architecture.md Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0866ee8 and f1194b9.

📒 Files selected for processing (6)
  • docs/architecture.md (2 hunks)
  • src/ellipsis.rs (1 hunks)
  • src/footnotes.rs (2 hunks)
  • src/lib.rs (1 hunks)
  • src/textproc.rs (1 hunks)
  • src/wrap.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

📄 CodeRabbit Inference Engine (AGENTS.md)

**/*.rs: Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider using Arc to reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessary mut bindings.
Handle errors with the Result type instead of panicking where feasible.
Avoid unsafe code unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments.
Do not use return in single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Prefer expect over allow.
Prefer .expect() over .unwrap().
Use concat!() to combine long string literals rather than escaping newlines with a backslash.
Prefer semantic error enums: Derive std::error::Error (via the thiserror crate) for any condition the caller might inspect, retry, or map to an HTTP status.
Use an opaque error only at the app boundary: Use eyre::Report for human-readable logs; these should not be exposed in public APIs.
Never export the opaque type from a library: Convert to domain enums at API boundaries, and to eyre only in the main main() entrypoint or top-level async task.

Files:

  • src/lib.rs
  • src/footnotes.rs
  • src/textproc.rs
  • src/ellipsis.rs
  • src/wrap.rs

⚙️ CodeRabbit Configuration File

**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.

  • Adhere to single responsibility and CQRS

  • Place function attributes after doc comments.

  • Do not use return in single-line functions.

  • Move conditionals with >2 branches into a predicate function.

  • Avoid unsafe unless absolutely necessary.

  • Every module must begin with a //! doc comment that explains the module's purpose and utility.

  • Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar

  • Lints must not be silenced except as a last resort.

    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.

  • Replace duplicated tests with #[rstest(...)] parameterised cases.

  • Prefer mockall for mocks/stubs.

  • Prefer .expect() over .unwrap()

  • Ensure that any API or behavioural changes are reflected in the documentation in docs/

  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/

  • Files must not exceed 400 lines in length

    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.

Files:

  • src/lib.rs
  • src/footnotes.rs
  • src/textproc.rs
  • src/ellipsis.rs
  • src/wrap.rs
docs/**/*.md

📄 CodeRabbit Inference Engine (AGENTS.md)

docs/**/*.md: Reference: Use the markdown files within the docs/ directory as a knowledge base and source of truth for project requirements, dependency choices, and architectural decisions.
Update: When new decisions are made, requirements change, libraries are added/removed, or architectural patterns evolve, proactively update the relevant file(s) in the docs/ directory to reflect the latest state. Ensure the documentation remains accurate and current.

Files:

  • docs/architecture.md
**/*.md

📄 CodeRabbit Inference Engine (AGENTS.md)

**/*.md: Documentation must use en-GB-oxendict spelling and grammar. (EXCEPTION: the naming of the "LICENSE" file, which is to be left unchanged for community consistency.)
Validate Markdown files using make markdownlint.
Run make fmt after any documentation changes to format all Markdown files and fix table markup.
Validate Mermaid diagrams in Markdown files by running make nixie.
Markdown paragraphs and bullet points must be wrapped at 80 columns.
Code blocks must be wrapped at 120 columns.
Tables and headings must not be wrapped.
Use dashes (-) for list bullets.
Use GitHub-flavoured Markdown footnotes ([^1]) for references and footnotes.

Files:

  • docs/architecture.md

⚙️ CodeRabbit Configuration File

**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")

  • Use en-GB-oxendict (-ize / -our) spelling and grammar
  • Paragraphs and bullets must be wrapped to 80 columns, except where a long URL would prevent this (in which case, silence MD013 for that line)
  • Code blocks should be wrapped to 120 columns.
  • Headings must not be wrapped.
  • Documents must start with a level 1 heading
  • Headings must correctly increase or decrease by no more than one level at a time
  • Use GitHub-flavoured Markdown style for footnotes and endnotes.
  • Numbered footnotes must be numbered by order of appearance in the document.

Files:

  • docs/architecture.md
🧠 Learnings (1)
src/footnotes.rs (1)

Learnt from: CR
PR: leynos/mdtablefix#0
File: docs/architecture.md:0-0
Timestamp: 2025-07-29T12:45:00.466Z
Learning: Applies to docs/src/footnotes.rs : The convert_footnotes function should convert bare numeric references into GitHub-flavoured Markdown footnotes, ignoring numbers inside inline code or parentheses, and only process the final contiguous list of numeric references.

🧬 Code Graph Analysis (2)
src/footnotes.rs (1)
src/textproc.rs (1)
  • process_tokens (41-106)
src/textproc.rs (1)
src/wrap.rs (1)
  • is_fence (200-200)
🔇 Additional comments (12)
docs/architecture.md (1)

228-231: Excellent documentation update for the new architecture.

The documentation accurately reflects the introduction of the textproc module and its role in centralising token processing logic. The mermaid diagram correctly shows the new dependencies, and the textual description clearly explains the module relationships.

Also applies to: 255-255, 262-262, 267-271

src/wrap.rs (1)

21-21: Good documentation update to reflect the architectural changes.

The generalised wording "token-processing helpers" is more accurate now that the tokenize_markdown function has been removed and token processing has been moved to the textproc module.

src/footnotes.rs (2)

18-18: Appropriate import update for the architectural refactor.

The import correctly reflects the migration from wrap::tokenize_markdown to textproc::process_tokens.


96-105: Excellent refactor using the new streaming API.

The refactor to use process_tokens is well-implemented and maintains the same functionality while improving code organisation. The closure correctly handles each token type:

  • Text tokens are processed for footnote conversion
  • Code tokens are preserved with backticks
  • Fence tokens are passed through unchanged
  • Newlines maintain line structure

This change aligns with the centralised token processing architecture.

src/textproc.rs (3)

1-8: Excellent module documentation.

The module-level documentation clearly explains the purpose and utility of the textproc module, satisfying the coding guidelines requirement for module documentation.


41-106: Well-implemented streaming token processor.

The process_tokens function is well-architected and handles all the edge cases correctly:

  • Proper fence state management with in_fence toggle
  • Correct backtick matching for code spans, including nested and unmatched cases
  • Appropriate blank line preservation logic
  • Clean separation of concerns with the closure-based approach

The implementation centralises token processing logic effectively, supporting the architectural refactor described in the PR objectives.


108-199: Comprehensive test coverage.

The test suite thoroughly covers the key functionality:

  • Identity transformations and empty input handling
  • Content removal scenarios
  • Trailing blank line preservation
  • Fence block processing including malformed cases

This provides excellent confidence in the implementation's correctness.

src/ellipsis.rs (5)

10-10: Import change looks correct.

The switch from wrap::tokenize_markdown to textproc::{Token, process_tokens} aligns with the architectural refactoring described in the PR objectives.


17-40: Function rewrite implements streaming token processing correctly.

The refactored replace_ellipsis function properly delegates tokenisation to process_tokens and handles all token variants appropriately. The early exit optimisation on line 20-23 improves performance by avoiding regex replacement when no ellipsis patterns are present.

The pattern matching covers all token types:

  • Token::Text: Processes ellipsis replacement with early exit optimisation
  • Token::Code: Preserves code spans by wrapping in backticks
  • Token::Fence: Preserves fenced blocks verbatim
  • Token::Newline: Maintains line structure

20-23: Early exit optimisation is well-implemented.

The check !DOT_RE.is_match(t) before expensive regex replacement is a sound performance optimisation that maintains correctness whilst reducing unnecessary processing.


24-30: Ellipsis replacement logic preserves original behaviour.

The replacement logic correctly:

  • Converts groups of 3+ dots to ellipsis characters
  • Handles leftover dots that don't form complete triples
  • Processes matches left-to-right as documented

The closure implementation maintains the same mathematical logic: len / 3 ellipses plus len % 3 leftover dots.


32-39: Token handling preserves fence context correctly.

All non-text tokens are handled appropriately:

  • Code spans wrapped in backticks to maintain markdown structure
  • Fences appended verbatim to preserve block boundaries
  • Newlines preserved to maintain line structure

This ensures the streaming approach maintains the same output format as the original implementation.

Comment thread src/lib.rs Outdated
@leynos leynos force-pushed the h9ndnv-codex/create-process_tokens-function-and-refactor branch from 8b48dd5 to ec1102e Compare July 31, 2025 17:54
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jul 31, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 31, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jul 31, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 31, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1194b9 and ec1102e.

📒 Files selected for processing (6)
  • docs/architecture.md (2 hunks)
  • src/ellipsis.rs (1 hunks)
  • src/footnotes.rs (2 hunks)
  • src/lib.rs (2 hunks)
  • src/textproc.rs (1 hunks)
  • src/wrap.rs (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

⚙️ CodeRabbit Configuration File

**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.

  • Adhere to single responsibility and CQRS

  • Place function attributes after doc comments.

  • Do not use return in single-line functions.

  • Move conditionals with >2 branches into a predicate function.

  • Avoid unsafe unless absolutely necessary.

  • Every module must begin with a //! doc comment that explains the module's purpose and utility.

  • Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar

  • Lints must not be silenced except as a last resort.

    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.

  • Replace duplicated tests with #[rstest(...)] parameterised cases.

  • Prefer mockall for mocks/stubs.

  • Prefer .expect() over .unwrap()

  • Ensure that any API or behavioural changes are reflected in the documentation in docs/

  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/

  • Files must not exceed 400 lines in length

    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.

Files:

  • src/lib.rs
  • src/footnotes.rs
  • src/wrap.rs
  • src/ellipsis.rs
  • src/textproc.rs
**/*.md

⚙️ CodeRabbit Configuration File

**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")

  • Use en-GB-oxendict (-ize / -our) spelling and grammar
  • Paragraphs and bullets must be wrapped to 80 columns, except where a long URL would prevent this (in which case, silence MD013 for that line)
  • Code blocks should be wrapped to 120 columns.
  • Headings must not be wrapped.
  • Documents must start with a level 1 heading
  • Headings must correctly increase or decrease by no more than one level at a time
  • Use GitHub-flavoured Markdown style for footnotes and endnotes.
  • Numbered footnotes must be numbered by order of appearance in the document.

Files:

  • docs/architecture.md
🧠 Learnings (1)
📚 Learning: applies to docs/src/footnotes.rs : the `convert_footnotes` function should convert bare numeric refe...
Learnt from: CR
PR: leynos/mdtablefix#0
File: docs/architecture.md:0-0
Timestamp: 2025-07-29T12:45:00.466Z
Learning: Applies to docs/src/footnotes.rs : The `convert_footnotes` function should convert bare numeric references into GitHub-flavoured Markdown footnotes, ignoring numbers inside inline code or parentheses, and only process the final contiguous list of numeric references.

Applied to files:

  • src/footnotes.rs
🪛 GitHub Actions: CI
src/wrap.rs

[error] 457-457: cargo fmt check failed: this file contains an unclosed delimiter

🔇 Additional comments (14)
src/wrap.rs (3)

40-51: Well-designed Token enum.

The enum variants clearly represent different Markdown elements with appropriate string slice types for efficient tokenization. Good documentation and standard derives support testing and debugging.


135-137: Clean utility function with good documentation.

The function correctly implements line-breaking logic with clear parameters and comprehensive documentation including examples.


457-457: Ignore false unclosed delimiter error in wrap.rs

Verified that all braces, parentheses, string literals and doc-comment code fences are balanced and the file ends correctly with mod tests;. No changes required.

Likely an incorrect or invalid review comment.

src/lib.rs (1)

12-12: Proper integration of the new textproc module.

The module documentation and public declaration correctly expose the new textproc module, maintaining consistency with the existing module structure.

Also applies to: 33-33

docs/architecture.md (1)

228-231: Accurate documentation of the new architecture.

The mermaid diagram and textual descriptions correctly reflect the introduction of the textproc module and the updated dependencies. The explanation clearly shows how ellipsis and footnotes now depend on textproc for shared token-processing functionality.

Also applies to: 255-255, 262-262, 267-271

src/footnotes.rs (2)

21-21: Correct import update for the new textproc module.

The import properly reflects the migration from wrap to textproc for token processing functionality.


99-108: Clean refactoring to use the streaming token API.

The process_tokens call correctly handles all token types while simplifying the implementation. The closure logic preserves the original footnote conversion behaviour whilst leveraging the new shared tokenization infrastructure.

src/textproc.rs (5)

1-8: Excellent module documentation.

The module doc comment clearly explains the purpose, utility, and key features including the streaming API and trailing blank line preservation. This follows the coding guideline requiring modules to begin with //! doc comments.


12-36: Robust inline code tokenization.

The function correctly handles multi-backtick code spans by matching delimiter lengths and gracefully handles unmatched backticks. The closure-based emission pattern provides flexibility for different use cases.


38-63: Well-structured line processing logic.

The function clearly separates fence handling from inline content processing, with proper state management and clean early returns that reduce complexity.


91-121: Excellent public API implementation.

The function handles edge cases properly, preserves trailing blank lines as documented, and provides a clean streaming interface. The comprehensive documentation with examples makes it easy to use correctly.


123-227: Comprehensive test coverage.

The test suite thoroughly covers functionality including edge cases, trailing blank line preservation, fence handling, and multi-backtick code spans. The descriptive test names clearly indicate what behaviour is being verified.

src/ellipsis.rs (2)

12-12: LGTM! Import aligns with architectural refactor.

The import correctly brings in the new shared token processing functionality from the textproc module.


19-40: Excellent refactor using shared token processing.

The refactor successfully leverages the new textproc::process_tokens function whilst maintaining identical behavior. The token handling covers all necessary cases and preserves the original ellipsis replacement logic.

However, remove the return statement on line 23 as it violates the coding guideline for single-line functions.

Apply this diff to remove the unnecessary return:

         Token::Text(t) => {
             if !DOT_RE.is_match(t) {
                 out.push_str(t);
-                return;
             }
+            } else {
             let replaced = DOT_RE.replace_all(t, |caps: &regex::Captures<'_>| {
                 let len = caps[0].len();
                 let ellipses = "…".repeat(len / 3);
                 let leftover = ".".repeat(len % 3);
                 format!("{ellipses}{leftover}")
             });
             out.push_str(&replaced);
+            }
         }

Likely an incorrect or invalid review comment.

Comment thread src/wrap.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
src/wrap/tokenize.rs (1)

20-31: Fix incorrect doc comment and derive set for Token.

The docstring claims Token is “emitted by segment_inline”, yet segment_inline returns Vec<String>. This is misleading now that process_tokens in textproc owns the stream.
Derive Clone, Eq while touching the item to keep parity with PartialEq and aid downstream use.

-/// Markdown token emitted by [`segment_inline`].
-#[derive(Debug, PartialEq)]
+/// Markdown token yielded by the streaming [`crate::textproc::process_tokens`] API.
+#[derive(Debug, Clone, PartialEq, Eq)]
 pub enum Token<'a> {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec1102e and b44776b.

📒 Files selected for processing (2)
  • src/wrap.rs (1 hunks)
  • src/wrap/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit Inference Engine (AGENTS.md)

**/*.rs: Clippy warnings MUST be disallowed.
Fix any warnings emitted during tests in the code itself rather than silencing them.
Where a function is too long, extract meaningfully named helper functions adhering to separation of concerns and CQRS.
Where a function has too many parameters, group related parameters in meaningfully named structs.
Where a function is returning a large error consider using Arc to reduce the amount of data returned.
Write unit and behavioural tests for new functionality. Run both before and after making any change.
Every module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs using Rustdoc comments (///) so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessary mut bindings.
Handle errors with the Result type instead of panicking where feasible.
Avoid unsafe code unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments.
Do not use return in single-line functions.
Use predicate functions for conditional criteria with more than two branches.
Lints must not be silenced except as a last resort.
Lint rule suppressions must be tightly scoped and include a clear reason.
Prefer expect over allow.
Prefer .expect() over .unwrap().
Use concat!() to combine long string literals rather than escaping newlines with a backslash.
Prefer semantic error enums: Derive std::error::Error (via the thiserror crate) for any condition the caller might inspect, retry, or map to an HTTP status.
Use an opaque error only at the app boundary: Use eyre::Report for human-readable logs; these should not be exposed in public APIs.
Never export the opaque type from a library: Convert to domain enums at API boundaries, and to eyre only in the main main() entrypoint or top-level async task.

Files:

  • src/wrap/tokenize.rs
  • src/wrap.rs

⚙️ CodeRabbit Configuration File

**/*.rs: * Seek to keep the cyclomatic complexity of functions no more than 12.

  • Adhere to single responsibility and CQRS

  • Place function attributes after doc comments.

  • Do not use return in single-line functions.

  • Move conditionals with >2 branches into a predicate function.

  • Avoid unsafe unless absolutely necessary.

  • Every module must begin with a //! doc comment that explains the module's purpose and utility.

  • Comments and docs must follow en-GB-oxendict (-ize / -our) spelling and grammar

  • Lints must not be silenced except as a last resort.

    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.

  • Replace duplicated tests with #[rstest(...)] parameterised cases.

  • Prefer mockall for mocks/stubs.

  • Prefer .expect() over .unwrap()

  • Ensure that any API or behavioural changes are reflected in the documentation in docs/

  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/

  • Files must not exceed 400 lines in length

    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.

Files:

  • src/wrap/tokenize.rs
  • src/wrap.rs

Comment thread src/wrap.rs
@leynos leynos merged commit 05ce23e into main Jul 31, 2025
2 checks passed
@leynos leynos deleted the h9ndnv-codex/create-process_tokens-function-and-refactor branch July 31, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant