Add HTML table conversion support by leynos · Pull Request #13 · leynos/mdtablefix

leynos · 2025-06-13T23:18:03Z

Summary

parse HTML <table> elements using html5ever and convert them to Markdown
reflow converted tables with existing logic
handle HTML tables in process_stream
add tests for HTML table conversion
add test for multiple tables in one document
document HTML table support and update README

Testing

cargo fmt --all
cargo clippy -- -D warnings
cargo test
markdownlint README.md docs/html-table-support.md docs/rust-testing-with-rstest-fixtures.md (fails: various MD013 and MD040 errors)

https://chatgpt.com/codex/tasks/task_e_684caa9f4f7c8322bdc5b35423c8eb94

Summary by Sourcery

Enable support for simple HTML tables by parsing them with html5ever, converting to Markdown, and then reflowing using existing logic; include tests and documentation for this feature.

New Features:

Detect and convert HTML elements into Markdown before reflowing
Enhancements:
- Integrate HTML table buffering and parsing into the main processing stream
Build:
- Add html5ever and markup5ever_rcdom dependencies for HTML parsing
Documentation:
- Update README and add dedicated HTML table support documentation
Tests:
- Add fixtures and tests for single and multiple HTML tables in document processing
Summary by CodeRabbit
- New Features
  - Added support for detecting and converting simple HTML
elements embedded in Markdown documents into Markdown table format.
Documentation
- Updated the README to describe the new HTML table conversion feature.
- Added a new documentation file detailing HTML table support and limitations.
Tests
- Introduced new tests to verify HTML table conversion and handling of multiple tables.

sourcery-ai · 2025-06-13T23:18:07Z

Reviewer's Guide

Implements support for detecting and converting HTML

elements to Markdown by parsing with html5ever, integrating this conversion into the main streaming logic before reflowing tables, and adding related dependencies, documentation, and tests.

Sequence Diagram for HTML Table Processing

sequenceDiagram
    participant UserInput
    participant P as process_stream
    participant PHL as push_html_line
    participant HTML_MD as html_table_to_markdown
    participant Parser as html5ever
    participant DOM_Utils as DOMTraversalUtils
    participant Node_MD as table_node_to_markdown

    UserInput->>P: Stream of lines (Markdown with HTML table)
    P->>P: Detects HTML table start (`<table...`)
    P->>PHL: push_html_line(current_line, html_buffer, html_depth, in_html, out_buffer)
    PHL->>PHL: Accumulate HTML table lines in html_buffer
    PHL->>PHL: Update html_depth based on <table /table> tags
    opt HTML table complete (html_depth is 0 after </table>)
        PHL->>HTML_MD: html_table_to_markdown(html_buffer)
        HTML_MD->>Parser: parse_document(html_content)
        Parser-->>HTML_MD: DOM (RcDom)
        HTML_MD->>DOM_Utils: collect_tables(DOM)
        DOM_Utils-->>HTML_MD: List of table_nodes
        loop for each table_node
            HTML_MD->>Node_MD: table_node_to_markdown(table_node)
            Node_MD->>DOM_Utils: collect_rows(table_node)
            DOM_Utils-->>Node_MD: row_handles
            loop for each row/cell in row_handles
                Node_MD->>DOM_Utils: node_text(cell_node)
                DOM_Utils-->>Node_MD: cell_text
            end
            Node_MD->>Node_MD: Converts to Markdown lines & calls reflow_table(markdown_lines)
            Node_MD-->>HTML_MD: Reflowed Markdown table lines
        end
        HTML_MD-->>PHL: Aggregated Markdown table lines
        PHL-->>P: Adds converted lines to main output stream
    end
    P->>P: Continues processing other lines

Class Diagram: New and Modified Components for HTML Table Conversion

classDiagram
    class process_stream {
      +process_stream(lines: &[String]) Vec~String~
      #html_buf: Vec~String~ (new state)
      #html_depth: usize (new state)
      #in_html: bool (new state)
      +push_html_line(line: &str, html_buf: &mut Vec~String~, html_depth: &mut usize, in_html: &mut bool, out: &mut Vec~String~) void (new method)
    }
    class html_table_to_markdown {
      +html_table_to_markdown(lines: &[String]) Vec~String~ (new function)
    }
    class table_node_to_markdown {
      +table_node_to_markdown(table_node: &Handle) Vec~String~ (new function)
    }
    class DOMTraversalUtils {
      <<Utility Module>>
      +collect_tables(handle: &Handle, tables: &mut Vec~Handle~) (new function)
      +collect_rows(handle: &Handle, rows: &mut Vec~Handle~) (new function)
      +node_text(handle: &Handle) String (new function)
      +collect_text(handle: &Handle, out: &mut Vec~String~) (new function)
    }
    class Html5everParser {
      <<External Library html5ever>>
      +parse_document(html: String) RcDom
    }
    class RcDomHandle {
      <<External DOM Node markup5ever_rcdom.Handle>>
    }
    class ExistingReflowLogic {
      <<Existing Module>>
      +reflow_table(markdown_lines: &[String]) Vec~String~
    }

    process_stream : uses internally > push_html_line
    push_html_line : calls > html_table_to_markdown
    html_table_to_markdown : uses > Html5everParser
    Html5everParser : returns > RcDomHandle
    html_table_to_markdown : uses > DOMTraversalUtils
    html_table_to_markdown : calls > table_node_to_markdown

    table_node_to_markdown : processes > RcDomHandle
    table_node_to_markdown : uses > DOMTraversalUtils
    table_node_to_markdown : calls > ExistingReflowLogic.reflow_table

    DOMTraversalUtils : operates_on > RcDomHandle

File-Level Changes

Change Details Files

Add HTML table parsing and conversion helpers

Introduce DOM traversal functions (collect_text, collect_tables, collect_rows) to extract table and cell data
Implement node_text to normalize text nodes and skip non-content elements
Create table_node_to_markdown to render table rows and generate header separators
Build html_table_to_markdown to parse raw HTML lines into Markdown with proper indentation
Implement push_html_line to buffer HTML lines and detect table boundaries

src/lib.rs

Integrate HTML table support into process_stream

Add html_buf, html_depth, and in_html state variables
Detect start, buffer lines, and branch processing inside HTML context
Flush buffered HTML on

src/lib.rs

Add HTML parsing dependencies

Include html5ever and markup5ever_rcdom in Cargo.toml

Cargo.toml

Document HTML table support

Create docs/html-table-support.md detailing feature and limitations

Update README.md to mention HTML conversion

`docs/html-table-support.md` `README.md`
Add integration tests for HTML table conversion	Define html_table and multiple_tables fixtures Add tests verifying single and multiple HTML table handling in process_stream	`tests/integration.rs`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2025-06-13T23:18:08Z

Walkthrough

Support for converting embedded HTML <table> elements into Markdown tables has been added. This includes new dependencies for HTML parsing, updated documentation, README changes describing the feature, new helper functions for HTML-to-Markdown conversion, and additional integration tests verifying the new functionality.

Changes

Files/Paths	Change Summary
Cargo.toml	Added `html5ever` and `markup5ever_rcdom` dependencies for HTML parsing.
README.md	Updated to mention HTML table to Markdown conversion feature.
docs/html-table-support.md	Added new documentation describing HTML table support and conversion process.
src/lib.rs	Added HTML table detection and conversion logic with new private helper functions; integrated into main processing.
tests/integration.rs	Added fixtures and tests for HTML table conversion and handling multiple Markdown tables.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant process_stream
    participant HTMLParser
    participant MarkdownConverter

    User->>process_stream: Provide Markdown input (may include HTML <table>)
    process_stream->>process_stream: Detect <table> block in input
    process_stream->>HTMLParser: Parse buffered HTML table fragment
    HTMLParser-->>process_stream: Return DOM structure of table
    process_stream->>MarkdownConverter: Convert DOM table to Markdown lines
    MarkdownConverter-->>process_stream: Return Markdown table lines
    process_stream->>User: Output Markdown with converted tables and reflowed Markdown tables

Possibly related PRs

Implement markdown table fixer #1: Implements the original Markdown table reflow and fixer functionality, which is now extended to support HTML table conversion in this PR.

Poem

In the land of Markdown, tables abound,
But some were in HTML, hidden and round.
Now with a hop and a parser so clever,
They’re Markdown again—HTML never!
🐇✨
Hop, skip, and convert with glee,
Tables united in Markdown, as neat as can be!

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

sourcery-ai

Hey @leynos - I've reviewed your changes - here's some feedback:

node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
Your HTML table detection trims all leading whitespace (line.trim_end()), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency.
find_table only locates the first <table> element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
- Your HTML table detection trims all leading whitespace (`line.trim_end()`), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency.
- find_table only locates the first `<table>` element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.

## Individual Comments

### Comment 1
<location> `src/lib.rs:361` </location>
<code_context>

+        if in_html {
+            html_buf.push(line.trim_end().to_string());
+            if line.contains("</table>") {
+                out.extend(html_table_to_markdown(&html_buf));
+                html_buf.clear();
</code_context>

<issue_to_address>
Nested <table> tags not handled correctly

Track table nesting by incrementing a counter on <table> and decrementing on </table>, only ending collection when the counter reaches zero.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

leynos · 2025-06-14T00:39:05Z

@sourcery-ai review

sourcery-ai

Hey @leynos - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/lib.rs:103` </location>
<code_context>
+        .join(" ")
+}
+
+fn collect_text(handle: &Handle, out: &mut Vec<String>) {
+    match &handle.data {
+        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
</code_context>

<issue_to_address>
Exclude non-content elements in `collect_text`

Currently, `collect_text` includes text from `<script>` and `<style>` tags, which can pollute your output. Please update it to skip these and other non-content tags.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
    match &handle.data {
        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
        NodeData::Element { .. } | NodeData::Document => {
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        _ => {}
    }
}
=======
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
    match &handle.data {
        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
        NodeData::Element { ref name, .. } => {
            // List of non-content tags to skip
            let tag = name.local.as_ref();
            if tag.eq_ignore_ascii_case("script")
                || tag.eq_ignore_ascii_case("style")
                || tag.eq_ignore_ascii_case("noscript")
                || tag.eq_ignore_ascii_case("template")
                || tag.eq_ignore_ascii_case("head")
            {
                // Skip these tags and their children
                return;
            }
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        NodeData::Document => {
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        _ => {}
    }
}
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

Cargo.toml (1)
4-4: ⚠️ Potential issue

Fix invalid Rust edition.

The edition "2024" is not a valid Rust edition. Valid editions are "2015", "2018", or "2021".

Apply this diff to fix the edition:
-edition = "2024"
+edition = "2021"

♻️ Duplicate comments (1)

src/lib.rs (1)

389-391: Good implementation of nested table tracking.

The depth tracking correctly handles nested tables by incrementing on <table and decrementing on </table>, addressing the previous review comment.

🧹 Nitpick comments (1)

src/lib.rs (1)

90-101: Simplify redundant whitespace normalization.

The function performs whitespace normalization twice - first when joining with spaces, then again by splitting and re-joining.

 fn node_text(handle: &Handle) -> String {
     let mut parts = Vec::new();
     collect_text(handle, &mut parts);
-    parts
-        .into_iter()
-        .filter(|s| !s.trim().is_empty())
-        .collect::<Vec<_>>()
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
+    parts.join(" ").split_whitespace().collect::<Vec<_>>().join(" ")
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c367ec and 252c973.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

Cargo.toml (1 hunks)
README.md (1 hunks)
docs/html-table-support.md (1 hunks)
src/lib.rs (5 hunks)
tests/integration.rs (2 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/html-table-support.md

[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple

` elements embedded in M...

(OXFORD_SPELLING_Z_NOT_S)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: coverage
GitHub Check: build-test

🔇 Additional comments (3)

README.md (1)

6-6: LGTM!

The README update accurately describes the new HTML table conversion feature.

docs/html-table-support.md (1)

1-11: Well-documented feature description.

The documentation clearly explains the HTML table support feature, including its limitations to basic table elements.

🧰 Tools

🪛 LanguageTool

[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple

` elements embedded in M...

(OXFORD_SPELLING_Z_NOT_S)

tests/integration.rs (1)

66-148: Comprehensive test coverage for new features.

The tests thoroughly verify both HTML table conversion and multiple table handling. The expected outputs correctly show markdown tables with proper formatting and header separators.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/lib.rs (3)
401-437: Duplicated HTML buffering logic – extract a helper

The block that pushes to html_buf, updates html_depth, and flushes when depth hits zero is duplicated almost verbatim in two branches. This hurts readability and makes future fixes error-prone.

Consider extracting a small function or closure, e.g. handle_html_line(line), and call it from both places.

This removes ~25 lines of duplication and aligns with DRY.

90-101: node_text performs two extra allocations per join/split

The text is first joined, then immediately split_whitespaceed and joined again, causing avoidable allocations.

A single pass with an iterator adapter gives the same normalisation:
-fn node_text(handle: &Handle) -> String {
-    let mut parts = Vec::new();
-    collect_text(handle, &mut parts);
-    parts
-        .into_iter()
-        .filter(|s| !s.trim().is_empty())
-        .collect::<Vec<_>>()
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
+fn node_text(handle: &Handle) -> String {
+    let mut parts = Vec::new();
+    collect_text(handle, &mut parts);
+    parts
+        .into_iter()
+        .flat_map(|s| s.split_whitespace().map(str::to_owned))
+        .collect::<Vec<_>>()
+        .join(" ")
 }
Minor, but worthwhile when walking large DOMs.

199-204: Use parse_fragment for snippets instead of parse_document

html_table_to_markdown feeds a fragment (no <html> / <body>) into parse_document.
html5ever will implicitly wrap it, but parse_fragment is the intended API for isolated snippets and avoids some edge-case quirks (e.g. incorrect foster parenting around <table>).
-    let opts = ParseOpts::default();
-    let dom: RcDom = parse_document(RcDom::default(), opts).one(html);
+    let dom: RcDom = parse_fragment(
+        RcDom::default(),
+        ParseOpts::default(),
+        QualName::new(None, ns!(html), local_name!("body")),
+        Vec::new(),
+    )
+    .one(html);
Requires use html5ever::{parse_fragment, QualName};.

Not critical but brings the code in line with html5ever best practice.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 252c973 and f4de0a8.

📒 Files selected for processing (1)

src/lib.rs (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: coverage
GitHub Check: build-test

🔇 Additional comments (1)

src/lib.rs (1)

467-470: Residual buffered HTML lost on early return scenarios

html_buf is flushed after the main loop, but only if the function wasn’t exited earlier.
Currently, the only early exits are through continue, so this is safe; however, should future refactors introduce a return inside the loop, this flush would be skipped.

A safer pattern is to wrap the loop body in a closure or move the flush into a Drop guard.

leynos · 2025-06-14T01:14:37Z

@sourcery-ai review

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lib.rs (1)
160-176: Header-separator width can still be wrong
Same issue as previously reported: col_count is fixed from the first row, so later rows with extra cells produce an undersized | --- | separator.

See earlier review; the fix is to track the maximum column count:
-        if i == 0 {
-            first_header = header_row;
-            col_count = cells.len();
-        }
+        col_count = col_count.max(cells.len());
+        if i == 0 {
+            first_header = header_row;
+        }

🧹 Nitpick comments (3)

src/lib.rs (3)
5-6: Prefer making the new HTML-parsing dependencies optional

html5ever and markup5ever_rcdom add ~2 MB to build artefacts and noticeably increase compile time.
Consider putting them behind a html-tables Cargo feature so existing users who only need Markdown reflow don’t pay the cost.
-[dependencies]
-html5ever = "0.27"
-markup5ever_rcdom = "0.2"
+[dependencies]
+html5ever = { version = "0.27", optional = true }
+markup5ever_rcdom = { version = "0.2", optional = true }
+
+[features]
+html-tables = ["html5ever", "markup5ever_rcdom"]
90-98: node_text does two passes over whitespace

parts.join(" ") immediately followed by split_whitespace() + join(" ") allocates twice and traverses the string three times.

A single split_whitespace() on the raw text is enough:
-fn node_text(handle: &Handle) -> String {
-    let mut parts = Vec::new();
-    collect_text(handle, &mut parts);
-    parts
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
-}
+fn node_text(handle: &Handle) -> String {
+    let mut parts = Vec::new();
+    collect_text(handle, &mut parts);
+    parts
+        .join(" ")
+        .split_whitespace()
+        .collect::<Vec<_>>()
+        .join(" ")
+}
(You can simplify further by building the output with a String and pushing spaces manually.)
Minor, but worth tidying.

214-231: Tag detection is case-sensitive and may mis-count multiple tags per line

html_depth += line.matches("<table").count(); and the </table> counterpart ignore <TABLE> or mixed-case tags and skip self-closing shorthand like <table/>.

Consider normalising the line (line.to_ascii_lowercase()) and counting all occurrences of "<table" / "</table" regardless of case.
Also increment/decrement for every closing tag, not only when contains returns true (you already have .matches() doing the count).

Low-risk edge case, but cheap to harden.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4de0a8 and 2e7ab79.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

Cargo.toml (1 hunks)
src/lib.rs (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build-test
GitHub Check: coverage

🔇 Additional comments (1)

src/lib.rs (1)

466-468: Unbalanced <table> blocks silently converted

If the input starts a <table> and never closes it, html_buf is flushed as a table at EOF, producing partial or garbled Markdown.

Please confirm this is intentional. You might instead emit the original HTML when html_depth != 0:
if !html_buf.is_empty() {
    if html_depth == 0 {
        out.extend(html_table_to_markdown(&html_buf));
    } else {
        out.extend(html_buf);
    }
}

coderabbitai · 2025-06-14T01:15:59Z

+fn html_table_to_markdown(lines: &[String]) -> Vec<String> {
+    let indent: String = lines
+        .first()
+        .map(|l| l.chars().take_while(|c| c.is_whitespace()).collect())
+        .unwrap_or_default();
+    let html: String = lines
+        .iter()
+        .map(|l| l.trim_end())
+        .collect::<Vec<_>>()
+        .join("\n");
+    let opts = ParseOpts::default();
+    let dom: RcDom = parse_document(RcDom::default(), opts).one(html);
+
+    let mut tables = Vec::new();
+    collect_tables(&dom.document, &mut tables);
+    if tables.is_empty() {
+        return lines.to_vec();
+    }
+
+    let mut out = Vec::new();
+    for table in tables {
+        for line in table_node_to_markdown(&table) {
+            out.push(format!("{indent}{line}"));
+        }
+    }
+    out
+}


⚠️ Potential issue

Nested tables are emitted twice

collect_tables gathers every <table> node, including those already nested inside another table. Converting each node individually duplicates inner tables in the output.

Track depth and only emit the outermost tables or filter out tables whose parent is also a <table>:

fn collect_tables(handle: &Handle, tables: &mut Vec<Handle>) { if let NodeData::Element { name, .. } = &handle.data { if name.local.as_ref() == "table" { if handle.parent().map_or(true, |p| !is_table(&p)) { tables.push(handle.clone()); } } } for child in handle.children.borrow().iter() { collect_tables(child, tables); } } fn is_table(h: &Handle) -> bool { matches!(h.data, NodeData::Element { ref name, .. } if name.local.as_ref() == "table") }

🤖 Prompt for AI Agents

In src/lib.rs around lines 186 to 212, the function html_table_to_markdown collects all table nodes including nested ones, causing nested tables to be emitted twice. To fix this, modify the collect_tables function to only add tables whose parent is not a table, effectively filtering out nested tables. Implement an is_table helper to check if a node is a table, then update collect_tables to check the parent node before adding the current table node to the list.

sourcery-ai

Hey @leynos - I've reviewed your changes - here's some feedback:

The HTML table start detection using line.trim_start().starts_with("<table") is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.
Counting <table and </table> occurrences via .matches(...) is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.
The line-based buffering for HTML tables adds complexity to process_stream; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- The HTML table start detection using `line.trim_start().starts_with("<table")` is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.
- Counting `<table` and `</table>` occurrences via `.matches(...)` is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.
- The line-based buffering for HTML tables adds complexity to `process_stream`; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

leynos · 2025-06-14T01:19:23Z

@coderabbitai please create the following Github issues:

The HTML table start detection using line.trim_start().starts_with("<table") is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.

Counting <table and </table> occurrences via .matches(...) is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.

The line-based buffering for HTML tables adds complexity to process_stream; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.

coderabbitai · 2025-06-14T01:20:29Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@leynos I'll create those GitHub issues for you right away.

🏁 Script executed:

#!/bin/bash

# Create first issue about HTML table start detection
gh issue create \
  --title "Improve HTML table start detection robustness" \
  --body "The current HTML table start detection using \`line.trim_start().starts_with(\"<table\")\` is brittle and doesn't handle:

- HTML tables with attributes (e.g., \`<table class=\"foo\">\`)
- Uppercase tags (e.g., \`<TABLE>\`)
- Mixed case scenarios

**Suggested improvement:**
Consider using a regex pattern or parser-based check to handle these cases more reliably.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

# Create second issue about table nesting detection  
gh issue create \
  --title "Fix naive HTML table nesting detection" \
  --body "The current approach of counting \`<table\` and \`</table>\` occurrences using \`.matches(...)\` is problematic:

- Case-sensitive matching (misses \`<TABLE>\` or \`</TABLE>\`)
- Naive string counting doesn't account for tables within comments or CDATA
- Could incorrectly handle malformed HTML

**Suggested improvement:**
Use the parsed DOM structure or implement a case-insensitive approach to reliably track nested table boundaries.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

# Create third issue about architectural complexity
gh issue create \
  --title "Refactor HTML table processing to separate preprocessing step" \
  --body "The current line-based buffering for HTML tables adds significant complexity to the main \`process_stream\` function with additional state variables (\`html_buf\`, \`html_depth\`, \`in_html\`).

**Suggested improvement:**
Extract HTML-to-Markdown conversion into a separate preprocessing step that:
- Runs before the main table reflow logic
- Simplifies the main processing loop
- Improves separation of concerns
- Makes the code more maintainable and testable

This would create a cleaner architecture where HTML tables are converted to Markdown tables first, then processed through the existing Markdown table reflow logic.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

echo "Created 3 GitHub issues for HTML table implementation improvements"

Length of output: 2639

Perfect! I've successfully created three GitHub issues for the HTML table implementation improvements you identified:

Issue Improve HTML table start detection robustness #15: Improve HTML table start detection robustness
- Addresses the brittle <table detection that doesn't handle attributes or case variations
Issue Fix naive HTML table nesting detection #16: Fix naive HTML table nesting detection
- Covers the case-sensitive string counting approach for tracking nested tables
Issue Refactor HTML table processing to separate preprocessing step #17: Refactor HTML table processing to separate preprocessing step
- Addresses the architectural complexity in the main process_stream function

All issues have been assigned to you and include backlinks to this PR and comment for context. These can now be tackled independently as follow-up improvements to the HTML table conversion feature.

Add process_stream test for multiple tables

46e84a0

leynos added the codex label Jun 13, 2025 — with ChatGPT Codex Connector

sourcery-ai Bot reviewed Jun 13, 2025

View reviewed changes

Comment thread src/lib.rs Outdated

Handle nested HTML tables

252c973

sourcery-ai Bot reviewed Jun 14, 2025

View reviewed changes

Comment thread src/lib.rs

coderabbitai Bot requested changes Jun 14, 2025

View reviewed changes

Comment thread Cargo.toml Outdated

Comment thread src/lib.rs

Filter non-content HTML nodes

f4de0a8

coderabbitai Bot requested changes Jun 14, 2025

View reviewed changes

Comment thread src/lib.rs

Simplify node text parsing

2e7ab79

coderabbitai Bot approved these changes Jun 14, 2025

View reviewed changes

coderabbitai Bot requested changes Jun 14, 2025

View reviewed changes

sourcery-ai Bot reviewed Jun 14, 2025

View reviewed changes

This was referenced Jun 14, 2025

Improve HTML table start detection robustness #15

Closed

Fix naive HTML table nesting detection #16

Closed

Refactor HTML table processing to separate preprocessing step #17

Closed

leynos merged commit 7546636 into main Jun 14, 2025
3 checks passed

leynos deleted the codex/add-html5ever-support-for-markdown-tables branch June 14, 2025 01:26

sourcery-ai Bot mentioned this pull request Jun 14, 2025

Refactor HTML table handling #20

Merged

This was referenced Jun 14, 2025

Add docstrings for HTML utilities #22

Merged

Add tests for HTML table conversion #24

Merged

Add text wrapping to mdtablefix #32

Merged

Add uppercase table test #41

Merged

Fix HTML helper comment #50

Merged

Refactor conditionals into predicates #51

Merged

coderabbitai Bot mentioned this pull request Jun 16, 2025

Fix markdown fenced code block languages #52

Merged

coderabbitai Bot mentioned this pull request Jul 13, 2025

Implement header fallback for HTML tables #55

Merged

Conversation

leynos commented Jun 13, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Summary by CodeRabbit

Uh oh!

sourcery-ai Bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence Diagram for HTML Table Processing

Class Diagram: New and Modified Components for HTML Table Conversion

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leynos commented Jun 14, 2025

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leynos commented Jun 14, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

leynos commented Jun 14, 2025

Uh oh!

coderabbitai Bot commented Jun 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leynos commented Jun 13, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jun 13, 2025 •

edited

Loading

coderabbitai Bot commented Jun 13, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)