Skip to content

Add HTML table conversion support#13

Merged
leynos merged 4 commits intomainfrom
codex/add-html5ever-support-for-markdown-tables
Jun 14, 2025
Merged

Add HTML table conversion support#13
leynos merged 4 commits intomainfrom
codex/add-html5ever-support-for-markdown-tables

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Jun 13, 2025

Summary

  • parse HTML <table> elements using html5ever and convert them to Markdown
  • reflow converted tables with existing logic
  • handle HTML tables in process_stream
  • add tests for HTML table conversion
  • add test for multiple tables in one document
  • document HTML table support and update README

Testing

  • cargo fmt --all
  • cargo clippy -- -D warnings
  • cargo test
  • markdownlint README.md docs/html-table-support.md docs/rust-testing-with-rstest-fixtures.md (fails: various MD013 and MD040 errors)

https://chatgpt.com/codex/tasks/task_e_684caa9f4f7c8322bdc5b35423c8eb94

Summary by Sourcery

Enable support for simple HTML tables by parsing them with html5ever, converting to Markdown, and then reflowing using existing logic; include tests and documentation for this feature.

New Features:

  • Detect and convert HTML elements into Markdown before reflowing

    Enhancements:

    • Integrate HTML table buffering and parsing into the main processing stream

    Build:

    • Add html5ever and markup5ever_rcdom dependencies for HTML parsing

    Documentation:

    • Update README and add dedicated HTML table support documentation

    Tests:

    • Add fixtures and tests for single and multiple HTML tables in document processing

    Summary by CodeRabbit

    • New Features
      • Added support for detecting and converting simple HTML
    elements embedded in Markdown documents into Markdown table format.
  • Documentation
    • Updated the README to describe the new HTML table conversion feature.
    • Added a new documentation file detailing HTML table support and limitations.
  • Tests
    • Introduced new tests to verify HTML table conversion and handling of multiple tables.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jun 13, 2025

Reviewer's Guide

Implements support for detecting and converting HTML

elements to Markdown by parsing with html5ever, integrating this conversion into the main streaming logic before reflowing tables, and adding related dependencies, documentation, and tests.

Sequence Diagram for HTML Table Processing

sequenceDiagram
    participant UserInput
    participant P as process_stream
    participant PHL as push_html_line
    participant HTML_MD as html_table_to_markdown
    participant Parser as html5ever
    participant DOM_Utils as DOMTraversalUtils
    participant Node_MD as table_node_to_markdown

    UserInput->>P: Stream of lines (Markdown with HTML table)
    P->>P: Detects HTML table start (`<table...`)
    P->>PHL: push_html_line(current_line, html_buffer, html_depth, in_html, out_buffer)
    PHL->>PHL: Accumulate HTML table lines in html_buffer
    PHL->>PHL: Update html_depth based on <table /table> tags
    opt HTML table complete (html_depth is 0 after </table>)
        PHL->>HTML_MD: html_table_to_markdown(html_buffer)
        HTML_MD->>Parser: parse_document(html_content)
        Parser-->>HTML_MD: DOM (RcDom)
        HTML_MD->>DOM_Utils: collect_tables(DOM)
        DOM_Utils-->>HTML_MD: List of table_nodes
        loop for each table_node
            HTML_MD->>Node_MD: table_node_to_markdown(table_node)
            Node_MD->>DOM_Utils: collect_rows(table_node)
            DOM_Utils-->>Node_MD: row_handles
            loop for each row/cell in row_handles
                Node_MD->>DOM_Utils: node_text(cell_node)
                DOM_Utils-->>Node_MD: cell_text
            end
            Node_MD->>Node_MD: Converts to Markdown lines & calls reflow_table(markdown_lines)
            Node_MD-->>HTML_MD: Reflowed Markdown table lines
        end
        HTML_MD-->>PHL: Aggregated Markdown table lines
        PHL-->>P: Adds converted lines to main output stream
    end
    P->>P: Continues processing other lines
Loading

Class Diagram: New and Modified Components for HTML Table Conversion

classDiagram
    class process_stream {
      +process_stream(lines: &[String]) Vec~String~
      #html_buf: Vec~String~ (new state)
      #html_depth: usize (new state)
      #in_html: bool (new state)
      +push_html_line(line: &str, html_buf: &mut Vec~String~, html_depth: &mut usize, in_html: &mut bool, out: &mut Vec~String~) void (new method)
    }
    class html_table_to_markdown {
      +html_table_to_markdown(lines: &[String]) Vec~String~ (new function)
    }
    class table_node_to_markdown {
      +table_node_to_markdown(table_node: &Handle) Vec~String~ (new function)
    }
    class DOMTraversalUtils {
      <<Utility Module>>
      +collect_tables(handle: &Handle, tables: &mut Vec~Handle~) (new function)
      +collect_rows(handle: &Handle, rows: &mut Vec~Handle~) (new function)
      +node_text(handle: &Handle) String (new function)
      +collect_text(handle: &Handle, out: &mut Vec~String~) (new function)
    }
    class Html5everParser {
      <<External Library html5ever>>
      +parse_document(html: String) RcDom
    }
    class RcDomHandle {
      <<External DOM Node markup5ever_rcdom.Handle>>
    }
    class ExistingReflowLogic {
      <<Existing Module>>
      +reflow_table(markdown_lines: &[String]) Vec~String~
    }

    process_stream : uses internally > push_html_line
    push_html_line : calls > html_table_to_markdown
    html_table_to_markdown : uses > Html5everParser
    Html5everParser : returns > RcDomHandle
    html_table_to_markdown : uses > DOMTraversalUtils
    html_table_to_markdown : calls > table_node_to_markdown

    table_node_to_markdown : processes > RcDomHandle
    table_node_to_markdown : uses > DOMTraversalUtils
    table_node_to_markdown : calls > ExistingReflowLogic.reflow_table

    DOMTraversalUtils : operates_on > RcDomHandle
Loading

File-Level Changes

Change Details Files
Add HTML table parsing and conversion helpers
  • Introduce DOM traversal functions (collect_text, collect_tables, collect_rows) to extract table and cell data
  • Implement node_text to normalize text nodes and skip non-content elements
  • Create table_node_to_markdown to render table rows and generate header separators
  • Build html_table_to_markdown to parse raw HTML lines into Markdown with proper indentation
  • Implement push_html_line to buffer HTML lines and detect table boundaries
src/lib.rs
Integrate HTML table support into process_stream
  • Add html_buf, html_depth, and in_html state variables
  • Detect start, buffer lines, and branch processing inside HTML context
  • Flush buffered HTML on
  • or end-of-input by converting to Markdown and appending to output
src/lib.rs
Add HTML parsing dependencies
  • Include html5ever and markup5ever_rcdom in Cargo.toml
Cargo.toml
Document HTML table support
  • Create docs/html-table-support.md detailing feature and limitations
  • Update README.md to mention HTML conversion
    docs/html-table-support.md
    README.md
    Add integration tests for HTML table conversion
    • Define html_table and multiple_tables fixtures
    • Add tests verifying single and multiple HTML table handling in process_stream
    tests/integration.rs

    Tips and commands

    Interacting with Sourcery

    • Trigger a new review: Comment @sourcery-ai review on the pull request.
    • Continue discussions: Reply directly to Sourcery's review comments.
    • Generate a GitHub issue from a review comment: Ask Sourcery to create an
      issue from a review comment by replying to it. You can also reply to a
      review comment with @sourcery-ai issue to create an issue from it.
    • Generate a pull request title: Write @sourcery-ai anywhere in the pull
      request title to generate a title at any time. You can also comment
      @sourcery-ai title on the pull request to (re-)generate the title at any time.
    • Generate a pull request summary: Write @sourcery-ai summary anywhere in
      the pull request body to generate a PR summary at any time exactly where you
      want it. You can also comment @sourcery-ai summary on the pull request to
      (re-)generate the summary at any time.
    • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
      request to (re-)generate the reviewer's guide at any time.
    • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
      pull request to resolve all Sourcery comments. Useful if you've already
      addressed all the comments and don't want to see them anymore.
    • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
      request to dismiss all existing Sourcery reviews. Especially useful if you
      want to start fresh with a new review - don't forget to comment
      @sourcery-ai review to trigger a new review!

    Customizing Your Experience

    Access your dashboard to:

    • Enable or disable review features such as the Sourcery-generated pull request
      summary, the reviewer's guide, and others.
    • Change the review language.
    • Add, remove or edit custom review instructions.
    • Adjust other review settings.

    Getting Help

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 13, 2025

Walkthrough

Support for converting embedded HTML <table> elements into Markdown tables has been added. This includes new dependencies for HTML parsing, updated documentation, README changes describing the feature, new helper functions for HTML-to-Markdown conversion, and additional integration tests verifying the new functionality.

Changes

Files/Paths Change Summary
Cargo.toml Added html5ever and markup5ever_rcdom dependencies for HTML parsing.
README.md Updated to mention HTML table to Markdown conversion feature.
docs/html-table-support.md Added new documentation describing HTML table support and conversion process.
src/lib.rs Added HTML table detection and conversion logic with new private helper functions; integrated into main processing.
tests/integration.rs Added fixtures and tests for HTML table conversion and handling multiple Markdown tables.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant process_stream
    participant HTMLParser
    participant MarkdownConverter

    User->>process_stream: Provide Markdown input (may include HTML <table>)
    process_stream->>process_stream: Detect <table> block in input
    process_stream->>HTMLParser: Parse buffered HTML table fragment
    HTMLParser-->>process_stream: Return DOM structure of table
    process_stream->>MarkdownConverter: Convert DOM table to Markdown lines
    MarkdownConverter-->>process_stream: Return Markdown table lines
    process_stream->>User: Output Markdown with converted tables and reflowed Markdown tables
Loading

Possibly related PRs

  • Implement markdown table fixer #1: Implements the original Markdown table reflow and fixer functionality, which is now extended to support HTML table conversion in this PR.

Poem

In the land of Markdown, tables abound,
But some were in HTML, hidden and round.
Now with a hop and a parser so clever,
They’re Markdown again—HTML never!
🐇✨
Hop, skip, and convert with glee,
Tables united in Markdown, as neat as can be!

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leynos - I've reviewed your changes - here's some feedback:

  • node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
  • Your HTML table detection trims all leading whitespace (line.trim_end()), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency.
  • find_table only locates the first <table> element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
- Your HTML table detection trims all leading whitespace (`line.trim_end()`), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency.
- find_table only locates the first `<table>` element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.

## Individual Comments

### Comment 1
<location> `src/lib.rs:361` </location>
<code_context>

+        if in_html {
+            html_buf.push(line.trim_end().to_string());
+            if line.contains("</table>") {
+                out.extend(html_table_to_markdown(&html_buf));
+                html_buf.clear();
</code_context>

<issue_to_address>
Nested <table> tags not handled correctly

Track table nesting by incrementing a counter on <table> and decrementing on </table>, only ending collection when the counter reaches zero.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/lib.rs Outdated
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jun 14, 2025

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leynos - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/lib.rs:103` </location>
<code_context>
+        .join(" ")
+}
+
+fn collect_text(handle: &Handle, out: &mut Vec<String>) {
+    match &handle.data {
+        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
</code_context>

<issue_to_address>
Exclude non-content elements in `collect_text`

Currently, `collect_text` includes text from `<script>` and `<style>` tags, which can pollute your output. Please update it to skip these and other non-content tags.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
    match &handle.data {
        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
        NodeData::Element { .. } | NodeData::Document => {
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        _ => {}
    }
}
=======
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
    match &handle.data {
        NodeData::Text { contents } => out.push(contents.borrow().to_string()),
        NodeData::Element { ref name, .. } => {
            // List of non-content tags to skip
            let tag = name.local.as_ref();
            if tag.eq_ignore_ascii_case("script")
                || tag.eq_ignore_ascii_case("style")
                || tag.eq_ignore_ascii_case("noscript")
                || tag.eq_ignore_ascii_case("template")
                || tag.eq_ignore_ascii_case("head")
            {
                // Skip these tags and their children
                return;
            }
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        NodeData::Document => {
            for child in handle.children.borrow().iter() {
                collect_text(child, out);
            }
        }
        _ => {}
    }
}
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/lib.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
Cargo.toml (1)

4-4: ⚠️ Potential issue

Fix invalid Rust edition.

The edition "2024" is not a valid Rust edition. Valid editions are "2015", "2018", or "2021".

Apply this diff to fix the edition:

-edition = "2024"
+edition = "2021"
♻️ Duplicate comments (1)
src/lib.rs (1)

389-391: Good implementation of nested table tracking.

The depth tracking correctly handles nested tables by incrementing on <table and decrementing on </table>, addressing the previous review comment.

🧹 Nitpick comments (1)
src/lib.rs (1)

90-101: Simplify redundant whitespace normalization.

The function performs whitespace normalization twice - first when joining with spaces, then again by splitting and re-joining.

 fn node_text(handle: &Handle) -> String {
     let mut parts = Vec::new();
     collect_text(handle, &mut parts);
-    parts
-        .into_iter()
-        .filter(|s| !s.trim().is_empty())
-        .collect::<Vec<_>>()
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
+    parts.join(" ").split_whitespace().collect::<Vec<_>>().join(" ")
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c367ec and 252c973.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml (1 hunks)
  • README.md (1 hunks)
  • docs/html-table-support.md (1 hunks)
  • src/lib.rs (5 hunks)
  • tests/integration.rs (2 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/html-table-support.md

[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple

` elements embedded in M...

(OXFORD_SPELLING_Z_NOT_S)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: coverage
  • GitHub Check: build-test
🔇 Additional comments (3)
README.md (1)

6-6: LGTM!

The README update accurately describes the new HTML table conversion feature.

docs/html-table-support.md (1)

1-11: Well-documented feature description.

The documentation clearly explains the HTML table support feature, including its limitations to basic table elements.

🧰 Tools
🪛 LanguageTool

[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple

` elements embedded in M...

(OXFORD_SPELLING_Z_NOT_S)

tests/integration.rs (1)

66-148: Comprehensive test coverage for new features.

The tests thoroughly verify both HTML table conversion and multiple table handling. The expected outputs correctly show markdown tables with proper formatting and header separators.

Comment thread Cargo.toml Outdated
Comment thread src/lib.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/lib.rs (3)

401-437: Duplicated HTML buffering logic – extract a helper

The block that pushes to html_buf, updates html_depth, and flushes when depth hits zero is duplicated almost verbatim in two branches. This hurts readability and makes future fixes error-prone.

Consider extracting a small function or closure, e.g. handle_html_line(line), and call it from both places.

This removes ~25 lines of duplication and aligns with DRY.


90-101: node_text performs two extra allocations per join/split

The text is first joined, then immediately split_whitespaceed and joined again, causing avoidable allocations.

A single pass with an iterator adapter gives the same normalisation:

-fn node_text(handle: &Handle) -> String {
-    let mut parts = Vec::new();
-    collect_text(handle, &mut parts);
-    parts
-        .into_iter()
-        .filter(|s| !s.trim().is_empty())
-        .collect::<Vec<_>>()
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
+fn node_text(handle: &Handle) -> String {
+    let mut parts = Vec::new();
+    collect_text(handle, &mut parts);
+    parts
+        .into_iter()
+        .flat_map(|s| s.split_whitespace().map(str::to_owned))
+        .collect::<Vec<_>>()
+        .join(" ")
 }

Minor, but worthwhile when walking large DOMs.


199-204: Use parse_fragment for snippets instead of parse_document

html_table_to_markdown feeds a fragment (no <html> / <body>) into parse_document.
html5ever will implicitly wrap it, but parse_fragment is the intended API for isolated snippets and avoids some edge-case quirks (e.g. incorrect foster parenting around <table>).

-    let opts = ParseOpts::default();
-    let dom: RcDom = parse_document(RcDom::default(), opts).one(html);
+    let dom: RcDom = parse_fragment(
+        RcDom::default(),
+        ParseOpts::default(),
+        QualName::new(None, ns!(html), local_name!("body")),
+        Vec::new(),
+    )
+    .one(html);

Requires use html5ever::{parse_fragment, QualName};.

Not critical but brings the code in line with html5ever best practice.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 252c973 and f4de0a8.

📒 Files selected for processing (1)
  • src/lib.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: coverage
  • GitHub Check: build-test
🔇 Additional comments (1)
src/lib.rs (1)

467-470: Residual buffered HTML lost on early return scenarios

html_buf is flushed after the main loop, but only if the function wasn’t exited earlier.
Currently, the only early exits are through continue, so this is safe; however, should future refactors introduce a return inside the loop, this flush would be skipped.

A safer pattern is to wrap the loop body in a closure or move the flush into a Drop guard.

Comment thread src/lib.rs
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jun 14, 2025

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib.rs (1)

160-176: Header-separator width can still be wrong
Same issue as previously reported: col_count is fixed from the first row, so later rows with extra cells produce an undersized | --- | separator.

See earlier review; the fix is to track the maximum column count:

-        if i == 0 {
-            first_header = header_row;
-            col_count = cells.len();
-        }
+        col_count = col_count.max(cells.len());
+        if i == 0 {
+            first_header = header_row;
+        }
🧹 Nitpick comments (3)
src/lib.rs (3)

5-6: Prefer making the new HTML-parsing dependencies optional

html5ever and markup5ever_rcdom add ~2 MB to build artefacts and noticeably increase compile time.
Consider putting them behind a html-tables Cargo feature so existing users who only need Markdown reflow don’t pay the cost.

-[dependencies]
-html5ever = "0.27"
-markup5ever_rcdom = "0.2"
+[dependencies]
+html5ever = { version = "0.27", optional = true }
+markup5ever_rcdom = { version = "0.2", optional = true }
+
+[features]
+html-tables = ["html5ever", "markup5ever_rcdom"]

90-98: node_text does two passes over whitespace

parts.join(" ") immediately followed by split_whitespace() + join(" ") allocates twice and traverses the string three times.

A single split_whitespace() on the raw text is enough:

-fn node_text(handle: &Handle) -> String {
-    let mut parts = Vec::new();
-    collect_text(handle, &mut parts);
-    parts
-        .join(" ")
-        .split_whitespace()
-        .collect::<Vec<_>>()
-        .join(" ")
-}
+fn node_text(handle: &Handle) -> String {
+    let mut parts = Vec::new();
+    collect_text(handle, &mut parts);
+    parts
+        .join(" ")
+        .split_whitespace()
+        .collect::<Vec<_>>()
+        .join(" ")
+}

(You can simplify further by building the output with a String and pushing spaces manually.)
Minor, but worth tidying.


214-231: Tag detection is case-sensitive and may mis-count multiple tags per line

html_depth += line.matches("<table").count(); and the </table> counterpart ignore <TABLE> or mixed-case tags and skip self-closing shorthand like <table/>.

Consider normalising the line (line.to_ascii_lowercase()) and counting all occurrences of "<table" / "</table" regardless of case.
Also increment/decrement for every closing tag, not only when contains returns true (you already have .matches() doing the count).

Low-risk edge case, but cheap to harden.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4de0a8 and 2e7ab79.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml (1 hunks)
  • src/lib.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-test
  • GitHub Check: coverage
🔇 Additional comments (1)
src/lib.rs (1)

466-468: Unbalanced <table> blocks silently converted

If the input starts a <table> and never closes it, html_buf is flushed as a table at EOF, producing partial or garbled Markdown.

Please confirm this is intentional. You might instead emit the original HTML when html_depth != 0:

if !html_buf.is_empty() {
    if html_depth == 0 {
        out.extend(html_table_to_markdown(&html_buf));
    } else {
        out.extend(html_buf);
    }
}

Comment thread src/lib.rs
Comment on lines +186 to +212
fn html_table_to_markdown(lines: &[String]) -> Vec<String> {
let indent: String = lines
.first()
.map(|l| l.chars().take_while(|c| c.is_whitespace()).collect())
.unwrap_or_default();
let html: String = lines
.iter()
.map(|l| l.trim_end())
.collect::<Vec<_>>()
.join("\n");
let opts = ParseOpts::default();
let dom: RcDom = parse_document(RcDom::default(), opts).one(html);

let mut tables = Vec::new();
collect_tables(&dom.document, &mut tables);
if tables.is_empty() {
return lines.to_vec();
}

let mut out = Vec::new();
for table in tables {
for line in table_node_to_markdown(&table) {
out.push(format!("{indent}{line}"));
}
}
out
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Nested tables are emitted twice

collect_tables gathers every <table> node, including those already nested inside another table. Converting each node individually duplicates inner tables in the output.

Track depth and only emit the outermost tables or filter out tables whose parent is also a <table>:

fn collect_tables(handle: &Handle, tables: &mut Vec<Handle>) {
    if let NodeData::Element { name, .. } = &handle.data {
        if name.local.as_ref() == "table" {
            if handle.parent().map_or(true, |p| !is_table(&p)) {
                tables.push(handle.clone());
            }
        }
    }
    for child in handle.children.borrow().iter() {
        collect_tables(child, tables);
    }
}

fn is_table(h: &Handle) -> bool {
    matches!(h.data, NodeData::Element { ref name, .. } if name.local.as_ref() == "table")
}
🤖 Prompt for AI Agents
In src/lib.rs around lines 186 to 212, the function html_table_to_markdown
collects all table nodes including nested ones, causing nested tables to be
emitted twice. To fix this, modify the collect_tables function to only add
tables whose parent is not a table, effectively filtering out nested tables.
Implement an is_table helper to check if a node is a table, then update
collect_tables to check the parent node before adding the current table node to
the list.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leynos - I've reviewed your changes - here's some feedback:

  • The HTML table start detection using line.trim_start().starts_with("<table") is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.
  • Counting <table and </table> occurrences via .matches(...) is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.
  • The line-based buffering for HTML tables adds complexity to process_stream; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The HTML table start detection using `line.trim_start().starts_with("<table")` is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.
- Counting `<table` and `</table>` occurrences via `.matches(...)` is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.
- The line-based buffering for HTML tables adds complexity to `process_stream`; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jun 14, 2025

@coderabbitai please create the following Github issues:

The HTML table start detection using line.trim_start().starts_with("<table") is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.

Counting <table and </table> occurrences via .matches(...) is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.

The line-based buffering for HTML tables adds complexity to process_stream; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 14, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@leynos I'll create those GitHub issues for you right away.


🏁 Script executed:

#!/bin/bash

# Create first issue about HTML table start detection
gh issue create \
  --title "Improve HTML table start detection robustness" \
  --body "The current HTML table start detection using \`line.trim_start().starts_with(\"<table\")\` is brittle and doesn't handle:

- HTML tables with attributes (e.g., \`<table class=\"foo\">\`)
- Uppercase tags (e.g., \`<TABLE>\`)
- Mixed case scenarios

**Suggested improvement:**
Consider using a regex pattern or parser-based check to handle these cases more reliably.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

# Create second issue about table nesting detection  
gh issue create \
  --title "Fix naive HTML table nesting detection" \
  --body "The current approach of counting \`<table\` and \`</table>\` occurrences using \`.matches(...)\` is problematic:

- Case-sensitive matching (misses \`<TABLE>\` or \`</TABLE>\`)
- Naive string counting doesn't account for tables within comments or CDATA
- Could incorrectly handle malformed HTML

**Suggested improvement:**
Use the parsed DOM structure or implement a case-insensitive approach to reliably track nested table boundaries.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

# Create third issue about architectural complexity
gh issue create \
  --title "Refactor HTML table processing to separate preprocessing step" \
  --body "The current line-based buffering for HTML tables adds significant complexity to the main \`process_stream\` function with additional state variables (\`html_buf\`, \`html_depth\`, \`in_html\`).

**Suggested improvement:**
Extract HTML-to-Markdown conversion into a separate preprocessing step that:
- Runs before the main table reflow logic
- Simplifies the main processing loop
- Improves separation of concerns
- Makes the code more maintainable and testable

This would create a cleaner architecture where HTML tables are converted to Markdown tables first, then processed through the existing Markdown table reflow logic.

**Related:**
- PR: https://github.com/leynos/mdtablefix/pull/13
- Comment: https://github.com/leynos/mdtablefix/pull/13#issuecomment-$(gh api repos/leynos/mdtablefix/pulls/13/comments --jq '.[-1].id')" \
  --assignee leynos

echo "Created 3 GitHub issues for HTML table implementation improvements"

Length of output: 2639


Perfect! I've successfully created three GitHub issues for the HTML table implementation improvements you identified:

  1. Issue Improve HTML table start detection robustness #15: Improve HTML table start detection robustness

    • Addresses the brittle <table detection that doesn't handle attributes or case variations
  2. Issue Fix naive HTML table nesting detection #16: Fix naive HTML table nesting detection

    • Covers the case-sensitive string counting approach for tracking nested tables
  3. Issue Refactor HTML table processing to separate preprocessing step #17: Refactor HTML table processing to separate preprocessing step

    • Addresses the architectural complexity in the main process_stream function

All issues have been assigned to you and include backlinks to this PR and comment for context. These can now be tackled independently as follow-up improvements to the HTML table conversion feature.

@leynos leynos merged commit 7546636 into main Jun 14, 2025
3 checks passed
@leynos leynos deleted the codex/add-html5ever-support-for-markdown-tables branch June 14, 2025 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant