Conversation
Reviewer's GuideImplements support for detecting and converting HTML elements to Markdown by parsing with html5ever, integrating this conversion into the main streaming logic before reflowing tables, and adding related dependencies, documentation, and tests.Sequence Diagram for HTML Table ProcessingsequenceDiagram
participant UserInput
participant P as process_stream
participant PHL as push_html_line
participant HTML_MD as html_table_to_markdown
participant Parser as html5ever
participant DOM_Utils as DOMTraversalUtils
participant Node_MD as table_node_to_markdown
UserInput->>P: Stream of lines (Markdown with HTML table)
P->>P: Detects HTML table start (`<table...`)
P->>PHL: push_html_line(current_line, html_buffer, html_depth, in_html, out_buffer)
PHL->>PHL: Accumulate HTML table lines in html_buffer
PHL->>PHL: Update html_depth based on <table /table> tags
opt HTML table complete (html_depth is 0 after </table>)
PHL->>HTML_MD: html_table_to_markdown(html_buffer)
HTML_MD->>Parser: parse_document(html_content)
Parser-->>HTML_MD: DOM (RcDom)
HTML_MD->>DOM_Utils: collect_tables(DOM)
DOM_Utils-->>HTML_MD: List of table_nodes
loop for each table_node
HTML_MD->>Node_MD: table_node_to_markdown(table_node)
Node_MD->>DOM_Utils: collect_rows(table_node)
DOM_Utils-->>Node_MD: row_handles
loop for each row/cell in row_handles
Node_MD->>DOM_Utils: node_text(cell_node)
DOM_Utils-->>Node_MD: cell_text
end
Node_MD->>Node_MD: Converts to Markdown lines & calls reflow_table(markdown_lines)
Node_MD-->>HTML_MD: Reflowed Markdown table lines
end
HTML_MD-->>PHL: Aggregated Markdown table lines
PHL-->>P: Adds converted lines to main output stream
end
P->>P: Continues processing other lines
Class Diagram: New and Modified Components for HTML Table ConversionclassDiagram
class process_stream {
+process_stream(lines: &[String]) Vec~String~
#html_buf: Vec~String~ (new state)
#html_depth: usize (new state)
#in_html: bool (new state)
+push_html_line(line: &str, html_buf: &mut Vec~String~, html_depth: &mut usize, in_html: &mut bool, out: &mut Vec~String~) void (new method)
}
class html_table_to_markdown {
+html_table_to_markdown(lines: &[String]) Vec~String~ (new function)
}
class table_node_to_markdown {
+table_node_to_markdown(table_node: &Handle) Vec~String~ (new function)
}
class DOMTraversalUtils {
<<Utility Module>>
+collect_tables(handle: &Handle, tables: &mut Vec~Handle~) (new function)
+collect_rows(handle: &Handle, rows: &mut Vec~Handle~) (new function)
+node_text(handle: &Handle) String (new function)
+collect_text(handle: &Handle, out: &mut Vec~String~) (new function)
}
class Html5everParser {
<<External Library html5ever>>
+parse_document(html: String) RcDom
}
class RcDomHandle {
<<External DOM Node markup5ever_rcdom.Handle>>
}
class ExistingReflowLogic {
<<Existing Module>>
+reflow_table(markdown_lines: &[String]) Vec~String~
}
process_stream : uses internally > push_html_line
push_html_line : calls > html_table_to_markdown
html_table_to_markdown : uses > Html5everParser
Html5everParser : returns > RcDomHandle
html_table_to_markdown : uses > DOMTraversalUtils
html_table_to_markdown : calls > table_node_to_markdown
table_node_to_markdown : processes > RcDomHandle
table_node_to_markdown : uses > DOMTraversalUtils
table_node_to_markdown : calls > ExistingReflowLogic.reflow_table
DOMTraversalUtils : operates_on > RcDomHandle
File-Level Changes
|
WalkthroughSupport for converting embedded HTML Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant process_stream
participant HTMLParser
participant MarkdownConverter
User->>process_stream: Provide Markdown input (may include HTML <table>)
process_stream->>process_stream: Detect <table> block in input
process_stream->>HTMLParser: Parse buffered HTML table fragment
HTMLParser-->>process_stream: Return DOM structure of table
process_stream->>MarkdownConverter: Convert DOM table to Markdown lines
MarkdownConverter-->>process_stream: Return Markdown table lines
process_stream->>User: Output Markdown with converted tables and reflowed Markdown tables
Possibly related PRs
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Hey @leynos - I've reviewed your changes - here's some feedback:
- node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
- Your HTML table detection trims all leading whitespace (
line.trim_end()), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency. - find_table only locates the first
<table>element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- node_text currently concatenates text nodes without any separating whitespace, which may run words together—consider inserting spaces or normalizing whitespace between sibling text nodes.
- Your HTML table detection trims all leading whitespace (`line.trim_end()`), so converted tables lose their original indentation—preserve or reapply indentation to maintain document consistency.
- find_table only locates the first `<table>` element and ignores any subsequent ones or nested tables—consider iterating through all table elements or supporting multiple HTML tables in one pass.
## Individual Comments
### Comment 1
<location> `src/lib.rs:361` </location>
<code_context>
+ if in_html {
+ html_buf.push(line.trim_end().to_string());
+ if line.contains("</table>") {
+ out.extend(html_table_to_markdown(&html_buf));
+ html_buf.clear();
</code_context>
<issue_to_address>
Nested <table> tags not handled correctly
Track table nesting by incrementing a counter on <table> and decrementing on </table>, only ending collection when the counter reaches zero.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey @leynos - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/lib.rs:103` </location>
<code_context>
+ .join(" ")
+}
+
+fn collect_text(handle: &Handle, out: &mut Vec<String>) {
+ match &handle.data {
+ NodeData::Text { contents } => out.push(contents.borrow().to_string()),
</code_context>
<issue_to_address>
Exclude non-content elements in `collect_text`
Currently, `collect_text` includes text from `<script>` and `<style>` tags, which can pollute your output. Please update it to skip these and other non-content tags.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
match &handle.data {
NodeData::Text { contents } => out.push(contents.borrow().to_string()),
NodeData::Element { .. } | NodeData::Document => {
for child in handle.children.borrow().iter() {
collect_text(child, out);
}
}
_ => {}
}
}
=======
fn collect_text(handle: &Handle, out: &mut Vec<String>) {
match &handle.data {
NodeData::Text { contents } => out.push(contents.borrow().to_string()),
NodeData::Element { ref name, .. } => {
// List of non-content tags to skip
let tag = name.local.as_ref();
if tag.eq_ignore_ascii_case("script")
|| tag.eq_ignore_ascii_case("style")
|| tag.eq_ignore_ascii_case("noscript")
|| tag.eq_ignore_ascii_case("template")
|| tag.eq_ignore_ascii_case("head")
{
// Skip these tags and their children
return;
}
for child in handle.children.borrow().iter() {
collect_text(child, out);
}
}
NodeData::Document => {
for child in handle.children.borrow().iter() {
collect_text(child, out);
}
}
_ => {}
}
}
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
Cargo.toml (1)
4-4:⚠️ Potential issueFix invalid Rust edition.
The edition
"2024"is not a valid Rust edition. Valid editions are "2015", "2018", or "2021".Apply this diff to fix the edition:
-edition = "2024" +edition = "2021"
♻️ Duplicate comments (1)
src/lib.rs (1)
389-391: Good implementation of nested table tracking.The depth tracking correctly handles nested tables by incrementing on
<tableand decrementing on</table>, addressing the previous review comment.
🧹 Nitpick comments (1)
src/lib.rs (1)
90-101: Simplify redundant whitespace normalization.The function performs whitespace normalization twice - first when joining with spaces, then again by splitting and re-joining.
fn node_text(handle: &Handle) -> String { let mut parts = Vec::new(); collect_text(handle, &mut parts); - parts - .into_iter() - .filter(|s| !s.trim().is_empty()) - .collect::<Vec<_>>() - .join(" ") - .split_whitespace() - .collect::<Vec<_>>() - .join(" ") + parts.join(" ").split_whitespace().collect::<Vec<_>>().join(" ") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.toml(1 hunks)README.md(1 hunks)docs/html-table-support.md(1 hunks)src/lib.rs(5 hunks)tests/integration.rs(2 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/html-table-support.md
[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple
(OXFORD_SPELLING_Z_NOT_S)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: coverage
- GitHub Check: build-test
🔇 Additional comments (3)
README.md (1)
6-6: LGTM!The README update accurately describes the new HTML table conversion feature.
docs/html-table-support.md (1)
1-11: Well-documented feature description.The documentation clearly explains the HTML table support feature, including its limitations to basic table elements.
🧰 Tools
🪛 LanguageTool
[style] ~3-~3: Would you like to use the Oxford spelling “recognize”? The spelling ‘recognise’ is also correct.
Context: ...ablefixuses thehtml5everparser to recognise simple
(OXFORD_SPELLING_Z_NOT_S)
tests/integration.rs (1)
66-148: Comprehensive test coverage for new features.
The tests thoroughly verify both HTML table conversion and multiple table handling. The expected outputs correctly show markdown tables with proper formatting and header separators.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/lib.rs (3)
401-437: Duplicated HTML buffering logic – extract a helperThe block that pushes to
html_buf, updateshtml_depth, and flushes when depth hits zero is duplicated almost verbatim in two branches. This hurts readability and makes future fixes error-prone.Consider extracting a small function or closure, e.g.
handle_html_line(line), and call it from both places.This removes ~25 lines of duplication and aligns with DRY.
90-101:node_textperforms two extra allocations per join/splitThe text is first
joined, then immediatelysplit_whitespaceed andjoined again, causing avoidable allocations.A single pass with an iterator adapter gives the same normalisation:
-fn node_text(handle: &Handle) -> String { - let mut parts = Vec::new(); - collect_text(handle, &mut parts); - parts - .into_iter() - .filter(|s| !s.trim().is_empty()) - .collect::<Vec<_>>() - .join(" ") - .split_whitespace() - .collect::<Vec<_>>() - .join(" ") +fn node_text(handle: &Handle) -> String { + let mut parts = Vec::new(); + collect_text(handle, &mut parts); + parts + .into_iter() + .flat_map(|s| s.split_whitespace().map(str::to_owned)) + .collect::<Vec<_>>() + .join(" ") }Minor, but worthwhile when walking large DOMs.
199-204: Useparse_fragmentfor snippets instead ofparse_document
html_table_to_markdownfeeds a fragment (no<html>/<body>) intoparse_document.
html5everwill implicitly wrap it, butparse_fragmentis the intended API for isolated snippets and avoids some edge-case quirks (e.g. incorrect foster parenting around<table>).- let opts = ParseOpts::default(); - let dom: RcDom = parse_document(RcDom::default(), opts).one(html); + let dom: RcDom = parse_fragment( + RcDom::default(), + ParseOpts::default(), + QualName::new(None, ns!(html), local_name!("body")), + Vec::new(), + ) + .one(html);Requires
use html5ever::{parse_fragment, QualName};.Not critical but brings the code in line with html5ever best practice.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lib.rs(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: coverage
- GitHub Check: build-test
🔇 Additional comments (1)
src/lib.rs (1)
467-470: Residual buffered HTML lost on early return scenarios
html_bufis flushed after the main loop, but only if the function wasn’t exited earlier.
Currently, the only early exits are throughcontinue, so this is safe; however, should future refactors introduce areturninside the loop, this flush would be skipped.A safer pattern is to wrap the loop body in a closure or move the flush into a
Dropguard.
|
@sourcery-ai review |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib.rs (1)
160-176: Header-separator width can still be wrong
Same issue as previously reported:col_countis fixed from the first row, so later rows with extra cells produce an undersized| --- |separator.See earlier review; the fix is to track the maximum column count:
- if i == 0 { - first_header = header_row; - col_count = cells.len(); - } + col_count = col_count.max(cells.len()); + if i == 0 { + first_header = header_row; + }
🧹 Nitpick comments (3)
src/lib.rs (3)
5-6: Prefer making the new HTML-parsing dependencies optional
html5everandmarkup5ever_rcdomadd ~2 MB to build artefacts and noticeably increase compile time.
Consider putting them behind ahtml-tablesCargo feature so existing users who only need Markdown reflow don’t pay the cost.-[dependencies] -html5ever = "0.27" -markup5ever_rcdom = "0.2" +[dependencies] +html5ever = { version = "0.27", optional = true } +markup5ever_rcdom = { version = "0.2", optional = true } + +[features] +html-tables = ["html5ever", "markup5ever_rcdom"]
90-98:node_textdoes two passes over whitespace
parts.join(" ")immediately followed bysplit_whitespace()+join(" ")allocates twice and traverses the string three times.A single
split_whitespace()on the raw text is enough:-fn node_text(handle: &Handle) -> String { - let mut parts = Vec::new(); - collect_text(handle, &mut parts); - parts - .join(" ") - .split_whitespace() - .collect::<Vec<_>>() - .join(" ") -} +fn node_text(handle: &Handle) -> String { + let mut parts = Vec::new(); + collect_text(handle, &mut parts); + parts + .join(" ") + .split_whitespace() + .collect::<Vec<_>>() + .join(" ") +}(You can simplify further by building the output with a
Stringand pushing spaces manually.)
Minor, but worth tidying.
214-231: Tag detection is case-sensitive and may mis-count multiple tags per line
html_depth += line.matches("<table").count();and the</table>counterpart ignore<TABLE>or mixed-case tags and skip self-closing shorthand like<table/>.Consider normalising the line (
line.to_ascii_lowercase()) and counting all occurrences of"<table"/"</table"regardless of case.
Also increment/decrement for every closing tag, not only whencontainsreturns true (you already have.matches()doing the count).Low-risk edge case, but cheap to harden.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.toml(1 hunks)src/lib.rs(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- Cargo.toml
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-test
- GitHub Check: coverage
🔇 Additional comments (1)
src/lib.rs (1)
466-468: Unbalanced<table>blocks silently convertedIf the input starts a
<table>and never closes it,html_bufis flushed as a table at EOF, producing partial or garbled Markdown.Please confirm this is intentional. You might instead emit the original HTML when
html_depth != 0:if !html_buf.is_empty() { if html_depth == 0 { out.extend(html_table_to_markdown(&html_buf)); } else { out.extend(html_buf); } }
| fn html_table_to_markdown(lines: &[String]) -> Vec<String> { | ||
| let indent: String = lines | ||
| .first() | ||
| .map(|l| l.chars().take_while(|c| c.is_whitespace()).collect()) | ||
| .unwrap_or_default(); | ||
| let html: String = lines | ||
| .iter() | ||
| .map(|l| l.trim_end()) | ||
| .collect::<Vec<_>>() | ||
| .join("\n"); | ||
| let opts = ParseOpts::default(); | ||
| let dom: RcDom = parse_document(RcDom::default(), opts).one(html); | ||
|
|
||
| let mut tables = Vec::new(); | ||
| collect_tables(&dom.document, &mut tables); | ||
| if tables.is_empty() { | ||
| return lines.to_vec(); | ||
| } | ||
|
|
||
| let mut out = Vec::new(); | ||
| for table in tables { | ||
| for line in table_node_to_markdown(&table) { | ||
| out.push(format!("{indent}{line}")); | ||
| } | ||
| } | ||
| out | ||
| } |
There was a problem hiding this comment.
Nested tables are emitted twice
collect_tables gathers every <table> node, including those already nested inside another table. Converting each node individually duplicates inner tables in the output.
Track depth and only emit the outermost tables or filter out tables whose parent is also a <table>:
fn collect_tables(handle: &Handle, tables: &mut Vec<Handle>) {
if let NodeData::Element { name, .. } = &handle.data {
if name.local.as_ref() == "table" {
if handle.parent().map_or(true, |p| !is_table(&p)) {
tables.push(handle.clone());
}
}
}
for child in handle.children.borrow().iter() {
collect_tables(child, tables);
}
}
fn is_table(h: &Handle) -> bool {
matches!(h.data, NodeData::Element { ref name, .. } if name.local.as_ref() == "table")
}🤖 Prompt for AI Agents
In src/lib.rs around lines 186 to 212, the function html_table_to_markdown
collects all table nodes including nested ones, causing nested tables to be
emitted twice. To fix this, modify the collect_tables function to only add
tables whose parent is not a table, effectively filtering out nested tables.
Implement an is_table helper to check if a node is a table, then update
collect_tables to check the parent node before adding the current table node to
the list.
There was a problem hiding this comment.
Hey @leynos - I've reviewed your changes - here's some feedback:
- The HTML table start detection using
line.trim_start().starts_with("<table")is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags. - Counting
<tableand</table>occurrences via.matches(...)is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables. - The line-based buffering for HTML tables adds complexity to
process_stream; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The HTML table start detection using `line.trim_start().starts_with("<table")` is brittle—consider using a regex or parser-based check to handle attributes and uppercase tags.
- Counting `<table` and `</table>` occurrences via `.matches(...)` is naive and case-sensitive; you should use the parsed DOM or a case-insensitive approach to reliably track nested tables.
- The line-based buffering for HTML tables adds complexity to `process_stream`; extracting HTML-to-Markdown conversion into a separate preprocessing step could simplify the main loop.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@coderabbitai please create the following Github issues: |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@leynos I'll create those GitHub issues for you right away. 🏁 Script executed: Length of output: 2639 Perfect! I've successfully created three GitHub issues for the HTML table implementation improvements you identified:
All issues have been assigned to you and include backlinks to this PR and comment for context. These can now be tackled independently as follow-up improvements to the HTML table conversion feature. |
Summary
<table>elements using html5ever and convert them to Markdownprocess_streamTesting
cargo fmt --allcargo clippy -- -D warningscargo testmarkdownlint README.md docs/html-table-support.md docs/rust-testing-with-rstest-fixtures.md(fails: various MD013 and MD040 errors)https://chatgpt.com/codex/tasks/task_e_684caa9f4f7c8322bdc5b35423c8eb94
Summary by Sourcery
Enable support for simple HTML tables by parsing them with html5ever, converting to Markdown, and then reflowing using existing logic; include tests and documentation for this feature.
New Features:
Enhancements:
Build:
Documentation:
Tests:
Summary by CodeRabbit