diff --git a/README.md b/README.md index 24ad540f..c9a8b0bb 100644 --- a/README.md +++ b/README.md @@ -159,12 +159,14 @@ Only simple tables composed of ``, ``, and `` tags are supported. Tag case and attributes are ignored. After conversion, they are reformatted alongside regular Markdown tables. -See [HTML table support for more details](docs/html-table-support.md). +See +[HTML table support for more details](docs/architecture.md#html-table-support-in-mdtablefix) +. ## Module structure -For an overview of how the crate's internal modules relate to each other, see -[Module relationships](docs/module-relationships.md). +For an overview of how the crate's internal modules relate to each other, see \ +[Module relationships](docs/architecture.md#module-relationships). ## Testing diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 00000000..7d706df2 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,307 @@ +# Architecture + +## Contents + +- [Markdown stream processor](#markdown-stream-processor) +- [Footnote conversion](#footnote-conversion) +- [HTML table support](#html-table-support-in-mdtablefix) +- [Module relationships](#module-relationships) +- [Concurrency with `rayon`](#concurrency-with-rayon) +- [Unicode width handling](#unicode-width-handling) + +## Markdown stream processor + +`process_stream_inner` orchestrates line-by-line rewriting. The full +implementation lives in [src/process.rs](../src/process.rs). Its signature is: + +```rust +pub fn process_stream_inner(lines: &[String], opts: Options) -> Vec +``` + +The function combines several helpers documented in `docs/`: + +- `fences::compress_fences` and `attach_orphan_specifiers` normalize code block + delimiters. +- `html::convert_html_tables` transforms basic HTML tables into Markdown so \ + they can be reflowed like regular tables. See \ + [HTML table support](#html-table-support-in-mdtablefix). +- `wrap::wrap_text` applies optional line wrapping. It relies on the + `unicode-width` crate for accurate character widths. + +The function maintains a small state machine that tracks whether it is inside a +Markdown table, an HTML table, or a fenced code block. The state determines how +incoming lines are buffered or emitted. Once the end of a table or fence is +reached, buffered lines are flushed and possibly reformatted. The simplified +behaviour is illustrated below. + +```mermaid +stateDiagram-v2 + + [*] --> Streaming: Start + + Streaming: Default state—processing lines individually + + InMarkdownTable: Buffering lines of a Markdown table + + InHtmlTable: Buffering lines of an HTML table + + InCodeFence: Passing through lines within a fenced code block + + Streaming --> InMarkdownTable: Line starts with "|" + Streaming --> InHtmlTable: Line contains table HTML tag + Streaming --> InCodeFence: Line is a fence delimiter ("```" or "~~~") + + InMarkdownTable --> Streaming: Flush buffer and reflow table on non-table line (e.g., blank, heading) + InMarkdownTable --> InMarkdownTable: Line contains "|" or separator pattern + + InHtmlTable --> Streaming: Flush buffer and convert table on final table HTML closing tag + InHtmlTable --> InHtmlTable: Line inside table tag + + InCodeFence --> Streaming: Line is a fence delimiter +``` + +Before: + +```markdown +|A|B| +|---|---| +|1|22| +
34
+``` + +After: + +```markdown +| A | B | +| --- | --- | +| 1 | 22 | +| 3 | 4 | +``` + +Code fences are passed through verbatim: + +```rust +| not | a | table | +``` + +After scanning all lines, the processor performs optional post-processing steps +such as ellipsis replacement and footnote conversion. See \ +[footnote conversion](#footnote-conversion) for details. The function then +returns the updated stream for writing to disk or further manipulation. + +## Footnote Conversion + +`mdtablefix` can optionally convert bare numeric references into +GitHub-flavoured Markdown footnotes. The `convert_footnotes` function performs +this operation and is exposed via the higher-level `process_stream_opts` +helper. Set `Options { footnotes: true, ..Default::default() }` when calling +`process_stream_opts` to enable the conversion logic. + +Inline references that appear after punctuation are rewritten as footnote links. + +Before: + +```markdown +A useful tip.1 +``` + +After: + +```markdown +A useful tip.[^1] +``` + +Numbers inside inline code or parentheses are ignored. + +Before: + +```markdown +Look at `code 1` for details. +Refer to equation (1) for context. +``` + +After: + +```markdown +Look at `code 1` for details. +Refer to equation (1) for context. +``` + +When the final lines of a document form a numbered list, they are replaced with +footnote definitions. + +Before: + +```markdown +Text. + + 1. First note + 2. Second note +``` + +After: + +```markdown +Text. + + [^1] First note +[^2] Second note +``` + +`convert_footnotes` only processes the final contiguous list of numeric +references. + +## HTML Table Support in `mdtablefix` + +`mdtablefix` can format simple HTML `` elements embedded in Markdown. +These HTML tables are transformed into Markdown before the main table reflow +logic runs. That preprocessing is handled by the `convert_html_tables` function. + +Only straightforward tables with ``, `
` and `` tags are detected. +Attributes and tag casing are ignored, and complex nested or styled tables are +not supported. After conversion, each HTML table is represented as a Markdown +table, so the usual reflow algorithm can align its columns consistently with +the rest of the document. + +```html + + + +
AB
12
+``` + +The converter checks the first table row for `
` cells or for `` or +`` tags inside `` elements to decide whether it is a header. If no such +markers exist and the table contains multiple rows, the first row is still +treated as the header, so the Markdown output includes a separator line. This +last-resort behaviour keeps simple tables readable after conversion. + +## Module Relationships + +This diagram illustrates the connections between the crate's modules. + +```mermaid +classDiagram + class lib { + <> + } + class html { + <> + +convert_html_tables() + +html_table_to_markdown() + } + class table { + <> + +reflow_table() + +split_cells() + +SEP_RE + } + class wrap { + <> + +wrap_text() + +is_fence() + } + class lists { + <> + +renumber_lists() + } + class breaks { + <> + +format_breaks() + +THEMATIC_BREAK_LEN + } + class ellipsis { + <> + +replace_ellipsis() + } + class fences { + <> + +compress_fences() + +attach_orphan_specifiers() + } + class footnotes { + <> + +convert_footnotes() + } + class process { + <> + +process_stream() + +process_stream_no_wrap() + } + class io { + <> + +rewrite() + +rewrite_no_wrap() + } + lib --> html + lib --> table + lib --> wrap + lib --> lists + lib --> breaks + lib --> ellipsis + lib --> fences + lib --> process + lib --> io + html ..> wrap : uses is_fence + table ..> reflow : uses parse_rows, etc. + lists ..> wrap : uses is_fence + breaks ..> wrap : uses is_fence + ellipsis ..> wrap : uses tokenize_markdown + process ..> html : uses convert_html_tables + process ..> table : uses reflow_table + process ..> wrap : uses wrap_text, is_fence + process ..> fences : uses compress_fences, attach_orphan_specifiers + process ..> ellipsis : uses replace_ellipsis + process ..> footnotes : uses convert_footnotes + io ..> process : uses process_stream, process_stream_no_wrap +``` + +The `lib` module re-exports the public API from the other modules. The +`ellipsis` module performs text normalization. The `process` module provides +streaming helpers that combine the lower-level functions, including ellipsis +replacement and footnote conversion. The `io` module handles filesystem +operations, delegating the text processing to `process`. + +## Concurrency with `rayon` + +`mdtablefix` uses the `rayon` crate to process multiple files concurrently. +`rayon` provides a work-stealing thread pool and simple parallel iterators. The +tool relies on Rayon's global thread pool so that no manual setup is required. +The dependency is specified as `^1.0` in `Cargo.toml` to track stable API +changes within the same major release. + +Parallelism is enabled automatically whenever more than one file path is +provided on the command line. Each worker gathers its output before printing, +so results appear in the original order. This buffering increases memory usage +and may reduce performance if many tiny files are processed. + +```mermaid +sequenceDiagram + participant User as actor User + participant CLI as CLI Main + participant FileHandler as handle_file + participant Stdout as Stdout + participant Stderr as Stderr + + User->>CLI: Run CLI with multiple files (not in-place) + CLI->>FileHandler: handle_file(file1) + CLI->>FileHandler: handle_file(file2) + CLI->>FileHandler: handle_file(file3) + Note over CLI,FileHandler: Files processed in parallel + FileHandler-->>CLI: Result (Ok(Some(output)) or Err(error)) + loop For each file in input order + CLI->>Stdout: Print output (if Ok) + CLI->>Stderr: Print error (if Err) + end + CLI-->>User: Exit (with error if any file errored) +``` + +## Unicode Width Handling + +`mdtablefix` wraps paragraphs and list items while respecting the display width +of Unicode characters. The `unicode-width` crate is used to compute the width +of strings when deciding where to break lines. This prevents emojis or other +multibyte characters from causing unexpected wraps or truncation. + +Whenever wrapping logic examines the length of a token, it relies on +`UnicodeWidthStr::width` to measure visible columns rather than byte length. diff --git a/docs/footnote-conversion.md b/docs/footnote-conversion.md deleted file mode 100644 index 2dbce337..00000000 --- a/docs/footnote-conversion.md +++ /dev/null @@ -1,61 +0,0 @@ -# Footnote Conversion - -`mdtablefix` can optionally convert bare numeric references into -GitHub-flavoured Markdown footnotes. The `convert_footnotes` function performs -this operation and is exposed via the higher-level `process_stream_opts` -helper. Set `Options { footnotes: true, ..Default::default() }` when calling -`process_stream_opts` to enable the conversion logic. - -Inline references that appear after punctuation are rewritten as footnote links. - -Before: - -```markdown -A useful tip.1 -``` - -After: - -```markdown -A useful tip.[^1] -``` - -Numbers inside inline code or parentheses are ignored. - -Before: - -```markdown -Look at `code 1` for details. -Refer to equation (1) for context. -``` - -After: - -```markdown -Look at `code 1` for details. -Refer to equation (1) for context. -``` - -When the final lines of a document form a numbered list they are replaced with -footnote definitions. - -Before: - -```markdown -Text. - - 1. First note - 2. Second note -``` - -After: - -```markdown -Text. - - [^1] First note -[^2] Second note -``` - -`convert_footnotes` only processes the final contiguous list of numeric -references. diff --git a/docs/html-table-support.md b/docs/html-table-support.md deleted file mode 100644 index 6e255006..00000000 --- a/docs/html-table-support.md +++ /dev/null @@ -1,24 +0,0 @@ -# HTML Table Support in `mdtablefix` - -`mdtablefix` can format simple HTML `` elements embedded in Markdown. -These HTML tables are transformed into Markdown before the main table reflow -logic runs. That preprocessing is handled by the `convert_html_tables` function. - -Only straightforward tables with ``, `
` and `` tags are detected. -Attributes and tag casing are ignored, and complex nested or styled tables are -not supported. After conversion each HTML table is represented as a Markdown -table so the usual reflow algorithm can align its columns consistently with the -rest of the document. - -```html - - - -
AB
12
-``` - -The converter checks the first table row for `
` cells or for `` or -`` tags inside `` elements to decide whether it is a header. If no such -markers exist and the table contains multiple rows, the first row is still -treated as the header, so the Markdown output includes a separator line. This -last-resort behaviour keeps simple tables readable after conversion. diff --git a/docs/markdown-stream-processor.md b/docs/markdown-stream-processor.md deleted file mode 100644 index 82cbc098..00000000 --- a/docs/markdown-stream-processor.md +++ /dev/null @@ -1,79 +0,0 @@ -# Markdown stream processor - -`process_stream_inner` orchestrates line-by-line rewriting. The full -implementation lives in [src/process.rs](../src/process.rs). Its signature is: - -```rust -pub fn process_stream_inner(lines: &[String], opts: Options) -> Vec -``` - -The function combines several helpers documented in `docs/`: - -- `fences::compress_fences` and `attach_orphan_specifiers` normalize code block - delimiters. -- `html::convert_html_tables` transforms basic HTML tables into Markdown so they - can be reflowed like regular tables. See - [HTML table support](html-table-support.md). -- `wrap::wrap_text` applies optional line wrapping. It relies on the - `unicode-width` crate for accurate character widths. - -The function maintains a small state machine that tracks whether it is inside a -Markdown table, an HTML table, or a fenced code block. The state determines how -incoming lines are buffered or emitted. Once the end of a table or fence is -reached, buffered lines are flushed and possibly reformatted. The simplified -behaviour is illustrated below. - -```mermaid -stateDiagram-v2 - - [*] --> Streaming: Start - - Streaming: Default state—processing lines individually - - InMarkdownTable: Buffering lines of a Markdown table - - InHtmlTable: Buffering lines of an HTML table - - InCodeFence: Passing through lines within a fenced code block - - Streaming --> InMarkdownTable: Line starts with "|" - Streaming --> InHtmlTable: Line contains table HTML tag - Streaming --> InCodeFence: Line is a fence delimiter ("```" or "~~~") - - InMarkdownTable --> Streaming: Flush buffer and reflow table on non-table line (e.g., blank, heading) - InMarkdownTable --> InMarkdownTable: Line contains "|" or separator pattern - - InHtmlTable --> Streaming: Flush buffer and convert table on final table HTML closing tag - InHtmlTable --> InHtmlTable: Line inside table tag - - InCodeFence --> Streaming: Line is a fence delimiter -``` - -Before: - -```markdown -|A|B| -|---|---| -|1|22| -
34
-``` - -After: - -```markdown -| A | B | -| --- | --- | -| 1 | 22 | -| 3 | 4 | -``` - -Code fences are passed through verbatim: - -```rust -| not | a | table | -``` - -After scanning all lines, the processor performs optional post-processing steps -such as ellipsis replacement and footnote conversion. See -[footnote conversion](footnote-conversion.md) for details. The function then -returns the updated stream for writing to disk or further manipulation. diff --git a/docs/module-relationships.md b/docs/module-relationships.md deleted file mode 100644 index 97e31aa6..00000000 --- a/docs/module-relationships.md +++ /dev/null @@ -1,85 +0,0 @@ -# Module Relationships - -This diagram illustrates the connections between the crate's modules. - -```mermaid -classDiagram - class lib { - <> - } - class html { - <> - +convert_html_tables() - +html_table_to_markdown() - } - class table { - <> - +reflow_table() - +split_cells() - +SEP_RE - } - class wrap { - <> - +wrap_text() - +is_fence() - } - class lists { - <> - +renumber_lists() - } - class breaks { - <> - +format_breaks() - +THEMATIC_BREAK_LEN - } - class ellipsis { - <> - +replace_ellipsis() - } - class fences { - <> - +compress_fences() - +attach_orphan_specifiers() - } - class footnotes { - <> - +convert_footnotes() - } - class process { - <> - +process_stream() - +process_stream_no_wrap() - } - class io { - <> - +rewrite() - +rewrite_no_wrap() - } - lib --> html - lib --> table - lib --> wrap - lib --> lists - lib --> breaks - lib --> ellipsis - lib --> fences - lib --> process - lib --> io - html ..> wrap : uses is_fence - table ..> reflow : uses parse_rows, etc. - lists ..> wrap : uses is_fence - breaks ..> wrap : uses is_fence - ellipsis ..> wrap : uses tokenize_markdown - process ..> html : uses convert_html_tables - process ..> table : uses reflow_table - process ..> wrap : uses wrap_text, is_fence - process ..> fences : uses compress_fences, attach_orphan_specifiers - process ..> ellipsis : uses replace_ellipsis - process ..> footnotes : uses convert_footnotes - io ..> process : uses process_stream, process_stream_no_wrap -``` - -The `lib` module re-exports the public API from the other modules. The -`ellipsis` module performs text normalization. The `process` module provides -streaming helpers that combine the lower-level functions, including ellipsis -replacement and footnote conversion. The `io` module handles filesystem -operations, delegating the text processing to `process`. diff --git a/docs/rayon-concurrency.md b/docs/rayon-concurrency.md deleted file mode 100644 index a0345c1d..00000000 --- a/docs/rayon-concurrency.md +++ /dev/null @@ -1,33 +0,0 @@ -# Concurrency with `rayon` - -`mdtablefix` uses the `rayon` crate to process multiple files concurrently. -`rayon` provides a work-stealing thread pool and simple parallel iterators. The -tool relies on Rayon’s global thread pool so that no manual setup is required. -The dependency is specified as `^1.0` in `Cargo.toml` to track stable API -changes within the same major release. - -Parallelism is enabled automatically whenever more than one file path is -provided on the command line. Each worker gathers its output before printing so -results appear in the original order. This buffering increases memory usage and -may reduce performance if many tiny files are processed. - -```mermaid -sequenceDiagram - participant User as actor User - participant CLI as CLI Main - participant FileHandler as handle_file - participant Stdout as Stdout - participant Stderr as Stderr - - User->>CLI: Run CLI with multiple files (not in-place) - CLI->>FileHandler: handle_file(file1) - CLI->>FileHandler: handle_file(file2) - CLI->>FileHandler: handle_file(file3) - Note over CLI,FileHandler: Files processed in parallel - FileHandler-->>CLI: Result (Ok(Some(output)) or Err(error)) - loop For each file in input order - CLI->>Stdout: Print output (if Ok) - CLI->>Stderr: Print error (if Err) - end - CLI-->>User: Exit (with error if any file errored) -``` diff --git a/docs/unicode-width.md b/docs/unicode-width.md deleted file mode 100644 index 8b5905fe..00000000 --- a/docs/unicode-width.md +++ /dev/null @@ -1,9 +0,0 @@ -# Unicode Width Handling - -`mdtablefix` wraps paragraphs and list items while respecting the display width -of Unicode characters. The `unicode-width` crate is used to compute the width -of strings when deciding where to break lines. This prevents emojis or other -multibyte characters from causing unexpected wraps or truncation. - -Whenever wrapping logic examines the length of a token, it relies on -`UnicodeWidthStr::width` to measure visible columns rather than byte length. diff --git a/src/table.rs b/src/table.rs index 378f8399..88bcb754 100644 --- a/src/table.rs +++ b/src/table.rs @@ -1,6 +1,7 @@ //! Markdown table reflow utilities. //! -//! Implements the algorithm outlined in `docs/html-table-support.md` lines 1-24. +//! Implements the algorithm outlined in +//! [`docs/architecture.md`](../../docs/architecture.md). //! Provides helpers used by the `reflow` module and `reflow_table` itself. use regex::Regex; diff --git a/src/wrap.rs b/src/wrap.rs index 3a5ae7fe..0eb8e45a 100644 --- a/src/wrap.rs +++ b/src/wrap.rs @@ -1,7 +1,8 @@ //! Text wrapping utilities respecting inline code and prefixes. //! -//! Unicode width handling follows `docs/unicode-width.md` lines 1-9 using the -//! `unicode-width` crate for accurate display calculations. +//! Unicode width handling follows the "Unicode Width Handling" section in +//! `docs/architecture.md` and uses the `unicode-width` crate for accurate +//! display calculations. use regex::Regex;