Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,12 +159,14 @@ Only simple tables composed of `<tr>`, `<th>`, and `<td>` tags are supported.
Tag case and attributes are ignored. After conversion, they are reformatted
alongside regular Markdown tables.

See [HTML table support for more details](docs/html-table-support.md).
See
[HTML&nbsp;table&nbsp;support&nbsp;for&nbsp;more&nbsp;details](docs/architecture.md#html-table-support-in-mdtablefix)
.

## Module structure

For an overview of how the crate's internal modules relate to each other, see
[Module relationships](docs/module-relationships.md).
For an overview of how the crate's internal modules relate to each other, see \
[Module relationships](docs/architecture.md#module-relationships).

## Testing

Expand Down
307 changes: 307 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
# Architecture

## Contents

- [Markdown stream processor](#markdown-stream-processor)
- [Footnote conversion](#footnote-conversion)
- [HTML table support](#html-table-support-in-mdtablefix)
- [Module relationships](#module-relationships)
- [Concurrency with `rayon`](#concurrency-with-rayon)
- [Unicode width handling](#unicode-width-handling)

## Markdown stream processor

`process_stream_inner` orchestrates line-by-line rewriting. The full
implementation lives in [src/process.rs](../src/process.rs). Its signature is:

```rust
pub fn process_stream_inner(lines: &[String], opts: Options) -> Vec<String>
```

The function combines several helpers documented in `docs/`:

- `fences::compress_fences` and `attach_orphan_specifiers` normalize code block
delimiters.
- `html::convert_html_tables` transforms basic HTML tables into Markdown so \
they can be reflowed like regular tables. See \
[HTML table support](#html-table-support-in-mdtablefix).
- `wrap::wrap_text` applies optional line wrapping. It relies on the
`unicode-width` crate for accurate character widths.

The function maintains a small state machine that tracks whether it is inside a
Comment on lines +24 to +31
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Wrap long list items to ≤ 80 columns.

Lines in this range overshoot the style limit for prose files. Hard-wrap to maintain consistency with the project’s Markdown guidelines.

🤖 Prompt for AI Agents
In docs/architecture.md around lines 24 to 31, the lines describing the
functions exceed the 80-column limit for prose files. Reformat these lines by
hard-wrapping the text so that no line is longer than 80 characters, ensuring
the content remains clear and consistent with the project's Markdown style
guidelines.

Markdown table, an HTML table, or a fenced code block. The state determines how
incoming lines are buffered or emitted. Once the end of a table or fence is
reached, buffered lines are flushed and possibly reformatted. The simplified
behaviour is illustrated below.

```mermaid
stateDiagram-v2

[*] --> Streaming: Start

Streaming: Default state—processing lines individually

InMarkdownTable: Buffering lines of a Markdown table

InHtmlTable: Buffering lines of an HTML table

InCodeFence: Passing through lines within a fenced code block

Streaming --> InMarkdownTable: Line starts with "|"
Streaming --> InHtmlTable: Line contains table HTML tag
Streaming --> InCodeFence: Line is a fence delimiter ("```" or "~~~")

InMarkdownTable --> Streaming: Flush buffer and reflow table on non-table line (e.g., blank, heading)
InMarkdownTable --> InMarkdownTable: Line contains "|" or separator pattern

InHtmlTable --> Streaming: Flush buffer and convert table on final table HTML closing tag
InHtmlTable --> InHtmlTable: Line inside table tag

InCodeFence --> Streaming: Line is a fence delimiter
```

Before:

```markdown
|A|B|
|---|---|
|1|22|
<table><tr><td>3</td><td>4</td></tr></table>
```

After:

```markdown
| A | B |
| --- | --- |
| 1 | 22 |
| 3 | 4 |
```

Code fences are passed through verbatim:

```rust
| not | a | table |
```

After scanning all lines, the processor performs optional post-processing steps
such as ellipsis replacement and footnote conversion. See \
[footnote conversion](#footnote-conversion) for details. The function then
returns the updated stream for writing to disk or further manipulation.

## Footnote Conversion

`mdtablefix` can optionally convert bare numeric references into
GitHub-flavoured Markdown footnotes. The `convert_footnotes` function performs
this operation and is exposed via the higher-level `process_stream_opts`
helper. Set `Options { footnotes: true, ..Default::default() }` when calling
`process_stream_opts` to enable the conversion logic.

Inline references that appear after punctuation are rewritten as footnote links.

Before:

```markdown
A useful tip.1
```

After:

```markdown
A useful tip.[^1]
```

Numbers inside inline code or parentheses are ignored.

Before:

```markdown
Look at `code 1` for details.
Refer to equation (1) for context.
```

After:

```markdown
Look at `code 1` for details.
Refer to equation (1) for context.
```

When the final lines of a document form a numbered list, they are replaced with
footnote definitions.

Before:

```markdown
Text.

1. First note
2. Second note
```

After:

```markdown
Text.

[^1] First note
[^2] Second note
```

`convert_footnotes` only processes the final contiguous list of numeric
references.

## HTML Table Support in `mdtablefix`

`mdtablefix` can format simple HTML `<table>` elements embedded in Markdown.
These HTML tables are transformed into Markdown before the main table reflow
logic runs. That preprocessing is handled by the `convert_html_tables` function.

Only straightforward tables with `<tr>`, `<th>` and `<td>` tags are detected.
Attributes and tag casing are ignored, and complex nested or styled tables are
not supported. After conversion, each HTML table is represented as a Markdown
table, so the usual reflow algorithm can align its columns consistently with
the rest of the document.

```html
<table>
<tr><th>A</th><th>B</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
```

The converter checks the first table row for `<th>` cells or for `<strong>` or
`<b>` tags inside `<td>` elements to decide whether it is a header. If no such
markers exist and the table contains multiple rows, the first row is still
treated as the header, so the Markdown output includes a separator line. This
last-resort behaviour keeps simple tables readable after conversion.

## Module Relationships

This diagram illustrates the connections between the crate's modules.

```mermaid
classDiagram
class lib {
<<module>>
}
class html {
<<module>>
+convert_html_tables()
+html_table_to_markdown()
}
class table {
<<module>>
+reflow_table()
+split_cells()
+SEP_RE
}
class wrap {
<<module>>
+wrap_text()
+is_fence()
}
class lists {
<<module>>
+renumber_lists()
}
class breaks {
<<module>>
+format_breaks()
+THEMATIC_BREAK_LEN
}
class ellipsis {
<<module>>
+replace_ellipsis()
}
class fences {
<<module>>
+compress_fences()
+attach_orphan_specifiers()
}
class footnotes {
<<module>>
+convert_footnotes()
}
class process {
<<module>>
+process_stream()
+process_stream_no_wrap()
}
class io {
<<module>>
+rewrite()
+rewrite_no_wrap()
}
lib --> html
lib --> table
lib --> wrap
lib --> lists
lib --> breaks
lib --> ellipsis
lib --> fences
lib --> process
lib --> io
html ..> wrap : uses is_fence
table ..> reflow : uses parse_rows, etc.
lists ..> wrap : uses is_fence
breaks ..> wrap : uses is_fence
ellipsis ..> wrap : uses tokenize_markdown
process ..> html : uses convert_html_tables
process ..> table : uses reflow_table
process ..> wrap : uses wrap_text, is_fence
process ..> fences : uses compress_fences, attach_orphan_specifiers
process ..> ellipsis : uses replace_ellipsis
process ..> footnotes : uses convert_footnotes
io ..> process : uses process_stream, process_stream_no_wrap
```

The `lib` module re-exports the public API from the other modules. The
`ellipsis` module performs text normalization. The `process` module provides
streaming helpers that combine the lower-level functions, including ellipsis
replacement and footnote conversion. The `io` module handles filesystem
operations, delegating the text processing to `process`.

## Concurrency with `rayon`

`mdtablefix` uses the `rayon` crate to process multiple files concurrently.
`rayon` provides a work-stealing thread pool and simple parallel iterators. The
tool relies on Rayon's global thread pool so that no manual setup is required.
The dependency is specified as `^1.0` in `Cargo.toml` to track stable API
changes within the same major release.

Parallelism is enabled automatically whenever more than one file path is
provided on the command line. Each worker gathers its output before printing,
so results appear in the original order. This buffering increases memory usage
and may reduce performance if many tiny files are processed.

```mermaid
sequenceDiagram
participant User as actor User
participant CLI as CLI Main
participant FileHandler as handle_file
participant Stdout as Stdout
participant Stderr as Stderr

User->>CLI: Run CLI with multiple files (not in-place)
CLI->>FileHandler: handle_file(file1)
CLI->>FileHandler: handle_file(file2)
CLI->>FileHandler: handle_file(file3)
Note over CLI,FileHandler: Files processed in parallel
FileHandler-->>CLI: Result (Ok(Some(output)) or Err(error))
loop For each file in input order
CLI->>Stdout: Print output (if Ok)
CLI->>Stderr: Print error (if Err)
end
CLI-->>User: Exit (with error if any file errored)
```

## Unicode Width Handling

`mdtablefix` wraps paragraphs and list items while respecting the display width
of Unicode characters. The `unicode-width` crate is used to compute the width
of strings when deciding where to break lines. This prevents emojis or other
multibyte characters from causing unexpected wraps or truncation.

Whenever wrapping logic examines the length of a token, it relies on
`UnicodeWidthStr::width` to measure visible columns rather than byte length.
61 changes: 0 additions & 61 deletions docs/footnote-conversion.md

This file was deleted.

Loading