Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ once_cell = "1"
rayon = "1.11"
html5ever = "0.27"
markup5ever_rcdom = "0.3"
textwrap = "0.16.2"
unicode-width = "0.1"


Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ list items to 80 columns.

Hyphenated words are treated as indivisible during wrapping, so
`very-long-word` will move to the next line intact rather than split at the
hyphen. The tool ignores fenced code blocks and respects escaped pipes (`\|`),
hyphen. The wrap engine now delegates line fitting to the `textwrap` crate
while preserving Markdown-aware token grouping for inline code, links, and hard
breaks. The tool ignores fenced code blocks and respects escaped pipes (`\|`),
making it safe to use on Markdown with mixed content.

## Installation
Expand Down
64 changes: 64 additions & 0 deletions docs/adrs/0002-textwrap-wrapping-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Architecture Decision Record (ADR) 0002: Delegate line fitting to `textwrap`

- Status: Accepted
- Date: 2026-04-22

## Context

The previous wrapping engine in `src/wrap/line_buffer.rs` implemented a bespoke
`LineBuffer` struct that accumulated tokens, tracked a split-point cursor, and
flushed completed lines one at a time. This approach had three compounding
problems:

- Width measurement was byte-based in early versions, producing incorrect splits
for non-ASCII characters such as CJK glyphs and emoji.
- The split-with-carry logic required carefully coordinated state between
`push_span`, `split_with_span`, and `flush_trailing_whitespace`, making the
code difficult to reason about and extend.
- Each fragment addition triggered a full re-evaluation of the buffer, risking
quadratic behaviour on long paragraphs.

## Decision

Replace `LineBuffer` with `textwrap::wrap_algorithms::wrap_first_fit` and a
fragment model built on the `textwrap::core::Fragment` trait. Each token group
becomes an `InlineFragment` that carries pre-computed display width (via
`unicode-width`) and a `FragmentKind` tag. `wrap_first_fit` performs greedy
line fitting over the fragment slice; post-processing in
`src/wrap/inline/postprocess.rs` normalizes whitespace-only lines and
rebalances atomic tails. Prefix handling is centralized in
`ParagraphWriter::wrap_with_prefix`, which computes available width once and
Comment thread
coderabbitai[bot] marked this conversation as resolved.
prepends the correct prefix to every wrapped output line.

The greedy first-fit algorithm is chosen over `textwrap`'s optimal-fit
algorithm because the optimal algorithm may produce non-local changes to
earlier lines when a later fragment is added, which conflicts with the
incremental buffer model and produces surprising diffs.

## Consequences

Positive:

- Line fitting is delegated to a well-tested upstream crate; the bespoke split
logic and `LineBuffer` state machine are removed entirely.
- Display widths are computed by `unicode-width` according to Unicode Standard
Annex `#11`, giving correct column counts for non-ASCII text.
- `InlineFragment::kind` centralizes token classification, so post-processing
predicates (`is_whitespace`, `is_atomic`, `is_plain`) do not repeat
classification logic.

Negative:

- Greedy first-fit produces wider first lines than optimal-fit would in some
cases, though this difference is not visible in standard Markdown prose.
- The project now depends on `textwrap 0.16` in addition to `unicode-width`.

## Alternatives considered

- **Optimal-fit algorithm** (`textwrap::wrap_algorithms::wrap_optimal_fit`):
rejected because it requires the complete fragment list upfront and may
redistribute earlier lines when later fragments are added, which conflicts
with the streaming model.
- **Patching `LineBuffer` for Unicode correctness**: rejected because the
split-point cursor and carry semantics remained inherently fragile; the
maintenance burden outweighed the risk of introducing a new dependency.
136 changes: 108 additions & 28 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ The function combines several helpers documented in `docs/`:
- `html::convert_html_tables` transforms basic HTML tables into Markdown so \
they can be reflowed like regular tables. See \
[HTML table support](#html-table-support-in-mdtablefix).
- `wrap::wrap_text` applies optional line wrapping. It relies on the
`unicode-width` crate for accurate character widths.
- `wrap::wrap_text` applies optional line wrapping. It classifies Markdown
block structure locally and delegates greedy line fitting to the `textwrap`
crate over Markdown-aware fragments measured with `unicode-width`.
- `wrap::tokenize_markdown` emits `Token` values for custom processing.
- `headings::convert_setext_headings` rewrites Setext headings with underline
markers into ATX headings when the CLI `--headings` flag is provided. The
Expand Down Expand Up @@ -374,35 +375,113 @@ module handles filesystem operations, delegating the text processing to

### Tokenizer flow

The inline tokenizer iterates over the source string lazily, so no duplicate
`Vec<char>` representation is required. The following diagram summarizes the
control flow, highlighting the helpers touched during whitespace, code span,
and link handling.
The inline tokenizer still iterates over the source string lazily, so no
duplicate `Vec<char>` representation is required. The resulting tokens are then
grouped into Markdown-aware fragments and passed to
`textwrap::wrap_algorithms::wrap_first_fit`, which chooses the breakpoints
without splitting code spans, links, or punctuation groups.

```mermaid
flowchart TD
A["Input text (&str)"] --> B["Initialize tokens Vec"]
B --> C["Iterate over text by byte index"]
C --> D{"Current char is whitespace?"}
D -- Yes --> E["scan_while for whitespace"]
E --> F["collect_range and push token"]
D -- No --> G{"Current char is '`'?"}
G -- Yes --> H["Check backslash escape (has_odd_backslash_escape_bytes)"]
H -- Escaped --> I["Push '`' as token"]
H -- Not escaped --> J["scan_while for code fence"]
J --> K["Find closing fence, collect_range and push token"]
G -- No --> L{"Current char is '[' or '!['?"}
L -- Yes --> M["parse_link_or_image"]
M --> N["Push link/image token"]
N --> O["scan_while for trailing punctuation"]
O --> P["collect_range and push punctuation token"]
L -- No --> Q["scan_while for non-whitespace/non-` chars"]
Q --> R["collect_range and push token"]
F & I & K & P & R --> S["Continue iteration"]
S --> C
C -->|End| T["Return tokens Vec"]
A["Input text (&str)"] --> B["Tokenize into whitespace and inline Markdown tokens"]
B --> C["Group tokens into Markdown-aware fragments"]
C --> D["Measure fragment widths with unicode-width"]
D --> E["Run textwrap wrap_first_fit over current fragments"]
E --> F["Merge whitespace-only continuation lines forward"]
F --> G["Render wrapped lines, trimming only a single trailing separator space"]
```

Figure: Wrap-tokenizer flow. Starting from an input string, the wrapper emits
whitespace and inline Markdown tokens, groups them into fragments, measures
their display widths with `unicode-width`, feeds them through
`textwrap::wrap_algorithms::wrap_first_fit`, and then reconstructs wrapped
lines while preserving Markdown-aware spacing rules.

### Wrap flow

The higher-level `wrap_text` entry point combines block classification,
paragraph buffering, prefix-aware wrapping, and inline line fitting. The
following flow shows how a line moves through those stages before it is either
preserved verbatim or emitted as wrapped output.

```mermaid
flowchart TD
A[Start: wrap_text called with lines and width] --> B{Classify line}

B -->|Fenced or indented code block| C[Preserve line verbatim]
B -->|Table or heading or directive| C
B -->|Blank line| D[Flush active paragraph and emit blank]
B -->|Paragraph or prefixed line| E[Send to ParagraphWriter]

E --> F{Has prefix such as bullet, blockquote, footnote}
F -->|Yes| G[wrap_with_prefix computes display width using unicode-width]
F -->|No| H[wrap_preserving_code wraps inline content]

G --> I[fragment-building / post-process helpers]
H --> I

I --> J[textwrap::wrap_algorithms::wrap_first_fit performs line breaking]
J --> K[Reconstruct wrapped lines with prefixes and preserved spans]
K --> L[Emit wrapped lines to wrap_text]

C --> M[Append line to output]
D --> M
L --> M

M --> N{More input lines?}
N -->|Yes| B
N -->|No| O[Flush remaining paragraph and finish]
```

Figure: `wrap_text` control flow. The wrapper classifies each incoming line,
passes fenced blocks, tables, headings, directives, and indented code through
unchanged, flushes paragraphs on blanks, routes prose and prefixed lines
through `ParagraphWriter`, computes visible widths with `unicode-width`, and
delegates inline line fitting to `textwrap` before reconstructing the emitted
Markdown lines.

### Wrap sequence

The following sequence diagram focuses on the runtime collaboration between the
CLI entry point, `wrap_text`, `ParagraphWriter`, the inline wrapper, and
`textwrap` while a paragraph is being processed.

```mermaid
sequenceDiagram
participant CLI as mdtablefix_CLI
participant WT as wrap_text
participant PW as ParagraphWriter
participant WP as wrap_preserving_code
participant IH as inline.rs_helpers
participant TW as textwrap::wrap_first_fit

CLI->>WT: wrap_text(lines, width)
loop For each classified paragraph line
WT->>PW: handle_prefix_line / flush_paragraph
alt Prefixed or plain paragraph content
PW->>WP: wrap_preserving_code(text, width)
WP->>IH: build_fragments + merge/rebalance
IH->>TW: wrap_first_fit(fragments, line_widths)
TW-->>IH: wrapped_fragment_groups
IH-->>WP: wrapped_lines_with_spans
WP-->>PW: wrapped_lines_with_prefixes
PW-->>WT: wrapped_lines
WT-->>CLI: append wrapped output
else Nonwrappable line
PW-->>WT: push_verbatim / original_line
WT-->>CLI: append original output
end
end
WT-->>CLI: return final wrapped text
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Figure: `wrap_text` sequence flow. The CLI calls `wrap_text`, which delegates
paragraph handling to `ParagraphWriter`; wrappable paragraph content then flows
through `wrap_preserving_code`, the fragment-building and post-processing
helpers in `src/wrap/inline.rs`, and the underlying `textwrap` engine before
wrapped lines return through the same stack to the CLI, while nonwrappable
lines bypass the inline wrapping path and are emitted unchanged.

The helper `html_table_to_markdown` is retained for backward compatibility but
is deprecated. New code should call `convert_html_tables` instead.

Expand Down Expand Up @@ -444,8 +523,9 @@ sequenceDiagram

`mdtablefix` wraps paragraphs and list items while respecting the display width
of Unicode characters. The `unicode-width` crate is used to compute the width
of strings when deciding where to break lines. This prevents emojis or other
multibyte characters from causing unexpected wraps or truncation.
of prefixes and Markdown-aware wrapping fragments before `textwrap` performs
line fitting. This prevents emojis or other multibyte characters from causing
unexpected wraps or truncation.

Whenever wrapping logic examines the length of a token, it relies on
`UnicodeWidthStr::width` to measure visible columns rather than byte length.
Expand Down
68 changes: 68 additions & 0 deletions docs/developers-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,71 @@ The rationale for the staged table reflow pipeline is recorded in
`docs/adrs/0001-table-reflow-pipeline.md`. Refer to that ADR when changing the
parse, width-calculation, or separator-handling flow so implementation changes
stay aligned with the documented design constraints.

## Wrap module architecture

The wrapping pipeline for `--wrap` is:

1. **Block classification.** `classify_block` in `src/wrap.rs` inspects each
input line and decides whether it should pass through verbatim or enter the
paragraph wrapper. Fenced code blocks, indented code blocks, headings,
tables, directives, and blank lines stop paragraph accumulation.

2. **Prefix-aware paragraph handling.** `ParagraphWriter` in
`src/wrap/paragraph.rs` is the single entry point for prefix-aware wrapping.
`wrap_with_prefix` computes the available content width once from the
Unicode display width of the first-line prefix, then feeds the paragraph
text into `wrap_preserving_code`.

3. **Fragment construction and line fitting.** `wrap_preserving_code` in
`src/wrap/inline.rs` tokenizes prose with `tokenize::segment_inline`, groups
the tokens into `InlineFragment` values, and calls
`textwrap::wrap_algorithms::wrap_first_fit` over the accumulated fragment
buffer.

4. **Post-processing and rendering.** The `postprocess` module applies
`merge_whitespace_only_lines` and then `rebalance_atomic_tails` so
whitespace-only wrap artefacts and isolated tails are normalized before the
fragments are rendered back into output lines.

`InlineFragment` carries the rendered fragment text, its precomputed display
width, and a `FragmentKind` tag. That construction-time classification lets the
`is_whitespace`, `is_atomic`, and `is_plain` predicates answer all later
questions without repeating ad hoc string inspection in the post-processing
passes.

The `postprocess` module exists because greedy line fitting alone does not
reproduce the repository's historical whitespace semantics. The first pass
merges whitespace-only wrap lines into adjacent content, and the second pass
rebalances a trailing atomic or plain fragment only when the destination line
still fits within the configured width.

### Key types and functions

Table: Key types and functions.

| Symbol | File |
| ------------------------------------------------------- | -------------------------------- |
| `FragmentKind`, `InlineFragment`, `classify_fragment` | `src/wrap/inline.rs` |
| `build_fragments`, `wrap_preserving_code` | `src/wrap/inline.rs` |
| `merge_whitespace_only_lines`, `rebalance_atomic_tails` | `src/wrap/inline/postprocess.rs` |
| `ParagraphWriter`, `wrap_with_prefix` | `src/wrap/paragraph.rs` |
| `ParagraphState`, `PrefixLine` | `src/wrap/paragraph.rs` |

Comment thread
coderabbitai[bot] marked this conversation as resolved.
### Design constraints

- **Public API stability.** `mdtablefix::wrap::wrap_text`, `Token`, and
`tokenize_markdown` must not change their signatures or observable behaviour.
- **Atomic fragments.** Inline code spans and Markdown links are never split
across lines; they move as a unit when they would overflow the target width.
- **Hard breaks.** Trailing two-space hard breaks must survive on the emitted
line where they occur.
- **Verbatim blocks.** Fenced code blocks must pass through unchanged, along
with the other non-paragraph block kinds detected by `classify_block`.
- **Prefix width.** The visual width of every prefix string is measured with
`UnicodeWidthStr::width` before the available text width is computed, so
non-ASCII prefix characters (e.g. `「` in CJK blockquotes) are accounted for
correctly.

Refer to `docs/adrs/0002-textwrap-wrapping-engine.md` for the rationale behind
replacing `LineBuffer` with `textwrap`.
Loading
Loading