Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps

**Coverage gap — documented non-parity with upstream remark.**

## Current state

`src/chat_sdk/shared/markdown_parser.py` is a ~880-line hand-rolled regex parser producing mdast-shaped output. It covers ATX headings, fenced code blocks, thematic breaks, blockquotes, ordered/unordered lists, GFM tables, inline emphasis/strong/delete/code/links/images, hard line breaks.

What it **doesn't** cover (listed at `docs/UPSTREAM_SYNC.md:442` as "by design, won't fix"):

- Setext headings (`=====` / `-----` underlines)
- Footnotes (`[^1]: ...`)
- Raw HTML nodes (block + inline)
- Indented code blocks (4-space indent)
- Escaped characters (`\*`, `\[`, `\\`, …)
- Backtick spans longer than 1 (`` ``code with ` inside`` ``)

The decision is recorded at `docs/DECISIONS.md:7-21`. Summary of the original rationale: (1) mdast compatibility — no Python parser produces mdast natively; (2) round-trip fidelity; (3) subset-is-sufficient; (4) zero runtime deps; (5) streaming renderer needs to understand the same constructs.

## Why reopen

Each "by design, won't fix" row accumulates risk:

- **LLM-generated content**: models frequently emit setext headings and escaped characters; these currently get rendered as raw text, breaking display on every adapter that pipes through our parser (all of them).
- **Upstream drift**: upstream remark handles these correctly. Every time upstream tests something that depends on them, we have to explicitly skip; the skip list grows.
- **Fragility**: the hand-rolled regex parser is already 880 lines and has had several correctness fixes for edge cases (see git history on `shared/markdown_parser.py`). Each new construct adds risk of subtle inline-span bugs; each fix risks regressing another construct.
- **Downstream asks**: consumers periodically ask why `\*not italic\*` doesn't render; the answer "we don't handle escaped characters" doesn't age well.

## Options

### Option A — close the gaps in the existing parser

Add setext, indented code, raw HTML, footnotes, escaped chars, and multi-backtick spans to `shared/markdown_parser.py`. Estimate: ~300–400 LOC, new edge cases to test. Keeps zero runtime deps.

### Option B — swap the backend

Evaluate `mistune` (has GFM + footnote plugins) or `markdown-it-py` (CommonMark + plugin ecosystem) with a thin translation layer to mdast. The original "no mdast from Python libs" objection stands, but the translation layer has shrunk in scope — we only need the node shapes we already emit (`paragraph`, `heading`, `code`, `list`, `listItem`, `blockquote`, `table`, `tableRow`, `tableCell`, `thematicBreak`, `emphasis`, `strong`, `delete`, `inlineCode`, `link`, `image`, `text`, `break`). A translator for those nodes is probably ~250 LOC. The parser itself shrinks to a call into the library.

Key concern with Option B: the `StreamingMarkdownRenderer` in `shared/streaming_markdown.py` needs to understand code-fence state and inline-marker state for mid-stream closure. If the parser library is used only for static parses (after-the-fact mdast), and streaming keeps its own state machine, they could diverge. Scope this carefully.

### Option C — close specific high-value gaps only

Order of ROI:
1. Escaped characters — common in LLM output; small implementation (~30 LOC).
2. Indented code blocks — common in technical content; small (~40 LOC).
3. Setext headings — common; medium (~60 LOC).
4. Multi-backtick spans — less common; small (~30 LOC).
5. Footnotes — rare in chat; defer.
6. Raw HTML — rare in chat; defer.

Option C is pragmatic: close items 1–3 in one PR, document 4–6 as remaining gaps, come back if consumers ask.

## Recommendation

Start with C. Benchmark against `mistune` in a separate spike to see whether the translator layer lands under 250 LOC; if yes, B becomes attractive for the 0.5 cycle.

## Acceptance

- [ ] Decide A / B / C (probably C, then reassess)
- [ ] If C: escaped chars + indented code + setext land with faithful tests from upstream remark test suite where available
- [ ] Remove matching rows from `docs/UPSTREAM_SYNC.md` non-parity table
- [ ] `docs/DECISIONS.md:7-21` updated to reflect the narrowed gap list
- [ ] Benchmark: parsing a 10KB mixed-content document stays under 5ms on CI hardware


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69

Current state

Why reopen

Options

Option A — close the gaps in the existing parser

Option B — swap the backend

Option C — close specific high-value gaps only

Recommendation

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69

Description

Current state

Why reopen

Options

Option A — close the gaps in the existing parser

Option B — swap the backend

Option C — close specific high-value gaps only

Recommendation

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions