Skip to content

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69

@patrick-chinchill

Description

@patrick-chinchill

Coverage gap — documented non-parity with upstream remark.

Current state

src/chat_sdk/shared/markdown_parser.py is a ~880-line hand-rolled regex parser producing mdast-shaped output. It covers ATX headings, fenced code blocks, thematic breaks, blockquotes, ordered/unordered lists, GFM tables, inline emphasis/strong/delete/code/links/images, hard line breaks.

What it doesn't cover (listed at docs/UPSTREAM_SYNC.md:442 as "by design, won't fix"):

  • Setext headings (===== / ----- underlines)
  • Footnotes ([^1]: ...)
  • Raw HTML nodes (block + inline)
  • Indented code blocks (4-space indent)
  • Escaped characters (\*, \[, \\, …)
  • Backtick spans longer than 1 ( code with ` inside )

The decision is recorded at docs/DECISIONS.md:7-21. Summary of the original rationale: (1) mdast compatibility — no Python parser produces mdast natively; (2) round-trip fidelity; (3) subset-is-sufficient; (4) zero runtime deps; (5) streaming renderer needs to understand the same constructs.

Why reopen

Each "by design, won't fix" row accumulates risk:

  • LLM-generated content: models frequently emit setext headings and escaped characters; these currently get rendered as raw text, breaking display on every adapter that pipes through our parser (all of them).
  • Upstream drift: upstream remark handles these correctly. Every time upstream tests something that depends on them, we have to explicitly skip; the skip list grows.
  • Fragility: the hand-rolled regex parser is already 880 lines and has had several correctness fixes for edge cases (see git history on shared/markdown_parser.py). Each new construct adds risk of subtle inline-span bugs; each fix risks regressing another construct.
  • Downstream asks: consumers periodically ask why \*not italic\* doesn't render; the answer "we don't handle escaped characters" doesn't age well.

Options

Option A — close the gaps in the existing parser

Add setext, indented code, raw HTML, footnotes, escaped chars, and multi-backtick spans to shared/markdown_parser.py. Estimate: ~300–400 LOC, new edge cases to test. Keeps zero runtime deps.

Option B — swap the backend

Evaluate mistune (has GFM + footnote plugins) or markdown-it-py (CommonMark + plugin ecosystem) with a thin translation layer to mdast. The original "no mdast from Python libs" objection stands, but the translation layer has shrunk in scope — we only need the node shapes we already emit (paragraph, heading, code, list, listItem, blockquote, table, tableRow, tableCell, thematicBreak, emphasis, strong, delete, inlineCode, link, image, text, break). A translator for those nodes is probably ~250 LOC. The parser itself shrinks to a call into the library.

Key concern with Option B: the StreamingMarkdownRenderer in shared/streaming_markdown.py needs to understand code-fence state and inline-marker state for mid-stream closure. If the parser library is used only for static parses (after-the-fact mdast), and streaming keeps its own state machine, they could diverge. Scope this carefully.

Option C — close specific high-value gaps only

Order of ROI:

  1. Escaped characters — common in LLM output; small implementation (~30 LOC).
  2. Indented code blocks — common in technical content; small (~40 LOC).
  3. Setext headings — common; medium (~60 LOC).
  4. Multi-backtick spans — less common; small (~30 LOC).
  5. Footnotes — rare in chat; defer.
  6. Raw HTML — rare in chat; defer.

Option C is pragmatic: close items 1–3 in one PR, document 4–6 as remaining gaps, come back if consumers ask.

Recommendation

Start with C. Benchmark against mistune in a separate spike to see whether the translator layer lands under 250 LOC; if yes, B becomes attractive for the 0.5 cycle.

Acceptance

  • Decide A / B / C (probably C, then reassess)
  • If C: escaped chars + indented code + setext land with faithful tests from upstream remark test suite where available
  • Remove matching rows from docs/UPSTREAM_SYNC.md non-parity table
  • docs/DECISIONS.md:7-21 updated to reflect the narrowed gap list
  • Benchmark: parsing a 10KB mixed-content document stays under 5ms on CI hardware

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions