Skip to content

spike(parser): bake-off harness for issue #69 Option B#100

Draft
patrick-chinchill wants to merge 6 commits into
mainfrom
claude/markdown-parser-swap-spike
Draft

spike(parser): bake-off harness for issue #69 Option B#100
patrick-chinchill wants to merge 6 commits into
mainfrom
claude/markdown-parser-swap-spike

Conversation

@patrick-chinchill
Copy link
Copy Markdown
Collaborator

@patrick-chinchill patrick-chinchill commented May 22, 2026

Summary

Spike for issue #69 Option B (parser swap). Bake-off harness comparing three candidate markdown libraries — mistune, markdown-it-py, marko — against the existing hand-rolled shared/markdown_parser.py.

Not wired into the runtime SDK. This PR adds:

  • A spike-parser dev dependency group (production install unaffected)
  • Per-library translators to mdast in src/chat_sdk/shared/parser_spike/
  • A divergence-report + completeness-gap test harness in tests/parser_spike/
  • A benchmark script in scripts/parser_spike/benchmark.py
  • Two fixtures: mixed_content.md (happy-path GFM, ≈3KB) and gap_cases.md (the constructs the baseline parser explicitly doesn't handle per the issue Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69 non-parity list)

How to run

uv sync --group dev --group spike-parser
uv run pytest tests/parser_spike/test_mdast_parity.py -s
uv run python scripts/parser_spike/benchmark.py

Findings

Parse-and-translate time (12KB mixed corpus, 50 iterations)

parser median p95 5ms budget?
baseline (hand) 2.59ms 2.72ms
mistune 11.94ms 13.04ms ✗ (2.4× over)
markdown-it-py 13.36ms 20.64ms ✗ (2.7× over)
marko 46.62ms 49.58ms ✗ (9.3× over)

Caveat: part of the baseline's perf win is "fewer constructs to parse per byte" — setext, indented code, multi-backtick spans, escaped chars, and raw HTML all skip straight through the inline fast-paths. An apples-to-apples comparison would require extending the baseline to recognise those constructs first.

Translator LOC

library LOC 250-LOC budget
mistune 161
markdown-it-py 215
marko 152

Completeness gap on hard constructs (gap_cases.md)

Six constructs from the issue #69 non-parity list. Silent drop = construct flattened to plain text; recognised = correct mdast node type emitted.

construct baseline mistune markdown-it-py marko
setext heading silent drop recognised recognised recognised
indented code block silent drop recognised recognised recognised
task list item recognised¹ silent drop recognised recognised
footnote definition silent drop silent drop silent drop² silent drop
inline HTML silent drop silent drop silent drop silent drop
definition list silent drop silent drop silent drop silent drop
silent-drop count 5 4 3 3

¹ Baseline matches - [x] as a list item but doesn't extract the checkbox state.
² markdown-it-py supports footnotes via mdit-py-plugins (not pulled in by the spike); enabling it would drop the count to 2.

The baseline is strictly worse on completeness than every candidate.

Happy-path mdast fidelity (mixed_content.md)

library divergences
mistune 26
markdown-it-py 24
marko 27

Most "divergences" are cases where the baseline is less mdast-spec-compliant than the candidates (soft-break splitting, trailing-newline handling on code blocks, link-followed-by-text splitting). A library swap would also fix these baseline correctness bugs.

The one real candidate-side bug surfaced was marko losing GFM table alignment metadata — translator fix.

What this changes about the strategy decision

The spike data argues against a clean recommendation in either direction:

  • If 10ms median parse time is acceptable (likely true for chat streaming where LLM token latency dwarfs this), Option B with markdown-it-py is the cleanest path: best completeness score, only ~1.5ms slower than mistune, mdit-py-plugins available for footnotes later. One PR, one dep, both gap lists close.
  • If zero-dep core is sacred, the only path is Option D (split): close baseline gaps in-tree and port remend separately. Higher total effort but preserves perf and install footprint.
  • Option C (selective parser fixes only, the original framing) leaves the streaming-side bugs from the #69 follow-up comment unaddressed; should be ruled out unless paired with a _remend fix.

Full analysis in src/chat_sdk/shared/parser_spike/README.md.

Test plan

  • uv run pytest tests/parser_spike/ -q — 9 pass
  • uv run pytest tests/ --ignore=tests/parser_spike -q — 3,581 pass, 1 pre-existing failure (test_github_webhook.py::TestGitHubAdapterConstructor::test_throws_when_no_auth, unrelated)
  • uv run ruff check src/chat_sdk/shared/parser_spike/ tests/parser_spike/ scripts/parser_spike/ — passes
  • uv run ruff format --check ... — formatted
  • uv run python scripts/parser_spike/benchmark.py — produces the timing/LOC table

Scope notes

Related: #99 (the surgical fix for the three streaming bugs the spike data also informs).

Three candidate markdown libraries evaluated as drop-in replacements
for the hand-rolled `shared/markdown_parser.py`: mistune, markdown-it-py,
marko. Each gets a thin translator (~150-215 LOC) to the mdast dict
shape produced by the existing parser. Not wired into the runtime SDK
-- this directory exists purely so the harness can run them
side-by-side.

Components:
- `pyproject.toml`: new `spike-parser` dependency group with the three
  candidates pinned. Production install is unaffected.
- `src/chat_sdk/shared/parser_spike/`: per-library translators sharing
  the mdast `make_*` constructors from `markdown_parser.py`. README
  documents the spike findings inline.
- `tests/parser_spike/`: mdast parity bake-off + fixture corpus
  (~3KB mixed-content markdown exercising every node type). Reports
  divergences vs baseline without failing.
- `scripts/parser_spike/benchmark.py`: parse-and-translate timings
  across the four parsers on a ~12KB scaled corpus, plus translator
  LOC budget check.

Current findings (sample run):
- baseline (hand): 2.59ms median -- meets the 5ms target by 2x.
- mistune: 11.94ms median, 161 translator LOC, 26 divergences.
- markdown-it-py: 13.36ms median, 215 LOC, 24 divergences.
- marko: 46.62ms median, 152 LOC, 27 divergences (incl. dropped
  table alignment -- translator bug).

The 5ms perf budget set in the issue acceptance criteria is met only
by the baseline. mdast fidelity is roughly equivalent across the three
candidates -- and most "divergences" are baseline non-compliance the
candidates correct. Suggests a re-framed Option D (keep the fast
parser, port upstream remend for the streaming side) -- spelled out
in the spike README.

Numbers are reproducible: `uv run python scripts/parser_spike/benchmark.py`
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: eb634c1c-7ae1-4896-8a6d-0a8f0918011a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a spike to evaluate replacing the current hand-rolled markdown parser with a library candidate (mistune, markdown-it-py, or marko). It includes translators for each library, a benchmarking script, and a parity test suite. Feedback identifies an inaccuracy in the line-of-code measurement logic regarding docstrings, unused variables in the markdown-it translator, and a fragile metadata delimiter that could fail on certain URLs or titles.

Comment thread scripts/parser_spike/benchmark.py Outdated
Comment on lines +61 to +62
code_lines = [line for line in text.splitlines() if line.strip() and not line.strip().startswith("#")]
# Subtract docstrings as a rough heuristic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LOC calculation heuristic does not actually subtract docstrings as stated in the comment. It only filters out lines that are entirely comments (starting with #). This means multi-line docstrings (e.g., """...""") are still counted in the LOC metric, which might lead to an overestimation of the translator complexity relative to the 250-LOC budget mentioned in the requirements.

Comment on lines +137 to +175
align: list[str | None] = []
in_header = False
header_aligns: list[str | None] = []
current_row: list[Content] = []
current_aligns: list[str | None] = []

while i < len(tokens):
tok = tokens[i]
if tok.type == "table_close":
return make_table(rows, align=header_aligns if any(header_aligns) else None), i + 1
if tok.type == "thead_open":
in_header = True
elif tok.type == "thead_close":
in_header = False
elif tok.type == "tr_open":
current_row = []
current_aligns = []
elif tok.type == "tr_close":
rows.append(make_table_row(current_row))
if in_header:
header_aligns = current_aligns
align = current_aligns
elif tok.type in ("th_open", "td_open"):
style = (tok.attrs or {}).get("style", "")
cell_align: str | None = None
if isinstance(style, str):
if "text-align:left" in style:
cell_align = "left"
elif "text-align:center" in style:
cell_align = "center"
elif "text-align:right" in style:
cell_align = "right"
current_aligns.append(cell_align)
inline = tokens[i + 1]
current_row.append(make_table_cell(_translate_inline(inline.children or [])))
i += 3 # th/td_open, inline, th/td_close
continue
i += 1
_ = align # quiet unused-var without changing semantics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable align is unused. The table alignment is correctly handled by header_aligns (which is extracted from the thead section). You can safely remove the declaration on line 137, the assignment on line 158, and the linter-quieting line 175.

_, parent = stack.pop()
current = parent # type: ignore[assignment]
payload = meta_kind.removeprefix("_link_meta:")
href, _, title = payload.partition("|")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of | as a separator for link metadata is fragile because both URLs and titles can contain the pipe character. This could lead to incorrect parsing of links during the bake-off, skewing the fidelity results. Consider using a non-printable character like \0 as a delimiter on line 223 and here on line 231.

…eport

The original happy-path fixture only exercised constructs the baseline
parser handles, so divergence counts (24-27) measured "structural
differences on constructs we both parse" -- not the real cost of using
the baseline. That framing undersold the library candidates.

New `gap_cases.md` fixture exercises six constructs from the issue #69
non-parity list: setext headings, indented code blocks, footnote
definitions, inline HTML, task list items, definition lists. A new
`test_report_completeness_gap` walks each parser's output and reports
whether each construct was recognised (correct mdast node type) or
silently dropped to plain text.

Findings on the gap fixture:

  construct             baseline  mistune  markdown-it-py  marko
  setext heading        DROP      OK       OK              OK
  indented code         DROP      OK       OK              OK
  task list item        OK*       DROP     OK              OK
  footnote definition   DROP      DROP     DROP**          DROP
  inline HTML           DROP      DROP     DROP            DROP
  definition list       DROP      DROP     DROP            DROP
  silent-drop count     5         4        3               3

  * Baseline matches `- [x]` as a list item but doesn't extract
    checkbox state.
  ** markdown-it-py supports footnotes via mdit-py-plugins (not in the
     spike); enabling it would drop the count to 2.

The baseline is strictly worse on completeness than every candidate.
Part of its 2.59ms perf win is "fewer constructs to parse per byte" --
setext, indented code, multi-backtick, escaped chars, raw HTML all
skip straight through the inline fast-paths.

README updated to reflect the corrected framing: three options now
(A: in-tree gap closure, B: library swap, D: split the problem) with
markdown-it-py promoted to preferred Option B candidate based on the
completeness score. Option C (selective parser fixes only) ruled out
unless paired with a separate _remend fix.
claude added 4 commits May 22, 2026 06:37
The spike concluded with Option A (PRs #99 + #101) being the right
call for the SDK's current chat-only scope. Documenting the concrete
triggers under which that conclusion should be re-examined:

- Non-chat input source (RAG, user-authored memory, markdown-file
  ingestion, GitHub-body parsing -- anywhere humans write the input)
- Long-form artifact output surface (citations + footnotes; math
  rendering; one-shot parse where latency is less critical)
- Web rendering surface (port of upstream `@chat-adapter/web`)
- New chat platform demanding richer parsing

Plus a playbook for re-running the bake-off (author fixture, run
parity + benchmark harness, compare silent-drop count / parse time)
and a decision threshold for promoting markdown-it-py into runtime
if a trigger materialises.

Spot-checked upstream's `packages/` at the time of writing: only
`adapter-web` is the relevant new package (added in v4.27, Python
port deferred); no artifact / RAG / ingestion packages exist
upstream. All triggers are forward-looking.
Three medium-priority findings from gemini-code-assist:

1. **Unused `align` variable** in `_consume_table`. Removed the
   variable, its assignment, and the linter-quieting `_ = align`
   line. `header_aligns` already correctly carries the table-level
   alignment from the `<thead>` section.

2. **Fragile `|` separator for link metadata** in `_translate_inline`.
   The stack stored link href and title as a single string with `|`
   between them; a URL or title containing `|` would corrupt parsing.
   Replaced with a proper tuple-typed stack frame:
   `tuple[list[Content], tuple[str, str | None] | None]` -- the meta
   is `None` for plain containers (strong/emphasis/delete) and an
   `(href, title)` tuple for links. Removes the partition/format
   round-trip entirely.

3. **LOC heuristic mis-comment**. The previous comment said it
   subtracted docstrings but the implementation only filtered blank
   and comment lines. Replaced with a proper `ast`-based docstring
   exclusion that walks Module / ClassDef / FunctionDef bodies for
   `Expr(Constant(str))` first-statement patterns and excludes their
   line ranges. New LOC numbers (smaller, more honest):

   mistune        149 (was 161 — 12 lines were docstrings)
   markdown-it-py 194 (was 215 — 21 lines)
   marko          147 (was 152 — 5 lines)

   All still under the 250-LOC budget; conclusion unchanged. README
   updated with the new numbers.

Spike tests still pass (9/9), and the benchmark still runs.
Holistic re-review blocker: the spike test module imported mistune /
markdown_it / marko at module top-level, but the default CI test job
runs `uv sync --group dev` (NOT `--group spike-parser`) then
`pytest tests/`. Pytest collected `tests/parser_spike/` and failed
with ModuleNotFoundError -- confirmed red on the PR (`test (3.12)`
failed, others cancelled by fail-fast).

Added `pytest.importorskip(...)` for all three candidate libraries at
the top of the test module, before the translator imports. In the
default dev-only env the module now skips cleanly (verified:
`uv sync --group dev` then `pytest tests/` -> 3,581 passed, 12
skipped, 0 collection errors); with the spike-parser group installed
it runs all 9 tests as before.

conftest.py only imports pathlib + pytest (no candidate libs), so it
needs no guard.
Final-pass subagent review surfaced 3 pyrefly errors in the spike
translators that the lint workflow enforces (currently masked because
the PR is still draft and the lint job is draft-gated). Fixing them
so the PR is mergeable if the team decides to keep the spike rather
than close-and-archive.

Errors fixed:

1. `markdown_it_translator.py:225` (link_open). `attrs.get("title")`
   returned `float | int | str | None` per markdown-it's loose typing,
   but the stack frame is typed as `tuple[str, str | None]`. Coerce
   via `str(raw_title) if raw_title is not None else None` before
   stuffing into the tuple.

2. `markdown_it_translator.py:238` (image). Same root cause: title
   passed to `make_image(title=...)` (typed `str | None`) was the raw
   `attrs.get("title")` result. Same coercion.

3. `mistune_translator.py:56` (parse_markdown). mistune's `parse()`
   returns `list[dict | str]` -- the rare `str` token is a bare lazy
   text payload. The list comprehension fed it to `_translate_block`
   which expects `dict`. Replaced with an explicit loop that
   distinguishes the two cases: dict tokens go through the block
   translator; string tokens are lifted into `paragraph(text(...))`
   so they're not silently dropped.

`uv run pyrefly check`: 0 errors (was 3). All 9 spike tests pass
unchanged. Full suite: 3,590 pass, 1 pre-existing unrelated failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps

2 participants