spike(parser): bake-off harness for issue #69 Option B#100
spike(parser): bake-off harness for issue #69 Option B#100patrick-chinchill wants to merge 6 commits into
Conversation
Three candidate markdown libraries evaluated as drop-in replacements for the hand-rolled `shared/markdown_parser.py`: mistune, markdown-it-py, marko. Each gets a thin translator (~150-215 LOC) to the mdast dict shape produced by the existing parser. Not wired into the runtime SDK -- this directory exists purely so the harness can run them side-by-side. Components: - `pyproject.toml`: new `spike-parser` dependency group with the three candidates pinned. Production install is unaffected. - `src/chat_sdk/shared/parser_spike/`: per-library translators sharing the mdast `make_*` constructors from `markdown_parser.py`. README documents the spike findings inline. - `tests/parser_spike/`: mdast parity bake-off + fixture corpus (~3KB mixed-content markdown exercising every node type). Reports divergences vs baseline without failing. - `scripts/parser_spike/benchmark.py`: parse-and-translate timings across the four parsers on a ~12KB scaled corpus, plus translator LOC budget check. Current findings (sample run): - baseline (hand): 2.59ms median -- meets the 5ms target by 2x. - mistune: 11.94ms median, 161 translator LOC, 26 divergences. - markdown-it-py: 13.36ms median, 215 LOC, 24 divergences. - marko: 46.62ms median, 152 LOC, 27 divergences (incl. dropped table alignment -- translator bug). The 5ms perf budget set in the issue acceptance criteria is met only by the baseline. mdast fidelity is roughly equivalent across the three candidates -- and most "divergences" are baseline non-compliance the candidates correct. Suggests a re-framed Option D (keep the fast parser, port upstream remend for the streaming side) -- spelled out in the spike README. Numbers are reproducible: `uv run python scripts/parser_spike/benchmark.py`
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a spike to evaluate replacing the current hand-rolled markdown parser with a library candidate (mistune, markdown-it-py, or marko). It includes translators for each library, a benchmarking script, and a parity test suite. Feedback identifies an inaccuracy in the line-of-code measurement logic regarding docstrings, unused variables in the markdown-it translator, and a fragile metadata delimiter that could fail on certain URLs or titles.
| code_lines = [line for line in text.splitlines() if line.strip() and not line.strip().startswith("#")] | ||
| # Subtract docstrings as a rough heuristic. |
There was a problem hiding this comment.
The LOC calculation heuristic does not actually subtract docstrings as stated in the comment. It only filters out lines that are entirely comments (starting with #). This means multi-line docstrings (e.g., """...""") are still counted in the LOC metric, which might lead to an overestimation of the translator complexity relative to the 250-LOC budget mentioned in the requirements.
| align: list[str | None] = [] | ||
| in_header = False | ||
| header_aligns: list[str | None] = [] | ||
| current_row: list[Content] = [] | ||
| current_aligns: list[str | None] = [] | ||
|
|
||
| while i < len(tokens): | ||
| tok = tokens[i] | ||
| if tok.type == "table_close": | ||
| return make_table(rows, align=header_aligns if any(header_aligns) else None), i + 1 | ||
| if tok.type == "thead_open": | ||
| in_header = True | ||
| elif tok.type == "thead_close": | ||
| in_header = False | ||
| elif tok.type == "tr_open": | ||
| current_row = [] | ||
| current_aligns = [] | ||
| elif tok.type == "tr_close": | ||
| rows.append(make_table_row(current_row)) | ||
| if in_header: | ||
| header_aligns = current_aligns | ||
| align = current_aligns | ||
| elif tok.type in ("th_open", "td_open"): | ||
| style = (tok.attrs or {}).get("style", "") | ||
| cell_align: str | None = None | ||
| if isinstance(style, str): | ||
| if "text-align:left" in style: | ||
| cell_align = "left" | ||
| elif "text-align:center" in style: | ||
| cell_align = "center" | ||
| elif "text-align:right" in style: | ||
| cell_align = "right" | ||
| current_aligns.append(cell_align) | ||
| inline = tokens[i + 1] | ||
| current_row.append(make_table_cell(_translate_inline(inline.children or []))) | ||
| i += 3 # th/td_open, inline, th/td_close | ||
| continue | ||
| i += 1 | ||
| _ = align # quiet unused-var without changing semantics |
| _, parent = stack.pop() | ||
| current = parent # type: ignore[assignment] | ||
| payload = meta_kind.removeprefix("_link_meta:") | ||
| href, _, title = payload.partition("|") |
There was a problem hiding this comment.
The use of | as a separator for link metadata is fragile because both URLs and titles can contain the pipe character. This could lead to incorrect parsing of links during the bake-off, skewing the fidelity results. Consider using a non-printable character like \0 as a delimiter on line 223 and here on line 231.
…eport The original happy-path fixture only exercised constructs the baseline parser handles, so divergence counts (24-27) measured "structural differences on constructs we both parse" -- not the real cost of using the baseline. That framing undersold the library candidates. New `gap_cases.md` fixture exercises six constructs from the issue #69 non-parity list: setext headings, indented code blocks, footnote definitions, inline HTML, task list items, definition lists. A new `test_report_completeness_gap` walks each parser's output and reports whether each construct was recognised (correct mdast node type) or silently dropped to plain text. Findings on the gap fixture: construct baseline mistune markdown-it-py marko setext heading DROP OK OK OK indented code DROP OK OK OK task list item OK* DROP OK OK footnote definition DROP DROP DROP** DROP inline HTML DROP DROP DROP DROP definition list DROP DROP DROP DROP silent-drop count 5 4 3 3 * Baseline matches `- [x]` as a list item but doesn't extract checkbox state. ** markdown-it-py supports footnotes via mdit-py-plugins (not in the spike); enabling it would drop the count to 2. The baseline is strictly worse on completeness than every candidate. Part of its 2.59ms perf win is "fewer constructs to parse per byte" -- setext, indented code, multi-backtick, escaped chars, raw HTML all skip straight through the inline fast-paths. README updated to reflect the corrected framing: three options now (A: in-tree gap closure, B: library swap, D: split the problem) with markdown-it-py promoted to preferred Option B candidate based on the completeness score. Option C (selective parser fixes only) ruled out unless paired with a separate _remend fix.
The spike concluded with Option A (PRs #99 + #101) being the right call for the SDK's current chat-only scope. Documenting the concrete triggers under which that conclusion should be re-examined: - Non-chat input source (RAG, user-authored memory, markdown-file ingestion, GitHub-body parsing -- anywhere humans write the input) - Long-form artifact output surface (citations + footnotes; math rendering; one-shot parse where latency is less critical) - Web rendering surface (port of upstream `@chat-adapter/web`) - New chat platform demanding richer parsing Plus a playbook for re-running the bake-off (author fixture, run parity + benchmark harness, compare silent-drop count / parse time) and a decision threshold for promoting markdown-it-py into runtime if a trigger materialises. Spot-checked upstream's `packages/` at the time of writing: only `adapter-web` is the relevant new package (added in v4.27, Python port deferred); no artifact / RAG / ingestion packages exist upstream. All triggers are forward-looking.
Three medium-priority findings from gemini-code-assist: 1. **Unused `align` variable** in `_consume_table`. Removed the variable, its assignment, and the linter-quieting `_ = align` line. `header_aligns` already correctly carries the table-level alignment from the `<thead>` section. 2. **Fragile `|` separator for link metadata** in `_translate_inline`. The stack stored link href and title as a single string with `|` between them; a URL or title containing `|` would corrupt parsing. Replaced with a proper tuple-typed stack frame: `tuple[list[Content], tuple[str, str | None] | None]` -- the meta is `None` for plain containers (strong/emphasis/delete) and an `(href, title)` tuple for links. Removes the partition/format round-trip entirely. 3. **LOC heuristic mis-comment**. The previous comment said it subtracted docstrings but the implementation only filtered blank and comment lines. Replaced with a proper `ast`-based docstring exclusion that walks Module / ClassDef / FunctionDef bodies for `Expr(Constant(str))` first-statement patterns and excludes their line ranges. New LOC numbers (smaller, more honest): mistune 149 (was 161 — 12 lines were docstrings) markdown-it-py 194 (was 215 — 21 lines) marko 147 (was 152 — 5 lines) All still under the 250-LOC budget; conclusion unchanged. README updated with the new numbers. Spike tests still pass (9/9), and the benchmark still runs.
Holistic re-review blocker: the spike test module imported mistune / markdown_it / marko at module top-level, but the default CI test job runs `uv sync --group dev` (NOT `--group spike-parser`) then `pytest tests/`. Pytest collected `tests/parser_spike/` and failed with ModuleNotFoundError -- confirmed red on the PR (`test (3.12)` failed, others cancelled by fail-fast). Added `pytest.importorskip(...)` for all three candidate libraries at the top of the test module, before the translator imports. In the default dev-only env the module now skips cleanly (verified: `uv sync --group dev` then `pytest tests/` -> 3,581 passed, 12 skipped, 0 collection errors); with the spike-parser group installed it runs all 9 tests as before. conftest.py only imports pathlib + pytest (no candidate libs), so it needs no guard.
Final-pass subagent review surfaced 3 pyrefly errors in the spike
translators that the lint workflow enforces (currently masked because
the PR is still draft and the lint job is draft-gated). Fixing them
so the PR is mergeable if the team decides to keep the spike rather
than close-and-archive.
Errors fixed:
1. `markdown_it_translator.py:225` (link_open). `attrs.get("title")`
returned `float | int | str | None` per markdown-it's loose typing,
but the stack frame is typed as `tuple[str, str | None]`. Coerce
via `str(raw_title) if raw_title is not None else None` before
stuffing into the tuple.
2. `markdown_it_translator.py:238` (image). Same root cause: title
passed to `make_image(title=...)` (typed `str | None`) was the raw
`attrs.get("title")` result. Same coercion.
3. `mistune_translator.py:56` (parse_markdown). mistune's `parse()`
returns `list[dict | str]` -- the rare `str` token is a bare lazy
text payload. The list comprehension fed it to `_translate_block`
which expects `dict`. Replaced with an explicit loop that
distinguishes the two cases: dict tokens go through the block
translator; string tokens are lifted into `paragraph(text(...))`
so they're not silently dropped.
`uv run pyrefly check`: 0 errors (was 3). All 9 spike tests pass
unchanged. Full suite: 3,590 pass, 1 pre-existing unrelated failure.
Summary
Spike for issue #69 Option B (parser swap). Bake-off harness comparing three candidate markdown libraries —
mistune,markdown-it-py,marko— against the existing hand-rolledshared/markdown_parser.py.Not wired into the runtime SDK. This PR adds:
spike-parserdev dependency group (production install unaffected)src/chat_sdk/shared/parser_spike/tests/parser_spike/scripts/parser_spike/benchmark.pymixed_content.md(happy-path GFM, ≈3KB) andgap_cases.md(the constructs the baseline parser explicitly doesn't handle per the issue Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69 non-parity list)How to run
Findings
Parse-and-translate time (12KB mixed corpus, 50 iterations)
Caveat: part of the baseline's perf win is "fewer constructs to parse per byte" — setext, indented code, multi-backtick spans, escaped chars, and raw HTML all skip straight through the inline fast-paths. An apples-to-apples comparison would require extending the baseline to recognise those constructs first.
Translator LOC
Completeness gap on hard constructs (
gap_cases.md)Six constructs from the issue #69 non-parity list. Silent drop = construct flattened to plain text; recognised = correct mdast node type emitted.
¹ Baseline matches
- [x]as a list item but doesn't extract the checkbox state.² markdown-it-py supports footnotes via
mdit-py-plugins(not pulled in by the spike); enabling it would drop the count to 2.The baseline is strictly worse on completeness than every candidate.
Happy-path mdast fidelity (
mixed_content.md)Most "divergences" are cases where the baseline is less mdast-spec-compliant than the candidates (soft-break splitting, trailing-newline handling on code blocks, link-followed-by-text splitting). A library swap would also fix these baseline correctness bugs.
The one real candidate-side bug surfaced was marko losing GFM table alignment metadata — translator fix.
What this changes about the strategy decision
The spike data argues against a clean recommendation in either direction:
markdown-it-pyis the cleanest path: best completeness score, only ~1.5ms slower than mistune,mdit-py-pluginsavailable for footnotes later. One PR, one dep, both gap lists close.remendseparately. Higher total effort but preserves perf and install footprint._remendfix.Full analysis in
src/chat_sdk/shared/parser_spike/README.md.Test plan
uv run pytest tests/parser_spike/ -q— 9 passuv run pytest tests/ --ignore=tests/parser_spike -q— 3,581 pass, 1 pre-existing failure (test_github_webhook.py::TestGitHubAdapterConstructor::test_throws_when_no_auth, unrelated)uv run ruff check src/chat_sdk/shared/parser_spike/ tests/parser_spike/ scripts/parser_spike/— passesuv run ruff format --check ...— formatteduv run python scripts/parser_spike/benchmark.py— produces the timing/LOC tableScope notes
spike-parserdependency group, notdependenciesor any extras.pip install chat-sdkstill pulls zero deps.Related: #99 (the surgical fix for the three streaming bugs the spike data also informs).