spike(parser): bake-off harness for issue #69 Option B by patrick-chinchill · Pull Request #100 · Chinchill-AI/chat-sdk-python

patrick-chinchill · 2026-05-22T04:20:52Z

Summary

Spike for issue #69 Option B (parser swap). Bake-off harness comparing three candidate markdown libraries — mistune, markdown-it-py, marko — against the existing hand-rolled shared/markdown_parser.py.

Not wired into the runtime SDK. This PR adds:

A spike-parser dev dependency group (production install unaffected)
Per-library translators to mdast in src/chat_sdk/shared/parser_spike/
A divergence-report + completeness-gap test harness in tests/parser_spike/
A benchmark script in scripts/parser_spike/benchmark.py
Two fixtures: mixed_content.md (happy-path GFM, ≈3KB) and gap_cases.md (the constructs the baseline parser explicitly doesn't handle per the issue Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69 non-parity list)

How to run

uv sync --group dev --group spike-parser
uv run pytest tests/parser_spike/test_mdast_parity.py -s
uv run python scripts/parser_spike/benchmark.py

Findings

Parse-and-translate time (12KB mixed corpus, 50 iterations)

parser	median	p95	5ms budget?
baseline (hand)	2.59ms	2.72ms	✓
mistune	11.94ms	13.04ms	✗ (2.4× over)
markdown-it-py	13.36ms	20.64ms	✗ (2.7× over)
marko	46.62ms	49.58ms	✗ (9.3× over)

Caveat: part of the baseline's perf win is "fewer constructs to parse per byte" — setext, indented code, multi-backtick spans, escaped chars, and raw HTML all skip straight through the inline fast-paths. An apples-to-apples comparison would require extending the baseline to recognise those constructs first.

Translator LOC

library	LOC	250-LOC budget
mistune	161	✓
markdown-it-py	215	✓
marko	152	✓

Completeness gap on hard constructs (`gap_cases.md`)

Six constructs from the issue #69 non-parity list. Silent drop = construct flattened to plain text; recognised = correct mdast node type emitted.

construct	baseline	mistune	markdown-it-py	marko
setext heading	silent drop	recognised	recognised	recognised
indented code block	silent drop	recognised	recognised	recognised
task list item	recognised¹	silent drop	recognised	recognised
footnote definition	silent drop	silent drop	silent drop²	silent drop
inline HTML	silent drop	silent drop	silent drop	silent drop
definition list	silent drop	silent drop	silent drop	silent drop
silent-drop count	5	4	3	3

¹ Baseline matches - [x] as a list item but doesn't extract the checkbox state.
² markdown-it-py supports footnotes via mdit-py-plugins (not pulled in by the spike); enabling it would drop the count to 2.

The baseline is strictly worse on completeness than every candidate.

Happy-path mdast fidelity (`mixed_content.md`)

library	divergences
mistune	26
markdown-it-py	24
marko	27

Most "divergences" are cases where the baseline is less mdast-spec-compliant than the candidates (soft-break splitting, trailing-newline handling on code blocks, link-followed-by-text splitting). A library swap would also fix these baseline correctness bugs.

The one real candidate-side bug surfaced was marko losing GFM table alignment metadata — translator fix.

What this changes about the strategy decision

The spike data argues against a clean recommendation in either direction:

If 10ms median parse time is acceptable (likely true for chat streaming where LLM token latency dwarfs this), Option B with markdown-it-py is the cleanest path: best completeness score, only ~1.5ms slower than mistune, mdit-py-plugins available for footnotes later. One PR, one dep, both gap lists close.
If zero-dep core is sacred, the only path is Option D (split): close baseline gaps in-tree and port remend separately. Higher total effort but preserves perf and install footprint.
Option C (selective parser fixes only, the original framing) leaves the streaming-side bugs from the #69 follow-up comment unaddressed; should be ruled out unless paired with a _remend fix.

Full analysis in src/chat_sdk/shared/parser_spike/README.md.

Test plan

uv run pytest tests/parser_spike/ -q — 9 pass
uv run pytest tests/ --ignore=tests/parser_spike -q — 3,581 pass, 1 pre-existing failure (test_github_webhook.py::TestGitHubAdapterConstructor::test_throws_when_no_auth, unrelated)
uv run ruff check src/chat_sdk/shared/parser_spike/ tests/parser_spike/ scripts/parser_spike/ — passes
uv run ruff format --check ... — formatted
uv run python scripts/parser_spike/benchmark.py — produces the timing/LOC table

Scope notes

No runtime imports: the spike modules are never imported by the SDK.
Production install unaffected: the three libraries are in a spike-parser dependency group, not dependencies or any extras. pip install chat-sdk still pulls zero deps.
Doesn't close Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69: this provides the data for the strategy decision, not the decision itself.

Related: #99 (the surgical fix for the three streaming bugs the spike data also informs).

Three candidate markdown libraries evaluated as drop-in replacements for the hand-rolled `shared/markdown_parser.py`: mistune, markdown-it-py, marko. Each gets a thin translator (~150-215 LOC) to the mdast dict shape produced by the existing parser. Not wired into the runtime SDK -- this directory exists purely so the harness can run them side-by-side. Components: - `pyproject.toml`: new `spike-parser` dependency group with the three candidates pinned. Production install is unaffected. - `src/chat_sdk/shared/parser_spike/`: per-library translators sharing the mdast `make_*` constructors from `markdown_parser.py`. README documents the spike findings inline. - `tests/parser_spike/`: mdast parity bake-off + fixture corpus (~3KB mixed-content markdown exercising every node type). Reports divergences vs baseline without failing. - `scripts/parser_spike/benchmark.py`: parse-and-translate timings across the four parsers on a ~12KB scaled corpus, plus translator LOC budget check. Current findings (sample run): - baseline (hand): 2.59ms median -- meets the 5ms target by 2x. - mistune: 11.94ms median, 161 translator LOC, 26 divergences. - markdown-it-py: 13.36ms median, 215 LOC, 24 divergences. - marko: 46.62ms median, 152 LOC, 27 divergences (incl. dropped table alignment -- translator bug). The 5ms perf budget set in the issue acceptance criteria is met only by the baseline. mdast fidelity is roughly equivalent across the three candidates -- and most "divergences" are baseline non-compliance the candidates correct. Suggests a re-framed Option D (keep the fast parser, port upstream remend for the streaming side) -- spelled out in the spike README. Numbers are reproducible: `uv run python scripts/parser_spike/benchmark.py`

coderabbitai · 2026-05-22T04:20:58Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: eb634c1c-7ae1-4896-8a6d-0a8f0918011a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a spike to evaluate replacing the current hand-rolled markdown parser with a library candidate (mistune, markdown-it-py, or marko). It includes translators for each library, a benchmarking script, and a parity test suite. Feedback identifies an inaccuracy in the line-of-code measurement logic regarding docstrings, unused variables in the markdown-it translator, and a fragile metadata delimiter that could fail on certain URLs or titles.

gemini-code-assist · 2026-05-22T04:27:56Z

+        code_lines = [line for line in text.splitlines() if line.strip() and not line.strip().startswith("#")]
+        # Subtract docstrings as a rough heuristic.


The LOC calculation heuristic does not actually subtract docstrings as stated in the comment. It only filters out lines that are entirely comments (starting with #). This means multi-line docstrings (e.g., """...""") are still counted in the LOC metric, which might lead to an overestimation of the translator complexity relative to the 250-LOC budget mentioned in the requirements.

gemini-code-assist · 2026-05-22T04:27:56Z

+    align: list[str | None] = []
+    in_header = False
+    header_aligns: list[str | None] = []
+    current_row: list[Content] = []
+    current_aligns: list[str | None] = []
+
+    while i < len(tokens):
+        tok = tokens[i]
+        if tok.type == "table_close":
+            return make_table(rows, align=header_aligns if any(header_aligns) else None), i + 1
+        if tok.type == "thead_open":
+            in_header = True
+        elif tok.type == "thead_close":
+            in_header = False
+        elif tok.type == "tr_open":
+            current_row = []
+            current_aligns = []
+        elif tok.type == "tr_close":
+            rows.append(make_table_row(current_row))
+            if in_header:
+                header_aligns = current_aligns
+            align = current_aligns
+        elif tok.type in ("th_open", "td_open"):
+            style = (tok.attrs or {}).get("style", "")
+            cell_align: str | None = None
+            if isinstance(style, str):
+                if "text-align:left" in style:
+                    cell_align = "left"
+                elif "text-align:center" in style:
+                    cell_align = "center"
+                elif "text-align:right" in style:
+                    cell_align = "right"
+            current_aligns.append(cell_align)
+            inline = tokens[i + 1]
+            current_row.append(make_table_cell(_translate_inline(inline.children or [])))
+            i += 3  # th/td_open, inline, th/td_close
+            continue
+        i += 1
+    _ = align  # quiet unused-var without changing semantics


The variable align is unused. The table alignment is correctly handled by header_aligns (which is extracted from the thead section). You can safely remove the declaration on line 137, the assignment on line 158, and the linter-quieting line 175.

gemini-code-assist · 2026-05-22T04:27:56Z

+            _, parent = stack.pop()
+            current = parent  # type: ignore[assignment]
+            payload = meta_kind.removeprefix("_link_meta:")
+            href, _, title = payload.partition("|")


The use of | as a separator for link metadata is fragile because both URLs and titles can contain the pipe character. This could lead to incorrect parsing of links during the bake-off, skewing the fidelity results. Consider using a non-printable character like \0 as a delimiter on line 223 and here on line 231.

…eport The original happy-path fixture only exercised constructs the baseline parser handles, so divergence counts (24-27) measured "structural differences on constructs we both parse" -- not the real cost of using the baseline. That framing undersold the library candidates. New `gap_cases.md` fixture exercises six constructs from the issue #69 non-parity list: setext headings, indented code blocks, footnote definitions, inline HTML, task list items, definition lists. A new `test_report_completeness_gap` walks each parser's output and reports whether each construct was recognised (correct mdast node type) or silently dropped to plain text. Findings on the gap fixture: construct baseline mistune markdown-it-py marko setext heading DROP OK OK OK indented code DROP OK OK OK task list item OK* DROP OK OK footnote definition DROP DROP DROP** DROP inline HTML DROP DROP DROP DROP definition list DROP DROP DROP DROP silent-drop count 5 4 3 3 * Baseline matches `- [x]` as a list item but doesn't extract checkbox state. ** markdown-it-py supports footnotes via mdit-py-plugins (not in the spike); enabling it would drop the count to 2. The baseline is strictly worse on completeness than every candidate. Part of its 2.59ms perf win is "fewer constructs to parse per byte" -- setext, indented code, multi-backtick, escaped chars, raw HTML all skip straight through the inline fast-paths. README updated to reflect the corrected framing: three options now (A: in-tree gap closure, B: library swap, D: split the problem) with markdown-it-py promoted to preferred Option B candidate based on the completeness score. Option C (selective parser fixes only) ruled out unless paired with a separate _remend fix.

The spike concluded with Option A (PRs #99 + #101) being the right call for the SDK's current chat-only scope. Documenting the concrete triggers under which that conclusion should be re-examined: - Non-chat input source (RAG, user-authored memory, markdown-file ingestion, GitHub-body parsing -- anywhere humans write the input) - Long-form artifact output surface (citations + footnotes; math rendering; one-shot parse where latency is less critical) - Web rendering surface (port of upstream `@chat-adapter/web`) - New chat platform demanding richer parsing Plus a playbook for re-running the bake-off (author fixture, run parity + benchmark harness, compare silent-drop count / parse time) and a decision threshold for promoting markdown-it-py into runtime if a trigger materialises. Spot-checked upstream's `packages/` at the time of writing: only `adapter-web` is the relevant new package (added in v4.27, Python port deferred); no artifact / RAG / ingestion packages exist upstream. All triggers are forward-looking.

Three medium-priority findings from gemini-code-assist: 1. **Unused `align` variable** in `_consume_table`. Removed the variable, its assignment, and the linter-quieting `_ = align` line. `header_aligns` already correctly carries the table-level alignment from the `<thead>` section. 2. **Fragile `|` separator for link metadata** in `_translate_inline`. The stack stored link href and title as a single string with `|` between them; a URL or title containing `|` would corrupt parsing. Replaced with a proper tuple-typed stack frame: `tuple[list[Content], tuple[str, str | None] | None]` -- the meta is `None` for plain containers (strong/emphasis/delete) and an `(href, title)` tuple for links. Removes the partition/format round-trip entirely. 3. **LOC heuristic mis-comment**. The previous comment said it subtracted docstrings but the implementation only filtered blank and comment lines. Replaced with a proper `ast`-based docstring exclusion that walks Module / ClassDef / FunctionDef bodies for `Expr(Constant(str))` first-statement patterns and excludes their line ranges. New LOC numbers (smaller, more honest): mistune 149 (was 161 — 12 lines were docstrings) markdown-it-py 194 (was 215 — 21 lines) marko 147 (was 152 — 5 lines) All still under the 250-LOC budget; conclusion unchanged. README updated with the new numbers. Spike tests still pass (9/9), and the benchmark still runs.

Holistic re-review blocker: the spike test module imported mistune / markdown_it / marko at module top-level, but the default CI test job runs `uv sync --group dev` (NOT `--group spike-parser`) then `pytest tests/`. Pytest collected `tests/parser_spike/` and failed with ModuleNotFoundError -- confirmed red on the PR (`test (3.12)` failed, others cancelled by fail-fast). Added `pytest.importorskip(...)` for all three candidate libraries at the top of the test module, before the translator imports. In the default dev-only env the module now skips cleanly (verified: `uv sync --group dev` then `pytest tests/` -> 3,581 passed, 12 skipped, 0 collection errors); with the spike-parser group installed it runs all 9 tests as before. conftest.py only imports pathlib + pytest (no candidate libs), so it needs no guard.

Final-pass subagent review surfaced 3 pyrefly errors in the spike translators that the lint workflow enforces (currently masked because the PR is still draft and the lint job is draft-gated). Fixing them so the PR is mergeable if the team decides to keep the spike rather than close-and-archive. Errors fixed: 1. `markdown_it_translator.py:225` (link_open). `attrs.get("title")` returned `float | int | str | None` per markdown-it's loose typing, but the stack frame is typed as `tuple[str, str | None]`. Coerce via `str(raw_title) if raw_title is not None else None` before stuffing into the tuple. 2. `markdown_it_translator.py:238` (image). Same root cause: title passed to `make_image(title=...)` (typed `str | None`) was the raw `attrs.get("title")` result. Same coercion. 3. `mistune_translator.py:56` (parse_markdown). mistune's `parse()` returns `list[dict | str]` -- the rare `str` token is a bare lazy text payload. The list comprehension fed it to `_translate_block` which expects `dict`. Replaced with an explicit loop that distinguishes the two cases: dict tokens go through the block translator; string tokens are lifted into `paragraph(text(...))` so they're not silently dropped. `uv run pyrefly check`: 0 errors (was 3). All 9 spike tests pass unchanged. Full suite: 3,590 pass, 1 pre-existing unrelated failure.

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

patrick-chinchill mentioned this pull request May 22, 2026

fix(markdown): chat-scoped completeness — escaped chars, task lists, math/escape in _remend #101

Open

5 tasks

claude added 4 commits May 22, 2026 06:37

patrick-chinchill mentioned this pull request May 28, 2026

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps #69

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike(parser): bake-off harness for issue #69 Option B#100

spike(parser): bake-off harness for issue #69 Option B#100
patrick-chinchill wants to merge 6 commits into
mainfrom
claude/markdown-parser-swap-spike

patrick-chinchill commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		code_lines = [line for line in text.splitlines() if line.strip() and not line.strip().startswith("#")]
		# Subtract docstrings as a rough heuristic.

Conversation

patrick-chinchill commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to run

Findings

Parse-and-translate time (12KB mixed corpus, 50 iterations)

Translator LOC

Completeness gap on hard constructs (gap_cases.md)

Happy-path mdast fidelity (mixed_content.md)

What this changes about the strategy decision

Test plan

Scope notes

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patrick-chinchill commented May 22, 2026 •

edited

Loading

Completeness gap on hard constructs (`gap_cases.md`)

Happy-path mdast fidelity (`mixed_content.md`)

coderabbitai Bot commented May 22, 2026 •

edited

Loading