Skip to content

fix(markdown): chat-scoped completeness — escaped chars, task lists, math/escape in _remend#101

Merged
patrick-chinchill merged 18 commits into
mainfrom
claude/parser-chat-completeness
May 28, 2026
Merged

fix(markdown): chat-scoped completeness — escaped chars, task lists, math/escape in _remend#101
patrick-chinchill merged 18 commits into
mainfrom
claude/parser-chat-completeness

Conversation

@patrick-chinchill
Copy link
Copy Markdown
Collaborator

@patrick-chinchill patrick-chinchill commented May 22, 2026

Summary

Chat-scoped completeness fixes for issue #69, targeting the gap subset that actually surfaces in LLM/agent output rather than chasing full CommonMark spec compliance.

Sequenced after #99 (depends on _is_excluded_asterisk from that PR). Targets main for the 4.27 cycle — same release, separate PR for review clarity.

Scope (chat-realistic gaps from the issue #69 follow-up)

Parser side (markdown_parser.py)

  1. Escaped characters — inline regexes get (?<!\\) lookbehinds so \*, \[, \_, \~~ stay literal; new _unescape_punct resolves the escape at text-leaf emission. Image alt text also unescapes. Inline-code contents deliberately preserve backslashes (CommonMark: code spans don't unescape).

  2. GFM task list checkboxes- [ ] / - [x] / - [X] parse to listItem with checked: bool. Ordered lists with 1. [x] content are NOT coerced into tasks (task lists are bullet-only per GFM).

Streaming side (streaming_markdown.py::_remend)

  1. Math regions — new _strip_math_regions drops $...$ (inline) and $$...$$ (display) before the inline-marker counters, so $a^* + b^*$ no longer accumulates phantom italic closers.

  2. Escape-aware counters — new _strip_escape_sequences removes \X pairs from the counting buffer so the tilde/backtick/bracket counters in _remend no longer fire on escaped delimiters. The bracket walk's existing \-skip is preserved.

What's deliberately NOT here

  • Word-internal asterisks (5*3=15) — prototyped, reverted. Excluding word-internal * also breaks text*foo* (legitimate paired italic whose *s are also word-flanked). Distinguishing the two needs a CommonMark delimiter-stack algorithm. Documented as the next Option A item in _is_excluded_asterisk's docstring.

  • Setext headings, footnotes, raw HTML, indented code blocks, definition lists — analyzed in the parser-swap spike (PR spike(parser): bake-off harness for issue #69 Option B #100). All silently dropped by the baseline and by every candidate without plugins. Not surfaced by LLM/chat output in practice (analysis).

  • Multi-backtick code spans — deferred. Less common than the items above and would require a non-trivial inline-parser change.

Test plan

26 new tests across 3 classes:

  • TestEscapedCharacters (8 tests): asterisk, underscore, bracket, tilde, backslash literal, mixed real/escaped emphasis, image-alt unescape, inline-code-preserves-backslash.
  • TestTaskListItems (7 tests): unchecked, checked, uppercase X, plain item with no attr, mixed list state, marker stripped from content, ordered list not task-coerced.
  • TestRemendChatCompleteness (11 tests): inline + display math regions, math doesn't break legitimate emphasis outside, math drops strike markers and bracket walk, escape-aware tilde/bracket/backtick counters, mixed escaped+real, escape neutrality.

Validation:

Scope notes

Related: #99 (the original three streaming bugs), #100 (the parser-swap spike that reframed the strategy decision).


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Support for GitHub-flavored task lists (checkbox state preserved, including empty items)
  • Bug Fixes

    • Better handling of backslash-escaped punctuation and inline math so formatting is preserved
    • Improved nested-list and progressive table rendering to avoid misplaced markers and premature header emission
    • More faithful parse/stringify roundtrips to retain original markdown formatting

Review Change Stack

claude added 3 commits May 22, 2026 03:47
Three production Slack streaming bugs from issue #69 comment:

1. `_remend("* item one\n")` appended a stray `*`. `_close_emphasis`
   counted the line-leading bullet as an italic opener. Now skips
   single-`*` runs preceded by line-leading whitespace and followed by
   a space/tab (list-marker shape), matching upstream `remend`'s
   awareness of list markers.

2. `StreamingMarkdownRenderer.finish()` on an odd-count bullet list
   produced `...item three\n*` — same root cause as #1; the fix covers
   both.

3. `_get_committable_prefix` released a confirmed table the moment a
   separator arrived, even with zero body rows. Slack
   `chat.appendStream` saw header+separator alone as broken syntax and
   then bare body rows in subsequent appends, never reconnecting them
   as a single table. Now the header+separator block is held until at
   least one body row arrives, then released atomically.

The original tests pinned the buggy "commit on separator" behavior;
updated to require a body row before committing the table block, and
added a TestIssue69Regressions class with the exact reproducers from
the issue comment plus invariants for the new contract.

Closes part of #69 (the streaming bugs). The broader hand-rolled
parser question (Options A/B/C in the issue) remains open.
The previous `_is_list_marker` skipped only line-leading single `*`
followed by horizontal whitespace. Tighten to match upstream remend's
`shouldSkipAsterisk` (packages/remend/src/emphasis-handlers.ts):
exclude any single `*` flanked by whitespace (or text boundary) on
both sides, which is what CommonMark says isn't a valid emphasis
delimiter anyway. The line-leading bullet case still falls out
naturally.

Picks up three additional cases the previous narrower check missed:
- `text * more` -- whitespace-flanked mid-line
- `trailing *\n` -- asterisk at end of line
- `partial *`   -- asterisk at end of buffer

Same `_remend` over-counting failure mode as the original issue #69
bugs, just different surface forms. Renamed the helper to
`_is_excluded_asterisk` to reflect the broader scope.

Remaining remend divergences (word-internal asterisks, math-block
contents, escaped sequences, multi-backtick spans, setext headings,
indented code, raw HTML, footnotes) are tracked on issue #69 for the
parser-strategy discussion -- out of scope for this PR.
…ath/escape in remend)

Pulls in the chat-realistic subset of the issue #69 gap list, scoped to
what LLMs actually emit in agent/chat contexts. Builds on #99's
_is_excluded_asterisk helper rather than restating it.

Parser side (markdown_parser.py):
- Inline regexes get `(?<!\\)` lookbehinds so escaped openers (`\*`,
  `\[`, `\_`, `\~~`) stay literal. Bracket content tolerates inner
  `\]` / `\[` via `(?:[^\]\\]|\\.)+?`.
- New `_unescape_punct` runs at text-leaf emission and image-alt
  emission, resolving `\X` to `X` for ASCII-punctuation X. Inline-code
  values are deliberately not unescaped (CommonMark: backticks treat
  backslashes literally).
- `_collect_list_items` detects the GFM task-marker prefix (`[ ]` or
  `[x]/[X]`) on unordered items and stores `checked: bool` on the
  resulting listItem. Plain items have no `checked` attr.
- `make_list_item` extended with optional `checked` kwarg.

Streaming side (streaming_markdown.py):
- New `_strip_math_regions` pass drops `$...$` (inline) and `$$...$$`
  (display) before the inline-marker counters. Markers inside math are
  literal, not delimiters.
- New `_strip_escape_sequences` pass removes `\X` pairs from the
  counting buffer so the tilde/backtick/bracket counters don't fire on
  escaped delimiters.
- Both passes run after `_strip_fenced_code` in `_remend`.

Word-internal asterisk exclusion was prototyped (closes `5*3=15`) but
reverted: it also breaks paired emphasis like `text*foo*` whose two
asterisks are both word-flanked. Distinguishing those needs a proper
CommonMark delimiter-stack -- documented as the next Option A item in
the helper's docstring.

20 new tests (15 parser + 11 streaming): escape behaviors on every
delimiter family, image alt unescape, inline-code preservation,
checked/unchecked/uppercase-X task items, mixed task lists, ordered
lists are NOT task-coerced, math regions don't pollute counters,
escape-aware counters, composition cases.

No existing tests changed. 3,621 pass on the full suite; the lone
pre-existing GitHub webhook failure is unrelated (see #83 test plan).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Warning

Review limit reached

@patrick-chinchill, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 9 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 72f311b6-8aeb-48eb-acc6-a4acb298d97a

📥 Commits

Reviewing files that changed from the base of the PR and between 03033b4 and 4fda9b4.

📒 Files selected for processing (4)
  • src/chat_sdk/shared/markdown_parser.py
  • src/chat_sdk/shared/streaming_markdown.py
  • tests/test_markdown_faithful.py
  • tests/test_streaming_markdown.py
📝 Walkthrough

Walkthrough

This PR implements GFM task-list support, escape-aware inline parsing/stringification for faithful round-trips, and streaming delimiter repair (escape & TeX-math awareness) plus table-holding changes so append-only streaming emits header+separator+first-row atomically.

Changes

Markdown Fidelity: Task Lists, Escaped Characters, and Streaming Fixes

Layer / File(s) Summary
Task list AST and parsing support
src/chat_sdk/shared/markdown_parser.py
make_list_item accepts optional checked; unordered list collection recognizes - [ ]/- [x] and sets listItem.checked, preserving empty task items.
Task list stringification and BaseFormat rendering
src/chat_sdk/shared/markdown_parser.py, src/chat_sdk/shared/base_format_converter.py
Stringifier emits [x]/[ ] for unordered items, preserves empty task items, handles nested-list-before-text by emitting parent prefix+marker, and _render_list preserves checkbox markers in formatted output.
Escape-protected inline parsing
src/chat_sdk/shared/markdown_parser.py
Adds sentinel-based escape protection/restoration, escape-aware inline regexes, and consistent restoration of escaped captures for links/images/inlineCode; hard breaks emitted consistently while preserving escapes.
Stringify escaping and paragraph safety
src/chat_sdk/shared/markdown_parser.py
Adds _escape_text_leaf and _escape_paragraph_block_markers, escapes image alt text, and ensures parse→stringify→parse roundtrips preserve escaped punctuation and block-start markers.
Streaming delimiter repair (escape & math aware)
src/chat_sdk/shared/streaming_markdown.py
Preprocesses text by removing backslash-escape sequences and TeX math regions before delimiter counting; excludes whitespace-flanked single * runs from emphasis tallying; updates _close_emphasis and _remend.
Table append-only holding behavior
src/chat_sdk/shared/streaming_markdown.py
_get_committable_prefix records separator position and holds header+separator until the first body row arrives, emitting header+separator+first-row atomically for append-only consumers.
Streaming markdown tests and regressions
tests/test_streaming_markdown.py
Updates tests to expect separator-only pushes to be held (empty delta) and first-body-row to release the assembled table; adds regressions for issue #69 covering _remend and table chunking.
Escaped characters & task-list tests
tests/test_markdown_faithful.py
Adds TestEscapedCharacters and TestTaskListItems; extends TestStringifyMarkdown with roundtrip assertions for escaped punctuation, block-marker safety, task-list state, and empty task items.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A rabbit hops through markdown lands,
Where checkboxes march and escapes stand—
Tasks now kept whole, math and stars tamed,
Streams held steady, and text un-maimed;
Fidelity prances, tests applauded, 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the primary changes: fixing markdown handling for escaped characters, task lists, and math/escape sequences in the _remend function, which directly aligns with the main objectives.
Linked Issues check ✅ Passed The PR successfully implements Option C from issue #69 by delivering focused fixes for escaped characters and GFM task lists with comprehensive tests, while deferring lower-ROI gaps as recommended.
Out of Scope Changes check ✅ Passed All changes are scoped to the documented objectives: escaped character handling, task list parsing/rendering, and math/escape-aware streaming fixes; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for GFM task lists and backslash-escaped punctuation in the markdown parser. It also improves the streaming renderer's accuracy by ignoring escaped characters and math regions when calculating unclosed delimiters, and by ensuring tables are only emitted once a body row is present. Feedback was provided regarding the inline math regex, which is currently too broad and may incorrectly match currency patterns, potentially leading to incorrect delimiter closing in the streaming output.

# Display math: $$...$$ may span multiple lines.
text = re.sub(r"\$\$.*?\$\$", "", text, flags=re.DOTALL)
# Inline math: $...$ within a single line, non-greedy, no embedded $.
text = re.sub(r"\$[^\$\n]+\$", "", text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex for inline math is too broad and will cause false positives with common text patterns like currency (e.g., "prices are $5 and $10"). This can lead to incorrect marker counting in _remend and potentially corrupt the output by appending unnecessary closing markers.

Following standard markdown math heuristics (like those in Pandoc or MathJax), the opening $ should be followed by a non-whitespace character, and the closing $ should be preceded by a non-whitespace character. It should also account for escaped dollar signs.

Suggested change
text = re.sub(r"\$[^\$\n]+\$", "", text)
text = re.sub(r"(?<!\\)\$(?!\s)(?:\\.|[^\$\n\\])+?(?<!\s)\$", "", text)

patrick-chinchill pushed a commit that referenced this pull request May 22, 2026
The spike concluded with Option A (PRs #99 + #101) being the right
call for the SDK's current chat-only scope. Documenting the concrete
triggers under which that conclusion should be re-examined:

- Non-chat input source (RAG, user-authored memory, markdown-file
  ingestion, GitHub-body parsing -- anywhere humans write the input)
- Long-form artifact output surface (citations + footnotes; math
  rendering; one-shot parse where latency is less critical)
- Web rendering surface (port of upstream `@chat-adapter/web`)
- New chat platform demanding richer parsing

Plus a playbook for re-running the bake-off (author fixture, run
parity + benchmark harness, compare silent-drop count / parse time)
and a decision threshold for promoting markdown-it-py into runtime
if a trigger materialises.

Spot-checked upstream's `packages/` at the time of writing: only
`adapter-web` is the relevant new package (added in v4.27, Python
port deferred); no artifact / RAG / ingestion packages exist
upstream. All triggers are forward-looking.
claude added 4 commits May 22, 2026 08:41
Independent review surfaced three actionable items:

1. **Math/escape ordering** (review #1) -- `_strip_math_regions` ran
   BEFORE `_strip_escape_sequences`, so input like
   `\$opener *unclosed text closer\$` had its two escaped `$`s paired
   as math, eating the `*` opener inside. Reordered: escape strip
   first, then math strip. Verified the original bug repro now
   correctly produces a closing `*`.

2. **Empty task items** (review #5) -- `r"^\[([ xX])\]\s+(.*)"`
   required at least one whitespace AFTER `]`, so `- [ ]` with no
   trailing content silently fell through to plain text. Loosened to
   `r"^\[([ xX])\](?:\s+(.*))?$"` -- trailing whitespace+content is
   now optional. `- [ ]` and `- [x]` (no trailing) both produce
   `listItem(checked=False/True, children=[])`.

3. **Strikethrough test strengthened** (review #7) -- the existing
   `test_escaped_tilde_does_not_form_strikethrough` only asserted the
   absence of a `delete` node. A buggy impl that dropped the tildes
   entirely would have passed. Added a content-shape assertion that
   `~~not strike~~` appears in the text leaves.

Reviewer's #3 (the `r"a *b\* * c"` case) was investigated and
determined NOT to be a bug -- the trailing `*` between two spaces is
not a valid CommonMark closer (whitespace-flanked), so italic
correctly stays open. Pre-fix behavior happened to balance by ignoring
flanking; post-fix is CommonMark-correct. Pre-existing edge cases #4
(`\\*text*`) and #6 (link-text unescape implicit) are documented in
the existing code; not blockers.

Tests added:
- `test_empty_unchecked_task_item_no_content` and `_checked_` variant
- `test_remend_escaped_dollar_does_not_pair_with_unescaped_dollar`
  (the exact #1 repro)
- `test_remend_escaped_dollar_does_not_create_phantom_math_region`
- Strengthened `test_escaped_tilde_does_not_form_strikethrough`

3,625 pass / 1 pre-existing failure unrelated.
Downstream-impact subagent found two blockers in PR #101's new AST
semantics: `stringify_markdown` and `_render_list` were not updated to
emit the new state, causing data loss when adapters round-trip through
the AST.

Blocker 1: `listItem.checked` silently dropped.
  parse_markdown("- [x] done") → stringify_markdown → "* done\n"
Affects every adapter that goes markdown → AST → stringify
(linear/github/telegram/whatsapp + the generic
BaseFormatConverter.to_markdown). On GitHub and Linear especially
(native task-list rendering), this is observable data loss.

Blocker 2: `_unescape_punct` + naive stringify produces re-parse drift.
  parse_markdown(r"Use \*literal\*") → text "Use *literal*"
  → stringify → "Use *literal*\n"
  → reparse → emphasis(literal)
Telegram MarkdownV2 / WhatsApp / Slack image alt / Google Chat all
render the un-escaped `*literal*` as bold/italic in the user's view.

Fixes in `markdown_parser.py`:
- New `_escape_text_leaf` re-escapes the seven delimiter chars
  (`* _ ~ \` [ ] \\`) on text-leaf emission. Inverse of
  `_unescape_punct`. Conservative: doesn't over-escape ordinary punct
  like `.` / `,` / `!`.
- `_stringify_node("text")` calls `_escape_text_leaf` on the value.
- `_stringify_node("image")` calls it on the `alt` attribute (alt is
  stored as a raw string, doesn't go through inline recursion).
- `_stringify_node("list")` emits `[x] ` / `[ ] ` prefix when
  `item.checked` is set (unordered items only -- task lists are a
  bullet-only GFM convention).
- Empty task items (`- [ ]` with no children) still get a line so the
  checkbox isn't lost on round-trip.

Fixes in `base_format_converter.py`:
- `_render_list` emits the same `[x] ` / `[ ] ` prefix so platform
  output (Slack mrkdwn, Discord, Teams, gchat, etc.) keeps the
  checkbox. Most platforms render the literal `[x]` as text; GitHub
  and Linear render it natively.

Tests added (TestStringifyMarkdown):
- `test_roundtrip_preserves_escaped_punctuation_in_text` (5 inputs:
  asterisk, bracket, backtick, tilde, backslash)
- `test_roundtrip_preserves_task_list_state` (checked, unchecked,
  mixed)
- `test_roundtrip_preserves_empty_task_item`
- `test_stringify_does_not_over_escape_ordinary_text`

3,629 pass / 1 pre-existing unrelated failure.
Round-trip verification subagent surfaced a residual gap in the
stringify path: while inline delimiter chars (`*`, `_`, `~`, `\``,
`[`, `]`, `\`) are now re-escaped on text-leaf emission, block-level
markers at paragraph start (`#`, `>`, `-`, `+`, ordered-list
`digits.`, thematic-break `---`) weren't. Result: AST drift on
roundtrip for inputs like `\# not heading` -> paragraph -> "# not
heading" -> heading.

The gap is narrow but real: it only fires when the very first text
leaf of a paragraph starts with a block-marker character, which
happens when a user typed an escape sequence to disambiguate from a
block construct. Five cases drifted (heading / blockquote / unordered
list / ordered list / thematic break). The sixth (`*`) was already
covered by `_escape_text_leaf` since `*` is in the inline-delimiter
set.

Fix in `markdown_parser.py`:
- New `_escape_paragraph_block_markers` helper applied to the joined
  output of `paragraph` nodes only. Mid-text occurrences of `#` / `>`
  / `-` / `1.` are safe and left alone (block markers only fire at
  line start).
- Heading / blockquote / unordered-list / thematic-break: prepend `\`.
- Ordered list: insert `\` between the digit run and the marker char
  (`.` or `)`), since digits aren't in CommonMark's escapable-punct
  set so `\1` wouldn't round-trip via `_unescape_punct`.

Tests added (TestStringifyMarkdown):
- `test_roundtrip_preserves_escaped_block_markers_at_paragraph_start`
  (8 inputs: all six block-marker families plus already-handled `*`
  and `---`)
- `test_stringify_does_not_over_escape_mid_text_block_markers`
  (4 inputs verifying `\` doesn't appear when block-marker chars
  occur mid-text)

3,631 pass / 1 pre-existing unrelated failure.
Fourth-round review subagent found that the previous fix only checked
position 0 of the joined paragraph text, but paragraph text leaves can
contain embedded `\n` (from `_unescape_punct` resolving an escaped
marker after a soft line break, or from joining text leaves around a
hard break). Block markers at the start of line 2+ of a paragraph
silently re-form constructs on re-parse.

Concrete drift cases the previous fix missed:
- `"Here is a paragraph\n\\# 1 is not a heading"` -> paragraph
  -> stringify -> "Here is a paragraph\n# 1 is not a heading\n"
  -> re-parse -> paragraph + heading.
- Same for `\>`, `\-`, `\+`, ordered-list `1\.`, thematic-break `\---`
  appearing on line 2+ of a multi-line paragraph.

Fix splits the joined paragraph text on `\n` and applies the
single-line escape per line, then rejoins. Single-line paragraphs
(the common case) take a no-`\n` fast path. Adds a one-test
`test_roundtrip_preserves_escaped_block_markers_after_soft_break`
pinning all five drift cases.

3,632 pass / 1 pre-existing unrelated failure.
@patrick-chinchill patrick-chinchill marked this pull request as ready for review May 22, 2026 21:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e67aa14f53

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

continue
for ci, child in enumerate(item_children):
child_type = child.get("type")
if child_type == "list":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve task parent when first child is a nested list

In list stringification, nested list children are emitted without first writing the parent item prefix, so a task like - [ ] followed by an indented sublist (e.g. - [ ]\n - child) loses the parent item entirely and re-parses as plain text (" * child"). This is a new regression from task-list parsing because empty task items now legitimately produce listItem.checked with only nested list children, and the current branch drops both the checkbox state and list structure.

Useful? React with 👍 / 👎.

Comment on lines +136 to +138
# Empty task item -- still emit a line so the marker isn't lost.
if is_first_content and task_marker:
lines.append(f"{indent}{prefix} {task_marker}".rstrip())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit task checkbox before nested list content

When a checked/unchecked item has no text node and only nested list children, is_first_content stays true and this fallback appends the checkbox line after rendering nested children, producing reversed output order (for Slack/Discord/Teams/Google Chat converters) such as nested bullets first and - [ ] last. That corrupts the visible hierarchy for checklist parents that only group subtasks.

Useful? React with 👍 / 👎.

claude added 2 commits May 23, 2026 00:42
PR #99 review (chatgpt-codex P1): in `_get_committable_prefix`, the
"hold pre-separator block" backward walk uses `TABLE_ROW_RE` to extend
`table_start`, but separator lines also match that pattern. For a
stream like `|A|B|\n|---|---|\n|1|2|\n|C|D|\n|---|---|\n` (two tables
with no blank line between them), the walk crosses the first table's
separator and collapses `_get_committable_prefix` back to "",
violating the monotonic append-only contract of
`get_committable_text()`. Adapters that compute deltas from prior
committed length then emit incorrect/no deltas.

Fix: the backward walk now stops at empty lines (existing), non-row
content (existing), AND at prior separators (new). When it hits a
prior separator -- meaning the candidate "new header" row above it was
already committed as a body row of the prior table -- the function
falls back to `return text` instead of holding. That emits one "stray
separator" on the rollback delta, which is broken markup but the
lesser evil compared to non-monotonic rollback.

The fix preserves the well-formed multi-table case where the second
table is separated from the first by a blank line; the empty-line
break in the walk fires before reaching the prior separator.

Tests added:
- `test_back_to_back_tables_keep_committable_monotonic` (the exact
  reviewer repro)
- `test_second_table_after_blank_line_still_holds_header` (verifies
  the fix doesn't regress the blank-line-separated case)

The outdated comment from gemini-code-assist on `_is_excluded_asterisk`
suggesting we also include `\n` and end-of-string in the
list-marker exclusion is already addressed in the second commit on
this branch, which broadened the helper to match remend's
`shouldSkipAsterisk` exactly.

79 streaming tests pass / 3598 total / 1 pre-existing unrelated failure.
Three findings actioned:

1. **Math regex too broad** (gemini-code-assist, high). The inline-math
   pattern `\$[^\$\n]+\$` matched plain currency text like
   `prices are $5 and $10`, eating the `$5 and $10` portion and
   corrupting downstream inline-marker counts. Tightened to require
   non-whitespace immediately after the opening `$` and before the
   closing `$` (standard pandoc / markdown-it heuristic). Display math
   also tightened (`.+?` instead of `.*?`) to require non-empty content
   and avoid matching bare `$$$$`.

2. **Task item with only nested list children loses parent prefix**
   (chatgpt-codex P1). `parse_markdown("- [ ]\n  - child")` produces a
   listItem with `checked: False` and only a nested-list child.
   Stringify hit the nested-list branch first, indented and appended,
   never wrote the `* [ ]` parent line. Re-parse then dropped the
   parent entirely. Fixed by tracking `prefix_emitted` across the
   item's children -- if the first non-skipped child is a nested
   list, emit the parent prefix line first.

3. **Same scenario in `_render_list` produces reversed output**
   (chatgpt-codex P2). `is_first_content` stayed True through the
   nested-list branch, so the empty-task fallback appended the
   checkbox line AFTER the nested rendering. Fixed by emitting the
   prefix before recursing into the nested list when it's the first
   content, then dropping the `if task_marker` qualifier on the
   fallback so any item with zero rendered children still gets a line
   (was previously only firing for task items).

Tests added:
- `test_task_item_with_only_nested_list_keeps_parent_prefix`
- `test_remend_unescaped_currency_does_not_pair_as_math`

3,634 pass / 1 pre-existing unrelated failure.
@patrick-chinchill patrick-chinchill marked this pull request as draft May 23, 2026 00:59
@patrick-chinchill patrick-chinchill marked this pull request as ready for review May 23, 2026 00:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 12fed105f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/chat_sdk/shared/markdown_parser.py Outdated
("delete", re.compile(r"(?<!\\)~~((?:[^\\]|\\.)+?)~~")),
# Emphasis: *text* (not preceded/followed by *)
("emphasis_star", re.compile(r"(?<!\*)\*(?!\*)(.+?)(?<!\*)\*(?!\*)")),
("emphasis_star", re.compile(r"(?<![*\\])\*(?!\*)((?:[^\\]|\\.)+?)(?<!\*)\*(?!\*)")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat doubled backslashes as non-escaping delimiters

The new (?<!\\) opener guards only check the immediately previous character, so delimiters after an even number of backslashes are incorrectly treated as escaped. For example, \\*real italic* (literal backslash + italic) now parses as plain text instead of emphasis, and the same regression applies to links/code/strike patterns using this guard. This changes valid markdown semantics and will silently drop formatting in common path-like text that precedes inline markdown with a doubled backslash.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed real bug — verified \\*real italic* parses as plain text instead of literal \ + italic. Deferring the fix to a follow-up rather than blocking this PR. Reasoning:

The fix can't be done with the existing (?<!\\) lookbehind approach — Python's re doesn't support variable-width lookbehind, and any fixed-width alternation (e.g. (?:(?<!\\)|(?<=\\\\))) handles even-count backslashes but breaks 3+ backslashes (which should be escape-backslash + escape-delimiter = literal). The proper fix is a placeholder-based pre-pass: substitute every \X pair with a private-use Unicode placeholder before regex runs, then restore at text-leaf and inline-code emission. That's a moderate refactor of _parse_inline, _unescape_punct, and several inline regexes; safer as its own PR.

For now, added a comment block at _INLINE_PATTERNS (commit 9a2027f) documenting the limitation and the recommended fix path. The pattern is rare in LLM/chat output (which is this PR's scope) — backslash-emphasis sequences typically appear in technical docs explaining regex/glob syntax, which LLMs usually wrap in code fences. Will revisit when we have a concrete production case or when we tackle the broader parser-rewrite question from #100.


Generated by Claude Code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: actually fixed in 48fa4f3 rather than deferring. The placeholder-based pre-pass was smaller than I'd estimated (~50 LOC of helpers + ~10 LOC of wiring) and didn't break any existing tests.

Approach: _protect_escapes runs at the start of _parse_inline and replaces every \X escape with <SENTINEL>X (SENTINEL = U+0001, a control char that doesn't appear in chat markdown). The regex lookbehinds change from (?<!\\) to (?<!\x01). Because the sentinel marks the escape unambiguously, any number of preceding backslashes is handled correctly:

  • \*<S>* — lookbehind sees <S>, skips. Correct (escaped).
  • \\*<S>\* — the * is preceded by \ (not <S>), lookbehind passes, emphasis opens. Correct.
  • \\\*<S>\<S>* — the * is preceded by <S>, lookbehind skips. Correct (3 backslashes = escaped \ + escaped *, both literal).

Two restore functions handle the emission modes:

  • _restore_escapes (text leaves, image alt, link URL): <S>XX
  • _restore_escapes_as_literal_pair (inline code, where CommonMark preserves backslashes): <S>X\X

Tests added: test_escaped_backslash_followed_by_real_emphasis pins your exact repro, plus a triple-backslash case for the harder counting. 3,636 / 3,637 total tests pass.


Generated by Claude Code

PR #101 review (chatgpt-codex P2): the inline regex lookbehinds only
check one character back, so an escaped backslash followed by a real
delimiter (e.g. `\\*foo*`) is treated as an escaped delimiter and
falls through to plain text -- losing the intended emphasis.

Fixing this properly requires variable-width lookbehind (not
supported by Python's `re`) or a placeholder-based pre-pass that
substitutes escape sequences before the regex runs. Both are
moderate refactors of the inline parser. The pattern is rare in
LLM / chat output (technical docs explaining backslash + emphasis
typically use code fences), so deferring to a future
chat-completeness pass.

Added a comment block near `_INLINE_PATTERNS` documenting the
limitation and the recommended fix path so the next person to touch
this code sees it immediately.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/chat_sdk/shared/markdown_parser.py (1)

475-475: ⚡ Quick win

Replace or-defaulting with an explicit None check.

Line 475 uses task_m.group(2) or ""; please switch to explicit None handling to avoid truthiness traps.

Proposed patch
-                    item_text = task_m.group(2) or ""
+                    task_text = task_m.group(2)
+                    item_text = task_text if task_text is not None else ""

As per coding guidelines, src/**/*.py: Use x if x is not None else default instead of x or default to avoid truthiness traps when porting from TypeScript.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/chat_sdk/shared/markdown_parser.py` at line 475, Replace the
truthy-default pattern used when extracting the second regex group: instead of
using "task_m.group(2) or ''" assign item_text using an explicit None check
(e.g., set item_text to task_m.group(2) if task_m.group(2) is not None else ""),
locating the assignment to item_text that uses task_m and updating it
accordingly to avoid truthiness traps.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/chat_sdk/shared/markdown_parser.py`:
- Line 475: Replace the truthy-default pattern used when extracting the second
regex group: instead of using "task_m.group(2) or ''" assign item_text using an
explicit None check (e.g., set item_text to task_m.group(2) if task_m.group(2)
is not None else ""), locating the assignment to item_text that uses task_m and
updating it accordingly to avoid truthiness traps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 21754e16-9544-4bcd-bedb-c09a8d4b38ac

📥 Commits

Reviewing files that changed from the base of the PR and between 08c42fa and 9a2027f.

📒 Files selected for processing (5)
  • src/chat_sdk/shared/base_format_converter.py
  • src/chat_sdk/shared/markdown_parser.py
  • src/chat_sdk/shared/streaming_markdown.py
  • tests/test_markdown_faithful.py
  • tests/test_streaming_markdown.py

Closes the PR #101 review #4 finding (chatgpt-codex P2) properly
rather than deferring. `\\*foo*` (escaped backslash + real italic)
now parses correctly as text(`\`) + emphasis(`foo`); previously the
fixed-width `(?<!\\)` lookbehind couldn't distinguish odd vs even
backslash runs, so the `*` was wrongly treated as escaped.

Approach: a `_protect_escapes` pre-pass runs at the start of
`_parse_inline`, replacing every `\X` (X in `_ESCAPABLE_PUNCT`) with
`<SENTINEL>X` where SENTINEL = U+0001 (a control codepoint that
never legitimately appears in chat markdown). The inline regex
lookbehinds now check for the sentinel instead of `\`, which trivially
handles any number of preceding backslashes:

- `\*` -> `<SENTINEL>*`. Regex lookbehind sees SENTINEL, skips.
- `\\*` -> `<SENTINEL>\*`. The `*` is preceded by `\` (not SENTINEL),
  so the lookbehind passes and emphasis opens correctly.
- `\\\*` -> `<SENTINEL>\<SENTINEL>*`. The `*` is preceded by
  SENTINEL, lookbehind skips. Correct (3 backslashes = escaped
  backslash + escaped asterisk = both literal).

The complementary restore functions handle the two emission modes:
- `_restore_escapes(text)`: SENTINEL pairs -> literal X (text leaves
  and image alt -- the backslash is dropped because the escape
  resolved to a literal char).
- `_restore_escapes_as_literal_pair(text)`: SENTINEL pairs -> `\X`
  (inline code -- CommonMark preserves backslashes inside code spans).

Changes:
- New helpers `_protect_escapes`, `_restore_escapes`,
  `_restore_escapes_as_literal_pair`. ~50 LOC.
- `_INLINE_PATTERNS` lookbehinds changed from `(?<!\\)` to `(?<!\x01)`;
  emphasis-star/_under add `\x01` to the exclusion sets; content
  groups for link/image use `(?:[^\]\x01]|\x01.)+?` to consume
  sentinel pairs without terminating early.
- `_parse_inline` calls `_protect_escapes` on first entry and
  `_restore_escapes` at emission sites (text leaves, image alt, link
  URL/title). Inline code uses `_restore_escapes_as_literal_pair`.
- Recursive calls into link text / emphasis content pass a
  `_already_protected=True` kwarg so the pre-pass doesn't run twice.
- Removed `_unescape_punct` (no longer used); updated docstring
  references to point at `_restore_escapes` instead.
- Removed the limitation comment added in 9a2027f.

Tests added:
- `test_escaped_backslash_followed_by_real_emphasis` (the reviewer's
  exact repro — `\\*real italic*` now produces text + emphasis)
- `test_triple_backslash_before_star_is_fully_escaped`

3,636 pass / 1 pre-existing unrelated failure. No existing tests
broke (escape behaviour preserved, inline code preservation
preserved, image alt unescape preserved).
Comment thread src/chat_sdk/shared/markdown_parser.py Fixed
Comment thread src/chat_sdk/shared/markdown_parser.py Fixed
Second round of subagent review on commit 48fa4f3 surfaced three
issues:

1. **U+0001 sentinel could be mangled by adapter-supplied input**
   (blocker). The pre-pass used `\x01` as the escape sentinel and
   assumed it would never appear in chat content -- but no adapter
   strips control characters, so input like `before\x01*after*` would
   have the sentinel paired with the asterisk and dropped, corrupting
   the visible text.

   Switched the sentinel to U+FDD0 (a Unicode noncharacter -- one of
   the 32 BMP codepoints permanently reserved by the Unicode standard
   for process-internal application use; never assigned to a real
   character). Also added a strip step at `_protect_escapes` entry:
   any literal U+FDD0 in input is removed before the escape walk runs,
   as defense-in-depth. Updated all the `(?<!\x01)` lookbehinds in
   `_INLINE_PATTERNS` to use the new sentinel.

2. **Silent URL behavior change** (medium). The PR refactored link /
   image URL handling to call `_restore_escapes(url)` (CommonMark says
   URLs honour backslash escapes too). This is a correctness
   improvement but wasn't called out -- added
   `test_link_url_resolves_backslash_escapes` to pin the new contract
   so future regressions are caught.

3. **Weak triple-backslash test** (nit). The existing
   `test_triple_backslash_before_star_is_fully_escaped` only asserted
   `"emphasis" not in types` -- a hypothetical buggy impl that
   dropped all input would pass. Strengthened to assert the exact AST
   shape: `[{"type": "text", "value": "\\*foo*"}]`.

Tests added / strengthened:
- `test_triple_backslash_before_star_is_fully_escaped` -- exact AST
  assertion
- `test_sentinel_codepoint_in_input_is_stripped` -- adversarial input
  with U+FDD0 is sanitised
- `test_link_url_resolves_backslash_escapes` -- new URL behavior pinned

3,638 pass / 1 pre-existing unrelated failure.
Copy link
Copy Markdown
Collaborator Author

@codex review

Substantive changes since your last review:

  • 12fed10 — fixed your P1 (task item with only nested list children loses parent prefix) in _stringify_node and the P2 mirror in _render_list; fixed gemini's math-regex currency false-positive.
  • 48fa4f3 — fixed your P2 (doubled-backslash drops emphasis) with a placeholder-based pre-pass. Replaced fixed-width (?<!\\) lookbehinds with (?<!<SENTINEL>) where SENTINEL marks each \X escape. Correctly handles any count of preceding backslashes.
  • 03033b4 — hardened the sentinel against adversarial input: switched from U+0001 to U+FDD0 (a Unicode noncharacter that's permanently reserved for internal use); added a strip step so pre-existing sentinels in input can't smuggle in fake escape sequences. Also pinned the new URL-escape behavior in a test.

Particularly interested in scrutiny of:

  • The sentinel choice and the strip-at-entry approach
  • The two restore variants (_restore_escapes for text leaves vs _restore_escapes_as_literal_pair for inline code per CommonMark)
  • Whether any callers feed parse_markdown content that's already been through an upstream escape pass

Generated by Claude Code

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/chat_sdk/shared/markdown_parser.py (1)

265-299: ⚡ Quick win

Avoid hardcoding the sentinel glyph inside inline regexes.

_INLINE_PATTERNS repeats the literal in several places instead of deriving from _ESCAPE_SENTINEL. If the sentinel is changed again, escape matching can silently drift.

♻️ Proposed refactor
+_ESC = re.escape(_ESCAPE_SENTINEL)
+
 _INLINE_PATTERNS = [
-    ("inlineCode", re.compile(r"(?<!�)`([^`]+)`")),
+    ("inlineCode", re.compile(rf"(?<!{_ESC})`([^`]+)`")),
     # ...apply the same replacement for all other `(?<!�)` and `�.` usages...
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/chat_sdk/shared/markdown_parser.py` around lines 265 - 299, The inline
regexes in _INLINE_PATTERNS currently hardcode the sentinel glyph "�"; update
them to derive the sentinel from the module-level _ESCAPE_SENTINEL so changes
won't drift: build each pattern string using re.escape(_ESCAPE_SENTINEL) (or a
variable like_ESC = re.escape(_ESCAPE_SENTINEL)) and interpolate that escaped
sentinel into the lookbehind/lookaround fragments (e.g., (?<!{ESC}) ), then
compile those constructed pattern strings into the same named entries in
_INLINE_PATTERNS; ensure behavior with _protect_escapes remains identical (i.e.,
escaped delimiters still match the sentinel form).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_markdown_faithful.py`:
- Around line 251-260: The test currently only checks that a link node exists;
update test_link_url_resolves_backslash_escapes to also assert the link node's
normalized URL value so regressions in unescaping fail the test: locate the
single link node from link_nodes and assert its URL equals
"http://example.com/path)more" (access the URL property on the node, e.g.
link_nodes[0].get('target') or link_nodes[0].get('destination') as appropriate
for your node shape) instead of only checking len(link_nodes) == 1.

---

Nitpick comments:
In `@src/chat_sdk/shared/markdown_parser.py`:
- Around line 265-299: The inline regexes in _INLINE_PATTERNS currently hardcode
the sentinel glyph "�"; update them to derive the sentinel from the module-level
_ESCAPE_SENTINEL so changes won't drift: build each pattern string using
re.escape(_ESCAPE_SENTINEL) (or a variable like_ESC =
re.escape(_ESCAPE_SENTINEL)) and interpolate that escaped sentinel into the
lookbehind/lookaround fragments (e.g., (?<!{ESC}) ), then compile those
constructed pattern strings into the same named entries in _INLINE_PATTERNS;
ensure behavior with _protect_escapes remains identical (i.e., escaped
delimiters still match the sentinel form).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9c91ee92-602d-4d60-b27d-3614c4a169a6

📥 Commits

Reviewing files that changed from the base of the PR and between 9a2027f and 03033b4.

📒 Files selected for processing (2)
  • src/chat_sdk/shared/markdown_parser.py
  • tests/test_markdown_faithful.py

Comment on lines +251 to +260
def test_link_url_resolves_backslash_escapes(self):
# PR #101 review #3 (downstream-impact reviewer): the inline
# parser now resolves backslash escapes in link URLs per
# CommonMark spec. This is a behavior change from the previous
# raw pass-through; pin it so the contract is explicit.
children = self._para_children(r"[t](http://example.com/path\)more)")
link_nodes = [c for c in children if c.get("type") == "link"]
# The `\)` inside the URL is now resolved to literal `)`.
# Previously this URL would have been truncated at the first `)`.
assert len(link_nodes) == 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the normalized URL value, not just link-node presence.

Line 260 only checks node count, so this can still pass when URL unescaping/regression is wrong. Add a concrete URL assertion for the \) case.

✅ Suggested test tightening
         children = self._para_children(r"[t](http://example.com/path\)more)")
         link_nodes = [c for c in children if c.get("type") == "link"]
@@
         assert len(link_nodes) == 1
+        assert link_nodes[0]["url"] == "http://example.com/path)more"

As per coding guidelines, tests/**/*.py: "Every test must fail when the code is wrong — no assert True stubs, no bare truthiness checks when specific values are available".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_markdown_faithful.py` around lines 251 - 260, The test currently
only checks that a link node exists; update
test_link_url_resolves_backslash_escapes to also assert the link node's
normalized URL value so regressions in unescaping fail the test: locate the
single link node from link_nodes and assert its URL equals
"http://example.com/path)more" (access the URL property on the node, e.g.
link_nodes[0].get('target') or link_nodes[0].get('destination') as appropriate
for your node shape) instead of only checking len(link_nodes) == 1.

claude added 3 commits May 23, 2026 15:05
…ackslash

Holistic re-review blocker: a code span ending in a backslash (e.g.
`` `C:\` ``, `` `\d+\` ``) left the escape sentinel (U+FDD0) dangling
in the captured content, leaking the noncharacter into
stringify_markdown / ast_to_plain_text output as a replacement glyph.

Root cause: `_protect_escapes` rewrites `\X` -> `<SENTINEL>X`. For
`` `C:\` `` the `\`+backtick becomes `<SENTINEL>`+backtick, and the
inline-code regex `` (?<!ï·�)`([^`]+)` `` captures `C:<SENTINEL>` --
the sentinel's paired backtick falls outside the group as the code-span
closer. Both restore functions guarded on `i + 1 < len(text)`, so a
trailing sentinel with no following char was emitted raw.

Fix: both restore functions now handle a dangling sentinel explicitly.
`_restore_escapes_as_literal_pair` (code spans) emits a bare backslash
-- matching CommonMark's "backslash is literal inside code spans" rule,
so `` `C:\` `` correctly yields code content `C:\`.
`_restore_escapes` (text leaves / URLs) likewise emits a backslash
rather than the raw sentinel. Neither can ever emit U+FDD0 now.

Test added: `test_code_span_ending_in_backslash_does_not_leak_sentinel`
covers three realistic code-span-ending-in-backslash inputs and asserts
the sentinel codepoint never reaches output.

3,638 pass / 1 pre-existing unrelated failure.
…at-completeness

# Conflicts:
#	tests/test_streaming_markdown.py
Adversarial re-review surfaced two pre-existing stringify round-trip
bugs that are commonly hit by LLM-produced markdown. Fixing them in
this PR rather than filing follow-ups (per the team preference for
clean PRs).

Both bugs predate PR #101 -- the same drift reproduces on `main`.
They were never exercised by the chat-completeness work specifically,
but they sit next to the escape-roundtrip code this PR added and
break the same general "parse -> stringify -> parse stable" contract
the rest of the PR pins.

Fix 1 (list-item continuation indent):
- `- item1\n  para2` parsed to one list with one item containing one
  paragraph `"item1\npara2"`. Stringify emitted `* item1\npara2\n` --
  no indent on `para2` -- so re-parse split into `[list, paragraph]`.
- `_stringify_node`'s list branch now splits each child's stringified
  text on `\n` and applies the standard 2-space continuation indent
  to lines after the first. Empty continuation lines stay empty so
  intra-item blank lines round-trip too.
- Affects every adapter that uses `from_markdown` / `to_markdown` to
  translate (linear, github, telegram, whatsapp + the generic
  `BaseFormatConverter.to_markdown`).

Fix 2 (hard break stringify):
- A `break` node (parsed from `Step  \nline2`) stringified as just
  `\n`, which re-parses as a soft break -- collapsing the
  `[text, break, text]` triple into a single text leaf with embedded
  `\n`. The semantic distinction between hard and soft breaks was
  silently lost on round-trip.
- `_stringify_node`'s break branch now emits `"  \n"` per CommonMark.

Tests added:
- `test_roundtrip_preserves_list_item_continuation_indent` (4 inputs:
  single-line continuation, multi-line continuation, multiple items
  with continuations, ordered list with continuation)
- `test_roundtrip_preserves_hard_break_in_paragraph` (asserts both
  AST stability AND that the `break` node survives)

Docstring note added near `_INLINE_PATTERNS` for the third finding
(O(n^2) on adversarial bracket-heavy input). Pre-existing in the
underlying linear-search-per-position; this PR's alternation regex
adds a constant-factor slowdown only on ~1000+ unclosed brackets,
sub-millisecond on typical chat content. Deferred to a future
character-level walker if a non-chat input surface ever lands.

3,643 pass / 1 pre-existing unrelated failure.
claude and others added 2 commits May 28, 2026 04:37
…mpleteness

# Conflicts:
#	src/chat_sdk/shared/streaming_markdown.py
#	tests/test_streaming_markdown.py
@patrick-chinchill patrick-chinchill merged commit b1004b1 into main May 28, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisit hand-rolled markdown parser: close setext / footnotes / HTML / indented-code gaps

3 participants