fix(markdown): chat-scoped completeness — escaped chars, task lists, math/escape in _remend#101
Conversation
Three production Slack streaming bugs from issue #69 comment: 1. `_remend("* item one\n")` appended a stray `*`. `_close_emphasis` counted the line-leading bullet as an italic opener. Now skips single-`*` runs preceded by line-leading whitespace and followed by a space/tab (list-marker shape), matching upstream `remend`'s awareness of list markers. 2. `StreamingMarkdownRenderer.finish()` on an odd-count bullet list produced `...item three\n*` — same root cause as #1; the fix covers both. 3. `_get_committable_prefix` released a confirmed table the moment a separator arrived, even with zero body rows. Slack `chat.appendStream` saw header+separator alone as broken syntax and then bare body rows in subsequent appends, never reconnecting them as a single table. Now the header+separator block is held until at least one body row arrives, then released atomically. The original tests pinned the buggy "commit on separator" behavior; updated to require a body row before committing the table block, and added a TestIssue69Regressions class with the exact reproducers from the issue comment plus invariants for the new contract. Closes part of #69 (the streaming bugs). The broader hand-rolled parser question (Options A/B/C in the issue) remains open.
The previous `_is_list_marker` skipped only line-leading single `*` followed by horizontal whitespace. Tighten to match upstream remend's `shouldSkipAsterisk` (packages/remend/src/emphasis-handlers.ts): exclude any single `*` flanked by whitespace (or text boundary) on both sides, which is what CommonMark says isn't a valid emphasis delimiter anyway. The line-leading bullet case still falls out naturally. Picks up three additional cases the previous narrower check missed: - `text * more` -- whitespace-flanked mid-line - `trailing *\n` -- asterisk at end of line - `partial *` -- asterisk at end of buffer Same `_remend` over-counting failure mode as the original issue #69 bugs, just different surface forms. Renamed the helper to `_is_excluded_asterisk` to reflect the broader scope. Remaining remend divergences (word-internal asterisks, math-block contents, escaped sequences, multi-backtick spans, setext headings, indented code, raw HTML, footnotes) are tracked on issue #69 for the parser-strategy discussion -- out of scope for this PR.
…ath/escape in remend) Pulls in the chat-realistic subset of the issue #69 gap list, scoped to what LLMs actually emit in agent/chat contexts. Builds on #99's _is_excluded_asterisk helper rather than restating it. Parser side (markdown_parser.py): - Inline regexes get `(?<!\\)` lookbehinds so escaped openers (`\*`, `\[`, `\_`, `\~~`) stay literal. Bracket content tolerates inner `\]` / `\[` via `(?:[^\]\\]|\\.)+?`. - New `_unescape_punct` runs at text-leaf emission and image-alt emission, resolving `\X` to `X` for ASCII-punctuation X. Inline-code values are deliberately not unescaped (CommonMark: backticks treat backslashes literally). - `_collect_list_items` detects the GFM task-marker prefix (`[ ]` or `[x]/[X]`) on unordered items and stores `checked: bool` on the resulting listItem. Plain items have no `checked` attr. - `make_list_item` extended with optional `checked` kwarg. Streaming side (streaming_markdown.py): - New `_strip_math_regions` pass drops `$...$` (inline) and `$$...$$` (display) before the inline-marker counters. Markers inside math are literal, not delimiters. - New `_strip_escape_sequences` pass removes `\X` pairs from the counting buffer so the tilde/backtick/bracket counters don't fire on escaped delimiters. - Both passes run after `_strip_fenced_code` in `_remend`. Word-internal asterisk exclusion was prototyped (closes `5*3=15`) but reverted: it also breaks paired emphasis like `text*foo*` whose two asterisks are both word-flanked. Distinguishing those needs a proper CommonMark delimiter-stack -- documented as the next Option A item in the helper's docstring. 20 new tests (15 parser + 11 streaming): escape behaviors on every delimiter family, image alt unescape, inline-code preservation, checked/unchecked/uppercase-X task items, mixed task lists, ordered lists are NOT task-coerced, math regions don't pollute counters, escape-aware counters, composition cases. No existing tests changed. 3,621 pass on the full suite; the lone pre-existing GitHub webhook failure is unrelated (see #83 test plan).
|
Warning Review limit reached
More reviews will be available in 25 minutes and 9 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR implements GFM task-list support, escape-aware inline parsing/stringification for faithful round-trips, and streaming delimiter repair (escape & TeX-math awareness) plus table-holding changes so append-only streaming emits header+separator+first-row atomically. ChangesMarkdown Fidelity: Task Lists, Escaped Characters, and Streaming Fixes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces support for GFM task lists and backslash-escaped punctuation in the markdown parser. It also improves the streaming renderer's accuracy by ignoring escaped characters and math regions when calculating unclosed delimiters, and by ensuring tables are only emitted once a body row is present. Feedback was provided regarding the inline math regex, which is currently too broad and may incorrectly match currency patterns, potentially leading to incorrect delimiter closing in the streaming output.
| # Display math: $$...$$ may span multiple lines. | ||
| text = re.sub(r"\$\$.*?\$\$", "", text, flags=re.DOTALL) | ||
| # Inline math: $...$ within a single line, non-greedy, no embedded $. | ||
| text = re.sub(r"\$[^\$\n]+\$", "", text) |
There was a problem hiding this comment.
The regex for inline math is too broad and will cause false positives with common text patterns like currency (e.g., "prices are $5 and $10"). This can lead to incorrect marker counting in _remend and potentially corrupt the output by appending unnecessary closing markers.
Following standard markdown math heuristics (like those in Pandoc or MathJax), the opening $ should be followed by a non-whitespace character, and the closing $ should be preceded by a non-whitespace character. It should also account for escaped dollar signs.
| text = re.sub(r"\$[^\$\n]+\$", "", text) | |
| text = re.sub(r"(?<!\\)\$(?!\s)(?:\\.|[^\$\n\\])+?(?<!\s)\$", "", text) |
The spike concluded with Option A (PRs #99 + #101) being the right call for the SDK's current chat-only scope. Documenting the concrete triggers under which that conclusion should be re-examined: - Non-chat input source (RAG, user-authored memory, markdown-file ingestion, GitHub-body parsing -- anywhere humans write the input) - Long-form artifact output surface (citations + footnotes; math rendering; one-shot parse where latency is less critical) - Web rendering surface (port of upstream `@chat-adapter/web`) - New chat platform demanding richer parsing Plus a playbook for re-running the bake-off (author fixture, run parity + benchmark harness, compare silent-drop count / parse time) and a decision threshold for promoting markdown-it-py into runtime if a trigger materialises. Spot-checked upstream's `packages/` at the time of writing: only `adapter-web` is the relevant new package (added in v4.27, Python port deferred); no artifact / RAG / ingestion packages exist upstream. All triggers are forward-looking.
Independent review surfaced three actionable items: 1. **Math/escape ordering** (review #1) -- `_strip_math_regions` ran BEFORE `_strip_escape_sequences`, so input like `\$opener *unclosed text closer\$` had its two escaped `$`s paired as math, eating the `*` opener inside. Reordered: escape strip first, then math strip. Verified the original bug repro now correctly produces a closing `*`. 2. **Empty task items** (review #5) -- `r"^\[([ xX])\]\s+(.*)"` required at least one whitespace AFTER `]`, so `- [ ]` with no trailing content silently fell through to plain text. Loosened to `r"^\[([ xX])\](?:\s+(.*))?$"` -- trailing whitespace+content is now optional. `- [ ]` and `- [x]` (no trailing) both produce `listItem(checked=False/True, children=[])`. 3. **Strikethrough test strengthened** (review #7) -- the existing `test_escaped_tilde_does_not_form_strikethrough` only asserted the absence of a `delete` node. A buggy impl that dropped the tildes entirely would have passed. Added a content-shape assertion that `~~not strike~~` appears in the text leaves. Reviewer's #3 (the `r"a *b\* * c"` case) was investigated and determined NOT to be a bug -- the trailing `*` between two spaces is not a valid CommonMark closer (whitespace-flanked), so italic correctly stays open. Pre-fix behavior happened to balance by ignoring flanking; post-fix is CommonMark-correct. Pre-existing edge cases #4 (`\\*text*`) and #6 (link-text unescape implicit) are documented in the existing code; not blockers. Tests added: - `test_empty_unchecked_task_item_no_content` and `_checked_` variant - `test_remend_escaped_dollar_does_not_pair_with_unescaped_dollar` (the exact #1 repro) - `test_remend_escaped_dollar_does_not_create_phantom_math_region` - Strengthened `test_escaped_tilde_does_not_form_strikethrough` 3,625 pass / 1 pre-existing failure unrelated.
Downstream-impact subagent found two blockers in PR #101's new AST semantics: `stringify_markdown` and `_render_list` were not updated to emit the new state, causing data loss when adapters round-trip through the AST. Blocker 1: `listItem.checked` silently dropped. parse_markdown("- [x] done") → stringify_markdown → "* done\n" Affects every adapter that goes markdown → AST → stringify (linear/github/telegram/whatsapp + the generic BaseFormatConverter.to_markdown). On GitHub and Linear especially (native task-list rendering), this is observable data loss. Blocker 2: `_unescape_punct` + naive stringify produces re-parse drift. parse_markdown(r"Use \*literal\*") → text "Use *literal*" → stringify → "Use *literal*\n" → reparse → emphasis(literal) Telegram MarkdownV2 / WhatsApp / Slack image alt / Google Chat all render the un-escaped `*literal*` as bold/italic in the user's view. Fixes in `markdown_parser.py`: - New `_escape_text_leaf` re-escapes the seven delimiter chars (`* _ ~ \` [ ] \\`) on text-leaf emission. Inverse of `_unescape_punct`. Conservative: doesn't over-escape ordinary punct like `.` / `,` / `!`. - `_stringify_node("text")` calls `_escape_text_leaf` on the value. - `_stringify_node("image")` calls it on the `alt` attribute (alt is stored as a raw string, doesn't go through inline recursion). - `_stringify_node("list")` emits `[x] ` / `[ ] ` prefix when `item.checked` is set (unordered items only -- task lists are a bullet-only GFM convention). - Empty task items (`- [ ]` with no children) still get a line so the checkbox isn't lost on round-trip. Fixes in `base_format_converter.py`: - `_render_list` emits the same `[x] ` / `[ ] ` prefix so platform output (Slack mrkdwn, Discord, Teams, gchat, etc.) keeps the checkbox. Most platforms render the literal `[x]` as text; GitHub and Linear render it natively. Tests added (TestStringifyMarkdown): - `test_roundtrip_preserves_escaped_punctuation_in_text` (5 inputs: asterisk, bracket, backtick, tilde, backslash) - `test_roundtrip_preserves_task_list_state` (checked, unchecked, mixed) - `test_roundtrip_preserves_empty_task_item` - `test_stringify_does_not_over_escape_ordinary_text` 3,629 pass / 1 pre-existing unrelated failure.
Round-trip verification subagent surfaced a residual gap in the stringify path: while inline delimiter chars (`*`, `_`, `~`, `\``, `[`, `]`, `\`) are now re-escaped on text-leaf emission, block-level markers at paragraph start (`#`, `>`, `-`, `+`, ordered-list `digits.`, thematic-break `---`) weren't. Result: AST drift on roundtrip for inputs like `\# not heading` -> paragraph -> "# not heading" -> heading. The gap is narrow but real: it only fires when the very first text leaf of a paragraph starts with a block-marker character, which happens when a user typed an escape sequence to disambiguate from a block construct. Five cases drifted (heading / blockquote / unordered list / ordered list / thematic break). The sixth (`*`) was already covered by `_escape_text_leaf` since `*` is in the inline-delimiter set. Fix in `markdown_parser.py`: - New `_escape_paragraph_block_markers` helper applied to the joined output of `paragraph` nodes only. Mid-text occurrences of `#` / `>` / `-` / `1.` are safe and left alone (block markers only fire at line start). - Heading / blockquote / unordered-list / thematic-break: prepend `\`. - Ordered list: insert `\` between the digit run and the marker char (`.` or `)`), since digits aren't in CommonMark's escapable-punct set so `\1` wouldn't round-trip via `_unescape_punct`. Tests added (TestStringifyMarkdown): - `test_roundtrip_preserves_escaped_block_markers_at_paragraph_start` (8 inputs: all six block-marker families plus already-handled `*` and `---`) - `test_stringify_does_not_over_escape_mid_text_block_markers` (4 inputs verifying `\` doesn't appear when block-marker chars occur mid-text) 3,631 pass / 1 pre-existing unrelated failure.
Fourth-round review subagent found that the previous fix only checked position 0 of the joined paragraph text, but paragraph text leaves can contain embedded `\n` (from `_unescape_punct` resolving an escaped marker after a soft line break, or from joining text leaves around a hard break). Block markers at the start of line 2+ of a paragraph silently re-form constructs on re-parse. Concrete drift cases the previous fix missed: - `"Here is a paragraph\n\\# 1 is not a heading"` -> paragraph -> stringify -> "Here is a paragraph\n# 1 is not a heading\n" -> re-parse -> paragraph + heading. - Same for `\>`, `\-`, `\+`, ordered-list `1\.`, thematic-break `\---` appearing on line 2+ of a multi-line paragraph. Fix splits the joined paragraph text on `\n` and applies the single-line escape per line, then rejoins. Single-line paragraphs (the common case) take a no-`\n` fast path. Adds a one-test `test_roundtrip_preserves_escaped_block_markers_after_soft_break` pinning all five drift cases. 3,632 pass / 1 pre-existing unrelated failure.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e67aa14f53
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| continue | ||
| for ci, child in enumerate(item_children): | ||
| child_type = child.get("type") | ||
| if child_type == "list": |
There was a problem hiding this comment.
Preserve task parent when first child is a nested list
In list stringification, nested list children are emitted without first writing the parent item prefix, so a task like - [ ] followed by an indented sublist (e.g. - [ ]\n - child) loses the parent item entirely and re-parses as plain text (" * child"). This is a new regression from task-list parsing because empty task items now legitimately produce listItem.checked with only nested list children, and the current branch drops both the checkbox state and list structure.
Useful? React with 👍 / 👎.
| # Empty task item -- still emit a line so the marker isn't lost. | ||
| if is_first_content and task_marker: | ||
| lines.append(f"{indent}{prefix} {task_marker}".rstrip()) |
There was a problem hiding this comment.
Emit task checkbox before nested list content
When a checked/unchecked item has no text node and only nested list children, is_first_content stays true and this fallback appends the checkbox line after rendering nested children, producing reversed output order (for Slack/Discord/Teams/Google Chat converters) such as nested bullets first and - [ ] last. That corrupts the visible hierarchy for checklist parents that only group subtasks.
Useful? React with 👍 / 👎.
PR #99 review (chatgpt-codex P1): in `_get_committable_prefix`, the "hold pre-separator block" backward walk uses `TABLE_ROW_RE` to extend `table_start`, but separator lines also match that pattern. For a stream like `|A|B|\n|---|---|\n|1|2|\n|C|D|\n|---|---|\n` (two tables with no blank line between them), the walk crosses the first table's separator and collapses `_get_committable_prefix` back to "", violating the monotonic append-only contract of `get_committable_text()`. Adapters that compute deltas from prior committed length then emit incorrect/no deltas. Fix: the backward walk now stops at empty lines (existing), non-row content (existing), AND at prior separators (new). When it hits a prior separator -- meaning the candidate "new header" row above it was already committed as a body row of the prior table -- the function falls back to `return text` instead of holding. That emits one "stray separator" on the rollback delta, which is broken markup but the lesser evil compared to non-monotonic rollback. The fix preserves the well-formed multi-table case where the second table is separated from the first by a blank line; the empty-line break in the walk fires before reaching the prior separator. Tests added: - `test_back_to_back_tables_keep_committable_monotonic` (the exact reviewer repro) - `test_second_table_after_blank_line_still_holds_header` (verifies the fix doesn't regress the blank-line-separated case) The outdated comment from gemini-code-assist on `_is_excluded_asterisk` suggesting we also include `\n` and end-of-string in the list-marker exclusion is already addressed in the second commit on this branch, which broadened the helper to match remend's `shouldSkipAsterisk` exactly. 79 streaming tests pass / 3598 total / 1 pre-existing unrelated failure.
Three findings actioned:
1. **Math regex too broad** (gemini-code-assist, high). The inline-math
pattern `\$[^\$\n]+\$` matched plain currency text like
`prices are $5 and $10`, eating the `$5 and $10` portion and
corrupting downstream inline-marker counts. Tightened to require
non-whitespace immediately after the opening `$` and before the
closing `$` (standard pandoc / markdown-it heuristic). Display math
also tightened (`.+?` instead of `.*?`) to require non-empty content
and avoid matching bare `$$$$`.
2. **Task item with only nested list children loses parent prefix**
(chatgpt-codex P1). `parse_markdown("- [ ]\n - child")` produces a
listItem with `checked: False` and only a nested-list child.
Stringify hit the nested-list branch first, indented and appended,
never wrote the `* [ ]` parent line. Re-parse then dropped the
parent entirely. Fixed by tracking `prefix_emitted` across the
item's children -- if the first non-skipped child is a nested
list, emit the parent prefix line first.
3. **Same scenario in `_render_list` produces reversed output**
(chatgpt-codex P2). `is_first_content` stayed True through the
nested-list branch, so the empty-task fallback appended the
checkbox line AFTER the nested rendering. Fixed by emitting the
prefix before recursing into the nested list when it's the first
content, then dropping the `if task_marker` qualifier on the
fallback so any item with zero rendered children still gets a line
(was previously only firing for task items).
Tests added:
- `test_task_item_with_only_nested_list_keeps_parent_prefix`
- `test_remend_unescaped_currency_does_not_pair_as_math`
3,634 pass / 1 pre-existing unrelated failure.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 12fed105f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ("delete", re.compile(r"(?<!\\)~~((?:[^\\]|\\.)+?)~~")), | ||
| # Emphasis: *text* (not preceded/followed by *) | ||
| ("emphasis_star", re.compile(r"(?<!\*)\*(?!\*)(.+?)(?<!\*)\*(?!\*)")), | ||
| ("emphasis_star", re.compile(r"(?<![*\\])\*(?!\*)((?:[^\\]|\\.)+?)(?<!\*)\*(?!\*)")), |
There was a problem hiding this comment.
Treat doubled backslashes as non-escaping delimiters
The new (?<!\\) opener guards only check the immediately previous character, so delimiters after an even number of backslashes are incorrectly treated as escaped. For example, \\*real italic* (literal backslash + italic) now parses as plain text instead of emphasis, and the same regression applies to links/code/strike patterns using this guard. This changes valid markdown semantics and will silently drop formatting in common path-like text that precedes inline markdown with a doubled backslash.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Confirmed real bug — verified \\*real italic* parses as plain text instead of literal \ + italic. Deferring the fix to a follow-up rather than blocking this PR. Reasoning:
The fix can't be done with the existing (?<!\\) lookbehind approach — Python's re doesn't support variable-width lookbehind, and any fixed-width alternation (e.g. (?:(?<!\\)|(?<=\\\\))) handles even-count backslashes but breaks 3+ backslashes (which should be escape-backslash + escape-delimiter = literal). The proper fix is a placeholder-based pre-pass: substitute every \X pair with a private-use Unicode placeholder before regex runs, then restore at text-leaf and inline-code emission. That's a moderate refactor of _parse_inline, _unescape_punct, and several inline regexes; safer as its own PR.
For now, added a comment block at _INLINE_PATTERNS (commit 9a2027f) documenting the limitation and the recommended fix path. The pattern is rare in LLM/chat output (which is this PR's scope) — backslash-emphasis sequences typically appear in technical docs explaining regex/glob syntax, which LLMs usually wrap in code fences. Will revisit when we have a concrete production case or when we tackle the broader parser-rewrite question from #100.
Generated by Claude Code
There was a problem hiding this comment.
Update: actually fixed in 48fa4f3 rather than deferring. The placeholder-based pre-pass was smaller than I'd estimated (~50 LOC of helpers + ~10 LOC of wiring) and didn't break any existing tests.
Approach: _protect_escapes runs at the start of _parse_inline and replaces every \X escape with <SENTINEL>X (SENTINEL = U+0001, a control char that doesn't appear in chat markdown). The regex lookbehinds change from (?<!\\) to (?<!\x01). Because the sentinel marks the escape unambiguously, any number of preceding backslashes is handled correctly:
\*→<S>*— lookbehind sees<S>, skips. Correct (escaped).\\*→<S>\*— the*is preceded by\(not<S>), lookbehind passes, emphasis opens. Correct.\\\*→<S>\<S>*— the*is preceded by<S>, lookbehind skips. Correct (3 backslashes = escaped\+ escaped*, both literal).
Two restore functions handle the emission modes:
_restore_escapes(text leaves, image alt, link URL):<S>X→X_restore_escapes_as_literal_pair(inline code, where CommonMark preserves backslashes):<S>X→\X
Tests added: test_escaped_backslash_followed_by_real_emphasis pins your exact repro, plus a triple-backslash case for the harder counting. 3,636 / 3,637 total tests pass.
Generated by Claude Code
PR #101 review (chatgpt-codex P2): the inline regex lookbehinds only check one character back, so an escaped backslash followed by a real delimiter (e.g. `\\*foo*`) is treated as an escaped delimiter and falls through to plain text -- losing the intended emphasis. Fixing this properly requires variable-width lookbehind (not supported by Python's `re`) or a placeholder-based pre-pass that substitutes escape sequences before the regex runs. Both are moderate refactors of the inline parser. The pattern is rare in LLM / chat output (technical docs explaining backslash + emphasis typically use code fences), so deferring to a future chat-completeness pass. Added a comment block near `_INLINE_PATTERNS` documenting the limitation and the recommended fix path so the next person to touch this code sees it immediately.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/chat_sdk/shared/markdown_parser.py (1)
475-475: ⚡ Quick winReplace
or-defaulting with an explicitNonecheck.Line 475 uses
task_m.group(2) or ""; please switch to explicitNonehandling to avoid truthiness traps.Proposed patch
- item_text = task_m.group(2) or "" + task_text = task_m.group(2) + item_text = task_text if task_text is not None else ""As per coding guidelines,
src/**/*.py: Usex if x is not None else defaultinstead ofx or defaultto avoid truthiness traps when porting from TypeScript.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/chat_sdk/shared/markdown_parser.py` at line 475, Replace the truthy-default pattern used when extracting the second regex group: instead of using "task_m.group(2) or ''" assign item_text using an explicit None check (e.g., set item_text to task_m.group(2) if task_m.group(2) is not None else ""), locating the assignment to item_text that uses task_m and updating it accordingly to avoid truthiness traps.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/chat_sdk/shared/markdown_parser.py`:
- Line 475: Replace the truthy-default pattern used when extracting the second
regex group: instead of using "task_m.group(2) or ''" assign item_text using an
explicit None check (e.g., set item_text to task_m.group(2) if task_m.group(2)
is not None else ""), locating the assignment to item_text that uses task_m and
updating it accordingly to avoid truthiness traps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 21754e16-9544-4bcd-bedb-c09a8d4b38ac
📒 Files selected for processing (5)
src/chat_sdk/shared/base_format_converter.pysrc/chat_sdk/shared/markdown_parser.pysrc/chat_sdk/shared/streaming_markdown.pytests/test_markdown_faithful.pytests/test_streaming_markdown.py
Closes the PR #101 review #4 finding (chatgpt-codex P2) properly rather than deferring. `\\*foo*` (escaped backslash + real italic) now parses correctly as text(`\`) + emphasis(`foo`); previously the fixed-width `(?<!\\)` lookbehind couldn't distinguish odd vs even backslash runs, so the `*` was wrongly treated as escaped. Approach: a `_protect_escapes` pre-pass runs at the start of `_parse_inline`, replacing every `\X` (X in `_ESCAPABLE_PUNCT`) with `<SENTINEL>X` where SENTINEL = U+0001 (a control codepoint that never legitimately appears in chat markdown). The inline regex lookbehinds now check for the sentinel instead of `\`, which trivially handles any number of preceding backslashes: - `\*` -> `<SENTINEL>*`. Regex lookbehind sees SENTINEL, skips. - `\\*` -> `<SENTINEL>\*`. The `*` is preceded by `\` (not SENTINEL), so the lookbehind passes and emphasis opens correctly. - `\\\*` -> `<SENTINEL>\<SENTINEL>*`. The `*` is preceded by SENTINEL, lookbehind skips. Correct (3 backslashes = escaped backslash + escaped asterisk = both literal). The complementary restore functions handle the two emission modes: - `_restore_escapes(text)`: SENTINEL pairs -> literal X (text leaves and image alt -- the backslash is dropped because the escape resolved to a literal char). - `_restore_escapes_as_literal_pair(text)`: SENTINEL pairs -> `\X` (inline code -- CommonMark preserves backslashes inside code spans). Changes: - New helpers `_protect_escapes`, `_restore_escapes`, `_restore_escapes_as_literal_pair`. ~50 LOC. - `_INLINE_PATTERNS` lookbehinds changed from `(?<!\\)` to `(?<!\x01)`; emphasis-star/_under add `\x01` to the exclusion sets; content groups for link/image use `(?:[^\]\x01]|\x01.)+?` to consume sentinel pairs without terminating early. - `_parse_inline` calls `_protect_escapes` on first entry and `_restore_escapes` at emission sites (text leaves, image alt, link URL/title). Inline code uses `_restore_escapes_as_literal_pair`. - Recursive calls into link text / emphasis content pass a `_already_protected=True` kwarg so the pre-pass doesn't run twice. - Removed `_unescape_punct` (no longer used); updated docstring references to point at `_restore_escapes` instead. - Removed the limitation comment added in 9a2027f. Tests added: - `test_escaped_backslash_followed_by_real_emphasis` (the reviewer's exact repro — `\\*real italic*` now produces text + emphasis) - `test_triple_backslash_before_star_is_fully_escaped` 3,636 pass / 1 pre-existing unrelated failure. No existing tests broke (escape behaviour preserved, inline code preservation preserved, image alt unescape preserved).
Second round of subagent review on commit 48fa4f3 surfaced three issues: 1. **U+0001 sentinel could be mangled by adapter-supplied input** (blocker). The pre-pass used `\x01` as the escape sentinel and assumed it would never appear in chat content -- but no adapter strips control characters, so input like `before\x01*after*` would have the sentinel paired with the asterisk and dropped, corrupting the visible text. Switched the sentinel to U+FDD0 (a Unicode noncharacter -- one of the 32 BMP codepoints permanently reserved by the Unicode standard for process-internal application use; never assigned to a real character). Also added a strip step at `_protect_escapes` entry: any literal U+FDD0 in input is removed before the escape walk runs, as defense-in-depth. Updated all the `(?<!\x01)` lookbehinds in `_INLINE_PATTERNS` to use the new sentinel. 2. **Silent URL behavior change** (medium). The PR refactored link / image URL handling to call `_restore_escapes(url)` (CommonMark says URLs honour backslash escapes too). This is a correctness improvement but wasn't called out -- added `test_link_url_resolves_backslash_escapes` to pin the new contract so future regressions are caught. 3. **Weak triple-backslash test** (nit). The existing `test_triple_backslash_before_star_is_fully_escaped` only asserted `"emphasis" not in types` -- a hypothetical buggy impl that dropped all input would pass. Strengthened to assert the exact AST shape: `[{"type": "text", "value": "\\*foo*"}]`. Tests added / strengthened: - `test_triple_backslash_before_star_is_fully_escaped` -- exact AST assertion - `test_sentinel_codepoint_in_input_is_stripped` -- adversarial input with U+FDD0 is sanitised - `test_link_url_resolves_backslash_escapes` -- new URL behavior pinned 3,638 pass / 1 pre-existing unrelated failure.
|
@codex review Substantive changes since your last review:
Particularly interested in scrutiny of:
Generated by Claude Code |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/chat_sdk/shared/markdown_parser.py (1)
265-299: ⚡ Quick winAvoid hardcoding the sentinel glyph inside inline regexes.
_INLINE_PATTERNSrepeats the literal�in several places instead of deriving from_ESCAPE_SENTINEL. If the sentinel is changed again, escape matching can silently drift.♻️ Proposed refactor
+_ESC = re.escape(_ESCAPE_SENTINEL) + _INLINE_PATTERNS = [ - ("inlineCode", re.compile(r"(?<!�)`([^`]+)`")), + ("inlineCode", re.compile(rf"(?<!{_ESC})`([^`]+)`")), # ...apply the same replacement for all other `(?<!�)` and `�.` usages... ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/chat_sdk/shared/markdown_parser.py` around lines 265 - 299, The inline regexes in _INLINE_PATTERNS currently hardcode the sentinel glyph "�"; update them to derive the sentinel from the module-level _ESCAPE_SENTINEL so changes won't drift: build each pattern string using re.escape(_ESCAPE_SENTINEL) (or a variable like_ESC = re.escape(_ESCAPE_SENTINEL)) and interpolate that escaped sentinel into the lookbehind/lookaround fragments (e.g., (?<!{ESC}) ), then compile those constructed pattern strings into the same named entries in _INLINE_PATTERNS; ensure behavior with _protect_escapes remains identical (i.e., escaped delimiters still match the sentinel form).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_markdown_faithful.py`:
- Around line 251-260: The test currently only checks that a link node exists;
update test_link_url_resolves_backslash_escapes to also assert the link node's
normalized URL value so regressions in unescaping fail the test: locate the
single link node from link_nodes and assert its URL equals
"http://example.com/path)more" (access the URL property on the node, e.g.
link_nodes[0].get('target') or link_nodes[0].get('destination') as appropriate
for your node shape) instead of only checking len(link_nodes) == 1.
---
Nitpick comments:
In `@src/chat_sdk/shared/markdown_parser.py`:
- Around line 265-299: The inline regexes in _INLINE_PATTERNS currently hardcode
the sentinel glyph "�"; update them to derive the sentinel from the module-level
_ESCAPE_SENTINEL so changes won't drift: build each pattern string using
re.escape(_ESCAPE_SENTINEL) (or a variable like_ESC =
re.escape(_ESCAPE_SENTINEL)) and interpolate that escaped sentinel into the
lookbehind/lookaround fragments (e.g., (?<!{ESC}) ), then compile those
constructed pattern strings into the same named entries in _INLINE_PATTERNS;
ensure behavior with _protect_escapes remains identical (i.e., escaped
delimiters still match the sentinel form).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 9c91ee92-602d-4d60-b27d-3614c4a169a6
📒 Files selected for processing (2)
src/chat_sdk/shared/markdown_parser.pytests/test_markdown_faithful.py
| def test_link_url_resolves_backslash_escapes(self): | ||
| # PR #101 review #3 (downstream-impact reviewer): the inline | ||
| # parser now resolves backslash escapes in link URLs per | ||
| # CommonMark spec. This is a behavior change from the previous | ||
| # raw pass-through; pin it so the contract is explicit. | ||
| children = self._para_children(r"[t](http://example.com/path\)more)") | ||
| link_nodes = [c for c in children if c.get("type") == "link"] | ||
| # The `\)` inside the URL is now resolved to literal `)`. | ||
| # Previously this URL would have been truncated at the first `)`. | ||
| assert len(link_nodes) == 1 |
There was a problem hiding this comment.
Assert the normalized URL value, not just link-node presence.
Line 260 only checks node count, so this can still pass when URL unescaping/regression is wrong. Add a concrete URL assertion for the \) case.
✅ Suggested test tightening
children = self._para_children(r"[t](http://example.com/path\)more)")
link_nodes = [c for c in children if c.get("type") == "link"]
@@
assert len(link_nodes) == 1
+ assert link_nodes[0]["url"] == "http://example.com/path)more"As per coding guidelines, tests/**/*.py: "Every test must fail when the code is wrong — no assert True stubs, no bare truthiness checks when specific values are available".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_markdown_faithful.py` around lines 251 - 260, The test currently
only checks that a link node exists; update
test_link_url_resolves_backslash_escapes to also assert the link node's
normalized URL value so regressions in unescaping fail the test: locate the
single link node from link_nodes and assert its URL equals
"http://example.com/path)more" (access the URL property on the node, e.g.
link_nodes[0].get('target') or link_nodes[0].get('destination') as appropriate
for your node shape) instead of only checking len(link_nodes) == 1.
…ackslash Holistic re-review blocker: a code span ending in a backslash (e.g. `` `C:\` ``, `` `\d+\` ``) left the escape sentinel (U+FDD0) dangling in the captured content, leaking the noncharacter into stringify_markdown / ast_to_plain_text output as a replacement glyph. Root cause: `_protect_escapes` rewrites `\X` -> `<SENTINEL>X`. For `` `C:\` `` the `\`+backtick becomes `<SENTINEL>`+backtick, and the inline-code regex `` (?<!ï·�)`([^`]+)` `` captures `C:<SENTINEL>` -- the sentinel's paired backtick falls outside the group as the code-span closer. Both restore functions guarded on `i + 1 < len(text)`, so a trailing sentinel with no following char was emitted raw. Fix: both restore functions now handle a dangling sentinel explicitly. `_restore_escapes_as_literal_pair` (code spans) emits a bare backslash -- matching CommonMark's "backslash is literal inside code spans" rule, so `` `C:\` `` correctly yields code content `C:\`. `_restore_escapes` (text leaves / URLs) likewise emits a backslash rather than the raw sentinel. Neither can ever emit U+FDD0 now. Test added: `test_code_span_ending_in_backslash_does_not_leak_sentinel` covers three realistic code-span-ending-in-backslash inputs and asserts the sentinel codepoint never reaches output. 3,638 pass / 1 pre-existing unrelated failure.
…at-completeness # Conflicts: # tests/test_streaming_markdown.py
Adversarial re-review surfaced two pre-existing stringify round-trip bugs that are commonly hit by LLM-produced markdown. Fixing them in this PR rather than filing follow-ups (per the team preference for clean PRs). Both bugs predate PR #101 -- the same drift reproduces on `main`. They were never exercised by the chat-completeness work specifically, but they sit next to the escape-roundtrip code this PR added and break the same general "parse -> stringify -> parse stable" contract the rest of the PR pins. Fix 1 (list-item continuation indent): - `- item1\n para2` parsed to one list with one item containing one paragraph `"item1\npara2"`. Stringify emitted `* item1\npara2\n` -- no indent on `para2` -- so re-parse split into `[list, paragraph]`. - `_stringify_node`'s list branch now splits each child's stringified text on `\n` and applies the standard 2-space continuation indent to lines after the first. Empty continuation lines stay empty so intra-item blank lines round-trip too. - Affects every adapter that uses `from_markdown` / `to_markdown` to translate (linear, github, telegram, whatsapp + the generic `BaseFormatConverter.to_markdown`). Fix 2 (hard break stringify): - A `break` node (parsed from `Step \nline2`) stringified as just `\n`, which re-parses as a soft break -- collapsing the `[text, break, text]` triple into a single text leaf with embedded `\n`. The semantic distinction between hard and soft breaks was silently lost on round-trip. - `_stringify_node`'s break branch now emits `" \n"` per CommonMark. Tests added: - `test_roundtrip_preserves_list_item_continuation_indent` (4 inputs: single-line continuation, multi-line continuation, multiple items with continuations, ordered list with continuation) - `test_roundtrip_preserves_hard_break_in_paragraph` (asserts both AST stability AND that the `break` node survives) Docstring note added near `_INLINE_PATTERNS` for the third finding (O(n^2) on adversarial bracket-heavy input). Pre-existing in the underlying linear-search-per-position; this PR's alternation regex adds a constant-factor slowdown only on ~1000+ unclosed brackets, sub-millisecond on typical chat content. Deferred to a future character-level walker if a non-chat input surface ever lands. 3,643 pass / 1 pre-existing unrelated failure.
…mpleteness # Conflicts: # src/chat_sdk/shared/streaming_markdown.py # tests/test_streaming_markdown.py
Summary
Chat-scoped completeness fixes for issue #69, targeting the gap subset that actually surfaces in LLM/agent output rather than chasing full CommonMark spec compliance.
Sequenced after #99 (depends on
_is_excluded_asteriskfrom that PR). Targetsmainfor the 4.27 cycle — same release, separate PR for review clarity.Scope (chat-realistic gaps from the issue #69 follow-up)
Parser side (
markdown_parser.py)Escaped characters — inline regexes get
(?<!\\)lookbehinds so\*,\[,\_,\~~stay literal; new_unescape_punctresolves the escape at text-leaf emission. Image alt text also unescapes. Inline-code contents deliberately preserve backslashes (CommonMark: code spans don't unescape).GFM task list checkboxes —
- [ ]/- [x]/- [X]parse tolistItemwithchecked: bool. Ordered lists with1. [x]content are NOT coerced into tasks (task lists are bullet-only per GFM).Streaming side (
streaming_markdown.py::_remend)Math regions — new
_strip_math_regionsdrops$...$(inline) and$$...$$(display) before the inline-marker counters, so$a^* + b^*$no longer accumulates phantom italic closers.Escape-aware counters — new
_strip_escape_sequencesremoves\Xpairs from the counting buffer so the tilde/backtick/bracket counters in_remendno longer fire on escaped delimiters. The bracket walk's existing\-skip is preserved.What's deliberately NOT here
Word-internal asterisks (
5*3=15) — prototyped, reverted. Excluding word-internal*also breakstext*foo*(legitimate paired italic whose*s are also word-flanked). Distinguishing the two needs a CommonMark delimiter-stack algorithm. Documented as the next Option A item in_is_excluded_asterisk's docstring.Setext headings, footnotes, raw HTML, indented code blocks, definition lists — analyzed in the parser-swap spike (PR spike(parser): bake-off harness for issue #69 Option B #100). All silently dropped by the baseline and by every candidate without plugins. Not surfaced by LLM/chat output in practice (analysis).
Multi-backtick code spans — deferred. Less common than the items above and would require a non-trivial inline-parser change.
Test plan
26 new tests across 3 classes:
TestEscapedCharacters(8 tests): asterisk, underscore, bracket, tilde, backslash literal, mixed real/escaped emphasis, image-alt unescape, inline-code-preserves-backslash.TestTaskListItems(7 tests): unchecked, checked, uppercase X, plain item with no attr, mixed list state, marker stripped from content, ordered list not task-coerced.TestRemendChatCompleteness(11 tests): inline + display math regions, math doesn't break legitimate emphasis outside, math drops strike markers and bracket walk, escape-aware tilde/bracket/backtick counters, mixed escaped+real, escape neutrality.Validation:
uv run pytest tests/— 3,621 pass, 1 pre-existing failure (test_github_webhook.py::TestGitHubAdapterConstructor::test_throws_when_no_auth, unrelated — same as sync: upstream v4.27.0 (alpha 1) #83/fix(streaming-markdown): list-marker awareness + table chunk-boundary (issue #69) #99/spike(parser): bake-off harness for issue #69 Option B #100)uv run ruff check src/ tests/ scripts/— passesuv run ruff format --check src/ tests/ scripts/— 193 files already formatteduv run python scripts/audit_test_quality.py— 0 hard failuresScope notes
claude/wizardly-heisenberg-VLInVand extends_is_excluded_asteriskadded there. Should merge after fix(streaming-markdown): list-marker awareness + table chunk-boundary (issue #69) #99.*, multi-backtick spans) can land in a follow-up if they ever materialize as user-visible bugs.Related: #99 (the original three streaming bugs), #100 (the parser-swap spike that reframed the strategy decision).
Generated by Claude Code
Summary by CodeRabbit
New Features
Bug Fixes