Skip to content

Preserve angle brackets in code blocks and inline code spans during sanitization#22005

Merged
dsyme merged 4 commits intomainfrom
copilot/fix-angle-bracket-redaction
Mar 21, 2026
Merged

Preserve angle brackets in code blocks and inline code spans during sanitization#22005
dsyme merged 4 commits intomainfrom
copilot/fix-angle-bracket-redaction

Conversation

Copy link
Contributor

Copilot AI commented Mar 20, 2026

convertXmlTags and removeXmlComments were applied to the entire content string, converting type parameters and code-block content like VBuffer<float32>VBuffer(float32). This mangled code samples and issue titles containing generic/template syntax.

Changes

sanitize_content_core.cjs — three new helpers

  • getFencedCodeRanges(s) — returns [start, end) character ranges of fenced code blocks (` ``` / ~~~) by scanning line-by-line
  • applyFnOutsideInlineCode(text, fn) — applies a transform to a text segment, skipping backtick-delimited inline code spans (handles multi-backtick spans)
  • applyToNonCodeRegions(s, fn) — composes both helpers; falls back to fn(s) on any parse error (never reduces protection)

sanitizeContentCore updated to call applyToNonCodeRegions(sanitized, removeXmlComments) and applyToNonCodeRegions(sanitized, convertXmlTags) instead of the bare functions.

sanitize_content.cjs

Same wrapping applied to the mention-filtering variant of the pipeline.

Scope

Only XML tag conversion and XML comment removal are made code-region aware. URL redaction, invisible-character stripping, and all other sanitizers are unchanged — they still apply everywhere.

Example

Before fix:
  `VBuffer<float32>` in inline code → `` `VBuffer(float32)` ``
  ```cpp
  std::vector<int> v;   →   std::vector(int) v;

After fix:
VBuffer<float32> preserved in inline code

std::vector<int> v;   preserved in fenced block

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 Send tasks to Copilot coding agent from [Slack](https://gh.io/cca-slack-docs) and [Teams](https://gh.io/cca-teams-docs) to turn conversations into code. Copilot posts an update in your thread when it's finished.

Angle brackets (e.g. VBuffer<float32>) inside fenced code blocks and
backtick inline-code spans are no longer converted to parentheses by
the XML-tag sanitizer.  The same code-region awareness is applied to
XML comment removal.

Three helpers are added to sanitize_content_core.cjs:
- getFencedCodeRanges: locates fenced code blocks by character position
- applyFnOutsideInlineCode: applies a transform skipping inline code spans
- applyToNonCodeRegions: composes both to skip all code regions

Both sanitizeContentCore and sanitizeContent now use these helpers when
calling removeXmlComments and convertXmlTags, with a safe fallback to
full-string processing if parsing fails.

Ten new tests verify the behaviour for fenced blocks, inline code,
regular text (still sanitised), XML comments, and the VBuffer<float32>
example from the original issue."

Co-authored-by: dsyme <7204669+dsyme@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/43a8350f-5a6b-41d6-b5cf-16233acc7ed4
Copilot AI changed the title [WIP] Fix redaction of angle brackets in code and issue titles Preserve angle brackets in code blocks and inline code spans during sanitization Mar 20, 2026
Copilot AI requested a review from dsyme March 20, 2026 18:16
@pelikhan pelikhan closed this Mar 20, 2026
@pelikhan pelikhan reopened this Mar 20, 2026
@pelikhan
Copy link
Contributor

Sorry I closed this one acidentaly... We need to somehow be careful because markdown parsers may have different bugs which would allow to create unbalanced code regions. We should never allow the dangerous HTML elements.

@github-actions
Copy link
Contributor

🤖 Contribution Check Report

Hey @Copilot 👋 — great work on this sanitization fix! Preserving angle brackets inside fenced code blocks and inline code spans (e.g. VBuffer(float32)) is a real usability improvement — mangled type parameters in issue titles and code samples is a legitimate pain point worth fixing.


✅ Checklist

Check Result
On-topic ✅ Yes — targeted bug fix in actions/setup/js sanitization pipeline
Follows process ✅ Yes — Copilot coding agent PR assigned to core team member (dsyme)
Focused ✅ Yes — all 3 changed files relate to the single XML sanitization concern
New dependencies ✅ No — no manifest changes
Has tests ✅ Yes — 10 test cases added to sanitize_content.test.cjs
Has description ✅ Yes — detailed body with Changes section, scope callout, and before/after example
Diff size 273 lines (265 additions, 8 deletions)

🟢 Verdict: Aligned

The PR is well-structured and complete. The three new helpers (getFencedCodeRanges, applyFnOutsideInlineCode, applyToNonCodeRegions) are clearly documented with JSDoc, the safe fallback (catch (_e) → fn(s)) ensures the security invariant is never weakened, and the test suite covers backtick fences, tilde fences, language-specifier fences, multi-backtick inline spans, mixed content, and the original VBuffer(float32) reproduction case.

One optional enhancement worth considering: the helper functions in sanitize_content_core.cjs are tested indirectly through the sanitizeContent integration tests. Adding a small dedicated test block directly for getFencedCodeRanges and applyFnOutsideInlineCode (e.g. in a sanitize_content_core.test.cjs) would give tighter feedback on edge cases like deeply nested backticks or malformed fences — but this is not a blocker.

This PR looks ready for maintainer review. 🎉

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

  • issue:#unknown (search_issues: has secrecy requirements that agent doesn't meet. The agent is not authorized to access private-scoped data.)

Generated by Contribution Check ·

@dsyme dsyme marked this pull request as ready for review March 21, 2026 13:27
Copilot AI review requested due to automatic review settings March 21, 2026 13:27
@dsyme dsyme merged commit 69d40fe into main Mar 21, 2026
51 checks passed
@dsyme dsyme deleted the copilot/fix-angle-bracket-redaction branch March 21, 2026 13:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents XML tag conversion (<...>(...)) and XML comment stripping from mangling markdown code samples and inline code spans that include angle brackets (e.g., generics/templates like VBuffer<float32>).

Changes:

  • Added markdown-aware helpers to identify fenced code blocks and inline code spans, and to apply transforms only outside those regions.
  • Updated both sanitization pipelines (sanitizeContentCore and the mention-filtering sanitizeContent) to run removeXmlComments / convertXmlTags via the new wrapper.
  • Added tests asserting angle brackets and XML comments are preserved inside fenced code blocks and inline code spans, while still sanitizing regular text.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
actions/setup/js/sanitize_content_core.cjs Adds helpers for code-region detection and applies XML sanitizers only outside code regions; exports applyToNonCodeRegions.
actions/setup/js/sanitize_content.cjs Uses applyToNonCodeRegions in the mention-filtering sanitization pipeline as well.
actions/setup/js/sanitize_content.test.cjs Adds tests for preserving angle brackets/comments in inline and fenced code regions and continued sanitization in normal text.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +966 to 972
// Remove XML comments – skip code blocks and inline code to avoid altering code content
sanitized = applyToNonCodeRegions(sanitized, removeXmlComments);

// Convert XML tags to parentheses format to prevent injection
sanitized = convertXmlTags(sanitized);
// Convert XML tags to parentheses format – skip code blocks and inline code so that
// type parameters (e.g. VBuffer<float32>) and code containing angle brackets are preserved
sanitized = applyToNonCodeRegions(sanitized, convertXmlTags);

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyToNonCodeRegions determines code regions before balanceCodeRegions runs later in the pipeline. If balanceCodeRegions modifies fence lengths / closes unclosed fences (a scenario this repo explicitly expects for AI-generated markdown), content that ends up inside a code block after balancing may still have had XML comment removal / tag conversion applied earlier, reintroducing the original mangling for malformed markdown. Consider balancing code regions before running the code-aware XML transforms, or running the XML transforms on the balanced markdown so both stages agree on code boundaries.

Copilot uses AI. Check for mistakes.
expect(result).toContain("`VBuffer<float>`");
expect(result).not.toContain("VBuffer(float32)");
expect(result).not.toContain("VBuffer(float)");
});
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests cover top-level fenced blocks and inline code spans, but they don’t cover fenced code blocks inside blockquotes (e.g. > ```yaml) or list items—both are common markdown patterns and currently not detected by getFencedCodeRanges. Adding coverage for these cases would prevent regressions where angle brackets are still converted inside those code blocks.

Suggested change
});
});
it("should treat fenced code blocks inside blockquotes as code regions", () => {
const markdown = [
"> ```yaml",
"> apiVersion: v1",
"> kind: Pod<V1>",
"> ```",
].join("\n");
const result = sanitizeContent(markdown);
expect(result).toContain("kind: Pod<V1>");
expect(result).not.toContain("kind: Pod(V1)");
});
it("should treat fenced code blocks inside list items as code regions", () => {
const markdown = [
"- ```csharp",
" var list = new List<string>();",
" ```",
].join("\n");
const result = sanitizeContent(markdown);
expect(result).toContain("List<string>");
expect(result).not.toContain("List(string)");
});

Copilot uses AI. Check for mistakes.
Comment on lines +344 to +351
if (!inBlock) {
const m = trimmed.match(/^(`{3,}|~{3,})/);
if (m) {
inBlock = true;
blockStart = pos;
fenceChar = m[1][0];
fenceLen = m[1].length;
}
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFencedCodeRanges only recognizes fences when the trimmed line starts with backticks/tildes. This misses valid fenced code blocks inside blockquotes (e.g. > ```yaml) and list items (e.g. 1. ````, - ````), so XML tag conversion/comment removal will still run inside those code blocks and continue mangling code samples in those common markdown constructs. Consider extending the fence detection to handle >-prefixed blockquote lines and list-item prefixes (or switching to a markdown fence parser that understands these cases).

Copilot uses AI. Check for mistakes.
Comment on lines +459 to +462
function applyToNonCodeRegions(s, fn) {
if (!s || typeof s !== "string") {
return s || "";
}
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSDoc for applyToNonCodeRegions says it returns a string, but when s is truthy and not a string (e.g. Buffer/object) it returns s as-is (return s || ""). Since this helper is exported, that return-type mismatch can leak non-strings to callers. Consider always returning a string here (e.g. "" for non-strings) or updating the JSDoc/type contract accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +392 to +437
while (i < text.length) {
if (text[i] !== "`") {
i++;
continue;
}

// Count consecutive backticks at the current position
const btStart = i;
let btCount = 0;
while (i < text.length && text[i] === "`") {
btCount++;
i++;
}
// i is now past the opening backtick sequence

// Look for the matching closing sequence of exactly btCount backticks
let closeIdx = -1;
let j = i;
while (j < text.length) {
if (text[j] === "`") {
let closeCount = 0;
const jStart = j;
while (j < text.length && text[j] === "`") {
closeCount++;
j++;
}
if (closeCount === btCount) {
closeIdx = jStart;
break;
}
// Different length – keep scanning (j already advanced past these backticks)
} else {
j++;
}
}

if (closeIdx !== -1) {
// Valid inline code span found: apply fn to the text before it, then keep the code span
if (textStart < btStart) {
parts.push(fn(text.slice(textStart, btStart)));
}
parts.push(text.slice(btStart, closeIdx + btCount));
textStart = closeIdx + btCount;
i = textStart;
}
// If no matching close was found, the backticks are treated as regular text (i already advanced)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyFnOutsideInlineCode can become O(n²) for pathological inputs with many unmatched/differently-sized backtick runs because for each opener it linearly scans the remainder of the string to find a matching closer. Given the sanitizer allows up to ~524k chars, this can become a noticeable CPU hotspot / potential DoS vector. Consider rewriting this as a single-pass state machine that advances j monotonically (or using a regex-based tokenizer) so the overall runtime is O(n).

Suggested change
while (i < text.length) {
if (text[i] !== "`") {
i++;
continue;
}
// Count consecutive backticks at the current position
const btStart = i;
let btCount = 0;
while (i < text.length && text[i] === "`") {
btCount++;
i++;
}
// i is now past the opening backtick sequence
// Look for the matching closing sequence of exactly btCount backticks
let closeIdx = -1;
let j = i;
while (j < text.length) {
if (text[j] === "`") {
let closeCount = 0;
const jStart = j;
while (j < text.length && text[j] === "`") {
closeCount++;
j++;
}
if (closeCount === btCount) {
closeIdx = jStart;
break;
}
// Different length – keep scanning (j already advanced past these backticks)
} else {
j++;
}
}
if (closeIdx !== -1) {
// Valid inline code span found: apply fn to the text before it, then keep the code span
if (textStart < btStart) {
parts.push(fn(text.slice(textStart, btStart)));
}
parts.push(text.slice(btStart, closeIdx + btCount));
textStart = closeIdx + btCount;
i = textStart;
}
// If no matching close was found, the backticks are treated as regular text (i already advanced)
// To avoid O(n^2) behavior when scanning for matching backtick runs,
// we pre-scan the string once to find all backtick runs, then pair them
// by length in a single pass. This preserves the original semantics:
// for a given run length, the first occurrence opens a span, the next
// closes it, and so on, ignoring intervening runs of different lengths.
/** @type {{ start: number, length: number }[]} */
const runs = [];
for (let idx = 0; idx < text.length;) {
if (text[idx] !== "`") {
idx++;
continue;
}
const start = idx;
let count = 0;
while (idx < text.length && text[idx] === "`") {
count++;
idx++;
}
runs.push({ start, length: count });
}
/**
* Map from backtick run length to the start index of the currently
* pending opener (if any) for that length.
* @type {Record<string, number | undefined>}
*/
const pending = Object.create(null);
/** @type {{ openStart: number, openLen: number, closeStart: number }[]} */
const spans = [];
for (const run of runs) {
const lenKey = String(run.length);
const prev = pending[lenKey];
if (prev === undefined) {
// No pending opener for this length: mark this run as an opener.
pending[lenKey] = run.start;
} else {
// Found a closer for the existing opener of this length.
spans.push({
openStart: prev,
openLen: run.length,
closeStart: run.start,
});
pending[lenKey] = undefined;
}
}
// spans are generated in the order in which their closing runs appear.
// Since each openStart is always before its closeStart and runs are
// iterated in increasing order of start, spans are effectively ordered
// by appearance in the text and do not overlap for a given length.
// We now walk through the spans, emitting processed text and raw code.
for (const span of spans) {
const btStart = span.openStart;
const btCount = span.openLen;
const closeIdx = span.closeStart;
// Skip spans that start before the current textStart (e.g., if any
// earlier spans have already advanced textStart past them).
if (btStart < textStart) {
continue;
}
// Apply fn to non-code text before this code span.
if (textStart < btStart) {
parts.push(fn(text.slice(textStart, btStart)));
}
// Preserve the code span (including backticks) unchanged.
parts.push(text.slice(btStart, closeIdx + btCount));
textStart = closeIdx + btCount;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Angle-bracket redaction too strong in code and issue titles

4 participants