Summary
hardenUnicodeText() in sanitize_content_core.cjs strips a specific allowlist of BMP invisible characters and bidi controls but omits the entire Unicode Tag Characters block (U+E0020–U+E007F, Plane 14). These 128 Cf-category codepoints have exact 1:1 ASCII equivalents (U+E0041 = TAG LATIN CAPITAL LETTER A, etc.) and produce output that renders as completely blank in all standard text renderers, including GitHub Markdown. An agent under prompt injection can encode any content (secrets, instruction payloads) entirely in Tag Characters; the encoded payload passes through sanitizeContentCore and the allowedAliases branch of sanitize_content.cjs unchanged, is posted to GitHub via the safe-outputs write path, and decodes 1:1 to the original ASCII content for any consumer parsing the raw API response (including subsequent AI workflow agents).
This is distinct from and qualitatively stronger than the U+2061–U+2064 fragmentation bypass (see related issue #1888 in gh-aw-security): fragmented text retains visual presence, while Tag Character encoding makes the entire payload completely invisible to human reviewers, substantially impairing human oversight of safe-output content.
Affected Area
Output sanitization / safe-outputs write path — hardenUnicodeText() in sanitize_content_core.cjs (same file deployed to both actions/ and safeoutputs/ paths, SHA 159c2fed, confirmed at v0.68.3).
Reproduction Outline
- Take any ASCII string (e.g., a secret pattern or prompt-injection payload).
- Encode every character using its Tag Character equivalent: codepoint
C (0x20–0x7E) → U+E0000 + C (surrogate pair \uDB40\uDCxx).
- Pass the encoded string through
hardenUnicodeText() from the deployed sanitize_content_core.cjs.
- Observe that the output bytes are unchanged (132 bytes in, 132 bytes out for a 33-character marker); NFKC normalization likewise leaves the string unchanged.
- Render the output in any standard Markdown renderer (GitHub, VS Code, browser): the string appears blank — no visible characters.
- Decode the raw output: the original ASCII content is recovered exactly.
Observed Behavior
Tag Characters (U+E0020–U+E007F) pass through hardenUnicodeText and sanitizeContentCore unchanged. A payload encoded entirely in Tag Characters appears as an empty string in all renderers but decodes 1:1 to the original ASCII content. Direct Node.js test against v0.68.3's sanitize_content_core.cjs:
Input bytes: 132 (33-char ASCII marker encoded in Tag Characters)
Output bytes: 132 (unchanged — pass-through confirmed)
NFKC unchanged: true
Rendered in GitHub Markdown: [blank]
Decoded: PENTEST-SECRET-MARKER-24665812475
Expected Behavior
hardenUnicodeText strips all Unicode Tag Characters (U+E0020–U+E007F) — either by adding the surrogate pair range \uDB40[\uDC00-\uDC7F] to the Step 3 regex, or preferably by adopting a Unicode General Category Cf (Format) strip that covers this block, the U+2061–U+2064 range, and other bypass-confirmed ranges (U+FE00–U+FE0F, U+FFF9–U+FFFB, U+180B–U+180D, U+061C, U+2065) without requiring future per-range patches.
Security Relevance
A prompt-injected agent can encode a secret or instruction payload entirely in Tag Characters and write it to a GitHub issue, comment, discussion, or PR body via the safe-outputs path. The content appears blank to any human reviewer performing oversight but is delivered in full to any downstream system or AI agent that reads the raw GitHub API response. This undermines the safe-outputs sanitization guarantee and human-in-the-loop review of agent-generated content. The fix for this issue and the related U+2061–U+2064 finding share a root cause and fix location and should be coordinated.
Suggested Fix
// Add to hardenUnicodeText Step 3 (or replace entire step with Cf-category strip):
result = result.replace(/\uDB40[\uDC00-\uDC7F]/g, "");
// Preferred: strip all Cf-category characters to avoid future per-range patches
Additional Context
If the current Tag Character pass-through is by design (e.g., for legitimate emoji tag sequences), that assumption should be explicitly documented in the sanitization module and the security architecture docs, since the current documentation implies comprehensive stripping of non-printable/invisible characters.
gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1943
Generated by File Issue · ● 461.4K · ◷
Summary
hardenUnicodeText()insanitize_content_core.cjsstrips a specific allowlist of BMP invisible characters and bidi controls but omits the entire Unicode Tag Characters block (U+E0020–U+E007F, Plane 14). These 128 Cf-category codepoints have exact 1:1 ASCII equivalents (U+E0041 = TAG LATIN CAPITAL LETTER A, etc.) and produce output that renders as completely blank in all standard text renderers, including GitHub Markdown. An agent under prompt injection can encode any content (secrets, instruction payloads) entirely in Tag Characters; the encoded payload passes throughsanitizeContentCoreand theallowedAliasesbranch ofsanitize_content.cjsunchanged, is posted to GitHub via the safe-outputs write path, and decodes 1:1 to the original ASCII content for any consumer parsing the raw API response (including subsequent AI workflow agents).This is distinct from and qualitatively stronger than the U+2061–U+2064 fragmentation bypass (see related issue #1888 in gh-aw-security): fragmented text retains visual presence, while Tag Character encoding makes the entire payload completely invisible to human reviewers, substantially impairing human oversight of safe-output content.
Affected Area
Output sanitization / safe-outputs write path —
hardenUnicodeText()insanitize_content_core.cjs(same file deployed to bothactions/andsafeoutputs/paths, SHA159c2fed, confirmed at v0.68.3).Reproduction Outline
C(0x20–0x7E) →U+E0000 + C(surrogate pair\uDB40\uDCxx).hardenUnicodeText()from the deployedsanitize_content_core.cjs.Observed Behavior
Tag Characters (U+E0020–U+E007F) pass through
hardenUnicodeTextandsanitizeContentCoreunchanged. A payload encoded entirely in Tag Characters appears as an empty string in all renderers but decodes 1:1 to the original ASCII content. Direct Node.js test against v0.68.3'ssanitize_content_core.cjs:Expected Behavior
hardenUnicodeTextstrips all Unicode Tag Characters (U+E0020–U+E007F) — either by adding the surrogate pair range\uDB40[\uDC00-\uDC7F]to the Step 3 regex, or preferably by adopting a Unicode General CategoryCf(Format) strip that covers this block, the U+2061–U+2064 range, and other bypass-confirmed ranges (U+FE00–U+FE0F, U+FFF9–U+FFFB, U+180B–U+180D, U+061C, U+2065) without requiring future per-range patches.Security Relevance
A prompt-injected agent can encode a secret or instruction payload entirely in Tag Characters and write it to a GitHub issue, comment, discussion, or PR body via the safe-outputs path. The content appears blank to any human reviewer performing oversight but is delivered in full to any downstream system or AI agent that reads the raw GitHub API response. This undermines the safe-outputs sanitization guarantee and human-in-the-loop review of agent-generated content. The fix for this issue and the related U+2061–U+2064 finding share a root cause and fix location and should be coordinated.
Suggested Fix
Additional Context
If the current Tag Character pass-through is by design (e.g., for legitimate emoji tag sequences), that assumption should be explicitly documented in the sanitization module and the security architecture docs, since the current documentation implies comprehensive stripping of non-printable/invisible characters.
gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1943