Skip to content

sanitize_content_core: hardenUnicodeText missing Tag Characters block (U+E0020–U+E007F) — fully invisible ASCII-equivalent encod [Content truncated due to length] #28058

@szabta89

Description

@szabta89

Summary

hardenUnicodeText() in sanitize_content_core.cjs strips a specific allowlist of BMP invisible characters and bidi controls but omits the entire Unicode Tag Characters block (U+E0020–U+E007F, Plane 14). These 128 Cf-category codepoints have exact 1:1 ASCII equivalents (U+E0041 = TAG LATIN CAPITAL LETTER A, etc.) and produce output that renders as completely blank in all standard text renderers, including GitHub Markdown. An agent under prompt injection can encode any content (secrets, instruction payloads) entirely in Tag Characters; the encoded payload passes through sanitizeContentCore and the allowedAliases branch of sanitize_content.cjs unchanged, is posted to GitHub via the safe-outputs write path, and decodes 1:1 to the original ASCII content for any consumer parsing the raw API response (including subsequent AI workflow agents).

This is distinct from and qualitatively stronger than the U+2061–U+2064 fragmentation bypass (see related issue #1888 in gh-aw-security): fragmented text retains visual presence, while Tag Character encoding makes the entire payload completely invisible to human reviewers, substantially impairing human oversight of safe-output content.

Affected Area

Output sanitization / safe-outputs write path — hardenUnicodeText() in sanitize_content_core.cjs (same file deployed to both actions/ and safeoutputs/ paths, SHA 159c2fed, confirmed at v0.68.3).

Reproduction Outline

  1. Take any ASCII string (e.g., a secret pattern or prompt-injection payload).
  2. Encode every character using its Tag Character equivalent: codepoint C (0x20–0x7E) → U+E0000 + C (surrogate pair \uDB40\uDCxx).
  3. Pass the encoded string through hardenUnicodeText() from the deployed sanitize_content_core.cjs.
  4. Observe that the output bytes are unchanged (132 bytes in, 132 bytes out for a 33-character marker); NFKC normalization likewise leaves the string unchanged.
  5. Render the output in any standard Markdown renderer (GitHub, VS Code, browser): the string appears blank — no visible characters.
  6. Decode the raw output: the original ASCII content is recovered exactly.

Observed Behavior

Tag Characters (U+E0020–U+E007F) pass through hardenUnicodeText and sanitizeContentCore unchanged. A payload encoded entirely in Tag Characters appears as an empty string in all renderers but decodes 1:1 to the original ASCII content. Direct Node.js test against v0.68.3's sanitize_content_core.cjs:

Input bytes:  132   (33-char ASCII marker encoded in Tag Characters)
Output bytes: 132   (unchanged — pass-through confirmed)
NFKC unchanged: true
Rendered in GitHub Markdown: [blank]
Decoded: PENTEST-SECRET-MARKER-24665812475

Expected Behavior

hardenUnicodeText strips all Unicode Tag Characters (U+E0020–U+E007F) — either by adding the surrogate pair range \uDB40[\uDC00-\uDC7F] to the Step 3 regex, or preferably by adopting a Unicode General Category Cf (Format) strip that covers this block, the U+2061–U+2064 range, and other bypass-confirmed ranges (U+FE00–U+FE0F, U+FFF9–U+FFFB, U+180B–U+180D, U+061C, U+2065) without requiring future per-range patches.

Security Relevance

A prompt-injected agent can encode a secret or instruction payload entirely in Tag Characters and write it to a GitHub issue, comment, discussion, or PR body via the safe-outputs path. The content appears blank to any human reviewer performing oversight but is delivered in full to any downstream system or AI agent that reads the raw GitHub API response. This undermines the safe-outputs sanitization guarantee and human-in-the-loop review of agent-generated content. The fix for this issue and the related U+2061–U+2064 finding share a root cause and fix location and should be coordinated.

Suggested Fix

// Add to hardenUnicodeText Step 3 (or replace entire step with Cf-category strip):
result = result.replace(/\uDB40[\uDC00-\uDC7F]/g, "");
// Preferred: strip all Cf-category characters to avoid future per-range patches

Additional Context

If the current Tag Character pass-through is by design (e.g., for legitimate emoji tag sequences), that assumption should be explicitly documented in the sanitization module and the security architecture docs, since the current documentation implies comprehensive stripping of non-printable/invisible characters.


gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1943

Generated by File Issue · ● 461.4K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions