sanitize_content_core: hardenUnicodeText missing Tag Characters block (U+E0020–U+E007F) — fully invisible ASCII-equivalent encod
[Content truncated due to length]

## Summary

`hardenUnicodeText()` in `sanitize_content_core.cjs` strips a specific allowlist of BMP invisible characters and bidi controls but omits the entire Unicode Tag Characters block (U+E0020–U+E007F, Plane 14). These 128 Cf-category codepoints have exact 1:1 ASCII equivalents (U+E0041 = TAG LATIN CAPITAL LETTER A, etc.) and produce output that renders as **completely blank** in all standard text renderers, including GitHub Markdown. An agent under prompt injection can encode any content (secrets, instruction payloads) entirely in Tag Characters; the encoded payload passes through `sanitizeContentCore` and the `allowedAliases` branch of `sanitize_content.cjs` unchanged, is posted to GitHub via the safe-outputs write path, and decodes 1:1 to the original ASCII content for any consumer parsing the raw API response (including subsequent AI workflow agents).

This is distinct from and qualitatively stronger than the U+2061–U+2064 fragmentation bypass (see related issue #1888 in gh-aw-security): fragmented text retains visual presence, while Tag Character encoding makes the entire payload completely invisible to human reviewers, substantially impairing human oversight of safe-output content.

## Affected Area

Output sanitization / safe-outputs write path — `hardenUnicodeText()` in `sanitize_content_core.cjs` (same file deployed to both `actions/` and `safeoutputs/` paths, SHA `159c2fed`, confirmed at v0.68.3).

## Reproduction Outline

1. Take any ASCII string (e.g., a secret pattern or prompt-injection payload).
2. Encode every character using its Tag Character equivalent: codepoint `C` (0x20–0x7E) → `U+E0000 + C` (surrogate pair `\uDB40\uDCxx`).
3. Pass the encoded string through `hardenUnicodeText()` from the deployed `sanitize_content_core.cjs`.
4. Observe that the output bytes are unchanged (132 bytes in, 132 bytes out for a 33-character marker); NFKC normalization likewise leaves the string unchanged.
5. Render the output in any standard Markdown renderer (GitHub, VS Code, browser): the string appears blank — no visible characters.
6. Decode the raw output: the original ASCII content is recovered exactly.

## Observed Behavior

Tag Characters (U+E0020–U+E007F) pass through `hardenUnicodeText` and `sanitizeContentCore` unchanged. A payload encoded entirely in Tag Characters appears as an empty string in all renderers but decodes 1:1 to the original ASCII content. Direct Node.js test against v0.68.3's `sanitize_content_core.cjs`:

```
Input bytes:  132   (33-char ASCII marker encoded in Tag Characters)
Output bytes: 132   (unchanged — pass-through confirmed)
NFKC unchanged: true
Rendered in GitHub Markdown: [blank]
Decoded: PENTEST-SECRET-MARKER-24665812475
```

## Expected Behavior

`hardenUnicodeText` strips all Unicode Tag Characters (U+E0020–U+E007F) — either by adding the surrogate pair range `\uDB40[\uDC00-\uDC7F]` to the Step 3 regex, or preferably by adopting a Unicode General Category `Cf` (Format) strip that covers this block, the U+2061–U+2064 range, and other bypass-confirmed ranges (U+FE00–U+FE0F, U+FFF9–U+FFFB, U+180B–U+180D, U+061C, U+2065) without requiring future per-range patches.

## Security Relevance

A prompt-injected agent can encode a secret or instruction payload entirely in Tag Characters and write it to a GitHub issue, comment, discussion, or PR body via the safe-outputs path. The content appears blank to any human reviewer performing oversight but is delivered in full to any downstream system or AI agent that reads the raw GitHub API response. This undermines the safe-outputs sanitization guarantee and human-in-the-loop review of agent-generated content. The fix for this issue and the related U+2061–U+2064 finding share a root cause and fix location and should be coordinated.

## Suggested Fix

```js
// Add to hardenUnicodeText Step 3 (or replace entire step with Cf-category strip):
result = result.replace(/\uDB40[\uDC00-\uDC7F]/g, "");
// Preferred: strip all Cf-category characters to avoid future per-range patches
```

## Additional Context

If the current Tag Character pass-through is by design (e.g., for legitimate emoji tag sequences), that assumption should be explicitly documented in the sanitization module and the security architecture docs, since the current documentation implies comprehensive stripping of non-printable/invisible characters.

---

**gh-aw version**: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1943




> Generated by [File Issue](https://github.com/githubnext/gh-aw-security/actions/runs/24834036166/agentic_workflow) · ● 461.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+githubnext%2Fgh-aw-security%2Ffile-issue%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sanitize_content_core: hardenUnicodeText missing Tag Characters block (U+E0020–U+E007F) — fully invisible ASCII-equivalent encod [Content truncated due to length] #28058

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sanitize_content_core: hardenUnicodeText missing Tag Characters block (U+E0020–U+E007F) — fully invisible ASCII-equivalent encod [Content truncated due to length] #28058

Description

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions