SafeOutputs: `hardenUnicodeText` lacks Cyrillic/Greek homoglyph normalization; `threat_detection.md` "Encoded Strings" check not
[Content truncated due to length]

## Summary

Two confirmed architectural gaps in the SafeOutputs sanitization and threat-detection layers (v0.67.1). First, `hardenUnicodeText` in `sanitize_content_core.cjs` applies NFC normalization and strips zero-width/bidi controls and full-width ASCII (U+FF01–FF5E), but does **not** apply NFKC or a Unicode confusables table — Cyrillic and Greek lookalike characters pass through `sanitizeContent` unchanged and are submitted verbatim to the GitHub API. Second, the `"Encoded Strings"` check in `threat_detection.md` (Base64, hex, etc.) is nested as a sub-bullet under the `"Malicious Patch"` section; for `create_issue`-only workflows with no git patch, encoded strings in the issue body are not explicitly covered by either the static layer or a dedicated AI-detection instruction.

## Affected Area

SafeOutputs content sanitization (`hardenUnicodeText` / `sanitize_content_core.cjs`) and AI threat-detection prompt (`threat_detection.md`) — the two defense-in-depth controls that mediate all write operations through the `safe_outputs` job.

## Reproduction Outline

**Gap 1 — Cyrillic homoglyph bypass (static layer)**

1. Substitute Latin characters in a sensitive string with visually identical Cyrillic codepoints (e.g., A→А U+0410, E→Е U+0415, T→Т U+0422, K→К U+041A, M→М U+041C, S→С U+0421, R→Р U+0420).
2. Pass the substituted string as the body of a `create_issue` safe-outputs call in any gh-aw workflow.
3. `sanitizeContent` is invoked by `collect_ndjson_output.cjs`; `hardenUnicodeText` runs NFC + full-width ASCII mapping — Cyrillic codepoints are unchanged.
4. The body reaches the GitHub API verbatim.

**Gap 2 — Encoded strings not flagged for create_issue-only workflows (AI detection layer)**

1. Use a `create_issue`-only workflow (no git patch produced).
2. Include a Base64- or hex-encoded representation of a sensitive string in the proposed issue body.
3. The detection model evaluates agent output under "Prompt Injection" and "Secret Leak" only; the "Encoded Strings" sub-check is contextually tied to the "Malicious Patch" section and is not reliably applied to issue body content.

## Observed Behavior

- Cyrillic-substituted content: `sanitizeContent` returns the input unchanged (`input === output: true`; confirmed against deployed v0.67.1 scripts).
- Base64/hex-encoded content: passes through `sanitizeContent` unchanged; detection AI does not have an explicit instruction to flag encoded strings in non-patch output.

## Expected Behavior

- `hardenUnicodeText` should normalize Cyrillic/Greek lookalike characters to their Latin equivalents (e.g., via `result.normalize("NFKC")` or a Unicode TR#39 confusables table) before content reaches the GitHub API.
- `threat_detection.md` should include an explicit instruction under the "Secret Leak" check to flag Base64, hex, or homoglyph-substituted representations of secrets regardless of whether a code patch is present.

## Security Relevance

The SafeOutputs write path is the primary control preventing an AI agent from leaking sensitive content to GitHub. A homoglyph-substituted or encoded secret that bypasses both the static sanitizer and the AI detection prompt can be exfiltrated into a public issue body while appearing to satisfy all pre-write checks. This contradicts the documented guarantee that "detection analysis explicitly examines outputs for secret leaks before safe output jobs proceed."

## Suggested Fixes

1. In `hardenUnicodeText` (`sanitize_content_core.cjs`), add `result = result.normalize("NFKC")` after the existing NFC step, and/or add a Cyrillic→Latin / Greek→Latin confusables-map pass (Unicode TR#39).
2. In `threat_detection.md`, add an explicit bullet under "Secret Leak" instructing the model to flag Base64, hex, ROT13, or homoglyph-substituted representations of secrets — independent of whether a code patch is present.
3. Consider extending `redact_secrets.cjs` built-in patterns to scan `create_issue` body content as a static backstop independent of the AI detection result.

**gh-aw version**: v0.67.1

Original finding: https://github.com/githubnext/gh-aw-security/issues/1711




> Generated by [File Issue](https://github.com/githubnext/gh-aw-security/actions/runs/24188563085/agentic_workflow) · ● 348.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+githubnext%2Fgh-aw-security%2Ffile-issue%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SafeOutputs: `hardenUnicodeText` lacks Cyrillic/Greek homoglyph normalization; `threat_detection.md` "Encoded Strings" check not [Content truncated due to length] #25457

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SafeOutputs: hardenUnicodeText lacks Cyrillic/Greek homoglyph normalization; threat_detection.md "Encoded Strings" check not [Content truncated due to length] #25457

Description

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

SafeOutputs: `hardenUnicodeText` lacks Cyrillic/Greek homoglyph normalization; `threat_detection.md` "Encoded Strings" check not [Content truncated due to length] #25457