feat: detect Unicode variation selectors in content scanner (Glassworm vector) by danielmeppiel · Pull Request #321 · microsoft/apm

danielmeppiel · 2026-03-16T09:10:15Z

Summary

Adds Unicode variation selector detection to the content scanner — the specific mechanism used in the Glassworm supply-chain attacks (March 2026) that compromised repositories and VS Code extensions.

Closes #320

What changed

Scanner ranges (`content_scanner.py`)

Range	Severity	Category	Rationale
U+E0100–E01EF (VS17-256)	critical	variation-selector	No legitimate use in prompt files. 240 invisible chars that can encode arbitrary data.
U+FE00–FE0D (VS1-14)	warning	variation-selector	Rare CJK typography variants. Unusual in prompt files.
U+FE0E (VS15)	warning	variation-selector	Text presentation selector. Uncommon in prompts.
U+FE0F (VS16)	info	variation-selector	Emoji presentation selector. Common with emoji — only shown with `--verbose`.

Design decisions

VS16 (U+FE0F) is info-level: Every emoji uses this (❤️ = ❤ + U+FE0F). Flagging as warning would generate noise on virtually every file with emoji. Info level means it only appears with --verbose.
No architecture changes: Extends the existing _SUSPICIOUS_RANGES table and _CHAR_LOOKUP O(1) dict. The lookup table grows from ~156 to 412 entries — negligible.
Strip behavior: apm audit --strip removes warning/info-level VS (1-16) but preserves critical VS (17-256) for manual review — consistent with existing strip semantics.
Zero changes to consumers: The audit command, install security gate, and compile/pack scanning all use ContentScanner generically via severity — the new category renders correctly without any pipeline changes.

Tests

11 scanner tests: Detection at all severity levels, boundary values, Glassworm-style injection patterns, emoji false-positive verification, and strip behavior
8 audit e2e tests: Exit codes, --strip, --verbose, and realistic injected prompt fixtures

Documentation

Security page: Updated severity table + Glassworm context (2 sentences, integrated naturally)
CHANGELOG: Entry under [Unreleased]

Credit

Thanks to @raye-deng for the detailed analysis identifying this gap — the variation selector ranges and Glassworm reference came directly from their feedback on #312.

Replace category-specific labels ("zero-width", "unusual whitespace", "tag/bidi") with generic alternatives ("hidden characters", "unusual characters", "suspicious characters") so summary messages accurately describe findings from any scanner category, including the new variation-selector detection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…m vector) Add variation selector ranges to the content scanner's detection table: - VS17-256 (U+E0100-E01EF): critical — no legitimate use in prompt files - VS1-14 (U+FE00-FE0D): warning — rare CJK typography variants - VS15 (U+FE0E): warning — text presentation selector - VS16 (U+FE0F): info — emoji presentation, shown only with --verbose These are the specific mechanism used in the Glassworm supply-chain attacks that compromised repositories and VS Code extensions by encoding invisible payload data that AST-based tools cannot detect. Includes comprehensive tests (11 scanner + 8 audit e2e), security documentation update, and CHANGELOG entry. Closes #320 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds Unicode variation selector detection to APM’s hidden-Unicode content scanning pipeline (scanner + apm audit UX), covering the Glassworm attack vector and ensuring the new severities behave correctly in scan, strip, and CLI output.

Changes:

Extend ContentScanner suspicious ranges to include variation selectors with severity mapping (VS17–256 critical; VS1–15 warning; VS16 info).
Add unit + CLI tests validating detection boundaries, Glassworm-style sequences, verbose rendering, and --strip behavior.
Update CLI/diagnostics wording and enterprise security documentation to reflect the broader “hidden characters” scope.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/apm_cli/security/content_scanner.py`	Adds variation selector ranges to `_SUSPICIOUS_RANGES` and lookup-driven detection/strip behavior.
`src/apm_cli/commands/audit.py`	Updates audit messaging/help text to reflect broader hidden-character categories (incl. variation selectors).
`src/apm_cli/utils/diagnostics.py`	Generalizes security summary strings from “zero-width/whitespace” to “hidden/unusual characters.”
`tests/unit/test_content_scanner.py`	Adds scanner/strip unit tests for variation selectors (severity + boundaries + attack pattern).
`tests/unit/test_audit_command.py`	Adds `apm audit --file` and `--strip` tests for VS critical/warning/info and verbose output.
`docs/src/content/docs/enterprise/security.md`	Documents variation selectors (incl. Glassworm context) and updates the severity table.
`CHANGELOG.md`	Adds an Unreleased entry for variation selector detection.

CHANGELOG.md

strip_non_critical() → strip_dangerous(): now strips critical + warning characters (hidden ASCII, bidi overrides, variation selectors) and preserves info-level characters (emoji selectors, non-breaking spaces). The old behavior was backwards — it removed harmless info-level chars (breaking emoji like ❤️ → ❤) while leaving the most dangerous critical chars (hidden instruction embedding) untouched. Also fixes _apply_strip() to no longer skip critical-only files, and simplifies the strip exit path since all dangerous chars are now removed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Scanner: - Add 17 invisible Unicode characters to detection: bidi marks (LRM, RLM, ALM), invisible math operators (U+2061-4), interlinear annotation markers (U+FFF9-B), deprecated formatting (U+206A-F) - All new ranges at warning severity — zero legitimate use in prompt files DX improvements: - Critical findings now suggest '--strip' (was 'manual review' only) - '--strip' help: 'Remove hidden characters (preserves emoji and whitespace)' - '--verbose' help: 'Show all findings including harmless ones' - Strip no-op prints 'Nothing to clean' instead of silent exit - Exit codes documented in command help text Tests: 15 new scanner tests, 2 new audit tests (103 total, all passing) Docs: security.md detection table expanded, governance.md updated Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- ZWJ between emoji characters (e.g. 👨‍👩‍👧) is downgraded to info-level and preserved by --strip, preventing compound emoji from breaking - _is_emoji_char() + _zwj_in_emoji_context() helpers with backward skip past VS16 and skin-tone modifiers - Consistent ZWJ handling in both scan_text() and strip_dangerous() - --strip --dry-run shows per-file counts of strippable characters in a Rich table without modifying any files - Hint message when --dry-run used without --strip - 16 new tests (12 ZWJ context + 5 dry-run, minus 1 overlap) - Updated security.md, governance.md, and CHANGELOG.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix cli-commands.md: - --strip description was inverted (described old strip_non_critical behavior) - Add missing --dry-run flag documentation - Expand 'What it detects' to cover all 35 ranges (variation selectors, bidi marks, invisible operators, annotation markers, deprecated formatting) - Update exit code descriptions (add strip success, variation selectors) - Fix misleading example comment Fix governance.md: - Update exit code descriptions to match implementation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel and others added 2 commits March 16, 2026 10:08

Copilot AI review requested due to automatic review settings March 16, 2026 09:10

Copilot started reviewing on behalf of danielmeppiel March 16, 2026 09:10 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

danielmeppiel and others added 4 commits March 16, 2026 10:23

danielmeppiel merged commit 14ec543 into main Mar 16, 2026
9 checks passed

danielmeppiel deleted the feat/variation-selector-detection branch March 16, 2026 11:27

This was referenced Mar 16, 2026

chore: release v0.8.0 #323

Merged

[aw] No-Op Runs #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: detect Unicode variation selectors in content scanner (Glassworm vector)#321

feat: detect Unicode variation selectors in content scanner (Glassworm vector)#321
danielmeppiel merged 6 commits intomainfrom
feat/variation-selector-detection

danielmeppiel commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielmeppiel commented Mar 16, 2026

Summary

What changed

Scanner ranges (content_scanner.py)

Design decisions

Tests

Documentation

Credit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scanner ranges (`content_scanner.py`)