Skip to content

feat: detect Unicode variation selectors in content scanner (Glassworm vector)#321

Merged
danielmeppiel merged 6 commits intomainfrom
feat/variation-selector-detection
Mar 16, 2026
Merged

feat: detect Unicode variation selectors in content scanner (Glassworm vector)#321
danielmeppiel merged 6 commits intomainfrom
feat/variation-selector-detection

Conversation

@danielmeppiel
Copy link
Collaborator

Summary

Adds Unicode variation selector detection to the content scanner — the specific mechanism used in the Glassworm supply-chain attacks (March 2026) that compromised repositories and VS Code extensions.

Closes #320

What changed

Scanner ranges (content_scanner.py)

Range Severity Category Rationale
U+E0100–E01EF (VS17-256) critical variation-selector No legitimate use in prompt files. 240 invisible chars that can encode arbitrary data.
U+FE00–FE0D (VS1-14) warning variation-selector Rare CJK typography variants. Unusual in prompt files.
U+FE0E (VS15) warning variation-selector Text presentation selector. Uncommon in prompts.
U+FE0F (VS16) info variation-selector Emoji presentation selector. Common with emoji — only shown with --verbose.

Design decisions

  • VS16 (U+FE0F) is info-level: Every emoji uses this (❤️ = ❤ + U+FE0F). Flagging as warning would generate noise on virtually every file with emoji. Info level means it only appears with --verbose.
  • No architecture changes: Extends the existing _SUSPICIOUS_RANGES table and _CHAR_LOOKUP O(1) dict. The lookup table grows from ~156 to 412 entries — negligible.
  • Strip behavior: apm audit --strip removes warning/info-level VS (1-16) but preserves critical VS (17-256) for manual review — consistent with existing strip semantics.
  • Zero changes to consumers: The audit command, install security gate, and compile/pack scanning all use ContentScanner generically via severity — the new category renders correctly without any pipeline changes.

Tests

  • 11 scanner tests: Detection at all severity levels, boundary values, Glassworm-style injection patterns, emoji false-positive verification, and strip behavior
  • 8 audit e2e tests: Exit codes, --strip, --verbose, and realistic injected prompt fixtures

Documentation

  • Security page: Updated severity table + Glassworm context (2 sentences, integrated naturally)
  • CHANGELOG: Entry under [Unreleased]

Credit

Thanks to @raye-deng for the detailed analysis identifying this gap — the variation selector ranges and Glassworm reference came directly from their feedback on #312.

danielmeppiel and others added 2 commits March 16, 2026 10:08
Replace category-specific labels ("zero-width", "unusual whitespace",
"tag/bidi") with generic alternatives ("hidden characters",
"unusual characters", "suspicious characters") so summary messages
accurately describe findings from any scanner category, including the
new variation-selector detection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…m vector)

Add variation selector ranges to the content scanner's detection table:
- VS17-256 (U+E0100-E01EF): critical — no legitimate use in prompt files
- VS1-14 (U+FE00-FE0D): warning — rare CJK typography variants
- VS15 (U+FE0E): warning — text presentation selector
- VS16 (U+FE0F): info — emoji presentation, shown only with --verbose

These are the specific mechanism used in the Glassworm supply-chain attacks
that compromised repositories and VS Code extensions by encoding invisible
payload data that AST-based tools cannot detect.

Includes comprehensive tests (11 scanner + 8 audit e2e), security
documentation update, and CHANGELOG entry.

Closes #320

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 16, 2026 09:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Unicode variation selector detection to APM’s hidden-Unicode content scanning pipeline (scanner + apm audit UX), covering the Glassworm attack vector and ensuring the new severities behave correctly in scan, strip, and CLI output.

Changes:

  • Extend ContentScanner suspicious ranges to include variation selectors with severity mapping (VS17–256 critical; VS1–15 warning; VS16 info).
  • Add unit + CLI tests validating detection boundaries, Glassworm-style sequences, verbose rendering, and --strip behavior.
  • Update CLI/diagnostics wording and enterprise security documentation to reflect the broader “hidden characters” scope.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/apm_cli/security/content_scanner.py Adds variation selector ranges to _SUSPICIOUS_RANGES and lookup-driven detection/strip behavior.
src/apm_cli/commands/audit.py Updates audit messaging/help text to reflect broader hidden-character categories (incl. variation selectors).
src/apm_cli/utils/diagnostics.py Generalizes security summary strings from “zero-width/whitespace” to “hidden/unusual characters.”
tests/unit/test_content_scanner.py Adds scanner/strip unit tests for variation selectors (severity + boundaries + attack pattern).
tests/unit/test_audit_command.py Adds apm audit --file and --strip tests for VS critical/warning/info and verbose output.
docs/src/content/docs/enterprise/security.md Documents variation selectors (incl. Glassworm context) and updates the severity table.
CHANGELOG.md Adds an Unreleased entry for variation selector detection.

danielmeppiel and others added 4 commits March 16, 2026 10:23
strip_non_critical() → strip_dangerous(): now strips critical + warning
characters (hidden ASCII, bidi overrides, variation selectors) and
preserves info-level characters (emoji selectors, non-breaking spaces).

The old behavior was backwards — it removed harmless info-level chars
(breaking emoji like ❤️ → ❤) while leaving the most dangerous
critical chars (hidden instruction embedding) untouched.

Also fixes _apply_strip() to no longer skip critical-only files, and
simplifies the strip exit path since all dangerous chars are now
removed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Scanner:
- Add 17 invisible Unicode characters to detection: bidi marks
  (LRM, RLM, ALM), invisible math operators (U+2061-4), interlinear
  annotation markers (U+FFF9-B), deprecated formatting (U+206A-F)
- All new ranges at warning severity — zero legitimate use in prompt files

DX improvements:
- Critical findings now suggest '--strip' (was 'manual review' only)
- '--strip' help: 'Remove hidden characters (preserves emoji and whitespace)'
- '--verbose' help: 'Show all findings including harmless ones'
- Strip no-op prints 'Nothing to clean' instead of silent exit
- Exit codes documented in command help text

Tests: 15 new scanner tests, 2 new audit tests (103 total, all passing)
Docs: security.md detection table expanded, governance.md updated

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ZWJ between emoji characters (e.g. 👨‍👩‍👧) is downgraded to info-level
  and preserved by --strip, preventing compound emoji from breaking
- _is_emoji_char() + _zwj_in_emoji_context() helpers with backward
  skip past VS16 and skin-tone modifiers
- Consistent ZWJ handling in both scan_text() and strip_dangerous()
- --strip --dry-run shows per-file counts of strippable characters
  in a Rich table without modifying any files
- Hint message when --dry-run used without --strip
- 16 new tests (12 ZWJ context + 5 dry-run, minus 1 overlap)
- Updated security.md, governance.md, and CHANGELOG.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix cli-commands.md:
- --strip description was inverted (described old strip_non_critical behavior)
- Add missing --dry-run flag documentation
- Expand 'What it detects' to cover all 35 ranges (variation selectors,
  bidi marks, invisible operators, annotation markers, deprecated formatting)
- Update exit code descriptions (add strip success, variation selectors)
- Fix misleading example comment

Fix governance.md:
- Update exit code descriptions to match implementation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel merged commit 14ec543 into main Mar 16, 2026
9 checks passed
@danielmeppiel danielmeppiel deleted the feat/variation-selector-detection branch March 16, 2026 11:27
This was referenced Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Unicode variation selector detection to content scanner (Glassworm attack vector)

2 participants