feat: detect Unicode variation selectors in content scanner (Glassworm vector)#321
Merged
danielmeppiel merged 6 commits intomainfrom Mar 16, 2026
Merged
Conversation
Replace category-specific labels ("zero-width", "unusual whitespace",
"tag/bidi") with generic alternatives ("hidden characters",
"unusual characters", "suspicious characters") so summary messages
accurately describe findings from any scanner category, including the
new variation-selector detection.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…m vector) Add variation selector ranges to the content scanner's detection table: - VS17-256 (U+E0100-E01EF): critical — no legitimate use in prompt files - VS1-14 (U+FE00-FE0D): warning — rare CJK typography variants - VS15 (U+FE0E): warning — text presentation selector - VS16 (U+FE0F): info — emoji presentation, shown only with --verbose These are the specific mechanism used in the Glassworm supply-chain attacks that compromised repositories and VS Code extensions by encoding invisible payload data that AST-based tools cannot detect. Includes comprehensive tests (11 scanner + 8 audit e2e), security documentation update, and CHANGELOG entry. Closes #320 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Unicode variation selector detection to APM’s hidden-Unicode content scanning pipeline (scanner + apm audit UX), covering the Glassworm attack vector and ensuring the new severities behave correctly in scan, strip, and CLI output.
Changes:
- Extend
ContentScannersuspicious ranges to include variation selectors with severity mapping (VS17–256 critical; VS1–15 warning; VS16 info). - Add unit + CLI tests validating detection boundaries, Glassworm-style sequences, verbose rendering, and
--stripbehavior. - Update CLI/diagnostics wording and enterprise security documentation to reflect the broader “hidden characters” scope.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/apm_cli/security/content_scanner.py |
Adds variation selector ranges to _SUSPICIOUS_RANGES and lookup-driven detection/strip behavior. |
src/apm_cli/commands/audit.py |
Updates audit messaging/help text to reflect broader hidden-character categories (incl. variation selectors). |
src/apm_cli/utils/diagnostics.py |
Generalizes security summary strings from “zero-width/whitespace” to “hidden/unusual characters.” |
tests/unit/test_content_scanner.py |
Adds scanner/strip unit tests for variation selectors (severity + boundaries + attack pattern). |
tests/unit/test_audit_command.py |
Adds apm audit --file and --strip tests for VS critical/warning/info and verbose output. |
docs/src/content/docs/enterprise/security.md |
Documents variation selectors (incl. Glassworm context) and updates the severity table. |
CHANGELOG.md |
Adds an Unreleased entry for variation selector detection. |
strip_non_critical() → strip_dangerous(): now strips critical + warning characters (hidden ASCII, bidi overrides, variation selectors) and preserves info-level characters (emoji selectors, non-breaking spaces). The old behavior was backwards — it removed harmless info-level chars (breaking emoji like ❤️ → ❤) while leaving the most dangerous critical chars (hidden instruction embedding) untouched. Also fixes _apply_strip() to no longer skip critical-only files, and simplifies the strip exit path since all dangerous chars are now removed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Scanner: - Add 17 invisible Unicode characters to detection: bidi marks (LRM, RLM, ALM), invisible math operators (U+2061-4), interlinear annotation markers (U+FFF9-B), deprecated formatting (U+206A-F) - All new ranges at warning severity — zero legitimate use in prompt files DX improvements: - Critical findings now suggest '--strip' (was 'manual review' only) - '--strip' help: 'Remove hidden characters (preserves emoji and whitespace)' - '--verbose' help: 'Show all findings including harmless ones' - Strip no-op prints 'Nothing to clean' instead of silent exit - Exit codes documented in command help text Tests: 15 new scanner tests, 2 new audit tests (103 total, all passing) Docs: security.md detection table expanded, governance.md updated Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ZWJ between emoji characters (e.g. 👨👩👧) is downgraded to info-level and preserved by --strip, preventing compound emoji from breaking - _is_emoji_char() + _zwj_in_emoji_context() helpers with backward skip past VS16 and skin-tone modifiers - Consistent ZWJ handling in both scan_text() and strip_dangerous() - --strip --dry-run shows per-file counts of strippable characters in a Rich table without modifying any files - Hint message when --dry-run used without --strip - 16 new tests (12 ZWJ context + 5 dry-run, minus 1 overlap) - Updated security.md, governance.md, and CHANGELOG.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix cli-commands.md: - --strip description was inverted (described old strip_non_critical behavior) - Add missing --dry-run flag documentation - Expand 'What it detects' to cover all 35 ranges (variation selectors, bidi marks, invisible operators, annotation markers, deprecated formatting) - Update exit code descriptions (add strip success, variation selectors) - Fix misleading example comment Fix governance.md: - Update exit code descriptions to match implementation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Unicode variation selector detection to the content scanner — the specific mechanism used in the Glassworm supply-chain attacks (March 2026) that compromised repositories and VS Code extensions.
Closes #320
What changed
Scanner ranges (
content_scanner.py)--verbose.Design decisions
--verbose._SUSPICIOUS_RANGEStable and_CHAR_LOOKUPO(1) dict. The lookup table grows from ~156 to 412 entries — negligible.apm audit --stripremoves warning/info-level VS (1-16) but preserves critical VS (17-256) for manual review — consistent with existing strip semantics.ContentScannergenerically via severity — the new category renders correctly without any pipeline changes.Tests
--strip,--verbose, and realistic injected prompt fixturesDocumentation
[Unreleased]Credit
Thanks to @raye-deng for the detailed analysis identifying this gap — the variation selector ranges and Glassworm reference came directly from their feedback on #312.