diff --git a/scratchpad/html-entity-mention-bypass-fix.md b/scratchpad/html-entity-mention-bypass-fix.md new file mode 100644 index 00000000000..595cd04b8c8 --- /dev/null +++ b/scratchpad/html-entity-mention-bypass-fix.md @@ -0,0 +1,137 @@ +# HTML Entity Encoding Bypass Fix for @mention Sanitization + +## Problem + +The safe-outputs sanitization system had a vulnerability where HTML entities could bypass @mention detection. If entities were decoded after @mention neutralization, an attacker could use entity-encoded @ symbols to trigger unwanted user notifications. + +### Attack Vectors + +1. Named entity: `@user` → `@user` +2. Decimal entity: `@user` → `@user` +3. Hexadecimal entity: `@user` or `@user` → `@user` +4. Double-encoded: `@user`, `@user`, `@user` → `@user` +5. Mixed encoding: `@user` → `@user` +6. Fully encoded: `@user` → `@user` + +## Solution + +Added `decodeHtmlEntities()` function in `actions/setup/js/sanitize_content_core.cjs` that: + +1. **Decodes named entities**: `@` → `@` (case-insensitive) +2. **Decodes decimal entities**: `&#NNN;` → corresponding Unicode character +3. **Decodes hexadecimal entities**: `&#xHHH;` or `&#XHHH;` → corresponding Unicode character +4. **Handles double-encoding**: `@`, `@`, `@` +5. **Validates code points**: Only accepts valid Unicode range (0x0 - 0x10FFFF) + +### Integration + +The `decodeHtmlEntities()` function is integrated into `hardenUnicodeText()` at **Step 2**, ensuring HTML entities are decoded **before** @mention detection occurs: + +```javascript +function hardenUnicodeText(text) { + // Step 1: Normalize Unicode (NFC) + result = result.normalize("NFC"); + + // Step 2: Decode HTML entities (CRITICAL - must be early) + result = decodeHtmlEntities(result); + + // Step 3: Strip zero-width characters + // Step 4: Remove bidirectional overrides + // Step 5: Convert full-width ASCII + + return result; +} +``` + +### Sanitization Pipeline + +``` +Input Text + ↓ +hardenUnicodeText() + ├─ Unicode normalization (NFC) + ├─ HTML entity decoding ← decodeHtmlEntities() + ├─ Zero-width character removal + ├─ Bidirectional control removal + └─ Full-width ASCII conversion + ↓ +ANSI escape sequence removal + ↓ +neutralizeMentions() or neutralizeAllMentions() + ↓ +Other sanitization steps + ↓ +Output (safe text) +``` + +## Test Coverage + +Comprehensive test suite in `actions/setup/js/sanitize_content.test.cjs` covers: + +- ✅ Named entity decoding (`@`) +- ✅ Double-encoded named entities (`@`) +- ✅ Decimal entity decoding (`@`) +- ✅ Double-encoded decimal entities (`@`) +- ✅ Hexadecimal entity decoding (lowercase `@`, uppercase `@`) +- ✅ Double-encoded hex entities (`@`, `@`) +- ✅ Multiple encoded mentions in one string +- ✅ Mixed encoded and normal mentions +- ✅ Org/team mentions with entities +- ✅ General entity decoding (non-@ characters) +- ✅ Invalid code point handling +- ✅ Malformed entity handling +- ✅ Case-insensitive named entities +- ✅ Interaction with other sanitization steps +- ✅ Allowed aliases with encoded mentions + +Total: 25+ test cases + +## Examples + +```javascript +// Named entity +sanitizeContent("@pelikhan") +// → "`@pelikhan`" + +// Decimal entity +sanitizeContent("@pelikhan") +// → "`@pelikhan`" + +// Hexadecimal entity +sanitizeContent("@pelikhan") +// → "`@pelikhan`" + +// Mixed encoding in username +sanitizeContent("@user") +// → "`@user`" + +// Fully encoded +sanitizeContent("@user") +// → "`@user`" + +// Double-encoded +sanitizeContent("@pelikhan") +// → "`@pelikhan`" +``` + +## Security Impact + +- **Risk Level**: MEDIUM → **RESOLVED** +- **Attack Surface**: Entity-encoded @ symbols could bypass mention detection +- **Fix**: All HTML entity encoding variants now decoded before @mention processing +- **Coverage**: Universal - applies to both `sanitizeContent()` and `sanitizeIncomingText()` + +## Files Modified + +- `actions/setup/js/sanitize_content_core.cjs` - Added `decodeHtmlEntities()` function and integrated into `hardenUnicodeText()` +- `actions/setup/js/sanitize_content.test.cjs` - Added 25+ test cases for HTML entity decoding +- Exported `decodeHtmlEntities` from module for potential standalone use + +## Defense in Depth + +This fix follows defense-in-depth principles: +1. **Early decoding**: Entities decoded at Step 2 of Unicode hardening +2. **Comprehensive coverage**: Handles all entity types and double-encoding +3. **Validation**: Rejects invalid Unicode code points +4. **Universal application**: Applies to all content sanitization flows +5. **Test coverage**: Extensive test suite validates all attack vectors