Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions scratchpad/html-entity-mention-bypass-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# HTML Entity Encoding Bypass Fix for @mention Sanitization

## Problem

The safe-outputs sanitization system had a vulnerability where HTML entities could bypass @mention detection. If entities were decoded after @mention neutralization, an attacker could use entity-encoded @ symbols to trigger unwanted user notifications.

### Attack Vectors

1. Named entity: `@user` → `@user`
2. Decimal entity: `@user` → `@user`
3. Hexadecimal entity: `@user` or `@user` → `@user`
4. Double-encoded: `@user`, `@user`, `@user` → `@user`
5. Mixed encoding: `@user` → `@user`
6. Fully encoded: `@user` → `@user`

## Solution

Added `decodeHtmlEntities()` function in `actions/setup/js/sanitize_content_core.cjs` that:

1. **Decodes named entities**: `@` → `@` (case-insensitive)
2. **Decodes decimal entities**: `&#NNN;` → corresponding Unicode character
3. **Decodes hexadecimal entities**: `&#xHHH;` or `&#XHHH;` → corresponding Unicode character
4. **Handles double-encoding**: `@`, `@`, `@`
5. **Validates code points**: Only accepts valid Unicode range (0x0 - 0x10FFFF)

### Integration

The `decodeHtmlEntities()` function is integrated into `hardenUnicodeText()` at **Step 2**, ensuring HTML entities are decoded **before** @mention detection occurs:

```javascript
function hardenUnicodeText(text) {
// Step 1: Normalize Unicode (NFC)
result = result.normalize("NFC");

// Step 2: Decode HTML entities (CRITICAL - must be early)
result = decodeHtmlEntities(result);

Comment on lines +31 to +37
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardenUnicodeText() example snippet is not valid JavaScript as written: it uses result before initialization/declaring it, and it ignores the text parameter. Since this is documentation meant to illustrate the actual implementation, update the snippet to match the real flow (let result = text;, then normalize/decode, etc.) so readers don’t copy a broken example.

Copilot uses AI. Check for mistakes.
// Step 3: Strip zero-width characters
// Step 4: Remove bidirectional overrides
// Step 5: Convert full-width ASCII

return result;
}
```

### Sanitization Pipeline

```
Input Text
hardenUnicodeText()
├─ Unicode normalization (NFC)
├─ HTML entity decoding ← decodeHtmlEntities()
├─ Zero-width character removal
├─ Bidirectional control removal
└─ Full-width ASCII conversion
ANSI escape sequence removal
neutralizeMentions() or neutralizeAllMentions()
Other sanitization steps
Output (safe text)
```

## Test Coverage

Comprehensive test suite in `actions/setup/js/sanitize_content.test.cjs` covers:

- ✅ Named entity decoding (`@`)
- ✅ Double-encoded named entities (`@`)
- ✅ Decimal entity decoding (`@`)
- ✅ Double-encoded decimal entities (`@`)
- ✅ Hexadecimal entity decoding (lowercase `@`, uppercase `@`)
- ✅ Double-encoded hex entities (`@`, `@`)
- ✅ Multiple encoded mentions in one string
- ✅ Mixed encoded and normal mentions
- ✅ Org/team mentions with entities
- ✅ General entity decoding (non-@ characters)
- ✅ Invalid code point handling
- ✅ Malformed entity handling
- ✅ Case-insensitive named entities
- ✅ Interaction with other sanitization steps
- ✅ Allowed aliases with encoded mentions

Total: 25+ test cases

## Examples

```javascript
// Named entity
sanitizeContent("@pelikhan")
// → "`@pelikhan`"

// Decimal entity
sanitizeContent("@pelikhan")
// → "`@pelikhan`"

// Hexadecimal entity
sanitizeContent("@pelikhan")
// → "`@pelikhan`"

// Mixed encoding in username
sanitizeContent("@user")
// → "`@user`"

// Fully encoded
sanitizeContent("@user")
// → "`@user`"

// Double-encoded
sanitizeContent("@pelikhan")
// → "`@pelikhan`"
```

## Security Impact

- **Risk Level**: MEDIUM → **RESOLVED**
- **Attack Surface**: Entity-encoded @ symbols could bypass mention detection
- **Fix**: All HTML entity encoding variants now decoded before @mention processing
- **Coverage**: Universal - applies to both `sanitizeContent()` and `sanitizeIncomingText()`

## Files Modified

- `actions/setup/js/sanitize_content_core.cjs` - Added `decodeHtmlEntities()` function and integrated into `hardenUnicodeText()`
- `actions/setup/js/sanitize_content.test.cjs` - Added 25+ test cases for HTML entity decoding
- Exported `decodeHtmlEntities` from module for potential standalone use

## Defense in Depth

This fix follows defense-in-depth principles:
1. **Early decoding**: Entities decoded at Step 2 of Unicode hardening
2. **Comprehensive coverage**: Handles all entity types and double-encoding
3. **Validation**: Rejects invalid Unicode code points
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In “Defense in Depth”, point 3 says invalid Unicode code points are “rejected”, but the current decodeHtmlEntities() behavior keeps the original entity text unchanged when a code point is invalid. Update the wording to reflect that invalid entities are preserved/left undecoded (rather than removed), to avoid overstating what the mitigation does.

Suggested change
3. **Validation**: Rejects invalid Unicode code points
3. **Validation**: Leaves invalid Unicode entities undecoded (preserves original text)

Copilot uses AI. Check for mistakes.
4. **Universal application**: Applies to all content sanitization flows
5. **Test coverage**: Extensive test suite validates all attack vectors
Loading