fix: clean email formatting for Discord messages#1441
Conversation
…achment detection - Extract boundary from Content-Type header for proper MIME parsing - Add cleanEmailBody() to strip MIME boundaries, headers, and HTML tags - Decode HTML entities in email body text - Improve inline image detection with Content-ID support - Add RFC 2047 decoding for encoded subject lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pass nested boundary explicitly to recursive calls - Add depth limit (max 10) to prevent stack overflow - Fixes RangeError: Maximum call stack size exceeded Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds RFC2047 subject decoding, header-derived boundary-aware multipart parsing with recursion limits and richer attachment detection, plus new HTML-entity decoding and MIME-artifact cleaning used when formatting email bodies for Discord. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Raw Email Source
participant Parser as Email Parser
participant Boundary as Boundary Extractor
participant Multipart as Multipart Handler
participant Formatter as Discord Formatter
Client->>Parser: submit rawEmail
Parser->>Boundary: parse Content-Type header
Boundary-->>Parser: headerBoundary
Parser->>Parser: decode RFC2047 subject
Parser->>Multipart: parse body with headerBoundary, depth=0
Multipart->>Multipart: split parts, detect nested boundaries
Multipart-->>Parser: text/html, text/plain, attachments
Parser->>Formatter: body parts + attachments
Formatter->>Formatter: select text/html vs text/plain, detect MIME artifacts
Formatter->>Formatter: cleanEmailBody + decodeHtmlEntities
Formatter->>External: post formatted message to Discord
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68e804b03b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else if (encoding.toUpperCase() === 'Q') { | ||
| // Quoted-printable encoding | ||
| const decoded = encoded | ||
| .replace(/_/g, ' ') | ||
| .replace(/=([0-9A-Fa-f]{2})/g, (_: string, hex: string) => String.fromCharCode(Number.parseInt(hex, 16))) | ||
| return decoded |
There was a problem hiding this comment.
Decode Q-encoded RFC2047 subjects as UTF-8 bytes
This Q-decoding branch converts each =XX byte directly with String.fromCharCode, which treats bytes as Unicode code points rather than decoding the byte sequence using the declared charset. For UTF-8 subjects like =?UTF-8?Q?Fran=C3=A7ois?=, this yields mojibake (François) instead of the intended text. Any RFC2047 Q-encoded subject with multibyte UTF-8 (or other multibyte charsets) will be displayed incorrectly in Discord.
Useful? React with 👍 / 👎.
| // Strip any remaining HTML tags | ||
| cleaned = stripTagsSafely(cleaned) | ||
|
|
There was a problem hiding this comment.
Avoid stripping plain-text angle-bracket content
cleanEmailBody always runs stripTagsSafely even when the body came from the plain-text part, so any legitimate text enclosed in <...> (email addresses like <support@example.com>, XML snippets, code, or Markdown) gets removed entirely. Because cleanEmailBody is applied to email.body.text in both thread posts and embeds, plain-text emails can lose meaningful content that previously rendered correctly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cloudflare_workers/email/discord.ts`:
- Around line 429-440: The decodeHtmlEntities function can double-unescape
sequences like "&lt;" because & is decoded before other entities; update
decodeHtmlEntities so it decodes numeric/hex entities and named entities like
<, >, ", ', first and perform the & → & replacement
last (i.e., move the .replace(/&/g, '&') to the end), and ensure callers
(stripHtml, cleanEmailBody, stripTagsSafely) continue to call decodeHtmlEntities
only once per processing step to avoid repeated decoding.
In `@cloudflare_workers/email/email-parser.ts`:
- Around line 74-96: The Q-encoding branch in decodeRfc2047 decodes hex escapes
into individual JS code units which breaks multi-byte charsets like UTF-8;
update the Q branch in decodeRfc2047 to parse =HH sequences into a byte array
(preserve non-escaped ASCII bytes), then use TextDecoder with the specified
charset (fallback to 'utf-8') to decode the resulting Uint8Array into a proper
string; ensure charset is normalized (e.g., lowercased) and reuse the existing
decodeBase64Utf8 approach or its charset handling conventions so Q-decoded
multi-byte sequences are decoded correctly for charsets such as 'utf-8'.
🧹 Nitpick comments (2)
cloudflare_workers/email/email-parser.ts (1)
64-69: Consider making the boundary regex more robust.The current regex may fail to capture boundaries containing certain characters. RFC 2046 allows boundaries to contain characters like
=,:, and.which are not matched by[^"\s;]+.♻️ Suggested improvement
function extractBoundaryFromHeader(contentType: string): string | undefined { if (!contentType) return undefined - const boundaryMatch = contentType.match(/boundary="?([^"\s;]+)"?/i) + // Handle both quoted and unquoted boundaries per RFC 2046 + const quotedMatch = contentType.match(/boundary="([^"]+)"/i) + if (quotedMatch) return quotedMatch[1] + + const unquotedMatch = contentType.match(/boundary=([^\s;]+)/i) + return unquotedMatch?.[1] - return boundaryMatch?.[1] }cloudflare_workers/email/discord.ts (1)
290-302: Duplicated MIME detection logic - consider extracting to a helper.The same MIME boundary detection and fallback logic appears in both
postToThread(lines 216-222) andformatEmailForDiscord(lines 294-300). Consider extracting this to a shared helper function.♻️ Suggested refactor
/** * Extracts clean body text from email, handling MIME artifacts */ function extractCleanBodyText(email: ParsedEmail): string { let bodyText = email.body.text || '' // If text body is empty or looks like it contains MIME boundaries, try HTML const hasMimeBoundaries = bodyText.includes('--') && bodyText.includes('Content-Type:') if (!bodyText || hasMimeBoundaries) { bodyText = stripHtml(email.body.html || '') } // Clean up any MIME boundaries, headers, and HTML that leaked through return cleanEmailBody(bodyText) }Then use
extractCleanBodyText(email)in both functions.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cloudflare_workers/email/discord.tscloudflare_workers/email/email-parser.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{vue,ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
Run
bun lintto lint Vue, TypeScript, and JavaScript files; usebun lint:fixto auto-fix issues
Files:
cloudflare_workers/email/discord.tscloudflare_workers/email/email-parser.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use single quotes and no semicolons per
@antfu/eslint-configrules
Files:
cloudflare_workers/email/discord.tscloudflare_workers/email/email-parser.ts
**/*.{ts,tsx,js,jsx,vue}
📄 CodeRabbit inference engine (AGENTS.md)
Run
bun lintor lint/format command before validating any backend or frontend task to ensure consistent formatting
Files:
cloudflare_workers/email/discord.tscloudflare_workers/email/email-parser.ts
🧬 Code graph analysis (1)
cloudflare_workers/email/email-parser.ts (1)
cloudflare_workers/email/types.ts (1)
EmailAttachment(50-55)
🪛 GitHub Check: CodeQL
cloudflare_workers/email/discord.ts
[failure] 430-433: Double escaping or unescaping
This replacement may produce '&' characters that are double-unescaped here.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Run tests
🔇 Additional comments (8)
cloudflare_workers/email/email-parser.ts (5)
29-46: LGTM! Good additions for subject decoding and boundary extraction.The RFC 2047 decoding for subject lines and extracting boundary from the Content-Type header are solid improvements. The logging provides useful debugging information for troubleshooting email parsing issues.
164-169: Good recursion depth limit to prevent stack overflow.The
MAX_DEPTH = 10limit is a sensible safeguard against maliciously crafted emails with deeply nested multipart structures. The warning log helps with debugging when this limit is hit.
212-227: Nested multipart parsing correctly passes boundary and increments depth.The explicit boundary passing and depth increment ensure proper parsing of nested structures while preventing infinite recursion. This addresses the PR objective of preventing stack overflow.
229-238: Good improvement for inline image detection via Content-ID.Detecting inline images using
Content-IDheaders (common in Gmail) ensures attachments render correctly in Discord threads. The fallback logic for images without explicit disposition is sensible.
248-268: Prefer first text/plain and text/html parts - good approach.Setting body text/html only if not already set ensures the first (typically most relevant) content part is used, which aligns with RFC 2046 recommendations for multipart/alternative.
cloudflare_workers/email/discord.ts (3)
213-223: MIME artifact detection is a reasonable heuristic.The check for
--combined withContent-Type:is a sensible way to detect when MIME boundaries have leaked into the text body. The fallback to HTML conversion and subsequent cleanup ensures cleaner Discord messages.
384-424:cleanEmailBodyfunction is well-structured for MIME artifact removal.The sequential cleanup of MIME boundaries, headers, HTML tags, and whitespace normalization addresses the PR objectives effectively. The character-safe
stripTagsSafelyusage avoids regex-based sanitization vulnerabilities.
451-473: Good integration of entity decoding in HTML-to-Markdown conversion.Calling
decodeHtmlEntitiesafter Turndown conversion and in the fallback path ensures consistent handling of HTML entities across all code paths.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
- Move & decoding to end of decodeHtmlEntities to prevent double-unescaping - Fix RFC 2047 Q-encoding to properly decode multi-byte UTF-8 charsets - Remove stripTagsSafely from cleanEmailBody to preserve <email@example.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|


Summary
Fixes email-to-Discord transformation to remove MIME artifacts and properly handle attachments. Emails now display clean text without boundaries, headers, and HTML tags. Also prevents infinite recursion in nested multipart email parsing.
Changes
--xxxxx) and Content-* headers from email bodiesTest plan
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.