Skip to content

Comments

perf: move html entity decoding to server side#1561

Merged
danielroe merged 5 commits intomainfrom
fix/entity-server
Feb 22, 2026
Merged

perf: move html entity decoding to server side#1561
danielroe merged 5 commits intomainfrom
fix/entity-server

Conversation

@danielroe
Copy link
Member

follow on from #1550 and #591 - we should be doing this on server-side

@vercel
Copy link

vercel bot commented Feb 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
npmx.dev Ready Ready Preview, Comment Feb 22, 2026 9:26am
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
docs.npmx.dev Ignored Ignored Preview Feb 22, 2026 9:26am
npmx-lunaria Ignored Ignored Feb 22, 2026 9:26am

Request Review

@codecov
Copy link

codecov bot commented Feb 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

HTML entity decoding was moved out of app/utils/formatters.ts into a new shared/utils/html.ts (adds decodeHtmlEntities) and removed from the former module. Components app/components/Package/Playgrounds.vue and app/components/ReadmeTocDropdown.vue stopped decoding entity-encoded labels and now render text as-is. server/utils/readme.ts gained robust HTML-tag stripping and now uses the new decoder for TOC items and playground labels. app/composables/useMarkdown.ts updated its import path. The readme API cache key was bumped from v8 to v9. Tests were added/moved for the new decoder.

Possibly related PRs

🚥 Pre-merge checks | ✅ 1
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description clearly relates to the changeset, referencing prior PRs and explaining the objective to move HTML entity decoding to server-side for performance improvement.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/entity-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/utils/readme.ts (1)

376-383: ⚠️ Potential issue | 🟠 Major

Decoded plainText inserted into heading HTML template causes content loss for headings containing < or >.

The order of operations at line 377 is:

  1. Strip HTML tags: text.replace(/<[^>]*>/g, '')
  2. Decode entities: decodeHtmlEntities(...) — turns &lt;<, &amp;&

The decoded plainText (which may now contain raw < and &) is then string-interpolated directly into the heading HTML at line 383:

return `<h${semanticLevel} ...><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n`

sanitize-html's underlying htmlparser2 is fairly tolerant, but when a bare < appears in text content it begins tag parsing — confirmed reports show the string is truncated to the text before the < symbol, so the remainder of the text is discarded. For example, a heading ## Types &lt;T&gt; would lose T> in the final rendered README HTML.

Before this PR the entity-encoded &lt; was inserted into the template (valid HTML); the client decoded it for display. The decodeHtmlEntities call should only apply to the TOC entry, not to the value embedded in the HTML template.

The CodeQL "Incomplete multi-character sanitization" warning at line 377 tracks exactly this dataflow.

🐛 Proposed fix — separate decoded text (TOC) from entity-safe text (HTML template)
-    // Collect TOC item with plain text (HTML stripped, entities decoded)
-    const plainText = decodeHtmlEntities(text.replace(/<[^>]*>/g, '').trim())
+    // Strip HTML tags for heading template (keep entities so the HTML stays valid)
+    const strippedText = text.replace(/<[^>]*>/g, '').trim()
+    // Decode entities only for the plain-text TOC entry
+    const plainText = decodeHtmlEntities(strippedText)
     if (plainText) {
       toc.push({ text: plainText, id, depth })
     }

     /** The link href uses the unique slug WITHOUT the 'user-content-' prefix, because that will later be added for all links. */
-    return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n`
+    return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${strippedText}</a></h${semanticLevel}>\n`
🧹 Nitpick comments (1)
shared/utils/html.ts (1)

1-13: Minimal but fit-for-purpose utility.

The regex alternatives and the htmlEntities map are in perfect correspondence, so the || match fallback is technically unreachable at runtime (every match the regex captures has a corresponding map entry). It serves as a compile-time hint when noUncheckedIndexedAccess is enabled, which is fine; just worth noting it cannot fire under normal operation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/utils/readme.ts (1)

388-395: ⚠️ Potential issue | 🟠 Major

Decoded plainText used in the HTML template causes content loss via sanitizeHtml.

After decoding, plainText may contain raw < and > characters. When these are interpolated into the heading HTML template at line 395 and then processed by sanitizeHtml, the resulting markup is treated as unknown HTML tags and stripped — not preserved as text.

Concrete regression trace for a heading such as ## `Promise<T>`:

  1. text (from marked) = "<code>Promise&lt;T&gt;</code>"
  2. stripHtmlTags(text)"Promise&lt;T&gt;"
  3. decodeHtmlEntities(...)"Promise<T>"plainText now contains raw angle brackets
  4. Line 395 template: <a href="#...">Promise<T></a>
  5. sanitizeHtml sees <T> as an unknown tag (not in ALLOWED_TAGS), strips it
  6. Rendered anchor text: "Promise"<T> is silently lost

Before this PR, plainText was not decoded, so Promise&lt;T&gt; was embedded as HTML entities, which sanitizeHtml preserves and the browser renders correctly as Promise<T>.

The fix is to separate the stripped-but-not-decoded text (safe for HTML templates) from the fully decoded text (correct for TOC display, where Vue renders it as plain text via interpolation):

🐛 Proposed fix
-    // Collect TOC item with plain text (HTML stripped, entities decoded)
-    const plainText = decodeHtmlEntities(stripHtmlTags(text).trim())
+    // Collect TOC item with plain text (HTML stripped, entities decoded)
+    const strippedText = stripHtmlTags(text).trim()
+    const plainText = decodeHtmlEntities(strippedText)
     if (plainText) {
       toc.push({ text: plainText, id, depth })
     }

     /** The link href uses the unique slug WITHOUT the 'user-content-' prefix, because that will later be added for all links. */
-    return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${plainText}</a></h${semanticLevel}>\n`
+    return `<h${semanticLevel} id="${id}" data-level="${depth}"><a href="#${uniqueSlug}">${strippedText}</a></h${semanticLevel}>\n`

Comment on lines +178 to +181
* Strip all HTML tags from a string, looping until stable to prevent
* incomplete sanitization from nested/interleaved tags
* (e.g. `<scr<script>ipt>` → `<script>` after one pass).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment example does not match the actual regex behaviour.

The doc comment illustrates <scr<script>ipt><script> after one pass, which is the failure mode of the narrower regex /<[^<>]*>/g (which excludes both < and >). The actual regex used here is /<[^>]*>/g, whose greedy [^>]* also matches <, so the first pass strips <scr<script> (from the opening < to the first >), leaving ipt> — not <script>.

📝 Proposed fix for the comment
 /**
  * Strip all HTML tags from a string, looping until stable to prevent
  * incomplete sanitization from nested/interleaved tags
- * (e.g. `<scr<script>ipt>` → `<script>` after one pass).
+ * (e.g. `<<script>>` → `<script>` after one pass, then `''` after the second).
  */

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
shared/utils/html.ts (2)

1-13: Derive the regex from htmlEntities keys to eliminate desync risk.

The regex alternation /&(?:amp|lt|gt|quot|apos|nbsp|#39);/g is a manual duplicate of the map's keys. Adding a new entity to the map without updating the regex will silently no-op.

♻️ Proposed refactor — derive regex from map keys
 const htmlEntities: Record<string, string> = {
   '&amp;': '&',
   '&lt;': '<',
   '&gt;': '>',
   '&quot;': '"',
   '&#39;': "'",
   '&apos;': "'",
   '&nbsp;': '\u00A0',
 }
 
+const htmlEntityRegex = new RegExp(
+  Object.keys(htmlEntities)
+    .map(k => k.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'))
+    .join('|'),
+  'g',
+)
+
 export function decodeHtmlEntities(text: string): string {
-  return text.replace(/&(?:amp|lt|gt|quot|apos|nbsp|#39);/g, match => htmlEntities[match] || match)
+  return text.replace(htmlEntityRegex, match => htmlEntities[match] || match)
 }

1-9: Consider extending entity coverage for common typographic characters.

README headings and package labels frequently contain entities outside this set — &mdash; (), &ndash; (), &hellip; (), &copy; (©), &reg; (®) — that will now appear as literal entity strings in the server-rendered output. Worth auditing whether the TOC/playground label paths encounter these in practice.

@danielroe danielroe added this pull request to the merge queue Feb 22, 2026
Merged via the queue into main with commit 54aa86a Feb 22, 2026
20 checks passed
@danielroe danielroe deleted the fix/entity-server branch February 22, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants