Repro
Source markdown line:
This type of code example should not be rendered using the HTML `<code>` tag.
Hugo (or any standard renderer) emits:
<p>This type of code example should not be rendered using the HTML <code><code></code> tag.</p>
extractHtmlText flow on this page:
node-html-parser .text strips DOM tags and decodes entities, producing the string This type of code example should not be rendered using the HTML <code> tag. — the literal text <code> is decoded from <code>.
- The tag-stripping regex then runs:
/<([a-zA-Z][a-zA-Z0-9-]*)([^>]*)>/g. The literal text <code> matches, code is in HTML_TAG_NAMES, so the replacer returns ''.
- Result:
This type of code example should not be rendered using the HTML tag. — the word "code" is gone.
On the markdown side, extractMarkdownText protects `<code>` as a code-span placeholder, restores it, and normalize()'s <([^>\n]+)> strips angle brackets, leaving code. So markdown has ...the HTML code tag. and HTML has ...the HTML tag. — mismatch.
The intent of the HTML_TAG_NAMES strip — per the comment — is to handle syntax-highlighting markup that survives DOM stripping when it appears inside <pre> (e.g., <span class="line">, nested <code>). That's a legitimate need. But the same regex can't distinguish "<span> from a real syntax-highlighted DOM element that survived as text" from "<code> that came from entity-decoded inline code content".
Why this is hard
The current implementation works at the text level after node-html-parser flattens the DOM. By the time tag stripping runs, the structural information is gone. Within the flat text, <code> from a syntax-highlighter span looks identical to <code> from <code> entity decoding inside a real <code> element.
Suggested approaches
Option A — DOM-aware extraction
Walk the DOM yourself instead of using .text. When you visit a <code> or <pre> element, extract its .textContent and append it as a single literal token (no further regex stripping inside it). This way, <code><code></code> becomes the literal string <code> and is preserved through the rest of the pipeline. Outside of <code>/<pre>, the existing tag-stripping logic still applies.
This is the most correct fix and also makes the syntax-highlighter case work better — instead of pattern-matching against tag names you guess at, you just don't run tag stripping inside <pre>/<code> at all, because the DOM gave you their text content directly.
Option B — narrower stripping inside flattened text
Keep the flat-text approach but only strip elements whose surface form is highly likely to be a syntax-highlighter artifact: classes like <span class=...>, <code class=...>, <div class=...>. Bare <code>, <main>, <title> etc. are vanishingly unlikely to appear from a syntax highlighter (highlighters always include classes for token type). Match <(span|code|div)([\s][^>]*)> (require attributes) to skip, leave bare tags alone. This is a smaller change but won't catch <span> if a highlighter ever omits classes.
Option C — denylist instead of allowlist
The current HTML_TAG_NAMES is essentially "every tag a browser knows about", and the intent is "strip these from flat text because they're probably highlight markup". The list of tags that actually appear inside <pre> from real syntax highlighters is small: span, code, div, mark, i, b, maybe a. Trim HTML_TAG_NAMES to that subset. This dramatically reduces collateral damage on tags like <code>, <title>, <head>, <main>, <nav> that authors mention in prose.
Repro material
Site: https://dacharycarey.com. Affected post (before workaround): https://dacharycarey.com/2025/09/07/audit-conclusions/ — references `<code>` in two paragraphs discussing how a code-classification system shouldn't tag certain examples with <code>. Other affected posts mention `<main>`, `<title>`, `<h1>`, `<link>` in prose about HTML.
Workaround used
Rewrote `<code>` → `code` in the source, dropping the angle brackets. This loses semantic intent — the author meant the literal HTML <code> element, not just the word "code" — and it doesn't generalize: every site mentioning HTML tags in prose has to pre-flatten them to satisfy the parity check.
Summary
markdown-content-parityreports content as missing when prose contains an inline code span whose content is an HTML-tag-shaped string, like`<code>`,`<main>`,`<title>`. The markdown is valid —`<code>`is a normal inline code span that renders to<code><code></code>, displaying the literal text "" as code. But the HTML-side extraction strips the entity-decoded<code>text, so the segment ends up shorter on the HTML side than on the markdown side, and substring containment fails.Repro
Source markdown line:
Hugo (or any standard renderer) emits:
extractHtmlTextflow on this page:node-html-parser.textstrips DOM tags and decodes entities, producing the stringThis type of code example should not be rendered using the HTML <code> tag.— the literal text<code>is decoded from<code>./<([a-zA-Z][a-zA-Z0-9-]*)([^>]*)>/g. The literal text<code>matches,codeis inHTML_TAG_NAMES, so the replacer returns''.This type of code example should not be rendered using the HTML tag.— the word "code" is gone.On the markdown side,
extractMarkdownTextprotects`<code>`as a code-span placeholder, restores it, andnormalize()'s<([^>\n]+)>strips angle brackets, leavingcode. So markdown has...the HTML code tag.and HTML has...the HTML tag.— mismatch.The intent of the HTML_TAG_NAMES strip — per the comment — is to handle syntax-highlighting markup that survives DOM stripping when it appears inside
<pre>(e.g.,<span class="line">, nested<code>). That's a legitimate need. But the same regex can't distinguish "<span>from a real syntax-highlighted DOM element that survived as text" from "<code>that came from entity-decoded inline code content".Why this is hard
The current implementation works at the text level after
node-html-parserflattens the DOM. By the time tag stripping runs, the structural information is gone. Within the flat text,<code>from a syntax-highlighter span looks identical to<code>from<code>entity decoding inside a real<code>element.Suggested approaches
Option A — DOM-aware extraction
Walk the DOM yourself instead of using
.text. When you visit a<code>or<pre>element, extract its.textContentand append it as a single literal token (no further regex stripping inside it). This way,<code><code></code>becomes the literal string<code>and is preserved through the rest of the pipeline. Outside of<code>/<pre>, the existing tag-stripping logic still applies.This is the most correct fix and also makes the syntax-highlighter case work better — instead of pattern-matching against tag names you guess at, you just don't run tag stripping inside
<pre>/<code>at all, because the DOM gave you their text content directly.Option B — narrower stripping inside flattened text
Keep the flat-text approach but only strip elements whose surface form is highly likely to be a syntax-highlighter artifact: classes like
<span class=...>,<code class=...>,<div class=...>. Bare<code>,<main>,<title>etc. are vanishingly unlikely to appear from a syntax highlighter (highlighters always include classes for token type). Match<(span|code|div)([\s][^>]*)>(require attributes) to skip, leave bare tags alone. This is a smaller change but won't catch<span>if a highlighter ever omits classes.Option C — denylist instead of allowlist
The current
HTML_TAG_NAMESis essentially "every tag a browser knows about", and the intent is "strip these from flat text because they're probably highlight markup". The list of tags that actually appear inside<pre>from real syntax highlighters is small:span,code,div,mark,i,b, maybea. TrimHTML_TAG_NAMESto that subset. This dramatically reduces collateral damage on tags like<code>,<title>,<head>,<main>,<nav>that authors mention in prose.Repro material
Site: https://dacharycarey.com. Affected post (before workaround): https://dacharycarey.com/2025/09/07/audit-conclusions/ — references
`<code>`in two paragraphs discussing how a code-classification system shouldn't tag certain examples with<code>. Other affected posts mention`<main>`,`<title>`,`<h1>`,`<link>`in prose about HTML.Workaround used
Rewrote
`<code>`→`code`in the source, dropping the angle brackets. This loses semantic intent — the author meant the literal HTML<code>element, not just the word "code" — and it doesn't generalize: every site mentioning HTML tags in prose has to pre-flatten them to satisfy the parity check.