Skip to content

markdown-content-parity: inline <tag> code spans get text-stripped, causing false 'missing' #90

@dacharyc

Description

@dacharyc

Summary

markdown-content-parity reports content as missing when prose contains an inline code span whose content is an HTML-tag-shaped string, like `<code>`, `<main>`, `<title>`. The markdown is valid — `<code>` is a normal inline code span that renders to <code>&lt;code&gt;</code>, displaying the literal text "" as code. But the HTML-side extraction strips the entity-decoded <code> text, so the segment ends up shorter on the HTML side than on the markdown side, and substring containment fails.

Repro

Source markdown line:

This type of code example should not be rendered using the HTML `<code>` tag.

Hugo (or any standard renderer) emits:

<p>This type of code example should not be rendered using the HTML <code>&lt;code&gt;</code> tag.</p>

extractHtmlText flow on this page:

  1. node-html-parser .text strips DOM tags and decodes entities, producing the string This type of code example should not be rendered using the HTML <code> tag. — the literal text <code> is decoded from &lt;code&gt;.
  2. The tag-stripping regex then runs: /<([a-zA-Z][a-zA-Z0-9-]*)([^>]*)>/g. The literal text <code> matches, code is in HTML_TAG_NAMES, so the replacer returns ''.
  3. Result: This type of code example should not be rendered using the HTML tag. — the word "code" is gone.

On the markdown side, extractMarkdownText protects `<code>` as a code-span placeholder, restores it, and normalize()'s <([^>\n]+)> strips angle brackets, leaving code. So markdown has ...the HTML code tag. and HTML has ...the HTML tag. — mismatch.

The intent of the HTML_TAG_NAMES strip — per the comment — is to handle syntax-highlighting markup that survives DOM stripping when it appears inside <pre> (e.g., <span class="line">, nested <code>). That's a legitimate need. But the same regex can't distinguish "<span> from a real syntax-highlighted DOM element that survived as text" from "<code> that came from entity-decoded inline code content".

Why this is hard

The current implementation works at the text level after node-html-parser flattens the DOM. By the time tag stripping runs, the structural information is gone. Within the flat text, <code> from a syntax-highlighter span looks identical to <code> from &lt;code&gt; entity decoding inside a real <code> element.

Suggested approaches

Option A — DOM-aware extraction
Walk the DOM yourself instead of using .text. When you visit a <code> or <pre> element, extract its .textContent and append it as a single literal token (no further regex stripping inside it). This way, <code>&lt;code&gt;</code> becomes the literal string <code> and is preserved through the rest of the pipeline. Outside of <code>/<pre>, the existing tag-stripping logic still applies.

This is the most correct fix and also makes the syntax-highlighter case work better — instead of pattern-matching against tag names you guess at, you just don't run tag stripping inside <pre>/<code> at all, because the DOM gave you their text content directly.

Option B — narrower stripping inside flattened text
Keep the flat-text approach but only strip elements whose surface form is highly likely to be a syntax-highlighter artifact: classes like <span class=...>, <code class=...>, <div class=...>. Bare <code>, <main>, <title> etc. are vanishingly unlikely to appear from a syntax highlighter (highlighters always include classes for token type). Match <(span|code|div)([\s][^>]*)> (require attributes) to skip, leave bare tags alone. This is a smaller change but won't catch <span> if a highlighter ever omits classes.

Option C — denylist instead of allowlist
The current HTML_TAG_NAMES is essentially "every tag a browser knows about", and the intent is "strip these from flat text because they're probably highlight markup". The list of tags that actually appear inside <pre> from real syntax highlighters is small: span, code, div, mark, i, b, maybe a. Trim HTML_TAG_NAMES to that subset. This dramatically reduces collateral damage on tags like <code>, <title>, <head>, <main>, <nav> that authors mention in prose.

Repro material

Site: https://dacharycarey.com. Affected post (before workaround): https://dacharycarey.com/2025/09/07/audit-conclusions/ — references `<code>` in two paragraphs discussing how a code-classification system shouldn't tag certain examples with <code>. Other affected posts mention `<main>`, `<title>`, `<h1>`, `<link>` in prose about HTML.

Workaround used

Rewrote `<code>``code` in the source, dropping the angle brackets. This loses semantic intent — the author meant the literal HTML <code> element, not just the word "code" — and it doesn't generalize: every site mentioning HTML tags in prose has to pre-flatten them to satisfy the parity check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions