markdown-content-parity: inline `<tag>` code spans get text-stripped, causing false 'missing'

## Summary

`markdown-content-parity` reports content as missing when prose contains an inline code span whose content is an HTML-tag-shaped string, like `` `<code>` ``, `` `<main>` ``, `` `<title>` ``. The markdown is valid — `` `<code>` `` is a normal inline code span that renders to `<code>&lt;code&gt;</code>`, displaying the literal text "<code>" as code. But the HTML-side extraction strips the entity-decoded `<code>` text, so the segment ends up shorter on the HTML side than on the markdown side, and substring containment fails.

## Repro

Source markdown line:
```markdown
This type of code example should not be rendered using the HTML `<code>` tag.
```

Hugo (or any standard renderer) emits:
```html
<p>This type of code example should not be rendered using the HTML <code>&lt;code&gt;</code> tag.</p>
```

`extractHtmlText` flow on this page:
1. `node-html-parser` `.text` strips DOM tags and decodes entities, producing the string `This type of code example should not be rendered using the HTML <code> tag.` — the literal text `<code>` is decoded from `&lt;code&gt;`.
2. The tag-stripping regex then runs: `/<([a-zA-Z][a-zA-Z0-9-]*)([^>]*)>/g`. The literal text `<code>` matches, `code` is in `HTML_TAG_NAMES`, so the replacer returns `''`.
3. Result: `This type of code example should not be rendered using the HTML  tag.` — the word "code" is gone.

On the markdown side, `extractMarkdownText` protects `` `<code>` `` as a code-span placeholder, restores it, and `normalize()`'s `<([^>\n]+)>` strips angle brackets, leaving `code`. So markdown has `...the HTML code tag.` and HTML has `...the HTML  tag.` — mismatch.

The intent of the HTML_TAG_NAMES strip — per the comment — is to handle syntax-highlighting markup that survives DOM stripping when it appears inside `<pre>` (e.g., `<span class="line">`, nested `<code>`). That's a legitimate need. But the same regex can't distinguish "`<span>` from a real syntax-highlighted DOM element that survived as text" from "`<code>` that came from entity-decoded inline code content".

## Why this is hard

The current implementation works at the text level after `node-html-parser` flattens the DOM. By the time tag stripping runs, the structural information is gone. Within the flat text, `<code>` from a syntax-highlighter span looks identical to `<code>` from `&lt;code&gt;` entity decoding inside a real `<code>` element.

## Suggested approaches

**Option A — DOM-aware extraction**
Walk the DOM yourself instead of using `.text`. When you visit a `<code>` or `<pre>` element, extract its `.textContent` and append it as a single literal token (no further regex stripping inside it). This way, `<code>&lt;code&gt;</code>` becomes the literal string `<code>` and is preserved through the rest of the pipeline. Outside of `<code>`/`<pre>`, the existing tag-stripping logic still applies.

This is the most correct fix and also makes the syntax-highlighter case work better — instead of pattern-matching against tag names you guess at, you just don't run tag stripping inside `<pre>`/`<code>` at all, because the DOM gave you their text content directly.

**Option B — narrower stripping inside flattened text**
Keep the flat-text approach but only strip elements whose surface form is highly likely to be a syntax-highlighter artifact: classes like `<span class=...>`, `<code class=...>`, `<div class=...>`. Bare `<code>`, `<main>`, `<title>` etc. are vanishingly unlikely to appear from a syntax highlighter (highlighters always include classes for token type). Match `<(span|code|div)([\s][^>]*)>` (require attributes) to skip, leave bare tags alone. This is a smaller change but won't catch `<span>` if a highlighter ever omits classes.

**Option C — denylist instead of allowlist**
The current `HTML_TAG_NAMES` is essentially "every tag a browser knows about", and the intent is "strip these from flat text because they're probably highlight markup". The list of tags that actually appear inside `<pre>` from real syntax highlighters is small: `span`, `code`, `div`, `mark`, `i`, `b`, maybe `a`. Trim `HTML_TAG_NAMES` to that subset. This dramatically reduces collateral damage on tags like `<code>`, `<title>`, `<head>`, `<main>`, `<nav>` that authors mention in prose.

## Repro material

Site: https://dacharycarey.com. Affected post (before workaround): https://dacharycarey.com/2025/09/07/audit-conclusions/ — references `` `<code>` `` in two paragraphs discussing how a code-classification system shouldn't tag certain examples with `<code>`. Other affected posts mention `` `<main>` ``, `` `<title>` ``, `` `<h1>` ``, `` `<link>` `` in prose about HTML.

## Workaround used

Rewrote `` `<code>` `` → `` `code` `` in the source, dropping the angle brackets. This loses semantic intent — the author meant the literal HTML `<code>` element, not just the word "code" — and it doesn't generalize: every site mentioning HTML tags in prose has to pre-flatten them to satisfy the parity check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

markdown-content-parity: inline `<tag>` code spans get text-stripped, causing false 'missing' #90

Summary

Repro

Why this is hard

Suggested approaches

Repro material

Workaround used

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

markdown-content-parity: inline <tag> code spans get text-stripped, causing false 'missing' #90

Description

Summary

Repro

Why this is hard

Suggested approaches

Repro material

Workaround used

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

markdown-content-parity: inline `<tag>` code spans get text-stripped, causing false 'missing' #90