Summary
markdown-content-parity reports content as missing when the source markdown uses underscore-style emphasis (_text_), even though CommonMark treats _text_ and *text* as equivalent and both render to <em> in HTML.
Repro
Source markdown contains: my office was _cold_ and dark
- HTML extraction (after
<em> tag stripping): my office was cold and dark
- Markdown extraction (after current normalization):
my office was _cold_ and dark
normalizedMd.includes(normalizedSegment) → false → segment reported missing
In production this produces ~5–17% missing on prose-heavy posts that lean on underscore emphasis. None of the flagged content is actually missing — it's the same prose, just rendered to <em> on the HTML side and left as _..._ on the markdown side.
Root cause
extractMarkdownText in checks/observability/markdown-content-parity.js strips only asterisk emphasis:
.replace(/(\*{1,3})(.*?)\1/g, '$2')
// Remove emphasis markers (* only — underscores are too common in
// code identifiers like mongoc_client_get_database and cause false
// mismatches when stripped as emphasis)
The code comment is the key concern: stripping naive _..._ would mangle code identifiers like mongoc_client_get_database. But extractMarkdownText already protects code via fenced-block and inline-code-span placeholders before any stripping runs, so identifiers inside `code` and ```fences``` are already safe.
The remaining false-positive risk is word-internal underscores in prose (foo_bar written outside of code spans). CommonMark's emphasis rules already handle this: an underscore is a left-flanking/right-flanking emphasis delimiter only when it's preceded/followed by a non-word boundary. A regex that requires \W (or string boundary) on both sides of the _..._ pair will match _emphasis_ in prose but not foo_bar_baz.
Suggested fix
Replace the asterisk-only emphasis regex with a pair of regexes that handle both delimiters with appropriate flanking rules:
// Strip *emphasis* (left/right flanking less strict, asterisks aren't word chars)
.replace(/(\*{1,3})(\S(?:.*?\S)?)\1/g, '$2')
// Strip _emphasis_ but only at word boundaries, to preserve identifiers
.replace(/(?<![A-Za-z0-9_])(_{1,3})(\S(?:.*?\S)?)\1(?![A-Za-z0-9_])/g, '$2')
The lookbehind/lookahead ensures foo_bar_baz is left alone (the _ between foo and bar is preceded by o, a word char) while text _emphasis_ here is stripped (the leading _ is preceded by space).
Repro material
Site where I hit this: https://dacharycarey.com (Hugo, Markdown output format). Affected pages: any post with underscore emphasis in prose — e.g. _cold_, _extremely_, _too_, [*Beauty*](amazon-link). Before workaround the parity check reported 10–13 of 50 sampled pages missing 5–17%; after converting all source _..._ → *..* (a workaround, not a fix), the same pages pass at 0% missing.
Workaround used
Bulk-converted _emphasis_ → *emphasis* in 26 source files (89 replacements). This isn't a real fix — both forms are valid CommonMark and the tool should handle both.
Summary
markdown-content-parityreports content as missing when the source markdown uses underscore-style emphasis (_text_), even though CommonMark treats_text_and*text*as equivalent and both render to<em>in HTML.Repro
Source markdown contains:
my office was _cold_ and dark<em>tag stripping):my office was cold and darkmy office was _cold_ and darknormalizedMd.includes(normalizedSegment)→false→ segment reported missingIn production this produces ~5–17% missing on prose-heavy posts that lean on underscore emphasis. None of the flagged content is actually missing — it's the same prose, just rendered to
<em>on the HTML side and left as_..._on the markdown side.Root cause
extractMarkdownTextinchecks/observability/markdown-content-parity.jsstrips only asterisk emphasis:The code comment is the key concern: stripping naive
_..._would mangle code identifiers likemongoc_client_get_database. ButextractMarkdownTextalready protects code via fenced-block and inline-code-span placeholders before any stripping runs, so identifiers inside`code`and```fences```are already safe.The remaining false-positive risk is word-internal underscores in prose (
foo_barwritten outside of code spans). CommonMark's emphasis rules already handle this: an underscore is a left-flanking/right-flanking emphasis delimiter only when it's preceded/followed by a non-word boundary. A regex that requires\W(or string boundary) on both sides of the_..._pair will match_emphasis_in prose but notfoo_bar_baz.Suggested fix
Replace the asterisk-only emphasis regex with a pair of regexes that handle both delimiters with appropriate flanking rules:
The lookbehind/lookahead ensures
foo_bar_bazis left alone (the_betweenfooandbaris preceded byo, a word char) whiletext _emphasis_ hereis stripped (the leading_is preceded by space).Repro material
Site where I hit this: https://dacharycarey.com (Hugo, Markdown output format). Affected pages: any post with underscore emphasis in prose — e.g.
_cold_,_extremely_,_too_,[*Beauty*](amazon-link). Before workaround the parity check reported 10–13 of 50 sampled pages missing 5–17%; after converting all source_..._→*..*(a workaround, not a fix), the same pages pass at 0% missing.Workaround used
Bulk-converted
_emphasis_→*emphasis*in 26 source files (89 replacements). This isn't a real fix — both forms are valid CommonMark and the tool should handle both.