Skip to content

markdown-content-parity: _emphasis_ not stripped, causes false 'missing' on CommonMark-valid prose #89

@dacharyc

Description

@dacharyc

Summary

markdown-content-parity reports content as missing when the source markdown uses underscore-style emphasis (_text_), even though CommonMark treats _text_ and *text* as equivalent and both render to <em> in HTML.

Repro

Source markdown contains: my office was _cold_ and dark

  • HTML extraction (after <em> tag stripping): my office was cold and dark
  • Markdown extraction (after current normalization): my office was _cold_ and dark
  • normalizedMd.includes(normalizedSegment)false → segment reported missing

In production this produces ~5–17% missing on prose-heavy posts that lean on underscore emphasis. None of the flagged content is actually missing — it's the same prose, just rendered to <em> on the HTML side and left as _..._ on the markdown side.

Root cause

extractMarkdownText in checks/observability/markdown-content-parity.js strips only asterisk emphasis:

.replace(/(\*{1,3})(.*?)\1/g, '$2')
// Remove emphasis markers (* only — underscores are too common in
// code identifiers like mongoc_client_get_database and cause false
// mismatches when stripped as emphasis)

The code comment is the key concern: stripping naive _..._ would mangle code identifiers like mongoc_client_get_database. But extractMarkdownText already protects code via fenced-block and inline-code-span placeholders before any stripping runs, so identifiers inside `code` and ```fences``` are already safe.

The remaining false-positive risk is word-internal underscores in prose (foo_bar written outside of code spans). CommonMark's emphasis rules already handle this: an underscore is a left-flanking/right-flanking emphasis delimiter only when it's preceded/followed by a non-word boundary. A regex that requires \W (or string boundary) on both sides of the _..._ pair will match _emphasis_ in prose but not foo_bar_baz.

Suggested fix

Replace the asterisk-only emphasis regex with a pair of regexes that handle both delimiters with appropriate flanking rules:

// Strip *emphasis* (left/right flanking less strict, asterisks aren't word chars)
.replace(/(\*{1,3})(\S(?:.*?\S)?)\1/g, '$2')
// Strip _emphasis_ but only at word boundaries, to preserve identifiers
.replace(/(?<![A-Za-z0-9_])(_{1,3})(\S(?:.*?\S)?)\1(?![A-Za-z0-9_])/g, '$2')

The lookbehind/lookahead ensures foo_bar_baz is left alone (the _ between foo and bar is preceded by o, a word char) while text _emphasis_ here is stripped (the leading _ is preceded by space).

Repro material

Site where I hit this: https://dacharycarey.com (Hugo, Markdown output format). Affected pages: any post with underscore emphasis in prose — e.g. _cold_, _extremely_, _too_, [*Beauty*](amazon-link). Before workaround the parity check reported 10–13 of 50 sampled pages missing 5–17%; after converting all source _..._*..* (a workaround, not a fix), the same pages pass at 0% missing.

Workaround used

Bulk-converted _emphasis_*emphasis* in 26 source files (89 replacements). This isn't a real fix — both forms are valid CommonMark and the tool should handle both.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions