markdown-content-parity: `_emphasis_` not stripped, causes false 'missing' on CommonMark-valid prose

## Summary

`markdown-content-parity` reports content as missing when the source markdown uses underscore-style emphasis (`_text_`), even though CommonMark treats `_text_` and `*text*` as equivalent and both render to `<em>` in HTML.

## Repro

Source markdown contains: `my office was _cold_ and dark`

- HTML extraction (after `<em>` tag stripping): `my office was cold and dark`
- Markdown extraction (after current normalization): `my office was _cold_ and dark`
- `normalizedMd.includes(normalizedSegment)` → `false` → segment reported missing

In production this produces ~5–17% missing on prose-heavy posts that lean on underscore emphasis. None of the flagged content is actually missing — it's the same prose, just rendered to `<em>` on the HTML side and left as `_..._` on the markdown side.

## Root cause

`extractMarkdownText` in `checks/observability/markdown-content-parity.js` strips only asterisk emphasis:

```js
.replace(/(\*{1,3})(.*?)\1/g, '$2')
// Remove emphasis markers (* only — underscores are too common in
// code identifiers like mongoc_client_get_database and cause false
// mismatches when stripped as emphasis)
```

The code comment is the key concern: stripping naive `_..._` would mangle code identifiers like `mongoc_client_get_database`. But `extractMarkdownText` already protects code via fenced-block and inline-code-span placeholders before any stripping runs, so identifiers inside `` `code` `` and ` ```fences``` ` are already safe.

The remaining false-positive risk is word-internal underscores in prose (`foo_bar` written outside of code spans). CommonMark's emphasis rules already handle this: an underscore is a left-flanking/right-flanking emphasis delimiter only when it's preceded/followed by a non-word boundary. A regex that requires `\W` (or string boundary) on both sides of the `_..._` pair will match `_emphasis_` in prose but not `foo_bar_baz`.

## Suggested fix

Replace the asterisk-only emphasis regex with a pair of regexes that handle both delimiters with appropriate flanking rules:

```js
// Strip *emphasis* (left/right flanking less strict, asterisks aren't word chars)
.replace(/(\*{1,3})(\S(?:.*?\S)?)\1/g, '$2')
// Strip _emphasis_ but only at word boundaries, to preserve identifiers
.replace(/(?<![A-Za-z0-9_])(_{1,3})(\S(?:.*?\S)?)\1(?![A-Za-z0-9_])/g, '$2')
```

The lookbehind/lookahead ensures `foo_bar_baz` is left alone (the `_` between `foo` and `bar` is preceded by `o`, a word char) while `text _emphasis_ here` is stripped (the leading `_` is preceded by space).

## Repro material

Site where I hit this: https://dacharycarey.com (Hugo, Markdown output format). Affected pages: any post with underscore emphasis in prose — e.g. `_cold_`, `_extremely_`, `_too_`, `[*Beauty*](amazon-link)`. Before workaround the parity check reported 10–13 of 50 sampled pages missing 5–17%; after converting all source `_..._` → `*..*` (a workaround, not a fix), the same pages pass at 0% missing.

## Workaround used

Bulk-converted `_emphasis_` → `*emphasis*` in 26 source files (89 replacements). This isn't a real fix — both forms are valid CommonMark and the tool should handle both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

markdown-content-parity: `_emphasis_` not stripped, causes false 'missing' on CommonMark-valid prose #89

Summary

Repro

Root cause

Suggested fix

Repro material

Workaround used

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

markdown-content-parity: _emphasis_ not stripped, causes false 'missing' on CommonMark-valid prose #89

Description

Summary

Repro

Root cause

Suggested fix

Repro material

Workaround used

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

markdown-content-parity: `_emphasis_` not stripped, causes false 'missing' on CommonMark-valid prose #89