markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing'

## Summary

`extractMarkdownText` runs heading-marker stripping before list-marker stripping in sequence, on the same line. When a heading begins with a number (e.g. `### 1. How well are...`), the heading regex turns it into `1. How well are...` and then the numbered-list regex strips the `1. ` away, leaving `How well are...`. The HTML side keeps `1. ` as part of the heading text, so the segments don't match.

## Repro

Source markdown:
```markdown
### 1. How well are key programming languages supported by code examples?
```

Hugo renders:
```html
<h3 id="1-how-well-are-key-programming-languages-supported-by-code-examples">1. How well are key programming languages supported by code examples?</h3>
```

`extractHtmlText` segment: `1. How well are key programming languages supported by code examples?`

`extractMarkdownText` flow on this line:
1. `^#{1,6}\s+` strips `### ` → `1. How well are key programming languages supported by code examples?`
2. `^[\s]*\d+\.\s+` strips `1. ` → `How well are key programming languages supported by code examples?`

Result: HTML has the `1.` prefix, markdown doesn't. Substring containment fails.

## Why this is a bug

The numbered-list regex was meant to strip `1. ` from list items like `1. First thing` so the item content matches `<li>First thing</li>`. It's wrong to apply it to text that came from a heading, because in `<h3>1. How well…</h3>` the `1.` is part of the heading text, not list markup.

Both forms are valid markdown for different reasons:
- Authors who want numbered headings (research questions, RFC sections, "Step 1:") legitimately write `### 1. Title`.
- The HTML renderer preserves the literal text inside the `<h3>`.

## Suggested fix

Track which lines were headings and skip list-marker stripping on those lines. Two ways to do it:

**Option A — placeholder-based, like the code protection**

```js
// Replace heading lines with placeholders before any other stripping
const headings = [];
text = text.replace(/^#{1,6}\s+(.*)$/gm, (_m, content) => {
  const idx = headings.length;
  headings.push(content);
  return `\x00HEAD${idx}\x00`;
});
// ...all other stripping (bullets, numbered lists, emphasis, etc.)...
// Restore heading text
text = text.replace(/\x00HEAD(\d+)\x00/g, (_m, idx) => headings[parseInt(idx, 10)]);
```

This guarantees no later regex touches heading content.

**Option B — line-by-line state**

Process line by line. If a line started with `^#{1,6}\s+`, after stripping the marker, do not run the bullet/numbered-list regexes against it.

Option A is more in line with how the existing code already protects code spans/blocks.

## Repro material

Site: https://dacharycarey.com. Affected post: https://dacharycarey.com/2025/09/07/audit-conclusions/. Four H3 headings of the form `### 1. ...` through `### 4. ...` produce four false "missing" segments. Removing the leading numbers from the headings is a workaround but loses author intent (these are numbered research questions, and the surrounding prose refers to them by number).

## Related

- #89 (CommonMark `_emphasis_` not stripped) — same family of "extractMarkdownText doesn't fully match what the HTML renderer produces"
- #90 (inline `` `<tag>` `` code spans get text-stripped on HTML side) — the other half of the audit-conclusions parity warning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing' #91

Summary

Repro

Why this is a bug

Suggested fix

Repro material

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing' #91

Description

Summary

Repro

Why this is a bug

Suggested fix

Repro material

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions