Skip to content

markdown-content-parity: numbered-list regex strips leading '1. ' from headings, causing false 'missing' #91

@dacharyc

Description

@dacharyc

Summary

extractMarkdownText runs heading-marker stripping before list-marker stripping in sequence, on the same line. When a heading begins with a number (e.g. ### 1. How well are...), the heading regex turns it into 1. How well are... and then the numbered-list regex strips the 1. away, leaving How well are.... The HTML side keeps 1. as part of the heading text, so the segments don't match.

Repro

Source markdown:

### 1. How well are key programming languages supported by code examples?

Hugo renders:

<h3 id="1-how-well-are-key-programming-languages-supported-by-code-examples">1. How well are key programming languages supported by code examples?</h3>

extractHtmlText segment: 1. How well are key programming languages supported by code examples?

extractMarkdownText flow on this line:

  1. ^#{1,6}\s+ strips ### 1. How well are key programming languages supported by code examples?
  2. ^[\s]*\d+\.\s+ strips 1. How well are key programming languages supported by code examples?

Result: HTML has the 1. prefix, markdown doesn't. Substring containment fails.

Why this is a bug

The numbered-list regex was meant to strip 1. from list items like 1. First thing so the item content matches <li>First thing</li>. It's wrong to apply it to text that came from a heading, because in <h3>1. How well…</h3> the 1. is part of the heading text, not list markup.

Both forms are valid markdown for different reasons:

  • Authors who want numbered headings (research questions, RFC sections, "Step 1:") legitimately write ### 1. Title.
  • The HTML renderer preserves the literal text inside the <h3>.

Suggested fix

Track which lines were headings and skip list-marker stripping on those lines. Two ways to do it:

Option A — placeholder-based, like the code protection

// Replace heading lines with placeholders before any other stripping
const headings = [];
text = text.replace(/^#{1,6}\s+(.*)$/gm, (_m, content) => {
  const idx = headings.length;
  headings.push(content);
  return `\x00HEAD${idx}\x00`;
});
// ...all other stripping (bullets, numbered lists, emphasis, etc.)...
// Restore heading text
text = text.replace(/\x00HEAD(\d+)\x00/g, (_m, idx) => headings[parseInt(idx, 10)]);

This guarantees no later regex touches heading content.

Option B — line-by-line state

Process line by line. If a line started with ^#{1,6}\s+, after stripping the marker, do not run the bullet/numbered-list regexes against it.

Option A is more in line with how the existing code already protects code spans/blocks.

Repro material

Site: https://dacharycarey.com. Affected post: https://dacharycarey.com/2025/09/07/audit-conclusions/. Four H3 headings of the form ### 1. ... through ### 4. ... produce four false "missing" segments. Removing the leading numbers from the headings is a workaround but loses author intent (these are numbered research questions, and the surrounding prose refers to them by number).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions