From 98f3edc7bd5ac6939970a04f2497ccff43df4f26 Mon Sep 17 00:00:00 2001 From: dacharyc Date: Sun, 3 May 2026 13:18:22 -0400 Subject: [PATCH] Clarify http-status-code behavior, fix site 500 v 404 --- SCORING.md | 18 ++++++++++-------- docs/checks/url-stability.md | 22 ++++++++++++++++------ docs/public/.htaccess | 6 ++++++ 3 files changed, 32 insertions(+), 14 deletions(-) diff --git a/SCORING.md b/SCORING.md index 5335eeb..cce8354 100644 --- a/SCORING.md +++ b/SCORING.md @@ -143,14 +143,16 @@ This behavior does **not** apply when: Not all warnings represent the same degree of degradation. A warning on `llms-txt-valid` (structure is non-standard but links are parseable) is less severe than a warning on `rendering-strategy` (sparse content that might need JavaScript). Most checks have a specific warn coefficient: -| Coefficient | Meaning | Checks | -| ----------- | ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **0.75** | Content substantively intact | `llms-txt-valid`, `content-negotiation`, `llms-txt-links-resolve`, `llms-txt-coverage`, `markdown-content-parity` | -| **0.60** | Partial coverage or platform-dependent | `llms-txt-directive-html`, `llms-txt-directive-md`, `redirect-behavior` | -| **0.50** | Genuine functional degradation | `llms-txt-exists`, `llms-txt-size`, `rendering-strategy`, `markdown-url-support`, `page-size-markdown`, `page-size-html`, `content-start-position`, `tabbed-content-serialization`, `section-header-quality`, `cache-header-hygiene`, `auth-gate-detection`, `auth-alternative-access` | -| **0.25** | Actively steering agents to a worse path | `llms-txt-links-markdown` (markdown exists but llms.txt links to HTML; agents don't discover .md variants on their own) | - -`markdown-code-fence-validity` only has pass/fail (no warn state). `http-status-codes` is normally pass/fail but warns when every sampled response is indeterminate (HTTP 202 from CDN cache-miss/build, or 5xx) so the check couldn't measure bad-URL handling. +| Coefficient | Meaning | Checks | +| ----------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **0.75** | Content substantively intact | `llms-txt-valid`, `content-negotiation`, `llms-txt-links-resolve`, `llms-txt-coverage`, `markdown-content-parity` | +| **0.60** | Partial coverage or platform-dependent | `llms-txt-directive-html`, `llms-txt-directive-md`, `redirect-behavior` | +| **0.50** | Genuine functional degradation | `llms-txt-exists`, `llms-txt-size`, `rendering-strategy`, `markdown-url-support`, `page-size-markdown`, `page-size-html`, `content-start-position`, `tabbed-content-serialization`, `section-header-quality`, `cache-header-hygiene`, `auth-gate-detection`, `auth-alternative-access`, `http-status-codes`† | +| **0.25** | Actively steering agents to a worse path | `llms-txt-links-markdown` (markdown exists but llms.txt links to HTML; agents don't discover .md variants on their own) | + +`markdown-code-fence-validity` only has pass/fail (no warn state). + +† `http-status-codes` is normally pass/fail. It warns only when every sampled response is indeterminate (HTTP 202 from CDN cache-miss/build, or 5xx), meaning bad-URL handling couldn't be measured. In that case the check applies the default 0.5 warn coefficient rather than scoring zero. Mixed responses (e.g., some `correct-error`, some `indeterminate`) are scored from the determinate subset only. ## Score caps diff --git a/docs/checks/url-stability.md b/docs/checks/url-stability.md index d6ec9dc..251be5c 100644 --- a/docs/checks/url-stability.md +++ b/docs/checks/url-stability.md @@ -17,19 +17,29 @@ In empirical testing, soft 404s (pages returning 200 with "page not found" conte ### Results -| Result | Condition | -| ------ | -------------------------------------------------- | -| Pass | Fabricated bad URLs return proper 4xx status codes | -| Fail | Bad URLs return 200 (soft 404) | +| Result | Condition | +| ------ | ------------------------------------------------------------------------------------------ | +| Pass | Fabricated bad URLs return proper 4xx status codes | +| Warn | Every sampled response was indeterminate (HTTP 202 or 5xx); bad-URL handling is unmeasured | +| Fail | Bad URLs return 200 (soft 404) | -This check has no warn state; it's strictly pass/fail. +AFDocs tests this by generating non-existent URLs based on your site's URL structure and checking whether the server returns 404 or 200. Per-page responses fall into one of three buckets: -AFDocs tests this by generating non-existent URLs based on your site's URL structure and checking whether the server returns 404 or 200. +- **`correct-error`** (counts toward pass): 4xx status code. +- **`soft-404`** (counts toward fail): 2xx/3xx status code, often a templated "page not found" page. +- **`indeterminate`** (excluded from the soft-404 tally): HTTP 202 or 5xx. RFC 7231 says 202 means "still processing," and Vercel/Next.js ISR returns it during cache-miss/build for fresh URLs. 5xx responses tell us nothing about how the site handles bad URLs. Both are reported separately rather than penalized as soft 404s. + +If at least one response is determinate, the check scores from the determinate subset (e.g., 2 correct-error + 1 indeterminate scores as 2/2 = pass). The warn state only fires when **every** sampled response is indeterminate, in which case the check applies the default 0.5 warn coefficient because bad-URL handling could not be measured. ### How to fix Configure your server or hosting platform to return a 404 status code for pages that don't exist. Most docs platforms handle this correctly by default; the common exception is single-page applications that serve the shell HTML for all routes and handle 404s client-side. +**If this check warns** with "all sampled pages returned indeterminate responses," the most common causes are: + +- **Vercel/Next.js ISR** returning 202 during cache-miss or build. Real agents (low concurrency, warm cache) typically don't hit this, so it's noise rather than signal. No action needed. +- **A misconfigured server returning 5xx for missing paths** (e.g., an Apache rewrite rule that maps `/foo` to `/foo.html` without checking that the target file exists, then loops or hits an internal error). This is a real issue: agents requesting a typo'd URL get a 500 instead of a clean 404. Add a guard so the rewrite only fires when the target exists, and set an `ErrorDocument 404` directive that points at your platform's 404 page. + ### What about serving helpful content on missing pages? It's tempting to serve something useful when an agent requests a page that doesn't exist. For example, you might return your `llms.txt` as a fallback, or a "did you mean?" page with links to related content. This seems like an elegant solution to agents hallucinating URLs. diff --git a/docs/public/.htaccess b/docs/public/.htaccess index 216f20f..b669ab6 100644 --- a/docs/public/.htaccess +++ b/docs/public/.htaccess @@ -40,10 +40,16 @@ RewriteRule ^llms\.txt$ /log-agent-signal.php?path=llms.txt&trigger=llms-txt [L, # VitePress builds non-index pages as flat .html files (quick-start.html), # not directories (quick-start/index.html). This rule maps trailing-slash # URLs to their .html counterparts so the directive check can fetch them. +# Guard with a -f check on the .html target so missing paths fall through +# to a real 404 (via ErrorDocument below) rather than looping into a 500. RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d +RewriteCond %{DOCUMENT_ROOT}/$1.html -f RewriteRule ^(.*?)/?$ /$1.html [L] +# Serve the VitePress 404 page body for missing paths and return a real 404. +ErrorDocument 404 /404.html + # Serve .md files with the correct content type AddType text/markdown .md