Skip to content

Apex llms.txt drowns out {baseUrl}/llms.txt when both exist #53

@SahilAujla

Description

@SahilAujla

Context

When a site has both an apex llms.txt (e.g. example.com/llms.txt) and a docs-section llms.txt (e.g. example.com/docs/llms.txt), and the user passes the docs URL to afdocs, the scorer picks the apex one as canonical for sampling. Every link-following check (llms-txt-directive, markdown-url-support, content-negotiation, page-size-html, markdown-content-parity) then samples apex pages instead of docs pages.

For the common case where the apex is a marketing site and docs live at /docs, this means agent-readiness improvements made in the docs section get masked by the marketing site's lack of agent-friendly features.

Concrete example

Site: alchemy.com/docs

  • alchemy.com/llms.txt → 159K marketing file, 683 links to /blog/, /case-studies/, /overviews/
  • alchemy.com/docs/llms.txt → 495-byte docs index, 6 section links (split per the llms-txt-size fix recommendation in the spec)

Verbose afdocs output:

✓ llms-txt-exists: llms.txt found at 2 location(s)
⚠ llms-txt-valid: ... https://alchemy.com/llms.txt: No blockquote summary found
✗ llms-txt-size: llms.txt is 158,998 characters

Sampled URLs in llms-txt-directive, markdown-url-support, and content-negotiation are all marketing pages (/overviews/..., /blog/..., /case-studies/...). The 19/50 directive-pass and 19/50 markdown-pass come from the few docs pages that happen to be in the marketing llms.txt.

Score regressed from 78 (C) to 68 (D) after splitting the docs llms.txt per the spec's recommendation, because shrinking ours apparently flipped the canonical pick to the apex.

Suggested behaviors (in priority order)

  1. Prefer the more-specific candidate. When {baseUrl}/llms.txt exists, prefer it over {origin}/llms.txt since it's by definition more aligned with the URL the user passed.
  2. Add a --llms-txt-url <url> flag. Lets users explicitly point afdocs at the canonical llms.txt for their docs, bypassing the heuristic. Especially useful for monorepo / multi-property setups.
  3. Surface the picked URL in output. Show which llms.txt was selected as canonical so users understand why their score is what it is.

Workarounds tried

  • Splitting per the spec's llms-txt-size recommendation: backfired (made our file smaller, so the apex won the heuristic).
  • --canonical-origin: doesn't change which llms.txt is picked.
  • --sampling curated --urls ...: works for sampling but the size/freshness checks still hit the apex.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions