Skip to content

RFC: Optimize documentation as AI-first ground truth #54

@marc0olo

Description

@marc0olo

Background

Docs are authored by humans but increasingly consumed by AI agents on behalf of
developers. This RFC asks: are we optimizing for the right consumer, and what
concrete changes would close the gap?

Two converging use cases require the same foundation — accurate, structured,
discoverable content:

  1. AI that reads docs directly (retrieval) — coding tools like Cursor and
    Claude Code fetching pages via llms.txt and .md endpoints
  2. AI that generates explanations on demand (generation grounding) — the
    source of truth AI grounds its answers in

We plan to integrate with Kapa.AI (as on the current docs.internetcomputer.org)
which handles the conversational AI layer — ingestion, indexing, retrieval, and
the chat interface for developers. The action items below directly serve Kapa.AI
ingestion quality, not just abstract "AI optimization."

This is not a proposal to build AI tooling. It's about ensuring this repo is the
best possible ground truth for both use cases.

Research summary

agentdocsspec.com defines 22 specific checks across 7 categories with
concrete thresholds — not just "have an llms.txt." Notable checks we likely
fail today:

  • Coverage: llms.txt must link to ≥95% of pages (stub pages may degrade this)
  • Content negotiation: serve Content-Type: text/markdown for
    Accept: text/markdown requests — currently not implemented
  • Cache hygiene: markdown endpoints need max-age < 3600 or must-revalidate
    with ETag
  • Platform truncation limits are documented: Claude Code ~100KB, MCP Fetch 5KB
    default, Claude API web_fetch ~20.7KB — relevant for page size decisions

Stripe's instructions section in llms.txt is the one structurally novel
pattern worth adopting.
They encode semantic directives for AI directly in
llms.txt: preferred APIs, deprecated alternatives, behavioral guidance. No
infrastructure required. Directly applicable — we could encode "always use
icp CLI, never dfx", preferred patterns, deprecation signals. Currently only
Stripe does this publicly.

The llms.txt + .md endpoints pattern is already correct for this site's
scale. The "nested llms.txt" variant (section-level index files) is a scaling
solution for sites where the root index exceeds 50KB — not a concern at current
page count. llms-full.txt (full content concatenated) is served by some
framework docs auto-generated by Starlight/VitePress, but is too large for any
current AI fetch pipeline and primarily useful for humans manually piping docs
into an LLM — not an agent optimization.

Diataxis has real value for AI routing — separating concept / guide /
reference / tutorial pages gives AI a structural signal about the type of answer
a page contains. Its limit is that it was designed around human cognitive modes,
not knowledge structure. It provides no relationship signals AI would benefit
from: prerequisites, related concepts, which APIs a page covers.

GraphRAG is cost-viable at this scale (~$1–5 one-time indexing for ~100
pages) but the query workload has to justify it. Useful for cross-cutting
questions ("how does auth work across the system"); overkill for lookup queries.
With Kapa.AI handling the retrieval layer, a separate GraphRAG implementation
would overlap significantly — revisit if Kapa.AI proves insufficient for
complex cross-cutting queries.

On-demand query-time generation is not mature. DeepWiki (Cognition)
pre-generates then retrieves — it doesn't generate at query time. No shipping
product with documented results exists for pure query-time generation. The
ground truth layer is the right investment now regardless of which model
dominates later.

Current state

The plugin (plugins/astro-agent-docs.mjs) generates llms.txt, clean .md
endpoints, agent signaling blockquote in HTML, and a sitemap alias. Solid
foundation.

Key gaps:

  • cleanMarkdown() strips all YAML frontmatter — agents and Kapa.AI see
    only title + body, no metadata
  • No instructions section in llms.txt
  • No content negotiation (Accept: text/markdown)
  • Stub pages create dead entries in the discovery index
  • No journey-aligned ordering in llms.txt (currently site taxonomy)
  • No relationship signals (prerequisites, related pages, API surface per page)
  • No AI-optimization guidance in the content authoring workflow

Proposed action items

Tier 1 — Low effort, high impact

  • Run the agentdocsspec.com compliance checker and fix failures
  • Add content negotiation (Accept: text/markdownContent-Type: text/markdown)
  • Add an instructions section to llms.txt with ICP-specific AI
    directives: never dfx, preferred APIs, deprecation signals
  • Selectively pass title and description through cleanMarkdown()
    currently stripped entirely; Kapa.AI and agents receive no metadata
  • Exclude stub pages from llms.txt until they have real content — dead
    entries degrade retrieval quality for both agents and Kapa.AI

Tier 2 — Medium effort, medium term

  • Add optional agent-optimized frontmatter fields: prerequisites,
    category, entities — not required authoring overhead, but enables
    richer indexing when populated
  • Reorder llms.txt entries by developer journey rather than site
    hierarchy — ordering functions as a priority signal for models
  • Add explicit AI-optimization guidance to the content authoring workflow:
    reference pages prefer tables over prose, ≤50K characters per page,
    ≤25% generic section headers ("Overview", "Introduction")
  • Generate a per-page JSON sidecar with structured metadata alongside the
    .md endpoint — relationship signals, entities, prerequisites

Tier 3 — Larger investment, revisit later

  • Hybrid GraphRAG layer — evaluate if Kapa.AI proves insufficient for
    cross-cutting queries once content volume is substantial
  • Check whether Starlight auto-generates llms-full.txt — if so, serve
    it passively; useful for humans manually ingesting docs into an LLM
    context window, not an agent optimization priority

Non-goals

  • Building AI tooling (generation systems, MCP server, custom retrieval)
  • Changing the Diataxis content structure or Markdown-first authoring workflow
  • Any change that increases authoring burden for content contributors

Open questions

  1. Should stub pages be excluded from llms.txt entirely until they have real
    content, or is a stub signal (explicit marker) better than absence?
  2. For the instructions section: who owns it, and what's the process for
    keeping AI directives accurate as APIs evolve?
  3. Which Kapa.AI ingestion path will be used — sitemap crawl, .md endpoints,
    or GitHub integration? This determines which Tier 1 items are highest
    priority.
  4. Does the Tier 2 per-page JSON sidecar overlap with what Kapa.AI builds
    internally, or is there a case for exposing it publicly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions