RFC: Optimize documentation as AI-first ground truth

## Background

Docs are authored by humans but increasingly consumed by AI agents on behalf of
developers. This RFC asks: are we optimizing for the right consumer, and what
concrete changes would close the gap?

Two converging use cases require the same foundation — accurate, structured,
discoverable content:

1. **AI that reads docs directly** (retrieval) — coding tools like Cursor and
   Claude Code fetching pages via llms.txt and .md endpoints
2. **AI that generates explanations on demand** (generation grounding) — the
   source of truth AI grounds its answers in

We plan to integrate with Kapa.AI (as on the current docs.internetcomputer.org)
which handles the conversational AI layer — ingestion, indexing, retrieval, and
the chat interface for developers. The action items below directly serve Kapa.AI
ingestion quality, not just abstract "AI optimization."

This is not a proposal to build AI tooling. It's about ensuring this repo is the
best possible ground truth for both use cases.

## Research summary

**agentdocsspec.com defines 22 specific checks** across 7 categories with
concrete thresholds — not just "have an llms.txt." Notable checks we likely
fail today:
- Coverage: llms.txt must link to ≥95% of pages (stub pages may degrade this)
- Content negotiation: serve `Content-Type: text/markdown` for
  `Accept: text/markdown` requests — currently not implemented
- Cache hygiene: markdown endpoints need `max-age < 3600` or `must-revalidate`
  with ETag
- Platform truncation limits are documented: Claude Code ~100KB, MCP Fetch 5KB
  default, Claude API web_fetch ~20.7KB — relevant for page size decisions

**Stripe's `instructions` section in llms.txt is the one structurally novel
pattern worth adopting.** They encode semantic directives for AI directly in
llms.txt: preferred APIs, deprecated alternatives, behavioral guidance. No
infrastructure required. Directly applicable — we could encode "always use
icp CLI, never dfx", preferred patterns, deprecation signals. Currently only
Stripe does this publicly.

**The `llms.txt` + `.md` endpoints pattern is already correct** for this site's
scale. The "nested llms.txt" variant (section-level index files) is a scaling
solution for sites where the root index exceeds 50KB — not a concern at current
page count. `llms-full.txt` (full content concatenated) is served by some
framework docs auto-generated by Starlight/VitePress, but is too large for any
current AI fetch pipeline and primarily useful for humans manually piping docs
into an LLM — not an agent optimization.

**Diataxis has real value for AI routing** — separating concept / guide /
reference / tutorial pages gives AI a structural signal about the type of answer
a page contains. Its limit is that it was designed around human cognitive modes,
not knowledge structure. It provides no relationship signals AI would benefit
from: prerequisites, related concepts, which APIs a page covers.

**GraphRAG is cost-viable at this scale** (~$1–5 one-time indexing for ~100
pages) but the query workload has to justify it. Useful for cross-cutting
questions ("how does auth work across the system"); overkill for lookup queries.
With Kapa.AI handling the retrieval layer, a separate GraphRAG implementation
would overlap significantly — revisit if Kapa.AI proves insufficient for
complex cross-cutting queries.

**On-demand query-time generation is not mature.** DeepWiki (Cognition)
pre-generates then retrieves — it doesn't generate at query time. No shipping
product with documented results exists for pure query-time generation. The
ground truth layer is the right investment now regardless of which model
dominates later.

## Current state

The plugin (`plugins/astro-agent-docs.mjs`) generates llms.txt, clean `.md`
endpoints, agent signaling blockquote in HTML, and a sitemap alias. Solid
foundation.

Key gaps:

- `cleanMarkdown()` strips **all** YAML frontmatter — agents and Kapa.AI see
  only title + body, no metadata
- No `instructions` section in `llms.txt`
- No content negotiation (`Accept: text/markdown`)
- Stub pages create dead entries in the discovery index
- No journey-aligned ordering in `llms.txt` (currently site taxonomy)
- No relationship signals (prerequisites, related pages, API surface per page)
- No AI-optimization guidance in the content authoring workflow

## Proposed action items

### Tier 1 — Low effort, high impact

- [ ] Run the agentdocsspec.com compliance checker and fix failures
- [ ] Add content negotiation (`Accept: text/markdown` → `Content-Type: text/markdown`)
- [ ] Add an `instructions` section to `llms.txt` with ICP-specific AI
      directives: never dfx, preferred APIs, deprecation signals
- [ ] Selectively pass `title` and `description` through `cleanMarkdown()` —
      currently stripped entirely; Kapa.AI and agents receive no metadata
- [ ] Exclude stub pages from `llms.txt` until they have real content — dead
      entries degrade retrieval quality for both agents and Kapa.AI

### Tier 2 — Medium effort, medium term

- [ ] Add optional agent-optimized frontmatter fields: `prerequisites`,
      `category`, `entities` — not required authoring overhead, but enables
      richer indexing when populated
- [ ] Reorder `llms.txt` entries by developer journey rather than site
      hierarchy — ordering functions as a priority signal for models
- [ ] Add explicit AI-optimization guidance to the content authoring workflow:
      reference pages prefer tables over prose, ≤50K characters per page,
      ≤25% generic section headers ("Overview", "Introduction")
- [ ] Generate a per-page JSON sidecar with structured metadata alongside the
      `.md` endpoint — relationship signals, entities, prerequisites

### Tier 3 — Larger investment, revisit later

- [ ] Hybrid GraphRAG layer — evaluate if Kapa.AI proves insufficient for
      cross-cutting queries once content volume is substantial
- [ ] Check whether Starlight auto-generates `llms-full.txt` — if so, serve
      it passively; useful for humans manually ingesting docs into an LLM
      context window, not an agent optimization priority

## Non-goals

- Building AI tooling (generation systems, MCP server, custom retrieval)
- Changing the Diataxis content structure or Markdown-first authoring workflow
- Any change that increases authoring burden for content contributors

## Open questions

1. Should stub pages be excluded from `llms.txt` entirely until they have real
   content, or is a stub signal (explicit marker) better than absence?
2. For the `instructions` section: who owns it, and what's the process for
   keeping AI directives accurate as APIs evolve?
3. Which Kapa.AI ingestion path will be used — sitemap crawl, `.md` endpoints,
   or GitHub integration? This determines which Tier 1 items are highest
   priority.
4. Does the Tier 2 per-page JSON sidecar overlap with what Kapa.AI builds
   internally, or is there a case for exposing it publicly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Optimize documentation as AI-first ground truth #54

Background

Research summary

Current state

Proposed action items

Tier 1 — Low effort, high impact

Tier 2 — Medium effort, medium term

Tier 3 — Larger investment, revisit later

Non-goals

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Optimize documentation as AI-first ground truth #54

Description

Background

Research summary

Current state

Proposed action items

Tier 1 — Low effort, high impact

Tier 2 — Medium effort, medium term

Tier 3 — Larger investment, revisit later

Non-goals

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions