Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/checks/content-discoverability.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,21 @@ If any of these redirect cross-host (e.g., `example.com` redirects to `docs.exam

If your `llms.txt` lives at a location not covered by these candidates, AFDocs won't find it. You can either move it to one of the candidate locations or [open an issue](https://github.com/agent-ecosystem/afdocs/issues) to suggest expanding the candidate list.

### Canonical selection

When more than one candidate returns a file (e.g. an apex `llms.txt` for the marketing site _and_ a `/docs/llms.txt` for the docs section), AFDocs picks one as **canonical**. The canonical file is the single source of truth for downstream checks: link sampling, size, validation, freshness, and link-resolution all operate on it alone. Other discovered files still appear in `details.discoveredFiles` for visibility, and `cache-header-hygiene` still verifies headers on every llms.txt found.

The selection rule is _most-specific-to-the-baseUrl wins_. AFDocs picks the file whose directory is the longest prefix of the URL you passed. For example:

| You passed | Files found | Canonical |
| --------------------- | ------------------------------------------- | ------------------------------- |
| `example.com/docs` | `/llms.txt` and `/docs/llms.txt` | `/docs/llms.txt` |
| `example.com` | `/llms.txt` and `/docs/llms.txt` | `/llms.txt` |
| `example.com/docs/v1` | `/llms.txt`, `/docs/llms.txt`, `/docs/v1/…` | `/docs/v1/llms.txt` |
| `example.com/docs/v1` | `/llms.txt` and `/docs/llms.txt` | `/docs/llms.txt` (longer match) |

Use `--llms-txt-url` (or the `llmsTxtUrl` config option) to override the heuristic when the canonical lives at a non-standard path. See the [CLI reference](/reference/cli#llms-txt-selection) for details.

### How to fix

**If this check fails**, create an `llms.txt` at one of the candidate locations above. The file should contain an H1 title, a blockquote summary, and markdown links to your key documentation pages. See the [llms.txt specification](https://llmstxt.org/) for the format.
Expand Down
21 changes: 21 additions & 0 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,27 @@ Use `--canonical-origin` when your site's URLs in `sitemap.xml` and `llms.txt` d
afdocs check https://preview-xyz-example.app/docs --canonical-origin https://example.com
```

### llms.txt selection

| Flag | Default | Description |
| ---------------------- | ------- | ------------------------------------------------------------------------ |
| `--llms-txt-url <url>` | | Explicit llms.txt URL to use as canonical (bypasses discovery heuristic) |

By default, `afdocs` discovers llms.txt at three candidate locations: `{baseUrl}/llms.txt`, `{origin}/llms.txt`, and `{origin}/docs/llms.txt`. When more than one of these returns a file, the most-specific one — the one whose directory is the longest prefix of the URL you passed — is used as canonical. Downstream checks (size, validity, link sampling) all operate on the canonical file.

For most sites this heuristic does the right thing. Use `--llms-txt-url` to override it when:

- The canonical llms.txt lives at a non-standard path (e.g. `/docs/v3/llms.txt`)
- A monorepo serves multiple docs surfaces at one origin and you want to score one specifically
- You want to verify a specific file before publishing

```bash
# Score a docs section explicitly, ignoring an apex /llms.txt
afdocs check https://example.com/docs --llms-txt-url https://example.com/docs/llms.txt
```

When the override is set, `llms-txt-exists` probes only that URL and reports failure if it isn't reachable. The cross-host redirect fallback is skipped.

### Size thresholds

| Flag | Default | Description |
Expand Down
26 changes: 14 additions & 12 deletions docs/reference/config-file.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ options:
preferredLocale: en
preferredVersion: v3
canonicalOrigin: https://example.com
llmsTxtUrl: https://example.com/docs/llms.txt
thresholds:
pass: 50000
fail: 100000
Expand Down Expand Up @@ -70,18 +71,19 @@ skipChecks:

Override default runner options. All fields are optional:

| Field | Default | Description |
| ------------------ | ----------- | ---------------------------------------------------------- |
| `maxLinksToTest` | `50` | Maximum number of pages to sample |
| `samplingStrategy` | `random` | `random`, `deterministic`, `curated`, or `none` |
| `maxConcurrency` | `3` | Maximum concurrent HTTP requests |
| `requestDelay` | `200` | Delay between requests in milliseconds |
| `requestTimeout` | `30000` | Timeout for individual HTTP requests in milliseconds |
| `preferredLocale` | auto-detect | Preferred locale for URL discovery (e.g. `en`, `fr`, `ja`) |
| `preferredVersion` | auto-detect | Preferred version for URL discovery (e.g. `v3`, `2.x`) |
| `canonicalOrigin` | | The production domain your content links to |
| `thresholds.pass` | `50000` | Page size pass threshold in characters |
| `thresholds.fail` | `100000` | Page size fail threshold in characters |
| Field | Default | Description |
| ------------------ | ----------- | ------------------------------------------------------------------------------------------- |
| `maxLinksToTest` | `50` | Maximum number of pages to sample |
| `samplingStrategy` | `random` | `random`, `deterministic`, `curated`, or `none` |
| `maxConcurrency` | `3` | Maximum concurrent HTTP requests |
| `requestDelay` | `200` | Delay between requests in milliseconds |
| `requestTimeout` | `30000` | Timeout for individual HTTP requests in milliseconds |
| `preferredLocale` | auto-detect | Preferred locale for URL discovery (e.g. `en`, `fr`, `ja`) |
| `preferredVersion` | auto-detect | Preferred version for URL discovery (e.g. `v3`, `2.x`) |
| `canonicalOrigin` | | The production domain your content links to |
| `llmsTxtUrl` | | Explicit llms.txt URL to use as canonical (overrides the discovery heuristic; see CLI docs) |
| `thresholds.pass` | `50000` | Page size pass threshold in characters |
| `thresholds.fail` | `100000` | Page size fail threshold in characters |

### `pages` (optional)

Expand Down
45 changes: 41 additions & 4 deletions src/checks/content-discoverability/llms-txt-exists.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import { registerCheck } from '../registry.js';
import { selectCanonicalLlmsTxt } from '../../helpers/llms-txt.js';
import { isCrossHostRedirect } from '../../helpers/to-md-urls.js';
import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js';

Expand All @@ -16,7 +17,8 @@ function getCandidateUrls(baseUrl: string, origin: string): string[] {
}

async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {
const candidates = getCandidateUrls(ctx.baseUrl, ctx.origin);
const explicitUrl = ctx.options.llmsTxtUrl;
const candidates = explicitUrl ? [explicitUrl] : getCandidateUrls(ctx.baseUrl, ctx.origin);
const discovered: DiscoveredFile[] = [];
const checkedUrls: Array<{
url: string;
Expand Down Expand Up @@ -68,8 +70,11 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {

// When no llms.txt found, check if any candidates redirected cross-host.
// If so, try {redirected_origin}/llms.txt as a fallback.
// Skip the fallback when the user explicitly specified an llmsTxtUrl —
// they told us exactly where to look, so silently probing other origins
// would defeat the purpose of the override.
const redirectedOrigins: string[] = [];
if (discovered.length === 0) {
if (discovered.length === 0 && !explicitUrl) {
const checkedSet = new Set(checkedUrls.map((u) => u.url));
const seenOrigins = new Set<string>();
for (const checked of checkedUrls) {
Expand Down Expand Up @@ -134,6 +139,12 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {
(fetchErrors > 0 ? `; ${fetchErrors} failed to fetch` : '') +
(rateLimited > 0 ? `; ${rateLimited} rate-limited (HTTP 429)` : '');

// Pick the canonical llms.txt — the one downstream checks should use as the
// single source of truth for sampling links, measuring size, validating
// structure, etc. When multiple llms.txt files exist (apex + docs section),
// the heuristic prefers the most-specific one relative to the baseUrl.
const canonical = selectCanonicalLlmsTxt(discovered, ctx.baseUrl);

// Store discovered files for downstream checks
const details: Record<string, unknown> = {
candidateUrls: checkedUrls,
Expand All @@ -142,6 +153,16 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {
rateLimited,
};

if (canonical) {
details.canonicalLlmsTxt = canonical;
details.canonicalUrl = canonical.url;
if (explicitUrl) {
details.canonicalSource = 'explicit';
} else if (discovered.length > 1) {
details.canonicalSource = 'heuristic';
}
}

if (redirectedOrigins.length > 0) {
details.redirectedOrigins = redirectedOrigins;
}
Expand Down Expand Up @@ -174,11 +195,14 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {
redirectedOrigins.length > 0
? `; candidates redirected cross-host to ${redirectedOrigins.join(', ')} (agents can't follow cross-host redirects)`
: '';
const message = explicitUrl
? `No llms.txt found at the URL specified via --llms-txt-url (${explicitUrl})${redirectNote}${suffix}`
: `No llms.txt found at any candidate location (${candidates.join(', ')})${redirectNote}${suffix}`;
return {
id: 'llms-txt-exists',
category: 'content-discoverability',
status: 'fail',
message: `No llms.txt found at any candidate location (${candidates.join(', ')})${redirectNote}${suffix}`,
message,
details,
};
}
Expand All @@ -203,11 +227,24 @@ async function checkLlmsTxtExists(ctx: CheckContext): Promise<CheckResult> {
details.sameContent = allSame;
}

// Build a message that surfaces which file was picked as canonical, so users
// can see at a glance which one drives the rest of the report.
let message: string;
if (explicitUrl && canonical) {
message = `llms.txt found at ${canonical.url} (specified via --llms-txt-url)`;
} else if (discovered.length === 1) {
message = `llms.txt found at ${discovered[0].url}`;
} else if (canonical) {
message = `llms.txt found at ${discovered.length} locations; using ${canonical.url} as canonical`;
} else {
message = `llms.txt found at ${discovered.length} location(s)`;
}

return {
id: 'llms-txt-exists',
category: 'content-discoverability',
status: 'pass',
message: `llms.txt found at ${discovered.length} location(s)${suffix}`,
message: message + suffix,
details,
};
}
Expand Down
5 changes: 3 additions & 2 deletions src/checks/content-discoverability/llms-txt-links-markdown.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import { registerCheck } from '../registry.js';
import { extractMarkdownLinks } from './llms-txt-valid.js';
import { filterByPathPrefix, getPathFilterBase } from '../../helpers/get-page-urls.js';
import { getLlmsTxtFilesForAnalysis } from '../../helpers/llms-txt.js';
import { toMdUrls } from '../../helpers/to-md-urls.js';
import { looksLikeMarkdown } from '../../helpers/detect-markdown.js';
import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js';
import type { CheckContext, CheckResult } from '../../types.js';

interface LinkMarkdownResult {
url: string;
Expand All @@ -25,7 +26,7 @@ function hasMarkdownExtension(url: string): boolean {

async function checkLlmsTxtLinksMarkdown(ctx: CheckContext): Promise<CheckResult> {
const existsResult = ctx.previousResults.get('llms-txt-exists');
const discovered = (existsResult?.details?.discoveredFiles ?? []) as DiscoveredFile[];
const discovered = getLlmsTxtFilesForAnalysis(existsResult);

if (discovered.length === 0) {
return {
Expand Down
5 changes: 3 additions & 2 deletions src/checks/content-discoverability/llms-txt-links-resolve.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ import { registerCheck } from '../registry.js';
import { LINK_RESOLVE_THRESHOLD } from '../../constants.js';
import { extractMarkdownLinks } from './llms-txt-valid.js';
import { filterByPathPrefix, getPathFilterBase } from '../../helpers/get-page-urls.js';
import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js';
import { getLlmsTxtFilesForAnalysis } from '../../helpers/llms-txt.js';
import type { CheckContext, CheckResult } from '../../types.js';

interface LinkCheckResult {
url: string;
Expand All @@ -13,7 +14,7 @@ interface LinkCheckResult {

async function checkLlmsTxtLinksResolve(ctx: CheckContext): Promise<CheckResult> {
const existsResult = ctx.previousResults.get('llms-txt-exists');
const discovered = (existsResult?.details?.discoveredFiles ?? []) as DiscoveredFile[];
const discovered = getLlmsTxtFilesForAnalysis(existsResult);

if (discovered.length === 0) {
return {
Expand Down
5 changes: 3 additions & 2 deletions src/checks/content-discoverability/llms-txt-size.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import { registerCheck } from '../registry.js';
import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js';
import { getLlmsTxtFilesForAnalysis } from '../../helpers/llms-txt.js';
import type { CheckContext, CheckResult } from '../../types.js';

async function checkLlmsTxtSize(ctx: CheckContext): Promise<CheckResult> {
const existsResult = ctx.previousResults.get('llms-txt-exists');
const discovered = (existsResult?.details?.discoveredFiles ?? []) as DiscoveredFile[];
const discovered = getLlmsTxtFilesForAnalysis(existsResult);

if (discovered.length === 0) {
return {
Expand Down
5 changes: 3 additions & 2 deletions src/checks/content-discoverability/llms-txt-valid.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { registerCheck } from '../registry.js';
import type { CheckContext, CheckResult, DiscoveredFile } from '../../types.js';
import { getLlmsTxtFilesForAnalysis } from '../../helpers/llms-txt.js';
import type { CheckContext, CheckResult } from '../../types.js';

interface ValidationResult {
url: string;
Expand Down Expand Up @@ -48,7 +49,7 @@ function validateLlmsTxt(content: string, url: string): ValidationResult {

async function checkLlmsTxtValid(ctx: CheckContext): Promise<CheckResult> {
const existsResult = ctx.previousResults.get('llms-txt-exists');
const discovered = (existsResult?.details?.discoveredFiles ?? []) as DiscoveredFile[];
const discovered = getLlmsTxtFilesForAnalysis(existsResult);

if (discovered.length === 0) {
return {
Expand Down
4 changes: 3 additions & 1 deletion src/checks/observability/cache-header-hygiene.ts
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,9 @@ async function check(ctx: CheckContext): Promise<CheckResult> {
// Collect URLs to check: llms.txt files + sampled page URLs
const urlsToCheck: string[] = [];

// llms.txt URLs
// llms.txt URLs — intentionally checks ALL discovered files (not just the
// canonical) so that multiple llms.txt locations (apex + docs) are each
// expected to have appropriate cache headers.
const existsResult = ctx.previousResults.get('llms-txt-exists');
const discovered = (existsResult?.details?.discoveredFiles ?? []) as DiscoveredFile[];
for (const file of discovered) {
Expand Down
23 changes: 23 additions & 0 deletions src/cli/commands/check.ts
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ export function registerCheckCommand(program: Command): void {
'--canonical-origin <url>',
'The production domain your content links to (for preview/staging testing)',
)
.option(
'--llms-txt-url <url>',
'Explicit llms.txt URL to use as canonical (bypasses discovery heuristic)',
)
.action(async (rawUrl: string | undefined, opts: Record<string, unknown>) => {
// Load config: explicit path or auto-discover
let config;
Expand Down Expand Up @@ -199,6 +203,24 @@ export function registerCheckCommand(program: Command): void {
}
}

let llmsTxtUrl: string | undefined;
const rawLlmsTxtUrl = (opts.llmsTxtUrl as string | undefined) ?? config?.options?.llmsTxtUrl;
if (rawLlmsTxtUrl) {
try {
llmsTxtUrl = new URL(normalizeUrl(rawLlmsTxtUrl)).toString();
} catch {
process.stderr.write(`Error: Invalid --llms-txt-url "${rawLlmsTxtUrl}".\n`);
process.exitCode = 1;
return;
}
const targetOrigin = new URL(url).origin;
if (new URL(llmsTxtUrl).origin !== targetOrigin) {
process.stderr.write(
`Warning: --llms-txt-url origin (${new URL(llmsTxtUrl).origin}) differs from target origin (${targetOrigin}). The flag will still be used as canonical.\n`,
);
}
}

const report = await runChecks(url, {
checkIds,
skipCheckIds,
Expand All @@ -214,6 +236,7 @@ export function registerCheckCommand(program: Command): void {
...(preferredLocale && { preferredLocale }),
...(preferredVersion && { preferredVersion }),
...(canonicalOrigin && { canonicalOrigin }),
...(llmsTxtUrl && { llmsTxtUrl }),
});

let output: string;
Expand Down
Loading