docs: agent evaluation layer taxonomy (Reasoning / Action / E2E / Safety)

## Summary

Document a four-layer taxonomy for agent evaluation in AgentV's docs, mapping each layer to specific evaluator types. This gives users a framework for deciding which evaluators to use for their agent evaluation goals.

## Motivation

Braintrust's documentation contains the most detailed agent evaluation taxonomy found across 23 researched frameworks. AgentV already has evaluators that cover all four layers — but users don't have guidance on which evaluators to use for which evaluation concern.

**Research reference**: [Braintrust findings — Agent Evaluation Patterns](https://github.com/agentevals/agentevals-research/blob/main/research/findings/braintrust/README.md#8-agent-evaluation-patterns)

## The Four Layers

### Layer 1: Reasoning
**What it evaluates**: Is the agent thinking correctly?
- Plan quality — are the agent's plans logical and complete?
- Plan adherence — does the agent follow its own plan?
- Tool selection — does the agent choose appropriate tools?

**AgentV evaluators**:
- `llm_judge` with reasoning-focused prompts
- `agent_judge` for deep agentic investigation

### Layer 2: Action
**What it evaluates**: Is the agent acting correctly?
- Tool call correctness — are tool calls properly formatted?
- Argument validity — are tool arguments correct?
- Execution path — no loops, no redundant calls
- Redundancy detection — identifying unnecessary tool invocations

**AgentV evaluators**:
- `tool_trajectory` (any_order, in_order, exact match modes)
- `execution_metrics` (tool call counts, exploration ratio)
- `code_judge` for custom trajectory validation

### Layer 3: End-to-End
**What it evaluates**: Did the agent accomplish its task?
- Task completion rate — binary success/failure
- Step efficiency — how many steps to reach goal?
- Latency — total execution time
- Cost — total token/API cost

**AgentV evaluators**:
- `llm_judge` / `rubric` for task completion
- `field_accuracy` / `contains` / `equals` for expected output matching
- `execution_metrics` for latency, cost, token usage thresholds
- `composite` for multi-objective scoring

### Layer 4: Safety
**What it evaluates**: Is the agent operating safely?
- Prompt injection resilience
- Policy adherence
- Bias detection
- Content safety

**AgentV evaluators**:
- `llm_judge` with safety-focused prompts
- `code_judge` with policy checking scripts
- (For comprehensive red-teaming: export to promptfoo — see #276)

## Deliverable

Add a documentation page (e.g., `docs/src/content/docs/guides/agent-eval-layers.mdx`) that:

1. Describes the four layers with examples
2. Maps each layer to specific AgentV evaluator types
3. Provides EVAL.yaml examples for each layer
4. Recommends a "starter evaluation" covering at least one evaluator per layer
5. Links to promptfoo integration (#276) for Layer 4 (Safety) red-teaming

## Why This Fits AgentV

This is documentation only — no code changes. It organizes existing evaluator capabilities into an industry-validated framework. Aligns with AgentV's AI-first design: agents authoring EVAL.yaml can use the taxonomy to select appropriate evaluators.

## Effort Estimate

1 day (documentation only)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: agent evaluation layer taxonomy (Reasoning / Action / E2E / Safety) #278

Summary

Motivation

The Four Layers

Layer 1: Reasoning

Layer 2: Action

Layer 3: End-to-End

Layer 4: Safety

Deliverable

Why This Fits AgentV

Effort Estimate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

docs: agent evaluation layer taxonomy (Reasoning / Action / E2E / Safety) #278

Description

Summary

Motivation

The Four Layers

Layer 1: Reasoning

Layer 2: Action

Layer 3: End-to-End

Layer 4: Safety

Deliverable

Why This Fits AgentV

Effort Estimate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions