You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document a four-layer taxonomy for agent evaluation in AgentV's docs, mapping each layer to specific evaluator types. This gives users a framework for deciding which evaluators to use for their agent evaluation goals.
Motivation
Braintrust's documentation contains the most detailed agent evaluation taxonomy found across 23 researched frameworks. AgentV already has evaluators that cover all four layers — but users don't have guidance on which evaluators to use for which evaluation concern.
This is documentation only — no code changes. It organizes existing evaluator capabilities into an industry-validated framework. Aligns with AgentV's AI-first design: agents authoring EVAL.yaml can use the taxonomy to select appropriate evaluators.
Summary
Document a four-layer taxonomy for agent evaluation in AgentV's docs, mapping each layer to specific evaluator types. This gives users a framework for deciding which evaluators to use for their agent evaluation goals.
Motivation
Braintrust's documentation contains the most detailed agent evaluation taxonomy found across 23 researched frameworks. AgentV already has evaluators that cover all four layers — but users don't have guidance on which evaluators to use for which evaluation concern.
Research reference: Braintrust findings — Agent Evaluation Patterns
The Four Layers
Layer 1: Reasoning
What it evaluates: Is the agent thinking correctly?
AgentV evaluators:
llm_judgewith reasoning-focused promptsagent_judgefor deep agentic investigationLayer 2: Action
What it evaluates: Is the agent acting correctly?
AgentV evaluators:
tool_trajectory(any_order, in_order, exact match modes)execution_metrics(tool call counts, exploration ratio)code_judgefor custom trajectory validationLayer 3: End-to-End
What it evaluates: Did the agent accomplish its task?
AgentV evaluators:
llm_judge/rubricfor task completionfield_accuracy/contains/equalsfor expected output matchingexecution_metricsfor latency, cost, token usage thresholdscompositefor multi-objective scoringLayer 4: Safety
What it evaluates: Is the agent operating safely?
AgentV evaluators:
llm_judgewith safety-focused promptscode_judgewith policy checking scriptsDeliverable
Add a documentation page (e.g.,
docs/src/content/docs/guides/agent-eval-layers.mdx) that:Why This Fits AgentV
This is documentation only — no code changes. It organizes existing evaluator capabilities into an industry-validated framework. Aligns with AgentV's AI-first design: agents authoring EVAL.yaml can use the taxonomy to select appropriate evaluators.
Effort Estimate
1 day (documentation only)