diff --git a/docs/decisions/0020-foundry-evals-integration.md b/docs/decisions/0020-foundry-evals-integration.md new file mode 100644 index 0000000000..f5b5db4db5 --- /dev/null +++ b/docs/decisions/0020-foundry-evals-integration.md @@ -0,0 +1,815 @@ +--- +status: accepted +contact: bentho +date: 2026-02-27 +deciders: bentho, markwallace-microsoft, westey-m +consulted: Pratyush Mishra, Shivam Shrivastava, Manni Arora (Centrica eval scenario) +informed: Agent Framework team, Foundry Evals team +--- + +# Agent Evaluation Architecture with Azure AI Foundry Integration + +## Context and Problem Statement + +Azure AI Foundry provides a rich evaluation service for AI agents — built-in evaluators for agent behavior (task adherence, intent resolution), tool usage (tool call accuracy, tool selection), quality (coherence, fluency, relevance), and safety (violence, self-harm, prohibited actions). Results are viewable in the Foundry portal with dashboards and comparison views. + +However, using Foundry Evals with an agent-framework agent today requires significant manual effort. Developers must: + +1. Transform agent-framework's `Message`/`Content` types into the OpenAI-style agent message schema that Foundry evaluators expect +2. Map tool definitions from agent-framework's `FunctionTool` format to evaluator-compatible schemas +3. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario +4. Handle App Insights trace ID queries, response ID collection, and eval polling + +Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level. + +### Functional Requirements for Agent Evaluation + +- **Single agents and workflows.** Evaluate both individual agent responses and multi-agent workflow results, with per-agent breakdown to pinpoint underperformance. +- **One-shot and multi-turn conversations.** Capture full conversation trajectories — including tool calls and results — not just final query/response pairs. +- **Conversation factoring.** Support splitting conversations into query/response in multiple ways (last turn, full trajectory, per-turn) because different factorings measure different things. +- **Multiple providers, mix and match.** Run Foundry LLM-as-judge evaluators alongside fast local checks and custom evaluators on the same data, without restructuring code. +- **Third-party extensibility.** Any evaluation library can participate by implementing the `Evaluator` protocol (Python) or `IAgentEvaluator` interface (.NET). No predetermined list of supported libraries — the protocol is intentionally simple (`evaluate(items) → results`) so that wrappers for libraries like DeepEval, RAGAS, or Promptfoo are straightforward to write. +- **Bring your own evaluator.** Creating a custom evaluator should be as simple as writing a function. +- **Evaluate without re-running.** Evaluate existing responses from logs or previous runs without invoking the agent again. + +## Decision Drivers + +- **Zero-friction evaluation**: Developers should go from "I have an agent" to "I have eval results" with minimal code. +- **Provider-agnostic API**: Core evaluation capabilities must not be tied to any specific provider. Provider configuration should be separate from the evaluation call. +- **Lowest concept count**: Introduce the fewest possible new types, abstractions, and APIs for developers to learn. +- **Leverage existing knowledge**: The framework already knows which agents exist, what tools they have, and what conversations occurred. Evals should use this automatically rather than requiring the developer to re-specify it. +- **Foundry-native results**: When using Foundry, results should be viewable in the Foundry portal with dashboards and comparison views. +- **Progressive disclosure**: Simple scenarios should be near-zero code. Advanced scenarios should build on the same primitives. +- **Cross-language parity**: Design must be implementable in both Python and .NET. + +## Considered Options + +1. **Provider-specific functions** — Build Foundry-specific helper functions (`evaluate_agent()`, etc.) directly in the Azure package. All eval functions take Foundry connection parameters. +2. **Evaluator protocol with shared orchestration** — Define a provider-agnostic `Evaluator` protocol in the base agent library (`agent_framework` in Python, `Microsoft.Agents.AI` in .NET). Orchestration functions live alongside it. Providers implement the protocol. +3. **Full eval framework** — Build comprehensive eval infrastructure including custom evaluator definitions, scoring profiles, and reporting inside agent-framework. + +## Decision Outcome + +Proposed option: "Evaluator protocol with shared orchestration", because it delivers the low-friction developer experience, supports multiple providers without API changes, and keeps the concept count low. + +### Usage Examples + +#### Evaluate an agent + +The agent is invoked once per query by default. For statistically meaningful evaluation, provide multiple diverse queries. For measuring **consistency** (does the same query produce reliable results?), use `num_repetitions` to run each query N times independently: + +**Python:** + +```python +evals = FoundryEvals( + project_client=client, + model_deployment="gpt-4o", + evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE], +) + +results = await evaluate_agent( + agent=my_agent, + queries=[ + "What's the weather in Seattle?", + "Plan a weekend trip to Portland", + "What restaurants are near Pike Place?", + ], + evaluators=evals, +) +for r in results: + r.assert_passed() +``` + +**C#:** + +```csharp +var evals = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence); + +AgentEvaluationResults results = await agent.EvaluateAsync( + new[] { + "What's the weather in Seattle?", + "Plan a weekend trip to Portland", + "What restaurants are near Pike Place?", + }, + evals); + +results.AssertAllPassed(); +``` + +`evaluate_agent` returns one `EvalResults` per evaluator. Each result contains per-item scores with the evaluated response for auditing: + +``` +# results[0] (FoundryEvals) +EvalResults(status="completed", passed=3, failed=0, total=3) + items[0]: EvalItemResult( + query="What's the weather in Seattle?", + response="It's currently 72°F and sunny in Seattle.", + scores={"relevance": 5, "coherence": 5}) + items[1]: EvalItemResult( + query="Plan a weekend trip to Portland", + response="Here's a 2-day Portland itinerary...", + scores={"relevance": 4, "coherence": 5}) + items[2]: EvalItemResult( + query="What restaurants are near Pike Place?", + response="Top restaurants near Pike Place Market: ...", + scores={"relevance": 5, "coherence": 4}) +``` + +#### Measure consistency with repetitions + +Run each query multiple times to detect non-deterministic behavior: + +**Python:** + +```python +results = await evaluate_agent( + agent=my_agent, + queries=["What's the weather in Seattle?"], + evaluators=evals, + num_repetitions=3, # each query runs 3 times independently +) +# results contain 3 items (1 query × 3 repetitions) +``` + +**C#:** + +```csharp +AgentEvaluationResults results = await agent.EvaluateAsync( + new[] { "What's the weather in Seattle?" }, + evals, + numRepetitions: 3); // each query runs 3 times independently +// results contain 3 items (1 query × 3 repetitions) +``` + +#### Evaluate a response you already have + +When you already have agent responses, pass them directly to skip re-running the agent. Each query is paired with its corresponding response: + +**Python:** + +```python +queries = ["What's the weather?", "What's the capital of France?"] +responses = [await agent.run([Message("user", [q])]) for q in queries] + +results = await evaluate_agent( + responses=responses, + evaluators=evals, +) +``` + +**C#:** + +```csharp +var queries = new[] { "What's the weather?" }; +var responses = new List(); +foreach (var q in queries) + responses.Add(await agent.RunAsync(new[] { new ChatMessage(ChatRole.User, q) })); + +AgentEvaluationResults results = await agent.EvaluateAsync( + responses: responses, + evals); +``` + +Each `AgentResponse` already contains the conversation (query + response), so the evaluator extracts query/response from the conversation. When you pass `responses` without `queries`, the conversation is the source of truth. + +#### Evaluate with conversation split strategies + +By default, evaluators see only the last turn (final user message → final assistant response). For multi-turn conversations, you can control how the conversation is factored for evaluation: + +**Python:** + +```python +results = await evaluate_agent( + agent=agent, + queries=["Plan a 3-day trip to Paris"], + evaluators=evals, + conversation_split=ConversationSplit.FULL, # evaluate entire trajectory +) + +# Or per-turn: each user→assistant exchange scored independently +results = await evaluate_agent( + agent=agent, + queries=["Plan a 3-day trip to Paris"], + evaluators=evals, + conversation_split=ConversationSplit.PER_TURN, +) +``` + +**C#:** + +```csharp +// Full conversation as context +AgentEvaluationResults results = await agent.EvaluateAsync( + new[] { "Plan a 3-day trip to Paris" }, + evals, + splitter: ConversationSplitters.Full); + +// Per-turn splitting +var items = EvalItem.PerTurnItems(conversation); // one EvalItem per user turn +var results = await evals.EvaluateAsync(items); +``` + +With `PER_TURN`, a 3-turn conversation produces 3 scored items: + +``` +EvalResults(status="completed", passed=3, failed=0, total=3) + items[0]: query="Plan a 3-day trip to Paris" scores={"relevance": 5} + items[1]: query="What about restaurants?" scores={"relevance": 4} + items[2]: query="Make it budget-friendly" scores={"relevance": 5} +``` + +#### Evaluate a multi-agent workflow + +**Python:** + +```python +result = await workflow.run("Plan a trip to Paris") +eval_results = await evaluate_workflow( + workflow=workflow, + workflow_result=result, + evaluators=evals, +) + +for r in eval_results: + print(f" overall: {r.passed}/{r.total}") + for name, sub in r.sub_results.items(): + print(f" {name}: {sub.passed}/{sub.total}") +``` + +**C#:** + +```csharp +WorkflowRunResult result = await workflow.RunAsync("Plan a trip to Paris"); + +IReadOnlyList evalResults = await result.EvaluateAsync(evals); + +foreach (var r in evalResults) +{ + Console.WriteLine($" overall: {r.Passed}/{r.Total}"); + foreach (var (name, sub) in r.SubResults) + Console.WriteLine($" {name}: {sub.Passed}/{sub.Total}"); +} +``` + +Workflows return one result per evaluator, with sub-results per agent in the workflow: + +``` +EvalResults(status="completed", passed=2, failed=0, total=2) + sub_results: + "planner": EvalResults(passed=1, total=1) + "researcher": EvalResults(passed=1, total=1) +``` + +#### Mix multiple providers + +**Python:** + +```python +@evaluator +def is_helpful(response: str) -> bool: + return len(response.split()) > 10 + +foundry = FoundryEvals( + project_client=client, + model_deployment="gpt-4o", + evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE], +) + +results = await evaluate_agent( + agent=agent, + queries=queries, + evaluators=[is_helpful, keyword_check("weather"), foundry], +) +``` + +**C#:** + +```csharp +IReadOnlyList results = await agent.EvaluateAsync( + queries, + evaluators: new IAgentEvaluator[] + { + new LocalEvaluator( + EvalChecks.KeywordCheck("weather"), + FunctionEvaluator.Create("is_helpful", (string r) => r.Split(' ').Length > 10)), + new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence), + }); +``` + +Multiple evaluators return one result each — `results[0]` is the local evaluator, `results[1]` is Foundry. + +#### Custom function evaluators + +**Python:** + +```python +@evaluator +def mentions_city(response: str, expected_output: str) -> bool: + return expected_output.lower() in response.lower() + +@evaluator +def used_tools(conversation: list, tools: list) -> float: + # ... scoring logic + return score + +local = LocalEvaluator(mentions_city, used_tools) +``` + +`@evaluator` uses **parameter name injection** — the function's parameter names determine what data it receives from the `EvalItem`. Supported names: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`. Any combination is valid. + +**C#:** + +```csharp +var local = new LocalEvaluator( + FunctionEvaluator.Create("mentions_city", + (EvalItem item) => item.ExpectedOutput != null + && item.Response.Contains(item.ExpectedOutput, StringComparison.OrdinalIgnoreCase)), + FunctionEvaluator.Create("is_concise", + (string response) => response.Split(' ').Length < 500)); +``` + +## What To Build + +### Core: Evaluator Protocol + +A runtime-checkable protocol that any evaluation provider implements: + +```python +@runtime_checkable +class Evaluator(Protocol): + name: str + + async def evaluate( + self, items: Sequence[EvalItem], *, eval_name: str = "Agent Framework Eval" + ) -> EvalResults: ... +``` + +The protocol is minimal — just `name` and `evaluate()`. + +### Core: EvalItem + +Provider-agnostic data format for items to evaluate: + +```python +@dataclass +class ExpectedToolCall: + name: str # Tool/function name + arguments: dict[str, Any] | None = None # None = don't check args + +@dataclass +class EvalItem: + conversation: list[Message] # Single source of truth + tools: list[FunctionTool] | None = None # Agent's available tools + context: str | None = None + expected_output: str | None = None # Ground-truth for comparison + expected_tool_calls: list[ExpectedToolCall] | None = None + split_strategy: ConversationSplitter | None = None + + query: str # property — derived from conversation split + response: str # property — derived from conversation split +``` + +`conversation` is the single source of truth. `query` and `response` are derived properties — splitting the conversation at the last user message (default) and extracting text from each side. Changing the `split_strategy` consistently changes all derived values. + +`tools` provides typed `FunctionTool` objects — including MCP tools, which are automatically extracted after agent runs. + +### Internal: AgentEvalConverter + +Internal class that converts agent-framework types to `EvalItem`. Used by `evaluate_agent()` and `evaluate_workflow()` — not part of the public API: + +| Agent Framework | Eval Format | +|---|---| +| `Content.function_call` | `tool_call` in OpenAI chat format | +| `Content.function_result` | `tool_result` in OpenAI chat format | +| `FunctionTool` | `{name, description, parameters}` schema | +| `Message` history | `conversation` list + `query`/`response` extraction | + +### Core: EvalResults + +Rich result type with convenience properties for CI integration: + +```python +results.all_passed # bool: no failures or errors (recursive for workflow) +results.passed # int: passing count +results.failed # int: failure count +results.total # int: total = passed + failed + errored +results.items # list[EvalItemResult]: per-item detail with query, response, and scores +results.error # str | None: error details on failure +results.sub_results # dict: per-agent breakdown (workflow evals) +results.report_url # str | None: portal link (Foundry) +results.assert_passed() # raises AssertionError with details +``` + +### Core: Orchestration Functions + +Provider-agnostic functions that extract data and delegate to evaluators: + +| Function | What it does | +|---|---| +| `evaluate_agent()` | Runs agent against test queries (or evaluates pre-existing `responses=`), converts to `EvalItem`s, passes to evaluator. Accepts optional `expected_output=` for ground-truth comparison, `expected_tool_calls=` for tool-correctness evaluation, and `num_repetitions=` for consistency measurement | +| `evaluate_workflow()` | Extracts per-agent data from `WorkflowRunResult`, evaluates each agent and overall output. Per-agent breakdown in `sub_results`. Also accepts `num_repetitions=` | + +### Core: Conversation Split Strategies + +Multi-turn conversations must be split into query (input) and response (output) halves for evaluation. How you split determines *what you're evaluating*: + +**Last-turn split** — split at the last user message. Everything up to and including it is the query context; the agent's subsequent actions are the response: + +``` +conversation: user1 → assistant1 → user2 → assistant2(tool) → tool_result → assistant3 +query_messages: [user1, assistant1, user2] +response_messages: [assistant2(tool), tool_result, assistant3] +``` + +This evaluates: "Given all the context so far, did the agent answer the latest question well?" Best for response quality at a specific point in the conversation. + +**Full-conversation split** — the first user message is the query; everything after is the response: + +``` +query_messages: [user1] +response_messages: [assistant1, user2, assistant2(tool), tool_result, assistant3] +``` + +This evaluates: "Given the original request, did the entire conversation trajectory serve the user?" Best for task completion and overall conversation quality. + +**Per-turn split** — produces N eval items from an N-turn conversation. Each turn is evaluated with its cumulative context: + +``` +item 1: query = [user1], response = [assistant1] +item 2: query = [user1, assistant1, user2], response = [assistant2(tool), tool_result, assistant3] +``` + +This evaluates each response independently. Best for fine-grained analysis and pinpointing where a conversation goes wrong. + +These factorings produce different scores for the same conversation. The framework ships all three as built-in strategies, defaulting to last-turn. Developers can also provide a custom splitter — a function (Python) or `IConversationSplitter` implementation (.NET) — and override the strategy at the call site or per evaluator. + +### Azure AI: FoundryEvals + +`Evaluator` implementation backed by Azure AI Foundry: + +```python +class FoundryEvals: + def __init__(self, *, project_client=None, openai_client=None, + model_deployment: str, evaluators=None, ...) + async def evaluate(self, items, *, eval_name) -> EvalResults +``` + +**Smart auto-detection in `evaluate()`:** +- Default evaluators: relevance, coherence, task_adherence +- Auto-adds `tool_call_accuracy` when items have tools/`tool_definitions` +- Filters out tool evaluators for items without tools + +### Azure AI: FoundryEvals Constants + +```python +from agent_framework_azure_ai import FoundryEvals + +evaluators = [FoundryEvals.RELEVANCE, FoundryEvals.TOOL_CALL_ACCURACY] +``` + +Categories: Agent behavior, Tool usage, Quality, Safety. + +### Azure AI: Foundry-Specific Functions + +| Function | What it does | +|---|---| +| `evaluate_traces()` | Evaluate from stored response IDs or OTel traces | +| `evaluate_foundry_target()` | Evaluate a Foundry-registered agent or deployment | + +### Core: LocalEvaluator and Function Evaluators + +`LocalEvaluator` implements the `Evaluator` protocol for fast, API-free evaluation. It runs check functions locally — useful for inner-loop development, CI smoke tests, and combining with cloud-based evaluators. + +Built-in checks: +- `keyword_check(*keywords)` — response must contain specified keywords +- `tool_called_check(*tool_names)` — agent must have called specified tools +- `tool_calls_present` — all `expected_tool_calls` names appear in conversation (unordered, extras OK) +- `tool_call_args_match` — expected tool calls match on name + arguments (subset match on args) + +Custom function evaluators use `@evaluator` to wrap plain Python functions. The function's **parameter names** determine what data it receives from the `EvalItem`: + +```python +from agent_framework import evaluator, LocalEvaluator + +# Tier 1: Simple check — just query + response +@evaluator +def is_concise(response: str) -> bool: + return len(response.split()) < 500 + +# Tier 2: Ground truth — compare against expected output +@evaluator +def mentions_city(response: str, expected_output: str) -> bool: + return expected_output.lower() in response.lower() + +# Tier 3: Full context — inspect conversation and tools +@evaluator +def used_tools(conversation: list, tools: list) -> float: + # ... scoring logic + return score + +local = LocalEvaluator(is_concise, mentions_city, used_tools) +``` + +Supported parameters: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`. +Return types: `bool`, `float` (≥0.5 = pass), `dict` with `score` or `passed` key, or `CheckResult`. + +Async functions are handled automatically — `@evaluator` detects `async def` and produces the right wrapper. + +### Example: GAIA Benchmark + +[GAIA](https://huggingface.co/gaia-benchmark) tests real-world multi-step tasks with known expected answers. Each task has a question and a ground-truth answer, with optional file attachments. The framework accommodates GAIA's knobs (difficulty levels, file inputs, multi-step tool use) through the existing `EvalItem` fields: + +```python +from datasets import load_dataset +from agent_framework import evaluate_agent, evaluator, LocalEvaluator + +gaia = load_dataset("gaia-benchmark/GAIA", "2023_level1", split="test") + +@evaluator +def exact_match(response: str, expected_output: str) -> bool: + return expected_output.strip().lower() in response.strip().lower() + +# Simple path — evaluate_agent handles running + expected_output stamping +results = await evaluate_agent( + agent=agent, + queries=[task["Question"] for task in gaia], + expected_output=[task["Final answer"] for task in gaia], + evaluators=LocalEvaluator(exact_match), +) +``` + +### Package Location + +- Core types and orchestration: `agent_framework._eval`, `agent_framework._local_eval` (Python), `Microsoft.Agents.AI` (.NET) +- Foundry provider: `agent_framework_azure_ai._foundry_evals` (Python), `Microsoft.Agents.AI.AzureAI` (.NET) +- Azure-AI re-exports core types for convenience (Python) + +## Known Limitations + +1. **Tool evaluators require query + agent**: Tool evaluators need tool definition schemas. When using these evaluators with `evaluate_agent(responses=...)`, provide `queries=` and pass an agent with tool definitions. +2. **`model_deployment` always required**: Could potentially be inferred from the Foundry project configuration. + +## Open Questions + +1. **Red teaming non-registered agents**: Requires Foundry API support for callback-based flows. +2. **Datasets with expected outputs**: A dataset abstraction for pre-populating `expected_output` values across eval runs is a natural next step but not yet designed. +3. **Multi-modal evaluation**: The `conversation` field on `EvalItem` already stores full `Message`/`Content` (Python) and `ChatMessage` (.NET) objects, which can represent multi-modal content (images, audio, structured data). Evaluators that accept the full `EvalItem` or `conversation` parameter can access this content today. However, the convenience shortcuts — `query`/`response` string projections and the `FunctionEvaluator` string overloads — are text-only. Multi-modal-aware evaluators should use the full-item path (`Func` in .NET, `conversation: list` parameter in Python). + +## .NET Implementation Design + +### Key Difference: MEAI Ecosystem + +Unlike Python, the .NET ecosystem already has `Microsoft.Extensions.AI.Evaluation` (v10.3.0) providing: + +- `IEvaluator` — per-item evaluation of `(messages, chatResponse) → EvaluationResult` +- `CompositeEvaluator` — combines multiple evaluators +- Quality evaluators — `RelevanceEvaluator`, `CoherenceEvaluator`, `GroundednessEvaluator` +- Safety evaluators — `ContentHarmEvaluator`, `ProtectedMaterialEvaluator` +- Metric types — `NumericMetric`, `BooleanMetric`, `StringMetric` + +The .NET integration uses MEAI's `IEvaluator` directly — no new evaluator interface. Our contribution is the **orchestration layer**: extension methods that run agents, extract data, call `IEvaluator` per item, and aggregate results. + +### Architecture + +``` +┌──────────────────────────────────────────────────────────────┐ +│ Developer Code │ +│ agent.EvaluateAsync(queries, evaluator) │ +│ run.EvaluateAsync(evaluator) │ +└────────────────┬─────────────────────────────────────────────┘ + │ +┌────────────────▼─────────────────────────────────────────────┐ +│ Orchestration Layer (Microsoft.Agents.AI) │ +│ AgentEvaluationExtensions — runs agents, extracts data, │ +│ calls IEvaluator per item, aggregates into │ +│ AgentEvaluationResults │ +└────────────────┬─────────────────────────────────────────────┘ + │ IEvaluator (MEAI) + │ + ┌───────────┼────────────┐ + │ │ │ + ┌───▼───-┐ ┌───▼────┐ ┌────▼──────────┐ + │ MEAI │ │ Local │ │ Foundry │ + │ Quality│ │ Checks │ │ (cloud batch) │ + │ Safety │ │ Lambdas│ │ │ + └────────┘ └────────┘ └───────────────┘ +``` + +All evaluators implement MEAI's `IEvaluator`. The orchestration layer doesn't need to know which kind — it calls `EvaluateAsync(messages, chatResponse)` per item on all of them. `FoundryEvals` handles batching internally (buffers items, submits once, returns per-item results). + +### .NET Core Types + +**No new evaluator interface.** Use MEAI's `IEvaluator` directly. + +**`AgentEvaluationResults`** — The only new type. Aggregates per-item MEAI `EvaluationResult`s across a batch of queries: + +```csharp +public class AgentEvaluationResults +{ + public string Provider { get; init; } + public string? ReportUrl { get; init; } + + // Per-item — standard MEAI EvaluationResult, unchanged + public IReadOnlyList Items { get; init; } + + // Aggregate pass/fail derived from metric interpretations + public int Passed { get; } + public int Failed { get; } + public int Total { get; } + public bool AllPassed { get; } + + // Workflow: per-agent breakdown + public IReadOnlyDictionary? SubResults { get; init; } + + public void AssertAllPassed(string? message = null); +} +``` + +### .NET Evaluator Implementations + +All implement MEAI's `IEvaluator`: + +**`LocalEvaluator`** — Runs lambda checks locally, returns `BooleanMetric` per check: + +```csharp +var local = new LocalEvaluator( + FunctionEvaluator.Create("is_concise", + (string response) => response.Split().Length < 500), + EvalChecks.KeywordCheck("weather"), + EvalChecks.ToolCalledCheck("get_weather")); +``` + +**MEAI evaluators** — Used directly, no adapter needed: + +```csharp +var quality = new CompositeEvaluator( + new RelevanceEvaluator(), + new CoherenceEvaluator()); +``` + +**`FoundryEvals`** — Implements `IEvaluator` but batches internally. On first call, buffers the item. On the last item (or when explicitly flushed), submits the batch to Foundry and distributes per-item results: + +```csharp +var foundry = new FoundryEvals(projectClient, "gpt-4o"); +``` + +### .NET Orchestration: Extension Methods + +```csharp +public static class AgentEvaluationExtensions +{ + // Evaluate an agent against test queries + public static Task EvaluateAsync( + this AIAgent agent, + IEnumerable queries, + IEvaluator evaluator, + ChatConfiguration? chatConfiguration = null, + IEnumerable? expectedOutput = null, + CancellationToken cancellationToken = default); + + // Evaluate pre-existing responses (without re-running the agent) + public static Task EvaluateAsync( + this AIAgent agent, + AgentResponse responses, + IEvaluator evaluator, + IEnumerable? queries = null, + ChatConfiguration? chatConfiguration = null, + IEnumerable? expectedOutput = null, + CancellationToken cancellationToken = default); + + // Evaluate with multiple evaluators (one result per evaluator) + public static Task> EvaluateAsync( + this AIAgent agent, + IEnumerable queries, + IEnumerable evaluators, + ChatConfiguration? chatConfiguration = null, + IEnumerable? expectedOutput = null, + CancellationToken cancellationToken = default); + + // Evaluate a workflow run with per-agent breakdown + public static Task EvaluateAsync( + this Run run, + IEvaluator evaluator, + ChatConfiguration? chatConfiguration = null, + bool includeOverall = true, + bool includePerAgent = true, + CancellationToken cancellationToken = default); +} +``` + +**Usage:** + +```csharp +// MEAI evaluators — just works +var results = await agent.EvaluateAsync( + queries: ["What's the weather?"], + evaluator: new RelevanceEvaluator(), + chatConfiguration: new ChatConfiguration(evalClient)); + +// Local checks +var results = await agent.EvaluateAsync( + queries: ["What's the weather?"], + evaluator: new LocalEvaluator( + EvalChecks.KeywordCheck("weather"))); + +// Foundry cloud +var results = await agent.EvaluateAsync( + queries: ["What's the weather?"], + evaluator: new FoundryEvals(projectClient, "gpt-4o")); + +// Evaluate existing response (without re-running the agent) +var response = await agent.RunAsync("What's the weather?"); +var results = await agent.EvaluateAsync( + responses: response, + queries: ["What's the weather?"], + evaluator: new FoundryEvals(projectClient, "gpt-4o")); + +// Mixed — one result per evaluator +var results = await agent.EvaluateAsync( + queries: ["What's the weather?"], + evaluators: [ + new LocalEvaluator(EvalChecks.KeywordCheck("weather")), + new RelevanceEvaluator(), + new FoundryEvals(projectClient, "gpt-4o") + ], + chatConfiguration: new ChatConfiguration(evalClient)); + +// Workflow with per-agent breakdown +Run run = await workflowRunner.RunAsync(workflow, "Plan a trip"); +var results = await run.EvaluateAsync( + evaluator: new FoundryEvals(projectClient, "gpt-4o")); +``` + +### .NET Function Evaluators + +Typed factory overloads (C# equivalent of Python's `@evaluator`): + +```csharp +public static class FunctionEvaluator +{ + public static EvalCheck Create(string name, Func check); // response only + public static EvalCheck Create(string name, Func check); // expectedOutput + public static EvalCheck Create(string name, Func check); // full item + public static EvalCheck Create(string name, Func check); // full control + public static EvalCheck Create(string name, Func> check); // async +} +``` + +`EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface: + +```csharp +public record ExpectedToolCall(string Name, IReadOnlyDictionary? Arguments = null); + +public sealed class EvalItem +{ + public EvalItem(string query, string response, IReadOnlyList conversation); + + public string Query { get; } + public string Response { get; } + public IReadOnlyList Conversation { get; } + public IReadOnlyList? Tools { get; set; } + public string? ExpectedOutput { get; set; } + public IReadOnlyList? ExpectedToolCalls { get; set; } + public string? Context { get; set; } + public IConversationSplitter? Splitter { get; set; } +} +``` + +### Workflow Data Extraction (.NET) + +`run.EvaluateAsync()` walks `Run.OutgoingEvents` via LINQ: + +1. Pair `ExecutorInvokedEvent` / `ExecutorCompletedEvent` by `ExecutorId` +2. Extract `AgentResponseEvent` for per-agent `ChatResponse` +3. Call `evaluator.EvaluateAsync()` per invocation +4. Group by `ExecutorId` for per-agent `SubResults` +5. Use final workflow output for overall eval + +### .NET Package Structure + +| Package | Contents | +|---------|----------| +| `Microsoft.Agents.AI` | `IAgentEvaluator`, `AgentEvaluationResults`, `LocalEvaluator`, `FunctionEvaluator`, `EvalChecks`, `EvalItem`, `ExpectedToolCall`, `AgentEvaluationExtensions` | +| `Microsoft.Agents.AI.AzureAI` | `FoundryEvals` (provider + constants) | + +### Python ↔ .NET Mapping + +| Python | .NET | +|--------|------| +| `Evaluator` protocol | `IAgentEvaluator` (our interface; MEAI provides `IEvaluator` for per-item scoring) | +| `EvalItem` dataclass | `EvalItem` class | +| `EvalResults` | `AgentEvaluationResults` | +| `EvalItemResult` / `EvalScoreResult` | MEAI `EvaluationResult` / `EvaluationMetric` (reused) | +| `LocalEvaluator` | `LocalEvaluator` (implements `IAgentEvaluator`) | +| `@evaluator` | `FunctionEvaluator.Create()` overloads | +| `keyword_check()` / `tool_called_check()` | `EvalChecks.KeywordCheck()` / `EvalChecks.ToolCalledCheck()` | +| `tool_calls_present` / `tool_call_args_match` | (custom `FunctionEvaluator` — same pattern) | +| `ExpectedToolCall` dataclass | `ExpectedToolCall` record | +| `FoundryEvals` | `FoundryEvals` (implements `IAgentEvaluator`, includes evaluator name constants) | +| `evaluate_agent()` | `agent.EvaluateAsync(queries, evaluator)` extension method | +| `evaluate_agent(responses=)` | `agent.EvaluateAsync(responses, evaluator)` extension method | +| `evaluate_workflow()` | `run.EvaluateAsync()` extension method | + +## More Information + +- [Foundry Evals documentation](https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-approach-gen-ai) — Azure AI Foundry evaluation overview