diff --git a/docs/decisions/0020-foundry-evals-integration.md b/docs/decisions/0020-foundry-evals-integration.md
new file mode 100644
index 0000000000..f5b5db4db5
--- /dev/null
+++ b/docs/decisions/0020-foundry-evals-integration.md
@@ -0,0 +1,815 @@
+---
+status: accepted
+contact: bentho
+date: 2026-02-27
+deciders: bentho, markwallace-microsoft, westey-m
+consulted: Pratyush Mishra, Shivam Shrivastava, Manni Arora (Centrica eval scenario)
+informed: Agent Framework team, Foundry Evals team
+---
+
+# Agent Evaluation Architecture with Azure AI Foundry Integration
+
+## Context and Problem Statement
+
+Azure AI Foundry provides a rich evaluation service for AI agents — built-in evaluators for agent behavior (task adherence, intent resolution), tool usage (tool call accuracy, tool selection), quality (coherence, fluency, relevance), and safety (violence, self-harm, prohibited actions). Results are viewable in the Foundry portal with dashboards and comparison views.
+
+However, using Foundry Evals with an agent-framework agent today requires significant manual effort. Developers must:
+
+1. Transform agent-framework's `Message`/`Content` types into the OpenAI-style agent message schema that Foundry evaluators expect
+2. Map tool definitions from agent-framework's `FunctionTool` format to evaluator-compatible schemas
+3. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario
+4. Handle App Insights trace ID queries, response ID collection, and eval polling
+
+Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level.
+
+### Functional Requirements for Agent Evaluation
+
+- **Single agents and workflows.** Evaluate both individual agent responses and multi-agent workflow results, with per-agent breakdown to pinpoint underperformance.
+- **One-shot and multi-turn conversations.** Capture full conversation trajectories — including tool calls and results — not just final query/response pairs.
+- **Conversation factoring.** Support splitting conversations into query/response in multiple ways (last turn, full trajectory, per-turn) because different factorings measure different things.
+- **Multiple providers, mix and match.** Run Foundry LLM-as-judge evaluators alongside fast local checks and custom evaluators on the same data, without restructuring code.
+- **Third-party extensibility.** Any evaluation library can participate by implementing the `Evaluator` protocol (Python) or `IAgentEvaluator` interface (.NET). No predetermined list of supported libraries — the protocol is intentionally simple (`evaluate(items) → results`) so that wrappers for libraries like DeepEval, RAGAS, or Promptfoo are straightforward to write.
+- **Bring your own evaluator.** Creating a custom evaluator should be as simple as writing a function.
+- **Evaluate without re-running.** Evaluate existing responses from logs or previous runs without invoking the agent again.
+
+## Decision Drivers
+
+- **Zero-friction evaluation**: Developers should go from "I have an agent" to "I have eval results" with minimal code.
+- **Provider-agnostic API**: Core evaluation capabilities must not be tied to any specific provider. Provider configuration should be separate from the evaluation call.
+- **Lowest concept count**: Introduce the fewest possible new types, abstractions, and APIs for developers to learn.
+- **Leverage existing knowledge**: The framework already knows which agents exist, what tools they have, and what conversations occurred. Evals should use this automatically rather than requiring the developer to re-specify it.
+- **Foundry-native results**: When using Foundry, results should be viewable in the Foundry portal with dashboards and comparison views.
+- **Progressive disclosure**: Simple scenarios should be near-zero code. Advanced scenarios should build on the same primitives.
+- **Cross-language parity**: Design must be implementable in both Python and .NET.
+
+## Considered Options
+
+1. **Provider-specific functions** — Build Foundry-specific helper functions (`evaluate_agent()`, etc.) directly in the Azure package. All eval functions take Foundry connection parameters.
+2. **Evaluator protocol with shared orchestration** — Define a provider-agnostic `Evaluator` protocol in the base agent library (`agent_framework` in Python, `Microsoft.Agents.AI` in .NET). Orchestration functions live alongside it. Providers implement the protocol.
+3. **Full eval framework** — Build comprehensive eval infrastructure including custom evaluator definitions, scoring profiles, and reporting inside agent-framework.
+
+## Decision Outcome
+
+Proposed option: "Evaluator protocol with shared orchestration", because it delivers the low-friction developer experience, supports multiple providers without API changes, and keeps the concept count low.
+
+### Usage Examples
+
+#### Evaluate an agent
+
+The agent is invoked once per query by default. For statistically meaningful evaluation, provide multiple diverse queries. For measuring **consistency** (does the same query produce reliable results?), use `num_repetitions` to run each query N times independently:
+
+**Python:**
+
+```python
+evals = FoundryEvals(
+    project_client=client,
+    model_deployment="gpt-4o",
+    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
+)
+
+results = await evaluate_agent(
+    agent=my_agent,
+    queries=[
+        "What's the weather in Seattle?",
+        "Plan a weekend trip to Portland",
+        "What restaurants are near Pike Place?",
+    ],
+    evaluators=evals,
+)
+for r in results:
+    r.assert_passed()
+```
+
+**C#:**
+
+```csharp
+var evals = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence);
+
+AgentEvaluationResults results = await agent.EvaluateAsync(
+    new[] {
+        "What's the weather in Seattle?",
+        "Plan a weekend trip to Portland",
+        "What restaurants are near Pike Place?",
+    },
+    evals);
+
+results.AssertAllPassed();
+```
+
+`evaluate_agent` returns one `EvalResults` per evaluator. Each result contains per-item scores with the evaluated response for auditing:
+
+```
+# results[0] (FoundryEvals)
+EvalResults(status="completed", passed=3, failed=0, total=3)
+  items[0]: EvalItemResult(
+    query="What's the weather in Seattle?",
+    response="It's currently 72°F and sunny in Seattle.",
+    scores={"relevance": 5, "coherence": 5})
+  items[1]: EvalItemResult(
+    query="Plan a weekend trip to Portland",
+    response="Here's a 2-day Portland itinerary...",
+    scores={"relevance": 4, "coherence": 5})
+  items[2]: EvalItemResult(
+    query="What restaurants are near Pike Place?",
+    response="Top restaurants near Pike Place Market: ...",
+    scores={"relevance": 5, "coherence": 4})
+```
+
+#### Measure consistency with repetitions
+
+Run each query multiple times to detect non-deterministic behavior:
+
+**Python:**
+
+```python
+results = await evaluate_agent(
+    agent=my_agent,
+    queries=["What's the weather in Seattle?"],
+    evaluators=evals,
+    num_repetitions=3,  # each query runs 3 times independently
+)
+# results contain 3 items (1 query × 3 repetitions)
+```
+
+**C#:**
+
+```csharp
+AgentEvaluationResults results = await agent.EvaluateAsync(
+    new[] { "What's the weather in Seattle?" },
+    evals,
+    numRepetitions: 3);  // each query runs 3 times independently
+// results contain 3 items (1 query × 3 repetitions)
+```
+
+#### Evaluate a response you already have
+
+When you already have agent responses, pass them directly to skip re-running the agent. Each query is paired with its corresponding response:
+
+**Python:**
+
+```python
+queries = ["What's the weather?", "What's the capital of France?"]
+responses = [await agent.run([Message("user", [q])]) for q in queries]
+
+results = await evaluate_agent(
+    responses=responses,
+    evaluators=evals,
+)
+```
+
+**C#:**
+
+```csharp
+var queries = new[] { "What's the weather?" };
+var responses = new List<AgentResponse>();
+foreach (var q in queries)
+    responses.Add(await agent.RunAsync(new[] { new ChatMessage(ChatRole.User, q) }));
+
+AgentEvaluationResults results = await agent.EvaluateAsync(
+    responses: responses,
+    evals);
+```
+
+Each `AgentResponse` already contains the conversation (query + response), so the evaluator extracts query/response from the conversation. When you pass `responses` without `queries`, the conversation is the source of truth.
+
+#### Evaluate with conversation split strategies
+
+By default, evaluators see only the last turn (final user message → final assistant response). For multi-turn conversations, you can control how the conversation is factored for evaluation:
+
+**Python:**
+
+```python
+results = await evaluate_agent(
+    agent=agent,
+    queries=["Plan a 3-day trip to Paris"],
+    evaluators=evals,
+    conversation_split=ConversationSplit.FULL,      # evaluate entire trajectory
+)
+
+# Or per-turn: each user→assistant exchange scored independently
+results = await evaluate_agent(
+    agent=agent,
+    queries=["Plan a 3-day trip to Paris"],
+    evaluators=evals,
+    conversation_split=ConversationSplit.PER_TURN,
+)
+```
+
+**C#:**
+
+```csharp
+// Full conversation as context
+AgentEvaluationResults results = await agent.EvaluateAsync(
+    new[] { "Plan a 3-day trip to Paris" },
+    evals,
+    splitter: ConversationSplitters.Full);
+
+// Per-turn splitting
+var items = EvalItem.PerTurnItems(conversation);  // one EvalItem per user turn
+var results = await evals.EvaluateAsync(items);
+```
+
+With `PER_TURN`, a 3-turn conversation produces 3 scored items:
+
+```
+EvalResults(status="completed", passed=3, failed=0, total=3)
+  items[0]: query="Plan a 3-day trip to Paris"    scores={"relevance": 5}
+  items[1]: query="What about restaurants?"        scores={"relevance": 4}
+  items[2]: query="Make it budget-friendly"        scores={"relevance": 5}
+```
+
+#### Evaluate a multi-agent workflow
+
+**Python:**
+
+```python
+result = await workflow.run("Plan a trip to Paris")
+eval_results = await evaluate_workflow(
+    workflow=workflow,
+    workflow_result=result,
+    evaluators=evals,
+)
+
+for r in eval_results:
+    print(f"  overall: {r.passed}/{r.total}")
+    for name, sub in r.sub_results.items():
+        print(f"    {name}: {sub.passed}/{sub.total}")
+```
+
+**C#:**
+
+```csharp
+WorkflowRunResult result = await workflow.RunAsync("Plan a trip to Paris");
+
+IReadOnlyList<AgentEvaluationResults> evalResults = await result.EvaluateAsync(evals);
+
+foreach (var r in evalResults)
+{
+    Console.WriteLine($"  overall: {r.Passed}/{r.Total}");
+    foreach (var (name, sub) in r.SubResults)
+        Console.WriteLine($"    {name}: {sub.Passed}/{sub.Total}");
+}
+```
+
+Workflows return one result per evaluator, with sub-results per agent in the workflow:
+
+```
+EvalResults(status="completed", passed=2, failed=0, total=2)
+  sub_results:
+    "planner":  EvalResults(passed=1, total=1)
+    "researcher": EvalResults(passed=1, total=1)
+```
+
+#### Mix multiple providers
+
+**Python:**
+
+```python
+@evaluator
+def is_helpful(response: str) -> bool:
+    return len(response.split()) > 10
+
+foundry = FoundryEvals(
+    project_client=client,
+    model_deployment="gpt-4o",
+    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
+)
+
+results = await evaluate_agent(
+    agent=agent,
+    queries=queries,
+    evaluators=[is_helpful, keyword_check("weather"), foundry],
+)
+```
+
+**C#:**
+
+```csharp
+IReadOnlyList<AgentEvaluationResults> results = await agent.EvaluateAsync(
+    queries,
+    evaluators: new IAgentEvaluator[]
+    {
+        new LocalEvaluator(
+            EvalChecks.KeywordCheck("weather"),
+            FunctionEvaluator.Create("is_helpful", (string r) => r.Split(' ').Length > 10)),
+        new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence),
+    });
+```
+
+Multiple evaluators return one result each — `results[0]` is the local evaluator, `results[1]` is Foundry.
+
+#### Custom function evaluators
+
+**Python:**
+
+```python
+@evaluator
+def mentions_city(response: str, expected_output: str) -> bool:
+    return expected_output.lower() in response.lower()
+
+@evaluator
+def used_tools(conversation: list, tools: list) -> float:
+    # ... scoring logic
+    return score
+
+local = LocalEvaluator(mentions_city, used_tools)
+```
+
+`@evaluator` uses **parameter name injection** — the function's parameter names determine what data it receives from the `EvalItem`. Supported names: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`. Any combination is valid.
+
+**C#:**
+
+```csharp
+var local = new LocalEvaluator(
+    FunctionEvaluator.Create("mentions_city",
+        (EvalItem item) => item.ExpectedOutput != null
+            && item.Response.Contains(item.ExpectedOutput, StringComparison.OrdinalIgnoreCase)),
+    FunctionEvaluator.Create("is_concise",
+        (string response) => response.Split(' ').Length < 500));
+```
+
+## What To Build
+
+### Core: Evaluator Protocol
+
+A runtime-checkable protocol that any evaluation provider implements:
+
+```python
+@runtime_checkable
+class Evaluator(Protocol):
+    name: str
+
+    async def evaluate(
+        self, items: Sequence[EvalItem], *, eval_name: str = "Agent Framework Eval"
+    ) -> EvalResults: ...
+```
+
+The protocol is minimal — just `name` and `evaluate()`.
+
+### Core: EvalItem
+
+Provider-agnostic data format for items to evaluate:
+
+```python
+@dataclass
+class ExpectedToolCall:
+    name: str                                    # Tool/function name
+    arguments: dict[str, Any] | None = None      # None = don't check args
+
+@dataclass
+class EvalItem:
+    conversation: list[Message]               # Single source of truth
+    tools: list[FunctionTool] | None = None   # Agent's available tools
+    context: str | None = None
+    expected_output: str | None = None          # Ground-truth for comparison
+    expected_tool_calls: list[ExpectedToolCall] | None = None
+    split_strategy: ConversationSplitter | None = None
+
+    query: str       # property — derived from conversation split
+    response: str    # property — derived from conversation split
+```
+
+`conversation` is the single source of truth. `query` and `response` are derived properties — splitting the conversation at the last user message (default) and extracting text from each side. Changing the `split_strategy` consistently changes all derived values.
+
+`tools` provides typed `FunctionTool` objects — including MCP tools, which are automatically extracted after agent runs.
+
+### Internal: AgentEvalConverter
+
+Internal class that converts agent-framework types to `EvalItem`. Used by `evaluate_agent()` and `evaluate_workflow()` — not part of the public API:
+
+| Agent Framework | Eval Format |
+|---|---|
+| `Content.function_call` | `tool_call` in OpenAI chat format |
+| `Content.function_result` | `tool_result` in OpenAI chat format |
+| `FunctionTool` | `{name, description, parameters}` schema |
+| `Message` history | `conversation` list + `query`/`response` extraction |
+
+### Core: EvalResults
+
+Rich result type with convenience properties for CI integration:
+
+```python
+results.all_passed          # bool: no failures or errors (recursive for workflow)
+results.passed              # int: passing count
+results.failed              # int: failure count
+results.total               # int: total = passed + failed + errored
+results.items               # list[EvalItemResult]: per-item detail with query, response, and scores
+results.error               # str | None: error details on failure
+results.sub_results         # dict: per-agent breakdown (workflow evals)
+results.report_url          # str | None: portal link (Foundry)
+results.assert_passed()     # raises AssertionError with details
+```
+
+### Core: Orchestration Functions
+
+Provider-agnostic functions that extract data and delegate to evaluators:
+
+| Function | What it does |
+|---|---|
+| `evaluate_agent()` | Runs agent against test queries (or evaluates pre-existing `responses=`), converts to `EvalItem`s, passes to evaluator. Accepts optional `expected_output=` for ground-truth comparison, `expected_tool_calls=` for tool-correctness evaluation, and `num_repetitions=` for consistency measurement |
+| `evaluate_workflow()` | Extracts per-agent data from `WorkflowRunResult`, evaluates each agent and overall output. Per-agent breakdown in `sub_results`. Also accepts `num_repetitions=` |
+
+### Core: Conversation Split Strategies
+
+Multi-turn conversations must be split into query (input) and response (output) halves for evaluation. How you split determines *what you're evaluating*:
+
+**Last-turn split** — split at the last user message. Everything up to and including it is the query context; the agent's subsequent actions are the response:
+
+```
+conversation: user1 → assistant1 → user2 → assistant2(tool) → tool_result → assistant3
+query_messages:    [user1, assistant1, user2]
+response_messages: [assistant2(tool), tool_result, assistant3]
+```
+
+This evaluates: "Given all the context so far, did the agent answer the latest question well?" Best for response quality at a specific point in the conversation.
+
+**Full-conversation split** — the first user message is the query; everything after is the response:
+
+```
+query_messages:    [user1]
+response_messages: [assistant1, user2, assistant2(tool), tool_result, assistant3]
+```
+
+This evaluates: "Given the original request, did the entire conversation trajectory serve the user?" Best for task completion and overall conversation quality.
+
+**Per-turn split** — produces N eval items from an N-turn conversation. Each turn is evaluated with its cumulative context:
+
+```
+item 1: query = [user1],                        response = [assistant1]
+item 2: query = [user1, assistant1, user2],      response = [assistant2(tool), tool_result, assistant3]
+```
+
+This evaluates each response independently. Best for fine-grained analysis and pinpointing where a conversation goes wrong.
+
+These factorings produce different scores for the same conversation. The framework ships all three as built-in strategies, defaulting to last-turn. Developers can also provide a custom splitter — a function (Python) or `IConversationSplitter` implementation (.NET) — and override the strategy at the call site or per evaluator.
+
+### Azure AI: FoundryEvals
+
+`Evaluator` implementation backed by Azure AI Foundry:
+
+```python
+class FoundryEvals:
+    def __init__(self, *, project_client=None, openai_client=None,
+                 model_deployment: str, evaluators=None, ...)
+    async def evaluate(self, items, *, eval_name) -> EvalResults
+```
+
+**Smart auto-detection in `evaluate()`:**
+- Default evaluators: relevance, coherence, task_adherence
+- Auto-adds `tool_call_accuracy` when items have tools/`tool_definitions`
+- Filters out tool evaluators for items without tools
+
+### Azure AI: FoundryEvals Constants
+
+```python
+from agent_framework_azure_ai import FoundryEvals
+
+evaluators = [FoundryEvals.RELEVANCE, FoundryEvals.TOOL_CALL_ACCURACY]
+```
+
+Categories: Agent behavior, Tool usage, Quality, Safety.
+
+### Azure AI: Foundry-Specific Functions
+
+| Function | What it does |
+|---|---|
+| `evaluate_traces()` | Evaluate from stored response IDs or OTel traces |
+| `evaluate_foundry_target()` | Evaluate a Foundry-registered agent or deployment |
+
+### Core: LocalEvaluator and Function Evaluators
+
+`LocalEvaluator` implements the `Evaluator` protocol for fast, API-free evaluation. It runs check functions locally — useful for inner-loop development, CI smoke tests, and combining with cloud-based evaluators.
+
+Built-in checks:
+- `keyword_check(*keywords)` — response must contain specified keywords
+- `tool_called_check(*tool_names)` — agent must have called specified tools
+- `tool_calls_present` — all `expected_tool_calls` names appear in conversation (unordered, extras OK)
+- `tool_call_args_match` — expected tool calls match on name + arguments (subset match on args)
+
+Custom function evaluators use `@evaluator` to wrap plain Python functions. The function's **parameter names** determine what data it receives from the `EvalItem`:
+
+```python
+from agent_framework import evaluator, LocalEvaluator
+
+# Tier 1: Simple check — just query + response
+@evaluator
+def is_concise(response: str) -> bool:
+    return len(response.split()) < 500
+
+# Tier 2: Ground truth — compare against expected output
+@evaluator
+def mentions_city(response: str, expected_output: str) -> bool:
+    return expected_output.lower() in response.lower()
+
+# Tier 3: Full context — inspect conversation and tools
+@evaluator
+def used_tools(conversation: list, tools: list) -> float:
+    # ... scoring logic
+    return score
+
+local = LocalEvaluator(is_concise, mentions_city, used_tools)
+```
+
+Supported parameters: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`.
+Return types: `bool`, `float` (≥0.5 = pass), `dict` with `score` or `passed` key, or `CheckResult`.
+
+Async functions are handled automatically — `@evaluator` detects `async def` and produces the right wrapper.
+
+### Example: GAIA Benchmark
+
+[GAIA](https://huggingface.co/gaia-benchmark) tests real-world multi-step tasks with known expected answers. Each task has a question and a ground-truth answer, with optional file attachments. The framework accommodates GAIA's knobs (difficulty levels, file inputs, multi-step tool use) through the existing `EvalItem` fields:
+
+```python
+from datasets import load_dataset
+from agent_framework import evaluate_agent, evaluator, LocalEvaluator
+
+gaia = load_dataset("gaia-benchmark/GAIA", "2023_level1", split="test")
+
+@evaluator
+def exact_match(response: str, expected_output: str) -> bool:
+    return expected_output.strip().lower() in response.strip().lower()
+
+# Simple path — evaluate_agent handles running + expected_output stamping
+results = await evaluate_agent(
+    agent=agent,
+    queries=[task["Question"] for task in gaia],
+    expected_output=[task["Final answer"] for task in gaia],
+    evaluators=LocalEvaluator(exact_match),
+)
+```
+
+### Package Location
+
+- Core types and orchestration: `agent_framework._eval`, `agent_framework._local_eval` (Python), `Microsoft.Agents.AI` (.NET)
+- Foundry provider: `agent_framework_azure_ai._foundry_evals` (Python), `Microsoft.Agents.AI.AzureAI` (.NET)
+- Azure-AI re-exports core types for convenience (Python)
+
+## Known Limitations
+
+1. **Tool evaluators require query + agent**: Tool evaluators need tool definition schemas. When using these evaluators with `evaluate_agent(responses=...)`, provide `queries=` and pass an agent with tool definitions.
+2. **`model_deployment` always required**: Could potentially be inferred from the Foundry project configuration.
+
+## Open Questions
+
+1. **Red teaming non-registered agents**: Requires Foundry API support for callback-based flows.
+2. **Datasets with expected outputs**: A dataset abstraction for pre-populating `expected_output` values across eval runs is a natural next step but not yet designed.
+3. **Multi-modal evaluation**: The `conversation` field on `EvalItem` already stores full `Message`/`Content` (Python) and `ChatMessage` (.NET) objects, which can represent multi-modal content (images, audio, structured data). Evaluators that accept the full `EvalItem` or `conversation` parameter can access this content today. However, the convenience shortcuts — `query`/`response` string projections and the `FunctionEvaluator` string overloads — are text-only. Multi-modal-aware evaluators should use the full-item path (`Func<EvalItem, CheckResult>` in .NET, `conversation: list` parameter in Python).
+
+## .NET Implementation Design
+
+### Key Difference: MEAI Ecosystem
+
+Unlike Python, the .NET ecosystem already has `Microsoft.Extensions.AI.Evaluation` (v10.3.0) providing:
+
+- `IEvaluator` — per-item evaluation of `(messages, chatResponse) → EvaluationResult`
+- `CompositeEvaluator` — combines multiple evaluators
+- Quality evaluators — `RelevanceEvaluator`, `CoherenceEvaluator`, `GroundednessEvaluator`
+- Safety evaluators — `ContentHarmEvaluator`, `ProtectedMaterialEvaluator`
+- Metric types — `NumericMetric`, `BooleanMetric`, `StringMetric`
+
+The .NET integration uses MEAI's `IEvaluator` directly — no new evaluator interface. Our contribution is the **orchestration layer**: extension methods that run agents, extract data, call `IEvaluator` per item, and aggregate results.
+
+### Architecture
+
+```
+┌──────────────────────────────────────────────────────────────┐
+│  Developer Code                                              │
+│  agent.EvaluateAsync(queries, evaluator)                     │
+│  run.EvaluateAsync(evaluator)                                │
+└────────────────┬─────────────────────────────────────────────┘
+                 │
+┌────────────────▼─────────────────────────────────────────────┐
+│  Orchestration Layer (Microsoft.Agents.AI)                   │
+│  AgentEvaluationExtensions — runs agents, extracts data,     │
+│  calls IEvaluator per item, aggregates into                  │
+│  AgentEvaluationResults                                      │
+└────────────────┬─────────────────────────────────────────────┘
+                 │ IEvaluator (MEAI)
+                 │
+     ┌───────────┼────────────┐
+     │           │            │
+ ┌───▼───-┐  ┌───▼────┐  ┌────▼──────────┐
+ │ MEAI   │  │ Local  │  │ Foundry       │
+ │ Quality│  │ Checks │  │ (cloud batch) │
+ │ Safety │  │ Lambdas│  │               │
+ └────────┘  └────────┘  └───────────────┘
+```
+
+All evaluators implement MEAI's `IEvaluator`. The orchestration layer doesn't need to know which kind — it calls `EvaluateAsync(messages, chatResponse)` per item on all of them. `FoundryEvals` handles batching internally (buffers items, submits once, returns per-item results).
+
+### .NET Core Types
+
+**No new evaluator interface.** Use MEAI's `IEvaluator` directly.
+
+**`AgentEvaluationResults`** — The only new type. Aggregates per-item MEAI `EvaluationResult`s across a batch of queries:
+
+```csharp
+public class AgentEvaluationResults
+{
+    public string Provider { get; init; }
+    public string? ReportUrl { get; init; }
+
+    // Per-item — standard MEAI EvaluationResult, unchanged
+    public IReadOnlyList<EvaluationResult> Items { get; init; }
+
+    // Aggregate pass/fail derived from metric interpretations
+    public int Passed { get; }
+    public int Failed { get; }
+    public int Total { get; }
+    public bool AllPassed { get; }
+
+    // Workflow: per-agent breakdown
+    public IReadOnlyDictionary<string, AgentEvaluationResults>? SubResults { get; init; }
+
+    public void AssertAllPassed(string? message = null);
+}
+```
+
+### .NET Evaluator Implementations
+
+All implement MEAI's `IEvaluator`:
+
+**`LocalEvaluator`** — Runs lambda checks locally, returns `BooleanMetric` per check:
+
+```csharp
+var local = new LocalEvaluator(
+    FunctionEvaluator.Create("is_concise",
+        (string response) => response.Split().Length < 500),
+    EvalChecks.KeywordCheck("weather"),
+    EvalChecks.ToolCalledCheck("get_weather"));
+```
+
+**MEAI evaluators** — Used directly, no adapter needed:
+
+```csharp
+var quality = new CompositeEvaluator(
+    new RelevanceEvaluator(),
+    new CoherenceEvaluator());
+```
+
+**`FoundryEvals`** — Implements `IEvaluator` but batches internally. On first call, buffers the item. On the last item (or when explicitly flushed), submits the batch to Foundry and distributes per-item results:
+
+```csharp
+var foundry = new FoundryEvals(projectClient, "gpt-4o");
+```
+
+### .NET Orchestration: Extension Methods
+
+```csharp
+public static class AgentEvaluationExtensions
+{
+    // Evaluate an agent against test queries
+    public static Task<AgentEvaluationResults> EvaluateAsync(
+        this AIAgent agent,
+        IEnumerable<string> queries,
+        IEvaluator evaluator,
+        ChatConfiguration? chatConfiguration = null,
+        IEnumerable<string>? expectedOutput = null,
+        CancellationToken cancellationToken = default);
+
+    // Evaluate pre-existing responses (without re-running the agent)
+    public static Task<AgentEvaluationResults> EvaluateAsync(
+        this AIAgent agent,
+        AgentResponse responses,
+        IEvaluator evaluator,
+        IEnumerable<string>? queries = null,
+        ChatConfiguration? chatConfiguration = null,
+        IEnumerable<string>? expectedOutput = null,
+        CancellationToken cancellationToken = default);
+
+    // Evaluate with multiple evaluators (one result per evaluator)
+    public static Task<IReadOnlyList<AgentEvaluationResults>> EvaluateAsync(
+        this AIAgent agent,
+        IEnumerable<string> queries,
+        IEnumerable<IEvaluator> evaluators,
+        ChatConfiguration? chatConfiguration = null,
+        IEnumerable<string>? expectedOutput = null,
+        CancellationToken cancellationToken = default);
+
+    // Evaluate a workflow run with per-agent breakdown
+    public static Task<AgentEvaluationResults> EvaluateAsync(
+        this Run run,
+        IEvaluator evaluator,
+        ChatConfiguration? chatConfiguration = null,
+        bool includeOverall = true,
+        bool includePerAgent = true,
+        CancellationToken cancellationToken = default);
+}
+```
+
+**Usage:**
+
+```csharp
+// MEAI evaluators — just works
+var results = await agent.EvaluateAsync(
+    queries: ["What's the weather?"],
+    evaluator: new RelevanceEvaluator(),
+    chatConfiguration: new ChatConfiguration(evalClient));
+
+// Local checks
+var results = await agent.EvaluateAsync(
+    queries: ["What's the weather?"],
+    evaluator: new LocalEvaluator(
+        EvalChecks.KeywordCheck("weather")));
+
+// Foundry cloud
+var results = await agent.EvaluateAsync(
+    queries: ["What's the weather?"],
+    evaluator: new FoundryEvals(projectClient, "gpt-4o"));
+
+// Evaluate existing response (without re-running the agent)
+var response = await agent.RunAsync("What's the weather?");
+var results = await agent.EvaluateAsync(
+    responses: response,
+    queries: ["What's the weather?"],
+    evaluator: new FoundryEvals(projectClient, "gpt-4o"));
+
+// Mixed — one result per evaluator
+var results = await agent.EvaluateAsync(
+    queries: ["What's the weather?"],
+    evaluators: [
+        new LocalEvaluator(EvalChecks.KeywordCheck("weather")),
+        new RelevanceEvaluator(),
+        new FoundryEvals(projectClient, "gpt-4o")
+    ],
+    chatConfiguration: new ChatConfiguration(evalClient));
+
+// Workflow with per-agent breakdown
+Run run = await workflowRunner.RunAsync(workflow, "Plan a trip");
+var results = await run.EvaluateAsync(
+    evaluator: new FoundryEvals(projectClient, "gpt-4o"));
+```
+
+### .NET Function Evaluators
+
+Typed factory overloads (C# equivalent of Python's `@evaluator`):
+
+```csharp
+public static class FunctionEvaluator
+{
+    public static EvalCheck Create(string name, Func<string, bool> check);           // response only
+    public static EvalCheck Create(string name, Func<string, string?, bool> check);  // expectedOutput
+    public static EvalCheck Create(string name, Func<EvalItem, bool> check);         // full item
+    public static EvalCheck Create(string name, Func<EvalItem, CheckResult> check);  // full control
+    public static EvalCheck Create(string name, Func<string, Task<bool>> check);     // async
+}
+```
+
+`EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface:
+
+```csharp
+public record ExpectedToolCall(string Name, IReadOnlyDictionary<string, object>? Arguments = null);
+
+public sealed class EvalItem
+{
+    public EvalItem(string query, string response, IReadOnlyList<ChatMessage> conversation);
+
+    public string Query { get; }
+    public string Response { get; }
+    public IReadOnlyList<ChatMessage> Conversation { get; }
+    public IReadOnlyList<AITool>? Tools { get; set; }
+    public string? ExpectedOutput { get; set; }
+    public IReadOnlyList<ExpectedToolCall>? ExpectedToolCalls { get; set; }
+    public string? Context { get; set; }
+    public IConversationSplitter? Splitter { get; set; }
+}
+```
+
+### Workflow Data Extraction (.NET)
+
+`run.EvaluateAsync()` walks `Run.OutgoingEvents` via LINQ:
+
+1. Pair `ExecutorInvokedEvent` / `ExecutorCompletedEvent` by `ExecutorId`
+2. Extract `AgentResponseEvent` for per-agent `ChatResponse`
+3. Call `evaluator.EvaluateAsync()` per invocation
+4. Group by `ExecutorId` for per-agent `SubResults`
+5. Use final workflow output for overall eval
+
+### .NET Package Structure
+
+| Package | Contents |
+|---------|----------|
+| `Microsoft.Agents.AI` | `IAgentEvaluator`, `AgentEvaluationResults`, `LocalEvaluator`, `FunctionEvaluator`, `EvalChecks`, `EvalItem`, `ExpectedToolCall`, `AgentEvaluationExtensions` |
+| `Microsoft.Agents.AI.AzureAI` | `FoundryEvals` (provider + constants) |
+
+### Python ↔ .NET Mapping
+
+| Python | .NET |
+|--------|------|
+| `Evaluator` protocol | `IAgentEvaluator` (our interface; MEAI provides `IEvaluator` for per-item scoring) |
+| `EvalItem` dataclass | `EvalItem` class |
+| `EvalResults` | `AgentEvaluationResults` |
+| `EvalItemResult` / `EvalScoreResult` | MEAI `EvaluationResult` / `EvaluationMetric` (reused) |
+| `LocalEvaluator` | `LocalEvaluator` (implements `IAgentEvaluator`) |
+| `@evaluator` | `FunctionEvaluator.Create()` overloads |
+| `keyword_check()` / `tool_called_check()` | `EvalChecks.KeywordCheck()` / `EvalChecks.ToolCalledCheck()` |
+| `tool_calls_present` / `tool_call_args_match` | (custom `FunctionEvaluator` — same pattern) |
+| `ExpectedToolCall` dataclass | `ExpectedToolCall` record |
+| `FoundryEvals` | `FoundryEvals` (implements `IAgentEvaluator`, includes evaluator name constants) |
+| `evaluate_agent()` | `agent.EvaluateAsync(queries, evaluator)` extension method |
+| `evaluate_agent(responses=)` | `agent.EvaluateAsync(responses, evaluator)` extension method |
+| `evaluate_workflow()` | `run.EvaluateAsync()` extension method |
+
+## More Information
+
+- [Foundry Evals documentation](https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-approach-gen-ai) — Azure AI Foundry evaluation overview