Add evaluation pipeline for MCP quality by ashragrawal · Pull Request #362 · microsoft/Agent365-devTools

ashragrawal · 2026-04-11T00:03:08Z

Summary

Adds a365 develop-mcp evaluate — a new subcommand that measures the quality
of an MCP server's tool schemas and produces an interactive HTML report with
prioritized action items.

Poor tool names, descriptions, or parameter schemas cause AI agents to select
the wrong tool or pass incorrect arguments. This command makes that quality
visible and actionable before the server is shipped to agents.

What the command does

Discovers tools from the MCP server over Streamable HTTP (JSON-RPC 2.0
handshake).
Generates a checklist combining deterministic checks
(structural/objective, evaluated inline) and semantic checks (require
language judgment).
Evaluates semantic checks via a locally installed coding agent —
GitHub Copilot or Claude Code (both default to Haiku).
Analyzes results to compute a 0-100 score, a 0-4 maturity level, and a
ranked set of action items.
Writes a machine-readable JSON report and a self-contained HTML report
that opens in the browser.

Usage

# Fully automatic (requires GitHub Copilot or Claude Code CLI)
a365 develop-mcp evaluate --server-url https://my-mcp-server.com/mcp

# With auth, specific engine, custom output
a365 develop-mcp evaluate \
  --server-url https://my-server/mcp \
  --auth-token $TOKEN \
  --eval-engine claude-code \
  --output-dir ./reports

Bring-your-own-LLM workflow

If no coding agent is installed (or the user wants to use a different LLM),
the command writes the checklist and a self-contained
semantic_eval_prompt.txt to the output directory, then stops. The user scores
each score: null item with their own tool (ChatGPT, Gemini, IDE assistant,
etc.) and re-runs the exact same command — the pipeline detects the existing
scored checklist, skips discovery, and generates the report. No special flag
required for the resume.

Key design decisions

Thin command, fat service. DevelopMcpCommand only defines args/options and
delegates to IEvaluationPipelineService, which orchestrates the 5 steps.
Keeps the command focused and the pipeline unit-testable.
Resume-by-file. The checklist JSON on disk is the source of truth between
runs. This is what makes the BYOL workflow round-trip without special flags.
Strict completion check. The pipeline refuses to generate a report if any
semantic check is still score: null, because Scorer treats all-null
categories as 100 — proceeding would produce an inflated, misleading score.
Required option, not positional arg. --server-url is a required option for
consistency with other develop-mcp subcommands and the Azure CLI regression
tests.
Neutral terminology. The 18 schema-quality problems are called "issues"
across types, JSON fields, prompts, and the HTML report.
Same HTTP pattern as the rest of the repo. SchemaDiscoveryService uses
HttpClientFactory.CreateAuthenticatedClient() — same as GraphApiService,
ArmApiService, etc. — so user output isn't polluted by the default
LoggingHttpMessageHandler's 4-lines-per-request noise.

Test plan

dotnet test tests.proj — 1544 passing, 0 failing
End-to-end with --eval-engine github-copilot against learn.microsoft.com (3
tools, 48 semantic checks)
End-to-end with --eval-engine claude-code
End-to-end with --eval-engine none on fresh run (stops with guidance)
End-to-end with --eval-engine auto and no agent on PATH (stops with install

BYOL guidance)

BYOL resume: score the checklist manually (simulated via
scripts/simulate_llm_scoring.py), re-run same command, confirm report
generates without re-discovery
Engine fallthrough: Copilot timeout falls through to Claude Code
Clean logs (no Start/Sending/Received/End HTTP noise)
Help text mentions coding-agent dependency + BYOL alternative

5-step pipeline: discover tools from MCP server, generate auditable checklist, evaluate semantic checks via coding agent CLI (GitHub Copilot or Claude Code), analyze scores/maturity/action items, render HTML report. Key design decisions: - Extract-evaluate-merge pattern: each tool evaluated in its own ~25KB temp file to avoid coding agent timeouts on large checklists - Engine fallthrough: tries Copilot first, then Claude Code, with per-tool 6-minute timeout and process tree cleanup on timeout - Copilot uses prompt-file approach (no stdin support); Claude uses stdin piping - 25 deterministic checks (C#) + 12 semantic checks per tool (coding agent) - 18-smell taxonomy with weighted 5-category scoring and maturity levels 0-4 - 318 new tests (xUnit + FluentAssertions)

github-actions · 2026-04-11T00:03:30Z

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

- Switch EvaluateCommand to InvocationContext pattern with CancellationToken threaded through the entire evaluation pipeline - Fix Claude Code on Windows: use prompt-file instead of stdin piping (cmd.exe /c does not forward stdin to child processes) - Fix SemanticEvaluationCompleted returning false when all checks were already scored (pre-evaluated checklists) - Remove no-op --verbose option - Remove redundant Environment.ExitCode = 1 assignments - Add CHANGELOG entry for a365 evaluate

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds the new a365 evaluate command and supporting evaluation pipeline to discover MCP tools, run deterministic + semantic checks (via GitHub Copilot / Claude Code), score + compute maturity, and generate JSON/HTML reports.

Changes:

Introduces evaluation pipeline services (schema discovery, checklist generation/evaluation, scoring, analysis, reporting) and wires them into the CLI.
Adds models for checklists/results, smell taxonomy + maturity scoring, and prompt templates for coding-agent semantic evaluation.
Adds extensive xUnit coverage for the new evaluation components and CLI command.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
CHANGELOG.md	Documents the new `a365 evaluate` command.
src/Microsoft.Agents.A365.DevTools.Cli/Commands/EvaluateCommand.cs	Adds the `evaluate` CLI command orchestrating the 5-step pipeline.
src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs	Adds evaluation-specific error codes.
src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs	Introduces a dedicated exception type for evaluation failures.
src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj	Adds HTTP client factory package + embeds HTML report template.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs	Adds report model for prioritized remediations.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs	Adds model for deterministic/semantic checklist items.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs	Adds HTML-template data model (result + impact map + ladder).
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs	Adds enums for categories, priorities, engines, etc.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs	Adds checklist root + metadata model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs	Adds maturity level model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs	Adds top-level evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SmellDefinition.cs	Adds smell taxonomy definition model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs	Adds per-tool checklist + grouped checks model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs	Adds per-tool evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs	Adds discovered tool schema model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs	Adds toolset-level evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Program.cs	Registers evaluate command + DI services (HttpClient + pipeline services).
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs	Generates action items from failed checks with score impact + smell mapping.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs	Runs extract-evaluate-merge with coding agents and merges results back.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentRunner.cs	Detects/runs Copilot/Claude CLIs with timeouts and cleanup.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs	Computes scores, maturity, smell summary, and action items aggregation.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs	Interface + result type for semantic evaluation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs	Interface for checklist generation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs	Interface for analysis/scoring step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs	Interface for report generation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs	Interface for MCP tool discovery step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs	Implements maturity level calculation + ladder rendering data.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs	Writes JSON/HTML report and optionally opens browser.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs	Implements MCP initialize + tools/list discovery over JSON-RPC + SSE handling.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs	Implements category/tool/overall scoring + averages.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs	Defines semantic check metadata for tools/params/toolset.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs	Builds prompts/commands for coding-agent semantic evaluation.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SmellTaxonomy.cs	Adds 18-smell taxonomy and impact map for HTML report.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs	Tests CLI structure + engine parsing + server-name derivation.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs	Tests action item generation + score impact + sorting.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/DeterministicChecksTests.cs	Tests deterministic checks behavior (naming/description/schema/toolset).
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs	Tests analyzer aggregation, scoring, maturity, smells, priorities.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs	Tests maturity thresholds/caps + ladder output.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs	Tests JSON/HTML report generation + sanitization + arg validation.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs	Tests scoring rules and weight math.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs	Tests semantic check definitions are consistent and stable.

Comments suppressed due to low confidence (5)

src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1

htmlPath may contain spaces/shell-special characters, and passing it via the (fileName, arguments) constructor can break argument parsing (especially on Linux/macOS). Prefer ProcessStartInfo.ArgumentList (single arg) or explicitly quote/escape the path to ensure the report opens reliably.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1
The HTML template receives raw JSON substituted into the document. If any server-supplied fields (tool names/descriptions, reasons, etc.) contain sequences like </script> or other HTML-breaking content, this can break the page or enable script injection when the report is opened. Consider embedding the JSON in a safer form (e.g., HTML-escaped JSON, base64-encoded payload decoded at runtime, or a <script type="application/json"> block with escaping of <, >, &, and </script>).
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs:1
The JSON structure example suggests the parameters object is keyed by "param_name", but in the actual model it’s a dictionary keyed by the actual parameter name (e.g., "userId": { "param_name": [...], ... }). This mismatch can mislead the coding agent and reduce evaluation success; update the example to reflect the real shape.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
SerializerOptions is declared but not used anywhere in this file. Remove it to avoid dead code, or use it consistently for any deserialization that needs case-insensitive behavior.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
For SSE responses, ReadAsStringAsync reads the entire response body before parsing, which can be problematic if the server keeps the event stream open (the call may not complete promptly). Consider switching to streaming processing (e.g., SendAsync(..., ResponseHeadersRead) and reading the content stream line-by-line until a JSON data: message is found) to avoid hangs and reduce memory usage.

Drop DeterministicChecks and its tests (unreferenced after inlining into ChecklistGenerator), plus unused methods ActionItemGenerator.GenerateFromChecks and SemanticCheckPrompts.BuildClaudeCodeCommand/BuildGithubCopilotCommand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inline the evaluate subcommand in DevelopMcpCommand and extract the 5-step pipeline into IEvaluationPipelineService so the command stays thin. Adds a DevelopMcpCommand.CreateCommand overload that accepts the pipeline service; the existing 2-param signature remains for tests that don't need evaluate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Repair JSON produced by coding agents: tolerate trailing commas and insert missing commas before deserializing the updated checklist, since agents occasionally emit structurally invalid JSON. Run Copilot with the Haiku model (extracted to a single constant) so both engines default to the same fast/cheap tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 1 comment.

…luate # Conflicts: # CHANGELOG.md

ashragrawal · 2026-04-16T21:28:09Z

@microsoft-github-policy-service agree company="Microsoft"

- CodingAgentRunner: correct the class summary to describe actual prompt delivery (Claude Code uses stdin on Unix, temp file on Windows; Copilot always uses a temp file). - ActionItemGenerator: map unknown CheckCategory values to "unknown" instead of "schema_structure", so new categories fall back to the default weight rather than silently inheriting schema-structure weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 9 comments.

Switch from AddHttpClient<T>() to the project's standard HttpClientFactory pattern (matches GraphApiService, ArmApiService, etc.). This removes the default LoggingHttpMessageHandler that emitted four "Start/Sending/ Received/End processing" lines per request at Information level, cleaning up the user-facing output during schema discovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the "Where You Stand" section rendered the maturity ladder and nothing below it when the server was at Level 4 (the top) — no "To reach Level N+1" box to guide users. This left a visual gap that looked like missing content. Add a terminal-state message acknowledging the server has reached the highest maturity level and pointing to the action items for remaining refinements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1

Injecting raw JSON into an HTML template via string replacement can be unsafe if the template places it inside a <script> tag: tool descriptions could contain </script> and prematurely terminate the script, enabling HTML/script injection in the generated report. Escape JSON for safe embedding (e.g., replace </ with <\\/), or embed the JSON as HTML-encoded text within a <script type=\"application/json\"> element and parse it at runtime.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
SerializerOptions is declared but never used anywhere in this file. Remove it to reduce noise, or apply it consistently where deserialization occurs (if the intent is to deserialize into typed models rather than using JsonDocument/manual parsing).

Previously the evaluate pipeline emitted a mix of developer-facing noise (duplicate "Engines available" / "Engines available again" lines, stray "Coding agent completed successfully" after every tool) and lacked clear progress indicators, making it hard to tell where the run was at a glance. Rework the output around a 5-step pipeline with aligned indented detail lines. Key changes: - Step markers [1/5]..[5/5] for discovery, checklist, eval, analysis, report. - Single "Using <Engine>" line (with optional fallback) instead of three "Detecting / Available / Engines available" lines. - Per-tool progress prints once per tool with an inline status ("ok" or "failed (continuing)"), not before+after. - Demote "Coding agent completed / exited / timed out" to debug — the user already sees success/failure on the per-tool line. - When no coding agent CLI is found, write the semantic eval prompt to semantic_eval_prompt.txt next to the checklist and guide users through install options OR scoring with their own LLM. - Remove the old "Analyzing results..." / "Analysis complete" / "Generating report..." intermediate lines; the step markers and trailing "Done. Score" line already convey that information. - Suppress the extraneous initial checklist-path log at Information level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Correct SemanticEvaluationCompleted: require zero remaining unevaluated semantic checks before marking complete. Previously a single successful tool would flip the flag to true, letting Scorer treat still-null categories as perfect 100 and inflate overall scores on partial runs. - Switch `develop-mcp evaluate`'s required input from a positional `server-url` argument to a required `--server-url` / `-u` option, for consistency with the other develop-mcp subcommands and the Azure CLI compliance regression test. - Route `ToolsetDesign` checks to `Scorer.ToolsetWeight` in ActionItemGenerator so action-item score impact stays aligned with overall scoring; removes an implicit reliance on the 0.15 fallback coincidentally matching ToolsetWeight. - Add ArgumentNullException guards to the EvaluationPipelineService constructor for parity with the rest of the codebase's DI services. - Expose ChecklistEvaluator.RepairJson as internal and add unit tests covering well-formed input, missing commas between objects/strings/ booleans, and empty input. - Relax DevelopMcpCommandTests subcommand-count assertions to check for presence/absence of "evaluate" instead of asserting a hardcoded total, so unrelated subcommand additions don't break these tests. - Add `because:` clauses to DeriveServerName assertions so the intent of each URL-sanitization invariant is documented at the assertion site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

Per-tool scoring was flaky (0-34 of 48 scored across runs) because the prompt said "use a whole-file write tool if available" and the agent non-deterministically chose edit/str_replace for individual items. Those edits failed on the repeating "score: null" pattern that isn't unique across checks, and the subprocess still exited 0 so the pipeline logged "ok" with nothing merged. Fix: build a per-engine prompt that names the exact tool the agent should use. SemanticCheckPrompts now takes an AgentToolset record describing ReadToolName/WriteToolName/EditToolName, and ChecklistEvaluator maps EvalEngine to the concrete names (Copilot: view/create/edit, Claude Code: Read/Write/Edit). The prompt instructs "use Write/create ONCE" and warns away from targeted string replacements. Also add Write to Claude Code's --allowedTools since a whole-file write is the reliable strategy for both engines. E2E on learn.microsoft.com: 46/48 scored consistently (was 20-34 flaky); the 2 remaining are the toolset-level server checks, which we'll follow up on separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restricting Copilot to --available-tools=view,create caused the model to thrash and leave checks unscored — it had the ability to do the task but not the flexibility to pick its own strategy. Inverting the restriction (allow everything, deny the dangerous families) lets the agent use its full toolkit for the scoring task while blocking the two ways it could escape the sandbox or leak data. Denies: Copilot: shell, write_shell, read_shell, stop_shell, list_shell (macOS/Linux), powershell, write_powershell, read_powershell, stop_powershell, list_powershell (Windows), web_fetch, web_search Claude Code: Bash, BashOutput, KillBash, WebFetch, WebSearch File access remains bounded by the per-invocation temp-dir sandbox — file tools respect cwd by default, and we don't pass --allow-all-paths. Prompt simplified: we no longer over-instruct the agent on which tool to use, just name the read/write tool names it has and describe the write-in-one-call strategy as a preference, not a restriction. E2E on learn.microsoft.com: 48/48 scored, score 92/100, HTML report generated (was flaky 20-46/48 previously). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 4 comments.

- Rename EvalEngine.GithubCopilot to GitHubCopilot so the serialized enum name matches GitHub branding (and report JSON stays consistent) - Use FormatEngineName display name in report eval-engine field instead of raw enum ToString() so downstream consumers see "GitHub Copilot" - Pass derived server name through ReportGenerator.SanitizeFileName so the UriFormatException fallback can't produce an invalid filename - Drop unused workingDir parameter from EvaluateToolChecks and EvaluateServerChecks (sandbox dir is created internally) - Fix ReportGenerator comment to drop the bogus "<!" escape mention - Reword evaluate help text so it doesn't imply --eval-engine none is required for BYOL (auto-mode also falls back to the written checklist) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Copilot model sometimes hedges on "pass if no issues" prompts and leaves the score as null instead of committing to true/false. Before this change, the pipeline accepted whatever came back from the first agent call, so runs would flake between 30/48 and 48/48 scored on identical inputs — the same tool or same pair of server-level checks would score one run and skip the next. Change: EvaluateToolChecks and EvaluateServerChecks now loop up to MaxAttempts (3) times. After each agent pass we merge scored items back into the in-memory checklist, re-serialize the current state to the temp file (so the next attempt only sees the items that are still null), and stop early as soon as everything is scored. Also wrap the deserialize-and-merge step in try/catch (JsonException). When the agent writes structurally invalid JSON (e.g. an abbreviated ChecklistItem object), we now log and retry instead of crashing the whole pipeline with an unhandled exception. E2E on learn.microsoft.com: 48/48 scored in a single run, score 90/100, full report generated (previously needed a resume run to finish the last 2 server-level checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The fixed 6-minute per-tool timeout was fine for tools with ~18 checks (AddDraftAttachments completed in ~3.5 min) but UpdateDraft, which has 46 semantic checks, hit the wall: 46 views + 31 creates + 78 reasoning rounds from Haiku in 6 minutes wasn't enough, so the subprocess was killed and all 46 checks came back null. Change: PerToolTimeout becomes TimeoutForChecks(checkCount) = 120s base + 15s per check, clamped to [3min, 20min] ChecklistEvaluator passes the unscored-check count into each attempt, so tools with more work get more time and small tools don't idle on an over-generous budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Observed Haiku needs closer to 15-20s per check (view + reason + write, with several thinking rounds) — 15s was cutting it close. Bumping to 20s keeps the same shape (base 120s + N*perCheck, clamped to [3, 20] min) but reduces the chance of hitting the ceiling mid-thought. UpdateDraft (46 checks) now gets 120 + 46*20 = 1040s = 17.3 min (was 13.0 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

The prior retry loop only re-invoked the agent when the subprocess exited 0 but left items null. If the first attempt hit the per-tool timeout, we gave up immediately ("retry would just repeat the same subprocess failure"). That assumption was wrong: on Haiku + Copilot we see non-deterministic timeouts — the same tool that times out on attempt 1 often completes on attempt 2 or 3 because Copilot's runtime is warmer, or the model happens to pick a shorter reasoning path. On the Mail MCP eval, 6 tools (SendEmailWithAttachments, GetMessage, FlagEmail, UploadAttachment, UploadLargeAttachment, ForwardMessage) ended with 0/N scored — all single-attempt timeouts that never got a retry. Similar-sized tools next to them in the pipeline completed fine on first attempt. Change: on subprocess failure, log and continue the retry loop instead of returning false. Still return false if *all* MaxAttempts subprocess calls fail — we're not pretending an unreachable agent succeeded. Same fix applied to EvaluateServerChecks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndbox doc - Scorer and ActionItemGenerator: remove null checks on parameters declared non-nullable. Production callers never pass null; tests that did are dropped. - ChecklistEvaluator: reword EvaluateToolChecks doc to reflect that setting WorkingDirectory is a reduced-surface defense (via each engine's path verification), not a full sandbox. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 1 comment.

Root cause of the Mail-MCP 0/N failures on GetMessage, FlagEmail, and UploadAttachment: Copilot's `create` tool cannot overwrite existing files ("Cannot be used if the specified path already exists"). We were telling the agent to "rewrite the whole file via create" — a strategy that physically fails the moment the pre-populated temp file exists. Some tools happened to stumble onto workarounds (create siblings, copy fields back); others (usually smaller ones like GetMessage, 54-char description) kept looping on the create->edit->view fallback dance for 9 minutes straight until timeout. Fix: use an edit-only (string-replace) flow. - SemanticCheckPrompts: - AgentToolset now names a read tool and an edit tool (no write tool). - New prompt instructs the agent to call edit once per null item with an old_str that includes both the item's id and its prompt field, which is globally unique in the file. - Explicit "answer with first instinct, do not re-read after a successful edit" rule to discourage the checking loop. - ChecklistEvaluator.ToolsetFor: Copilot=(view, edit); Claude=(Read, Edit). - CodingAgentRunner: - Copilot: --available-tools=view,edit (drops `create`). - Claude: --allowedTools Read,Edit (drops Write). Validated on learn.microsoft.com and the Mail MCP server: - learn.microsoft.com: 48/48 scored, 92/100, ~6.5 min total (was 46/48). - Mail MCP resume: 6 previously-failing tools all score first-attempt in ~2 min each (was 28 min + failing). Final: 638/638 scored, 82/100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MergeScores built its lookup with ToDictionary(e => e.Id), which throws ArgumentException on duplicate keys or a null id. The surrounding try/catch only catches JsonException, so a malformed agent batch would crash the run even when earlier attempts had made real progress. Drop empty ids and take last-wins on duplicates so a broken batch is treated like other agent-JSON quirks (retry on the next attempt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 4 comments.

… on availability - EvaluationPipelineService: when the user passes --eval-engine auto, the report used to record "auto" instead of whichever engine actually scored the checks. Thread a ChecklistEvaluationResult.EngineUsed back through TryEvaluateWithFallthrough / EvaluateToolChecks / EvaluateServerChecks so the report is stamped with the engine that ran (GitHub Copilot or Claude Code), falling back to the requested engine when none ran. - ChecklistEvaluator.BuildEngineList: when an explicit engine is requested (e.g. --eval-engine github-copilot), check availability first. If the CLI isn't on PATH, return an empty list so the caller surfaces the same "engine not found, here's how to install" guidance it uses in Auto mode, instead of looping through per-tool failures and printing the misleading "agent ran but left checks unscored" message. - ChecklistEvaluator: fix RepairJson XML doc — the implementation only inserts missing commas; trailing commas are handled separately by AllowTrailingCommas in ReadOptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds 4-layer F-001 XPIA mitigation. Each layer covers a specific failure the others miss: - L1 PromptSanitizer (new): strips bidi overrides, zero-width chars, C0/C1 controls, and U+E0000-U+E01EF tag-block from tool names, descriptions, and param names before they reach the agent. Without this, hidden Unicode in MCP content survives spotlighting and L3 keyword filters. - L2 spotlighting: prepends a SECURITY BOUNDARY header and wraps tool names in <untrusted-data> tags in all 3 prompt builders. Without this, the agent has no signal that schema content is untrusted. - L3 ScoringSafetyFilter (new): rejects agent reasons containing exfil URLs (http/https/ftp/data:) or prompt-injection markers ("ignore previous instructions", "system:", etc.). Cleared items go through the existing retry loop. Without this, exfil links and reproduced injection text reach the report. - L4 canary: injects a fake check whose correct answer is always false (random UUID match). A true score signals plan drift, logged as SECURITY error and surfaced via PlanDriftDetected on the result. This is the only post-hoc detector if L1-L3 fail silently. Also adds F-002 XSS defense-in-depth: routes maturity.label and AREA_LABELS values through esc() in SchemaEvalReport.html. Combined with the existing System.Text.Json encoding and EscapeForInlineScript layers, all 24 MCP-controlled fields are now escaped before any innerHTML assignment. Tests: PromptSanitizerTests, ScoringSafetyFilterTests, plus XSS regression tests in ReportGeneratorTests. All 148 affected tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…luate # Conflicts: # CHANGELOG.md # src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs

Resolved conflicts: - CHANGELOG.md: combined Added entries from both sides - DevelopMcpCommand.cs: collapsed CreateCommand overloads into a single signature with both evaluationPipelineService and graphApiService as optional, preserving all existing 2-arg call sites - Program.cs: pass both new args to DevelopMcpCommand.CreateCommand and keep DI registrations for both evaluate-pipeline services and the bootstrap-config-resolver Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ashragrawal marked this pull request as ready for review April 13, 2026 19:46

ashragrawal requested review from a team as code owners April 13, 2026 19:46

Copilot AI review requested due to automatic review settings April 13, 2026 19:46

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Commands/EvaluateCommand.cs Outdated

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentRunner.cs

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs

Copilot started reviewing on behalf of ashragrawal April 13, 2026 19:58 View session

ashragrawal and others added 3 commits April 16, 2026 14:01

Copilot AI review requested due to automatic review settings April 16, 2026 21:10

Copilot started reviewing on behalf of ashragrawal April 16, 2026 21:11 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationPipelineService.cs Outdated

Merge remote-tracking branch 'origin/main' into users/ashragrawal/eva…

ba37732

…luate # Conflicts: # CHANGELOG.md

Copilot AI review requested due to automatic review settings April 16, 2026 21:51

Copilot started reviewing on behalf of ashragrawal April 16, 2026 21:52 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

ashragrawal and others added 2 commits April 16, 2026 16:32

Copilot AI review requested due to automatic review settings April 16, 2026 23:41

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs

Copilot started reviewing on behalf of ashragrawal April 16, 2026 23:50 View session

ashragrawal and others added 2 commits April 16, 2026 17:02

Copilot AI review requested due to automatic review settings April 17, 2026 00:12

Copilot started reviewing on behalf of ashragrawal April 17, 2026 00:15 View session

Copilot started reviewing on behalf of ashragrawal April 20, 2026 21:14 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

ashragrawal and others added 2 commits April 20, 2026 15:03

Copilot AI review requested due to automatic review settings April 20, 2026 23:25

Copilot started reviewing on behalf of ashragrawal April 20, 2026 23:25 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

ashragrawal and others added 3 commits April 20, 2026 17:38

Copilot AI review requested due to automatic review settings April 21, 2026 00:47

Copilot started reviewing on behalf of ashragrawal April 21, 2026 00:47 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

ashragrawal and others added 2 commits April 20, 2026 20:56

Copilot AI review requested due to automatic review settings April 21, 2026 05:09

Copilot started reviewing on behalf of ashragrawal April 21, 2026 05:09 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs

ashragrawal and others added 2 commits April 20, 2026 23:23

Copilot AI review requested due to automatic review settings April 21, 2026 06:31

Copilot started reviewing on behalf of ashragrawal April 21, 2026 06:32 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

ashragrawal and others added 2 commits April 21, 2026 13:59

Copilot AI review requested due to automatic review settings April 27, 2026 22:01

Merge remote-tracking branch 'origin/main' into users/ashragrawal/eva…

a29f8f9

…luate # Conflicts: # CHANGELOG.md # src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs

github-actions Bot added the documentation Improvements or additions to documentation label Apr 27, 2026

Conversation

ashragrawal commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the command does

Usage

Bring-your-own-LLM workflow

Key design decisions

Test plan

Uh oh!

github-actions Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

ashragrawal commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ashragrawal commented Apr 11, 2026 •

edited

Loading

github-actions Bot commented Apr 11, 2026 •

edited

Loading