Add evaluation pipeline for MCP quality#362
Conversation
5-step pipeline: discover tools from MCP server, generate auditable checklist, evaluate semantic checks via coding agent CLI (GitHub Copilot or Claude Code), analyze scores/maturity/action items, render HTML report. Key design decisions: - Extract-evaluate-merge pattern: each tool evaluated in its own ~25KB temp file to avoid coding agent timeouts on large checklists - Engine fallthrough: tries Copilot first, then Claude Code, with per-tool 6-minute timeout and process tree cleanup on timeout - Copilot uses prompt-file approach (no stdin support); Claude uses stdin piping - 25 deterministic checks (C#) + 12 semantic checks per tool (coding agent) - 18-smell taxonomy with weighted 5-category scoring and maturity levels 0-4 - 318 new tests (xUnit + FluentAssertions)
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
- Switch EvaluateCommand to InvocationContext pattern with CancellationToken threaded through the entire evaluation pipeline - Fix Claude Code on Windows: use prompt-file instead of stdin piping (cmd.exe /c does not forward stdin to child processes) - Fix SemanticEvaluationCompleted returning false when all checks were already scored (pre-evaluated checklists) - Remove no-op --verbose option - Remove redundant Environment.ExitCode = 1 assignments - Add CHANGELOG entry for a365 evaluate
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds the new a365 evaluate command and supporting evaluation pipeline to discover MCP tools, run deterministic + semantic checks (via GitHub Copilot / Claude Code), score + compute maturity, and generate JSON/HTML reports.
Changes:
- Introduces evaluation pipeline services (schema discovery, checklist generation/evaluation, scoring, analysis, reporting) and wires them into the CLI.
- Adds models for checklists/results, smell taxonomy + maturity scoring, and prompt templates for coding-agent semantic evaluation.
- Adds extensive xUnit coverage for the new evaluation components and CLI command.
Reviewed changes
Copilot reviewed 46 out of 46 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| CHANGELOG.md | Documents the new a365 evaluate command. |
| src/Microsoft.Agents.A365.DevTools.Cli/Commands/EvaluateCommand.cs | Adds the evaluate CLI command orchestrating the 5-step pipeline. |
| src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs | Adds evaluation-specific error codes. |
| src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs | Introduces a dedicated exception type for evaluation failures. |
| src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj | Adds HTTP client factory package + embeds HTML report template. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs | Adds report model for prioritized remediations. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs | Adds model for deterministic/semantic checklist items. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs | Adds HTML-template data model (result + impact map + ladder). |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs | Adds enums for categories, priorities, engines, etc. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs | Adds checklist root + metadata model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs | Adds maturity level model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs | Adds top-level evaluation result model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SmellDefinition.cs | Adds smell taxonomy definition model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs | Adds per-tool checklist + grouped checks model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs | Adds per-tool evaluation result model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs | Adds discovered tool schema model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs | Adds toolset-level evaluation result model. |
| src/Microsoft.Agents.A365.DevTools.Cli/Program.cs | Registers evaluate command + DI services (HttpClient + pipeline services). |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs | Generates action items from failed checks with score impact + smell mapping. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs | Runs extract-evaluate-merge with coding agents and merges results back. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentRunner.cs | Detects/runs Copilot/Claude CLIs with timeouts and cleanup. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs | Computes scores, maturity, smell summary, and action items aggregation. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs | Interface + result type for semantic evaluation step. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs | Interface for checklist generation step. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs | Interface for analysis/scoring step. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs | Interface for report generation step. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs | Interface for MCP tool discovery step. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs | Implements maturity level calculation + ladder rendering data. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs | Writes JSON/HTML report and optionally opens browser. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs | Implements MCP initialize + tools/list discovery over JSON-RPC + SSE handling. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs | Implements category/tool/overall scoring + averages. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs | Defines semantic check metadata for tools/params/toolset. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs | Builds prompts/commands for coding-agent semantic evaluation. |
| src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SmellTaxonomy.cs | Adds 18-smell taxonomy and impact map for HTML report. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs | Tests CLI structure + engine parsing + server-name derivation. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs | Tests action item generation + score impact + sorting. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/DeterministicChecksTests.cs | Tests deterministic checks behavior (naming/description/schema/toolset). |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs | Tests analyzer aggregation, scoring, maturity, smells, priorities. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs | Tests maturity thresholds/caps + ladder output. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs | Tests JSON/HTML report generation + sanitization + arg validation. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs | Tests scoring rules and weight math. |
| src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs | Tests semantic check definitions are consistent and stable. |
Comments suppressed due to low confidence (5)
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1
htmlPathmay contain spaces/shell-special characters, and passing it via the(fileName, arguments)constructor can break argument parsing (especially on Linux/macOS). PreferProcessStartInfo.ArgumentList(single arg) or explicitly quote/escape the path to ensure the report opens reliably.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1- The HTML template receives raw JSON substituted into the document. If any server-supplied fields (tool names/descriptions, reasons, etc.) contain sequences like
</script>or other HTML-breaking content, this can break the page or enable script injection when the report is opened. Consider embedding the JSON in a safer form (e.g., HTML-escaped JSON, base64-encoded payload decoded at runtime, or a<script type="application/json">block with escaping of<,>,&, and</script>).
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs:1 - The JSON structure example suggests the
parametersobject is keyed by"param_name", but in the actual model it’s a dictionary keyed by the actual parameter name (e.g.,"userId": { "param_name": [...], ... }). This mismatch can mislead the coding agent and reduce evaluation success; update the example to reflect the real shape.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1 SerializerOptionsis declared but not used anywhere in this file. Remove it to avoid dead code, or use it consistently for any deserialization that needs case-insensitive behavior.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1- For SSE responses,
ReadAsStringAsyncreads the entire response body before parsing, which can be problematic if the server keeps the event stream open (the call may not complete promptly). Consider switching to streaming processing (e.g.,SendAsync(..., ResponseHeadersRead)and reading the content stream line-by-line until a JSONdata:message is found) to avoid hangs and reduce memory usage.
Drop DeterministicChecks and its tests (unreferenced after inlining into ChecklistGenerator), plus unused methods ActionItemGenerator.GenerateFromChecks and SemanticCheckPrompts.BuildClaudeCodeCommand/BuildGithubCopilotCommand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline the evaluate subcommand in DevelopMcpCommand and extract the 5-step pipeline into IEvaluationPipelineService so the command stays thin. Adds a DevelopMcpCommand.CreateCommand overload that accepts the pipeline service; the existing 2-param signature remains for tests that don't need evaluate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repair JSON produced by coding agents: tolerate trailing commas and insert missing commas before deserializing the updated checklist, since agents occasionally emit structurally invalid JSON. Run Copilot with the Haiku model (extracted to a single constant) so both engines default to the same fast/cheap tier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…luate # Conflicts: # CHANGELOG.md
|
@microsoft-github-policy-service agree company="Microsoft" |
- CodingAgentRunner: correct the class summary to describe actual prompt delivery (Claude Code uses stdin on Unix, temp file on Windows; Copilot always uses a temp file). - ActionItemGenerator: map unknown CheckCategory values to "unknown" instead of "schema_structure", so new categories fall back to the default weight rather than silently inheriting schema-structure weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch from AddHttpClient<T>() to the project's standard HttpClientFactory pattern (matches GraphApiService, ArmApiService, etc.). This removes the default LoggingHttpMessageHandler that emitted four "Start/Sending/ Received/End processing" lines per request at Information level, cleaning up the user-facing output during schema discovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the "Where You Stand" section rendered the maturity ladder and nothing below it when the server was at Level 4 (the top) — no "To reach Level N+1" box to guide users. This left a visual gap that looked like missing content. Add a terminal-state message acknowledging the server has reached the highest maturity level and pointing to the action items for remaining refinements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 48 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1
- Injecting raw JSON into an HTML template via string replacement can be unsafe if the template places it inside a
<script>tag: tool descriptions could contain</script>and prematurely terminate the script, enabling HTML/script injection in the generated report. Escape JSON for safe embedding (e.g., replace</with<\\/), or embed the JSON as HTML-encoded text within a<script type=\"application/json\">element and parse it at runtime.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1 SerializerOptionsis declared but never used anywhere in this file. Remove it to reduce noise, or apply it consistently where deserialization occurs (if the intent is to deserialize into typed models rather than using JsonDocument/manual parsing).
Previously the evaluate pipeline emitted a mix of developer-facing noise
(duplicate "Engines available" / "Engines available again" lines, stray
"Coding agent completed successfully" after every tool) and lacked clear
progress indicators, making it hard to tell where the run was at a glance.
Rework the output around a 5-step pipeline with aligned indented detail
lines. Key changes:
- Step markers [1/5]..[5/5] for discovery, checklist, eval, analysis, report.
- Single "Using <Engine>" line (with optional fallback) instead of three
"Detecting / Available / Engines available" lines.
- Per-tool progress prints once per tool with an inline status ("ok" or
"failed (continuing)"), not before+after.
- Demote "Coding agent completed / exited / timed out" to debug — the
user already sees success/failure on the per-tool line.
- When no coding agent CLI is found, write the semantic eval prompt to
semantic_eval_prompt.txt next to the checklist and guide users through
install options OR scoring with their own LLM.
- Remove the old "Analyzing results..." / "Analysis complete" / "Generating
report..." intermediate lines; the step markers and trailing "Done. Score"
line already convey that information.
- Suppress the extraneous initial checklist-path log at Information level.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Correct SemanticEvaluationCompleted: require zero remaining unevaluated semantic checks before marking complete. Previously a single successful tool would flip the flag to true, letting Scorer treat still-null categories as perfect 100 and inflate overall scores on partial runs. - Switch `develop-mcp evaluate`'s required input from a positional `server-url` argument to a required `--server-url` / `-u` option, for consistency with the other develop-mcp subcommands and the Azure CLI compliance regression test. - Route `ToolsetDesign` checks to `Scorer.ToolsetWeight` in ActionItemGenerator so action-item score impact stays aligned with overall scoring; removes an implicit reliance on the 0.15 fallback coincidentally matching ToolsetWeight. - Add ArgumentNullException guards to the EvaluationPipelineService constructor for parity with the rest of the codebase's DI services. - Expose ChecklistEvaluator.RepairJson as internal and add unit tests covering well-formed input, missing commas between objects/strings/ booleans, and empty input. - Relax DevelopMcpCommandTests subcommand-count assertions to check for presence/absence of "evaluate" instead of asserting a hardcoded total, so unrelated subcommand additions don't break these tests. - Add `because:` clauses to DeriveServerName assertions so the intent of each URL-sanitization invariant is documented at the assertion site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-tool scoring was flaky (0-34 of 48 scored across runs) because the prompt said "use a whole-file write tool if available" and the agent non-deterministically chose edit/str_replace for individual items. Those edits failed on the repeating "score: null" pattern that isn't unique across checks, and the subprocess still exited 0 so the pipeline logged "ok" with nothing merged. Fix: build a per-engine prompt that names the exact tool the agent should use. SemanticCheckPrompts now takes an AgentToolset record describing ReadToolName/WriteToolName/EditToolName, and ChecklistEvaluator maps EvalEngine to the concrete names (Copilot: view/create/edit, Claude Code: Read/Write/Edit). The prompt instructs "use Write/create ONCE" and warns away from targeted string replacements. Also add Write to Claude Code's --allowedTools since a whole-file write is the reliable strategy for both engines. E2E on learn.microsoft.com: 46/48 scored consistently (was 20-34 flaky); the 2 remaining are the toolset-level server checks, which we'll follow up on separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restricting Copilot to --available-tools=view,create caused the model to
thrash and leave checks unscored — it had the ability to do the task but
not the flexibility to pick its own strategy. Inverting the restriction
(allow everything, deny the dangerous families) lets the agent use its
full toolkit for the scoring task while blocking the two ways it could
escape the sandbox or leak data.
Denies:
Copilot:
shell, write_shell, read_shell, stop_shell, list_shell (macOS/Linux),
powershell, write_powershell, read_powershell, stop_powershell, list_powershell (Windows),
web_fetch, web_search
Claude Code:
Bash, BashOutput, KillBash, WebFetch, WebSearch
File access remains bounded by the per-invocation temp-dir sandbox —
file tools respect cwd by default, and we don't pass --allow-all-paths.
Prompt simplified: we no longer over-instruct the agent on which tool
to use, just name the read/write tool names it has and describe the
write-in-one-call strategy as a preference, not a restriction.
E2E on learn.microsoft.com: 48/48 scored, score 92/100, HTML report
generated (was flaky 20-46/48 previously).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename EvalEngine.GithubCopilot to GitHubCopilot so the serialized enum name matches GitHub branding (and report JSON stays consistent) - Use FormatEngineName display name in report eval-engine field instead of raw enum ToString() so downstream consumers see "GitHub Copilot" - Pass derived server name through ReportGenerator.SanitizeFileName so the UriFormatException fallback can't produce an invalid filename - Drop unused workingDir parameter from EvaluateToolChecks and EvaluateServerChecks (sandbox dir is created internally) - Fix ReportGenerator comment to drop the bogus "<!" escape mention - Reword evaluate help text so it doesn't imply --eval-engine none is required for BYOL (auto-mode also falls back to the written checklist) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Copilot model sometimes hedges on "pass if no issues" prompts and leaves the score as null instead of committing to true/false. Before this change, the pipeline accepted whatever came back from the first agent call, so runs would flake between 30/48 and 48/48 scored on identical inputs — the same tool or same pair of server-level checks would score one run and skip the next. Change: EvaluateToolChecks and EvaluateServerChecks now loop up to MaxAttempts (3) times. After each agent pass we merge scored items back into the in-memory checklist, re-serialize the current state to the temp file (so the next attempt only sees the items that are still null), and stop early as soon as everything is scored. Also wrap the deserialize-and-merge step in try/catch (JsonException). When the agent writes structurally invalid JSON (e.g. an abbreviated ChecklistItem object), we now log and retry instead of crashing the whole pipeline with an unhandled exception. E2E on learn.microsoft.com: 48/48 scored in a single run, score 90/100, full report generated (previously needed a resume run to finish the last 2 server-level checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fixed 6-minute per-tool timeout was fine for tools with ~18 checks (AddDraftAttachments completed in ~3.5 min) but UpdateDraft, which has 46 semantic checks, hit the wall: 46 views + 31 creates + 78 reasoning rounds from Haiku in 6 minutes wasn't enough, so the subprocess was killed and all 46 checks came back null. Change: PerToolTimeout becomes TimeoutForChecks(checkCount) = 120s base + 15s per check, clamped to [3min, 20min] ChecklistEvaluator passes the unscored-check count into each attempt, so tools with more work get more time and small tools don't idle on an over-generous budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed Haiku needs closer to 15-20s per check (view + reason + write, with several thinking rounds) — 15s was cutting it close. Bumping to 20s keeps the same shape (base 120s + N*perCheck, clamped to [3, 20] min) but reduces the chance of hitting the ceiling mid-thought. UpdateDraft (46 checks) now gets 120 + 46*20 = 1040s = 17.3 min (was 13.0 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior retry loop only re-invoked the agent when the subprocess
exited 0 but left items null. If the first attempt hit the per-tool
timeout, we gave up immediately ("retry would just repeat the same
subprocess failure"). That assumption was wrong: on Haiku + Copilot
we see non-deterministic timeouts — the same tool that times out on
attempt 1 often completes on attempt 2 or 3 because Copilot's runtime
is warmer, or the model happens to pick a shorter reasoning path.
On the Mail MCP eval, 6 tools (SendEmailWithAttachments, GetMessage,
FlagEmail, UploadAttachment, UploadLargeAttachment, ForwardMessage)
ended with 0/N scored — all single-attempt timeouts that never got a
retry. Similar-sized tools next to them in the pipeline completed fine
on first attempt.
Change: on subprocess failure, log and continue the retry loop instead
of returning false. Still return false if *all* MaxAttempts subprocess
calls fail — we're not pretending an unreachable agent succeeded.
Same fix applied to EvaluateServerChecks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndbox doc - Scorer and ActionItemGenerator: remove null checks on parameters declared non-nullable. Production callers never pass null; tests that did are dropped. - ChecklistEvaluator: reword EvaluateToolChecks doc to reflect that setting WorkingDirectory is a reduced-surface defense (via each engine's path verification), not a full sandbox. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the Mail-MCP 0/N failures on GetMessage, FlagEmail, and
UploadAttachment: Copilot's `create` tool cannot overwrite existing
files ("Cannot be used if the specified path already exists"). We were
telling the agent to "rewrite the whole file via create" — a strategy
that physically fails the moment the pre-populated temp file exists.
Some tools happened to stumble onto workarounds (create siblings, copy
fields back); others (usually smaller ones like GetMessage, 54-char
description) kept looping on the create->edit->view fallback dance for
9 minutes straight until timeout.
Fix: use an edit-only (string-replace) flow.
- SemanticCheckPrompts:
- AgentToolset now names a read tool and an edit tool (no write tool).
- New prompt instructs the agent to call edit once per null item with
an old_str that includes both the item's id and its prompt field,
which is globally unique in the file.
- Explicit "answer with first instinct, do not re-read after a
successful edit" rule to discourage the checking loop.
- ChecklistEvaluator.ToolsetFor: Copilot=(view, edit); Claude=(Read, Edit).
- CodingAgentRunner:
- Copilot: --available-tools=view,edit (drops `create`).
- Claude: --allowedTools Read,Edit (drops Write).
Validated on learn.microsoft.com and the Mail MCP server:
- learn.microsoft.com: 48/48 scored, 92/100, ~6.5 min total (was 46/48).
- Mail MCP resume: 6 previously-failing tools all score first-attempt
in ~2 min each (was 28 min + failing). Final: 638/638 scored, 82/100.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MergeScores built its lookup with ToDictionary(e => e.Id), which throws ArgumentException on duplicate keys or a null id. The surrounding try/catch only catches JsonException, so a malformed agent batch would crash the run even when earlier attempts had made real progress. Drop empty ids and take last-wins on duplicates so a broken batch is treated like other agent-JSON quirks (retry on the next attempt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… on availability - EvaluationPipelineService: when the user passes --eval-engine auto, the report used to record "auto" instead of whichever engine actually scored the checks. Thread a ChecklistEvaluationResult.EngineUsed back through TryEvaluateWithFallthrough / EvaluateToolChecks / EvaluateServerChecks so the report is stamped with the engine that ran (GitHub Copilot or Claude Code), falling back to the requested engine when none ran. - ChecklistEvaluator.BuildEngineList: when an explicit engine is requested (e.g. --eval-engine github-copilot), check availability first. If the CLI isn't on PATH, return an empty list so the caller surfaces the same "engine not found, here's how to install" guidance it uses in Auto mode, instead of looping through per-tool failures and printing the misleading "agent ran but left checks unscored" message. - ChecklistEvaluator: fix RepairJson XML doc — the implementation only inserts missing commas; trailing commas are handled separately by AllowTrailingCommas in ReadOptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 4-layer F-001 XPIA mitigation. Each layer covers a specific failure
the others miss:
- L1 PromptSanitizer (new): strips bidi overrides, zero-width chars,
C0/C1 controls, and U+E0000-U+E01EF tag-block from tool names,
descriptions, and param names before they reach the agent. Without
this, hidden Unicode in MCP content survives spotlighting and L3
keyword filters.
- L2 spotlighting: prepends a SECURITY BOUNDARY header and wraps tool
names in <untrusted-data> tags in all 3 prompt builders. Without
this, the agent has no signal that schema content is untrusted.
- L3 ScoringSafetyFilter (new): rejects agent reasons containing
exfil URLs (http/https/ftp/data:) or prompt-injection markers
("ignore previous instructions", "system:", etc.). Cleared items
go through the existing retry loop. Without this, exfil links and
reproduced injection text reach the report.
- L4 canary: injects a fake check whose correct answer is always
false (random UUID match). A true score signals plan drift, logged
as SECURITY error and surfaced via PlanDriftDetected on the result.
This is the only post-hoc detector if L1-L3 fail silently.
Also adds F-002 XSS defense-in-depth: routes maturity.label and
AREA_LABELS values through esc() in SchemaEvalReport.html. Combined
with the existing System.Text.Json encoding and EscapeForInlineScript
layers, all 24 MCP-controlled fields are now escaped before any
innerHTML assignment.
Tests: PromptSanitizerTests, ScoringSafetyFilterTests, plus XSS
regression tests in ReportGeneratorTests. All 148 affected tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…luate # Conflicts: # CHANGELOG.md # src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs
Resolved conflicts: - CHANGELOG.md: combined Added entries from both sides - DevelopMcpCommand.cs: collapsed CreateCommand overloads into a single signature with both evaluationPipelineService and graphApiService as optional, preserving all existing 2-arg call sites - Program.cs: pass both new args to DevelopMcpCommand.CreateCommand and keep DI registrations for both evaluate-pipeline services and the bootstrap-config-resolver Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
a365 develop-mcp evaluate— a new subcommand that measures the qualityof an MCP server's tool schemas and produces an interactive HTML report with
prioritized action items.
Poor tool names, descriptions, or parameter schemas cause AI agents to select
the wrong tool or pass incorrect arguments. This command makes that quality
visible and actionable before the server is shipped to agents.
What the command does
handshake).
(structural/objective, evaluated inline) and semantic checks (require
language judgment).
GitHub Copilot or Claude Code (both default to Haiku).
ranked set of action items.
that opens in the browser.
Usage
Bring-your-own-LLM workflow
If no coding agent is installed (or the user wants to use a different LLM),
the command writes the checklist and a self-contained
semantic_eval_prompt.txt to the output directory, then stops. The user scores
each score: null item with their own tool (ChatGPT, Gemini, IDE assistant,
etc.) and re-runs the exact same command — the pipeline detects the existing
scored checklist, skips discovery, and generates the report. No special flag
required for the resume.
Key design decisions
delegates to IEvaluationPipelineService, which orchestrates the 5 steps.
Keeps the command focused and the pipeline unit-testable.
runs. This is what makes the BYOL workflow round-trip without special flags.
semantic check is still score: null, because Scorer treats all-null
categories as 100 — proceeding would produce an inflated, misleading score.
consistency with other develop-mcp subcommands and the Azure CLI regression
tests.
across types, JSON fields, prompts, and the HTML report.
HttpClientFactory.CreateAuthenticatedClient() — same as GraphApiService,
ArmApiService, etc. — so user output isn't polluted by the default
LoggingHttpMessageHandler's 4-lines-per-request noise.
Test plan
tools, 48 semantic checks)
scripts/simulate_llm_scoring.py), re-run same command, confirm report
generates without re-discovery