Skip to content

Add evaluation pipeline for MCP quality#362

Open
ashragrawal wants to merge 32 commits intomainfrom
users/ashragrawal/evaluate
Open

Add evaluation pipeline for MCP quality#362
ashragrawal wants to merge 32 commits intomainfrom
users/ashragrawal/evaluate

Conversation

@ashragrawal
Copy link
Copy Markdown

@ashragrawal ashragrawal commented Apr 11, 2026

Summary

Adds a365 develop-mcp evaluate — a new subcommand that measures the quality
of an MCP server's tool schemas and produces an interactive HTML report with
prioritized action items.

Poor tool names, descriptions, or parameter schemas cause AI agents to select
the wrong tool or pass incorrect arguments. This command makes that quality
visible and actionable before the server is shipped to agents.

What the command does

  1. Discovers tools from the MCP server over Streamable HTTP (JSON-RPC 2.0
    handshake).
  2. Generates a checklist combining deterministic checks
    (structural/objective, evaluated inline) and semantic checks (require
    language judgment).
  3. Evaluates semantic checks via a locally installed coding agent —
    GitHub Copilot or Claude Code (both default to Haiku).
  4. Analyzes results to compute a 0-100 score, a 0-4 maturity level, and a
    ranked set of action items.
  5. Writes a machine-readable JSON report and a self-contained HTML report
    that opens in the browser.

Usage

# Fully automatic (requires GitHub Copilot or Claude Code CLI)
a365 develop-mcp evaluate --server-url https://my-mcp-server.com/mcp

# With auth, specific engine, custom output
a365 develop-mcp evaluate \
  --server-url https://my-server/mcp \
  --auth-token $TOKEN \
  --eval-engine claude-code \
  --output-dir ./reports

Bring-your-own-LLM workflow

If no coding agent is installed (or the user wants to use a different LLM),
the command writes the checklist and a self-contained
semantic_eval_prompt.txt to the output directory, then stops. The user scores
each score: null item with their own tool (ChatGPT, Gemini, IDE assistant,
etc.) and re-runs the exact same command — the pipeline detects the existing
scored checklist, skips discovery, and generates the report. No special flag
required for the resume.

Key design decisions

  • Thin command, fat service. DevelopMcpCommand only defines args/options and
    delegates to IEvaluationPipelineService, which orchestrates the 5 steps.
    Keeps the command focused and the pipeline unit-testable.
  • Resume-by-file. The checklist JSON on disk is the source of truth between
    runs. This is what makes the BYOL workflow round-trip without special flags.
  • Strict completion check. The pipeline refuses to generate a report if any
    semantic check is still score: null, because Scorer treats all-null
    categories as 100 — proceeding would produce an inflated, misleading score.
  • Required option, not positional arg. --server-url is a required option for
    consistency with other develop-mcp subcommands and the Azure CLI regression
    tests.
  • Neutral terminology. The 18 schema-quality problems are called "issues"
    across types, JSON fields, prompts, and the HTML report.
  • Same HTTP pattern as the rest of the repo. SchemaDiscoveryService uses
    HttpClientFactory.CreateAuthenticatedClient() — same as GraphApiService,
    ArmApiService, etc. — so user output isn't polluted by the default
    LoggingHttpMessageHandler's 4-lines-per-request noise.

Test plan

  • dotnet test tests.proj — 1544 passing, 0 failing
  • End-to-end with --eval-engine github-copilot against learn.microsoft.com (3
    tools, 48 semantic checks)
  • End-to-end with --eval-engine claude-code
  • End-to-end with --eval-engine none on fresh run (stops with guidance)
  • End-to-end with --eval-engine auto and no agent on PATH (stops with install
  • BYOL guidance)
  • BYOL resume: score the checklist manually (simulated via
    scripts/simulate_llm_scoring.py), re-run same command, confirm report
    generates without re-discovery
  • Engine fallthrough: Copilot timeout falls through to Claude Code
  • Clean logs (no Start/Sending/Received/End HTTP noise)
  • Help text mentions coding-agent dependency + BYOL alternative

5-step pipeline: discover tools from MCP server, generate auditable checklist,
evaluate semantic checks via coding agent CLI (GitHub Copilot or Claude Code),
analyze scores/maturity/action items, render HTML report.

Key design decisions:
- Extract-evaluate-merge pattern: each tool evaluated in its own ~25KB temp
  file to avoid coding agent timeouts on large checklists
- Engine fallthrough: tries Copilot first, then Claude Code, with per-tool
  6-minute timeout and process tree cleanup on timeout
- Copilot uses prompt-file approach (no stdin support); Claude uses stdin piping
- 25 deterministic checks (C#) + 12 semantic checks per tool (coding agent)
- 18-smell taxonomy with weighted 5-category scoring and maturity levels 0-4
- 318 new tests (xUnit + FluentAssertions)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 11, 2026

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@ashragrawal ashragrawal marked this pull request as ready for review April 13, 2026 19:46
@ashragrawal ashragrawal requested review from a team as code owners April 13, 2026 19:46
Copilot AI review requested due to automatic review settings April 13, 2026 19:46
- Switch EvaluateCommand to InvocationContext pattern with CancellationToken
  threaded through the entire evaluation pipeline
- Fix Claude Code on Windows: use prompt-file instead of stdin piping
  (cmd.exe /c does not forward stdin to child processes)
- Fix SemanticEvaluationCompleted returning false when all checks were
  already scored (pre-evaluated checklists)
- Remove no-op --verbose option
- Remove redundant Environment.ExitCode = 1 assignments
- Add CHANGELOG entry for a365 evaluate
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds the new a365 evaluate command and supporting evaluation pipeline to discover MCP tools, run deterministic + semantic checks (via GitHub Copilot / Claude Code), score + compute maturity, and generate JSON/HTML reports.

Changes:

  • Introduces evaluation pipeline services (schema discovery, checklist generation/evaluation, scoring, analysis, reporting) and wires them into the CLI.
  • Adds models for checklists/results, smell taxonomy + maturity scoring, and prompt templates for coding-agent semantic evaluation.
  • Adds extensive xUnit coverage for the new evaluation components and CLI command.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
CHANGELOG.md Documents the new a365 evaluate command.
src/Microsoft.Agents.A365.DevTools.Cli/Commands/EvaluateCommand.cs Adds the evaluate CLI command orchestrating the 5-step pipeline.
src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs Adds evaluation-specific error codes.
src/Microsoft.Agents.A365.DevTools.Cli/Exceptions/EvaluationException.cs Introduces a dedicated exception type for evaluation failures.
src/Microsoft.Agents.A365.DevTools.Cli/Microsoft.Agents.A365.DevTools.Cli.csproj Adds HTTP client factory package + embeds HTML report template.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ActionItem.cs Adds report model for prioritized remediations.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ChecklistItem.cs Adds model for deterministic/semantic checklist items.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvalReportData.cs Adds HTML-template data model (result + impact map + ladder).
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs Adds enums for categories, priorities, engines, etc.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluationChecklist.cs Adds checklist root + metadata model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/MaturityLevel.cs Adds maturity level model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SchemaEvalResult.cs Adds top-level evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/SmellDefinition.cs Adds smell taxonomy definition model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolChecklist.cs Adds per-tool checklist + grouped checks model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolEvalResult.cs Adds per-tool evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolSchema.cs Adds discovered tool schema model.
src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/ToolsetEvalResult.cs Adds toolset-level evaluation result model.
src/Microsoft.Agents.A365.DevTools.Cli/Program.cs Registers evaluate command + DI services (HttpClient + pipeline services).
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ActionItemGenerator.cs Generates action items from failed checks with score impact + smell mapping.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Runs extract-evaluate-merge with coding agents and merges results back.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/CodingAgentRunner.cs Detects/runs Copilot/Claude CLIs with timeouts and cleanup.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/EvaluationAnalyzer.cs Computes scores, maturity, smell summary, and action items aggregation.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistEvaluator.cs Interface + result type for semantic evaluation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IChecklistGenerator.cs Interface for checklist generation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IEvaluationAnalyzer.cs Interface for analysis/scoring step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/IReportGenerator.cs Interface for report generation step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ISchemaDiscoveryService.cs Interface for MCP tool discovery step.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/MaturityCalculator.cs Implements maturity level calculation + ladder rendering data.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs Writes JSON/HTML report and optionally opens browser.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs Implements MCP initialize + tools/list discovery over JSON-RPC + SSE handling.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs Implements category/tool/overall scoring + averages.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckDefinitions.cs Defines semantic check metadata for tools/params/toolset.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs Builds prompts/commands for coding-agent semantic evaluation.
src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SmellTaxonomy.cs Adds 18-smell taxonomy and impact map for HTML report.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Commands/EvaluateCommandTests.cs Tests CLI structure + engine parsing + server-name derivation.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ActionItemGeneratorTests.cs Tests action item generation + score impact + sorting.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/DeterministicChecksTests.cs Tests deterministic checks behavior (naming/description/schema/toolset).
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/EvaluationAnalyzerTests.cs Tests analyzer aggregation, scoring, maturity, smells, priorities.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/MaturityCalculatorTests.cs Tests maturity thresholds/caps + ladder output.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ReportGeneratorTests.cs Tests JSON/HTML report generation + sanitization + arg validation.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/ScorerTests.cs Tests scoring rules and weight math.
src/Tests/Microsoft.Agents.A365.DevTools.Cli.Tests/Services/Evaluate/SemanticCheckDefinitionsTests.cs Tests semantic check definitions are consistent and stable.
Comments suppressed due to low confidence (5)

src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1

  • htmlPath may contain spaces/shell-special characters, and passing it via the (fileName, arguments) constructor can break argument parsing (especially on Linux/macOS). Prefer ProcessStartInfo.ArgumentList (single arg) or explicitly quote/escape the path to ensure the report opens reliably.
    src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1
  • The HTML template receives raw JSON substituted into the document. If any server-supplied fields (tool names/descriptions, reasons, etc.) contain sequences like </script> or other HTML-breaking content, this can break the page or enable script injection when the report is opened. Consider embedding the JSON in a safer form (e.g., HTML-escaped JSON, base64-encoded payload decoded at runtime, or a <script type="application/json"> block with escaping of <, >, &, and </script>).
    src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SemanticCheckPrompts.cs:1
  • The JSON structure example suggests the parameters object is keyed by "param_name", but in the actual model it’s a dictionary keyed by the actual parameter name (e.g., "userId": { "param_name": [...], ... }). This mismatch can mislead the coding agent and reduce evaluation success; update the example to reflect the real shape.
    src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
  • SerializerOptions is declared but not used anywhere in this file. Remove it to avoid dead code, or use it consistently for any deserialization that needs case-insensitive behavior.
    src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
  • For SSE responses, ReadAsStringAsync reads the entire response body before parsing, which can be problematic if the server keeps the event stream open (the call may not complete promptly). Consider switching to streaming processing (e.g., SendAsync(..., ResponseHeadersRead) and reading the content stream line-by-line until a JSON data: message is found) to avoid hangs and reduce memory usage.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Commands/EvaluateCommand.cs Outdated
ashragrawal and others added 3 commits April 16, 2026 14:01
Drop DeterministicChecks and its tests (unreferenced after inlining
into ChecklistGenerator), plus unused methods ActionItemGenerator.GenerateFromChecks
and SemanticCheckPrompts.BuildClaudeCodeCommand/BuildGithubCopilotCommand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline the evaluate subcommand in DevelopMcpCommand and extract the
5-step pipeline into IEvaluationPipelineService so the command stays thin.
Adds a DevelopMcpCommand.CreateCommand overload that accepts the pipeline
service; the existing 2-param signature remains for tests that don't
need evaluate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repair JSON produced by coding agents: tolerate trailing commas and
insert missing commas before deserializing the updated checklist,
since agents occasionally emit structurally invalid JSON.

Run Copilot with the Haiku model (extracted to a single constant) so
both engines default to the same fast/cheap tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 16, 2026 21:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 1 comment.

@ashragrawal
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

- CodingAgentRunner: correct the class summary to describe actual prompt
  delivery (Claude Code uses stdin on Unix, temp file on Windows;
  Copilot always uses a temp file).
- ActionItemGenerator: map unknown CheckCategory values to "unknown"
  instead of "schema_structure", so new categories fall back to the
  default weight rather than silently inheriting schema-structure weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 16, 2026 21:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 9 comments.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs Outdated
ashragrawal and others added 2 commits April 16, 2026 16:32
Switch from AddHttpClient<T>() to the project's standard HttpClientFactory
pattern (matches GraphApiService, ArmApiService, etc.). This removes the
default LoggingHttpMessageHandler that emitted four "Start/Sending/
Received/End processing" lines per request at Information level, cleaning
up the user-facing output during schema discovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the "Where You Stand" section rendered the maturity ladder
and nothing below it when the server was at Level 4 (the top) — no
"To reach Level N+1" box to guide users. This left a visual gap that
looked like missing content.

Add a terminal-state message acknowledging the server has reached the
highest maturity level and pointing to the action items for remaining
refinements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 16, 2026 23:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ReportGenerator.cs:1

  • Injecting raw JSON into an HTML template via string replacement can be unsafe if the template places it inside a <script> tag: tool descriptions could contain </script> and prematurely terminate the script, enabling HTML/script injection in the generated report. Escape JSON for safe embedding (e.g., replace </ with <\\/), or embed the JSON as HTML-encoded text within a <script type=\"application/json\"> element and parse it at runtime.
    src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/SchemaDiscoveryService.cs:1
  • SerializerOptions is declared but never used anywhere in this file. Remove it to reduce noise, or apply it consistently where deserialization occurs (if the intent is to deserialize into typed models rather than using JsonDocument/manual parsing).

ashragrawal and others added 2 commits April 16, 2026 17:02
Previously the evaluate pipeline emitted a mix of developer-facing noise
(duplicate "Engines available" / "Engines available again" lines, stray
"Coding agent completed successfully" after every tool) and lacked clear
progress indicators, making it hard to tell where the run was at a glance.

Rework the output around a 5-step pipeline with aligned indented detail
lines. Key changes:

- Step markers [1/5]..[5/5] for discovery, checklist, eval, analysis, report.
- Single "Using <Engine>" line (with optional fallback) instead of three
  "Detecting / Available / Engines available" lines.
- Per-tool progress prints once per tool with an inline status ("ok" or
  "failed (continuing)"), not before+after.
- Demote "Coding agent completed / exited / timed out" to debug — the
  user already sees success/failure on the per-tool line.
- When no coding agent CLI is found, write the semantic eval prompt to
  semantic_eval_prompt.txt next to the checklist and guide users through
  install options OR scoring with their own LLM.
- Remove the old "Analyzing results..." / "Analysis complete" / "Generating
  report..." intermediate lines; the step markers and trailing "Done. Score"
  line already convey that information.
- Suppress the extraneous initial checklist-path log at Information level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Correct SemanticEvaluationCompleted: require zero remaining unevaluated
  semantic checks before marking complete. Previously a single successful
  tool would flip the flag to true, letting Scorer treat still-null
  categories as perfect 100 and inflate overall scores on partial runs.
- Switch `develop-mcp evaluate`'s required input from a positional
  `server-url` argument to a required `--server-url` / `-u` option, for
  consistency with the other develop-mcp subcommands and the Azure CLI
  compliance regression test.
- Route `ToolsetDesign` checks to `Scorer.ToolsetWeight` in
  ActionItemGenerator so action-item score impact stays aligned with
  overall scoring; removes an implicit reliance on the 0.15 fallback
  coincidentally matching ToolsetWeight.
- Add ArgumentNullException guards to the EvaluationPipelineService
  constructor for parity with the rest of the codebase's DI services.
- Expose ChecklistEvaluator.RepairJson as internal and add unit tests
  covering well-formed input, missing commas between objects/strings/
  booleans, and empty input.
- Relax DevelopMcpCommandTests subcommand-count assertions to check for
  presence/absence of "evaluate" instead of asserting a hardcoded total,
  so unrelated subcommand additions don't break these tests.
- Add `because:` clauses to DeriveServerName assertions so the intent of
  each URL-sanitization invariant is documented at the assertion site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 00:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Models/Evaluate/EvaluateEnums.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
ashragrawal and others added 2 commits April 20, 2026 15:03
Per-tool scoring was flaky (0-34 of 48 scored across runs) because the
prompt said "use a whole-file write tool if available" and the agent
non-deterministically chose edit/str_replace for individual items. Those
edits failed on the repeating "score: null" pattern that isn't unique
across checks, and the subprocess still exited 0 so the pipeline logged
"ok" with nothing merged.

Fix: build a per-engine prompt that names the exact tool the agent should
use. SemanticCheckPrompts now takes an AgentToolset record describing
ReadToolName/WriteToolName/EditToolName, and ChecklistEvaluator maps
EvalEngine to the concrete names (Copilot: view/create/edit,
Claude Code: Read/Write/Edit). The prompt instructs "use Write/create
ONCE" and warns away from targeted string replacements.

Also add Write to Claude Code's --allowedTools since a whole-file write
is the reliable strategy for both engines.

E2E on learn.microsoft.com: 46/48 scored consistently (was 20-34 flaky);
the 2 remaining are the toolset-level server checks, which we'll follow
up on separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restricting Copilot to --available-tools=view,create caused the model to
thrash and leave checks unscored — it had the ability to do the task but
not the flexibility to pick its own strategy. Inverting the restriction
(allow everything, deny the dangerous families) lets the agent use its
full toolkit for the scoring task while blocking the two ways it could
escape the sandbox or leak data.

Denies:
  Copilot:
    shell, write_shell, read_shell, stop_shell, list_shell (macOS/Linux),
    powershell, write_powershell, read_powershell, stop_powershell, list_powershell (Windows),
    web_fetch, web_search
  Claude Code:
    Bash, BashOutput, KillBash, WebFetch, WebSearch

File access remains bounded by the per-invocation temp-dir sandbox —
file tools respect cwd by default, and we don't pass --allow-all-paths.

Prompt simplified: we no longer over-instruct the agent on which tool
to use, just name the read/write tool names it has and describe the
write-in-one-call strategy as a preference, not a restriction.

E2E on learn.microsoft.com: 48/48 scored, score 92/100, HTML report
generated (was flaky 20-46/48 previously).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 20, 2026 23:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 4 comments.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Commands/DevelopMcpCommand.cs Outdated
ashragrawal and others added 3 commits April 20, 2026 17:38
- Rename EvalEngine.GithubCopilot to GitHubCopilot so the serialized enum
  name matches GitHub branding (and report JSON stays consistent)
- Use FormatEngineName display name in report eval-engine field instead
  of raw enum ToString() so downstream consumers see "GitHub Copilot"
- Pass derived server name through ReportGenerator.SanitizeFileName so
  the UriFormatException fallback can't produce an invalid filename
- Drop unused workingDir parameter from EvaluateToolChecks and
  EvaluateServerChecks (sandbox dir is created internally)
- Fix ReportGenerator comment to drop the bogus "<!" escape mention
- Reword evaluate help text so it doesn't imply --eval-engine none is
  required for BYOL (auto-mode also falls back to the written checklist)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Copilot model sometimes hedges on "pass if no issues" prompts and
leaves the score as null instead of committing to true/false. Before
this change, the pipeline accepted whatever came back from the first
agent call, so runs would flake between 30/48 and 48/48 scored on
identical inputs — the same tool or same pair of server-level checks
would score one run and skip the next.

Change: EvaluateToolChecks and EvaluateServerChecks now loop up to
MaxAttempts (3) times. After each agent pass we merge scored items
back into the in-memory checklist, re-serialize the current state to
the temp file (so the next attempt only sees the items that are still
null), and stop early as soon as everything is scored.

Also wrap the deserialize-and-merge step in try/catch (JsonException).
When the agent writes structurally invalid JSON (e.g. an abbreviated
ChecklistItem object), we now log and retry instead of crashing the
whole pipeline with an unhandled exception.

E2E on learn.microsoft.com: 48/48 scored in a single run, score 90/100,
full report generated (previously needed a resume run to finish the
last 2 server-level checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fixed 6-minute per-tool timeout was fine for tools with ~18 checks
(AddDraftAttachments completed in ~3.5 min) but UpdateDraft, which has
46 semantic checks, hit the wall: 46 views + 31 creates + 78 reasoning
rounds from Haiku in 6 minutes wasn't enough, so the subprocess was
killed and all 46 checks came back null.

Change: PerToolTimeout becomes TimeoutForChecks(checkCount) =
  120s base + 15s per check, clamped to [3min, 20min]

ChecklistEvaluator passes the unscored-check count into each attempt,
so tools with more work get more time and small tools don't idle on
an over-generous budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 00:47
Observed Haiku needs closer to 15-20s per check (view + reason + write,
with several thinking rounds) — 15s was cutting it close. Bumping to 20s
keeps the same shape (base 120s + N*perCheck, clamped to [3, 20] min)
but reduces the chance of hitting the ceiling mid-thought.

UpdateDraft (46 checks) now gets 120 + 46*20 = 1040s = 17.3 min
(was 13.0 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 6 comments.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/Scorer.cs
ashragrawal and others added 2 commits April 20, 2026 20:56
The prior retry loop only re-invoked the agent when the subprocess
exited 0 but left items null. If the first attempt hit the per-tool
timeout, we gave up immediately ("retry would just repeat the same
subprocess failure"). That assumption was wrong: on Haiku + Copilot
we see non-deterministic timeouts — the same tool that times out on
attempt 1 often completes on attempt 2 or 3 because Copilot's runtime
is warmer, or the model happens to pick a shorter reasoning path.

On the Mail MCP eval, 6 tools (SendEmailWithAttachments, GetMessage,
FlagEmail, UploadAttachment, UploadLargeAttachment, ForwardMessage)
ended with 0/N scored — all single-attempt timeouts that never got a
retry. Similar-sized tools next to them in the pipeline completed fine
on first attempt.

Change: on subprocess failure, log and continue the retry loop instead
of returning false. Still return false if *all* MaxAttempts subprocess
calls fail — we're not pretending an unreachable agent succeeded.

Same fix applied to EvaluateServerChecks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndbox doc

- Scorer and ActionItemGenerator: remove null checks on parameters declared
  non-nullable. Production callers never pass null; tests that did are dropped.
- ChecklistEvaluator: reword EvaluateToolChecks doc to reflect that setting
  WorkingDirectory is a reduced-surface defense (via each engine's path
  verification), not a full sandbox.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 05:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 1 comment.

ashragrawal and others added 2 commits April 20, 2026 23:23
Root cause of the Mail-MCP 0/N failures on GetMessage, FlagEmail, and
UploadAttachment: Copilot's `create` tool cannot overwrite existing
files ("Cannot be used if the specified path already exists"). We were
telling the agent to "rewrite the whole file via create" — a strategy
that physically fails the moment the pre-populated temp file exists.
Some tools happened to stumble onto workarounds (create siblings, copy
fields back); others (usually smaller ones like GetMessage, 54-char
description) kept looping on the create->edit->view fallback dance for
9 minutes straight until timeout.

Fix: use an edit-only (string-replace) flow.

- SemanticCheckPrompts:
  - AgentToolset now names a read tool and an edit tool (no write tool).
  - New prompt instructs the agent to call edit once per null item with
    an old_str that includes both the item's id and its prompt field,
    which is globally unique in the file.
  - Explicit "answer with first instinct, do not re-read after a
    successful edit" rule to discourage the checking loop.
- ChecklistEvaluator.ToolsetFor: Copilot=(view, edit); Claude=(Read, Edit).
- CodingAgentRunner:
  - Copilot: --available-tools=view,edit (drops `create`).
  - Claude:  --allowedTools Read,Edit (drops Write).

Validated on learn.microsoft.com and the Mail MCP server:
- learn.microsoft.com: 48/48 scored, 92/100, ~6.5 min total (was 46/48).
- Mail MCP resume: 6 previously-failing tools all score first-attempt
  in ~2 min each (was 28 min + failing). Final: 638/638 scored, 82/100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MergeScores built its lookup with ToDictionary(e => e.Id), which throws
ArgumentException on duplicate keys or a null id. The surrounding try/catch
only catches JsonException, so a malformed agent batch would crash the run
even when earlier attempts had made real progress. Drop empty ids and take
last-wins on duplicates so a broken batch is treated like other agent-JSON
quirks (retry on the next attempt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 06:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 49 changed files in this pull request and generated 4 comments.

Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
Comment thread src/Microsoft.Agents.A365.DevTools.Cli/Services/Evaluate/ChecklistEvaluator.cs Outdated
ashragrawal and others added 2 commits April 21, 2026 13:59
… on availability

- EvaluationPipelineService: when the user passes --eval-engine auto, the
  report used to record "auto" instead of whichever engine actually scored
  the checks. Thread a ChecklistEvaluationResult.EngineUsed back through
  TryEvaluateWithFallthrough / EvaluateToolChecks / EvaluateServerChecks so
  the report is stamped with the engine that ran (GitHub Copilot or Claude
  Code), falling back to the requested engine when none ran.
- ChecklistEvaluator.BuildEngineList: when an explicit engine is requested
  (e.g. --eval-engine github-copilot), check availability first. If the CLI
  isn't on PATH, return an empty list so the caller surfaces the same
  "engine not found, here's how to install" guidance it uses in Auto mode,
  instead of looping through per-tool failures and printing the misleading
  "agent ran but left checks unscored" message.
- ChecklistEvaluator: fix RepairJson XML doc — the implementation only
  inserts missing commas; trailing commas are handled separately by
  AllowTrailingCommas in ReadOptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 4-layer F-001 XPIA mitigation. Each layer covers a specific failure
the others miss:

- L1 PromptSanitizer (new): strips bidi overrides, zero-width chars,
  C0/C1 controls, and U+E0000-U+E01EF tag-block from tool names,
  descriptions, and param names before they reach the agent. Without
  this, hidden Unicode in MCP content survives spotlighting and L3
  keyword filters.
- L2 spotlighting: prepends a SECURITY BOUNDARY header and wraps tool
  names in <untrusted-data> tags in all 3 prompt builders. Without
  this, the agent has no signal that schema content is untrusted.
- L3 ScoringSafetyFilter (new): rejects agent reasons containing
  exfil URLs (http/https/ftp/data:) or prompt-injection markers
  ("ignore previous instructions", "system:", etc.). Cleared items
  go through the existing retry loop. Without this, exfil links and
  reproduced injection text reach the report.
- L4 canary: injects a fake check whose correct answer is always
  false (random UUID match). A true score signals plan drift, logged
  as SECURITY error and surfaced via PlanDriftDetected on the result.
  This is the only post-hoc detector if L1-L3 fail silently.

Also adds F-002 XSS defense-in-depth: routes maturity.label and
AREA_LABELS values through esc() in SchemaEvalReport.html. Combined
with the existing System.Text.Json encoding and EscapeForInlineScript
layers, all 24 MCP-controlled fields are now escaped before any
innerHTML assignment.

Tests: PromptSanitizerTests, ScoringSafetyFilterTests, plus XSS
regression tests in ReportGeneratorTests. All 148 affected tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 22:01
…luate

# Conflicts:
#	CHANGELOG.md
#	src/Microsoft.Agents.A365.DevTools.Cli/Constants/ErrorCodes.cs
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 27, 2026
Resolved conflicts:
- CHANGELOG.md: combined Added entries from both sides
- DevelopMcpCommand.cs: collapsed CreateCommand overloads into a single
  signature with both evaluationPipelineService and graphApiService as
  optional, preserving all existing 2-arg call sites
- Program.cs: pass both new args to DevelopMcpCommand.CreateCommand and
  keep DI registrations for both evaluate-pipeline services and the
  bootstrap-config-resolver

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants