diff --git a/.github/skills/ci-analysis/SKILL.md b/.github/skills/ci-analysis/SKILL.md index 13177698c00a33..8ae99a98c8b1d1 100644 --- a/.github/skills/ci-analysis/SKILL.md +++ b/.github/skills/ci-analysis/SKILL.md @@ -1,6 +1,6 @@ --- name: ci-analysis -description: Analyze CI build and test status from Azure DevOps and Helix for dotnet repository PRs. Use when checking CI status, investigating failures, determining if a PR is ready to merge, or given URLs containing dev.azure.com or helix.dot.net. Also use when asked "why is CI red", "test failures", "retry CI", "rerun tests", or "is CI green". +description: Analyze CI build and test status from Azure DevOps and Helix for dotnet repository PRs. Use when checking CI status, investigating failures, determining if a PR is ready to merge, or given URLs containing dev.azure.com or helix.dot.net. Also use when asked "why is CI red", "test failures", "retry CI", "rerun tests", "is CI green", "build failed", "checks failing", or "flaky tests". --- # Azure DevOps and Helix CI Analysis @@ -9,6 +9,8 @@ Analyze CI build status and test failures in Azure DevOps and Helix for dotnet r > 🚨 **NEVER** use `gh pr review --approve` or `--request-changes`. Only `--comment` is allowed. Approval and blocking are human-only actions. +**Workflow**: Gather PR context (Step 0) → run the script → read the human-readable output + `[CI_ANALYSIS_SUMMARY]` JSON → synthesize recommendations yourself. The script collects data; you generate the advice. + ## When to Use This Skill Use this skill when: @@ -64,7 +66,7 @@ The script operates in three distinct modes depending on what information you ha | You have... | Use | What you get | |-------------|-----|-------------| -| A GitHub PR number | `-PRNumber 12345` | Full analysis: all builds, failures, known issues, retry recommendation | +| A GitHub PR number | `-PRNumber 12345` | Full analysis: all builds, failures, known issues, structured JSON summary | | An AzDO build ID | `-BuildId 1276327` | Single build analysis: timeline, failures, Helix results | | A Helix job ID (optionally a specific work item) | `-HelixJob "..." [-WorkItem "..."]` | Deep dive: list work items for the job, or with `-WorkItem`, focus on a single work item's console logs, artifacts, and test results | @@ -73,7 +75,7 @@ The script operates in three distinct modes depending on what information you ha ## What the Script Does ### PR Analysis Mode (`-PRNumber`) -1. Discovers all AzDO builds associated with the PR +1. Discovers AzDO builds associated with the PR (via `gh pr checks` — finds failing builds and one non-failing build as fallback; for full build history, use `azure-devops-pipelines_get_builds`) 2. Fetches Build Analysis for known issues 3. Gets failed jobs from Azure DevOps timeline 4. **Separates canceled jobs from failed jobs** (canceled may be dependency-canceled or timeout-canceled) @@ -81,26 +83,28 @@ The script operates in three distinct modes depending on what information you ha 6. Fetches console logs (with `-ShowLogs`) 7. Searches for known issues with "Known Build Error" label 8. Correlates failures with PR file changes -9. **Provides smart retry recommendations** +9. **Emits structured summary** — `[CI_ANALYSIS_SUMMARY]` JSON block with all key facts for the agent to reason over + +> **After the script runs**, you (the agent) generate recommendations. The script collects data; you synthesize the advice. See [Generating Recommendations](#generating-recommendations) below. ### Build ID Mode (`-BuildId`) 1. Fetches the build timeline directly (skips PR discovery) -2. Performs steps 3–7 and 9 from PR Analysis Mode, but does **not** fetch Build Analysis known issues or correlate failures with PR file changes (those require a PR number) +2. Performs steps 3–7 from PR Analysis Mode, but does **not** fetch Build Analysis known issues or correlate failures with PR file changes (those require a PR number). Still emits `[CI_ANALYSIS_SUMMARY]` JSON. ### Helix Job Mode (`-HelixJob` [and optional `-WorkItem`]) 1. With `-HelixJob` alone: enumerates work items for the job and summarizes their status 2. With `-HelixJob` and `-WorkItem`: queries the specific work item for status and artifacts 3. Fetches console logs and file listings, displays detailed failure information -> ⚠️ **Canceled ≠ Failed.** Canceled jobs often have completed Helix work items — the AzDO wrapper timed out but tests may have passed. See "Recovering Results from Canceled Jobs" below. - ## Interpreting Results **Known Issues section**: Failures matching existing GitHub issues - these are tracked and being investigated. -**Canceled jobs**: Jobs that were canceled (not failed) due to earlier stage failures or timeouts. Dependency-canceled jobs (canceled because an earlier stage failed) don't need investigation. Timeout-canceled jobs may still have recoverable Helix results — see "Recovering Results from Canceled Jobs" below. +**Build Analysis check status**: The "Build Analysis" GitHub check is **green** only when *every* failure is matched to a known issue. If it's **red**, at least one failure is unaccounted for — do NOT claim "all failures are known issues" just because some known issues were found. You must verify each failing job is covered by a specific known issue before calling it safe to retry. -> ❌ **Don't dismiss canceled jobs.** Timeout-canceled jobs may have passing Helix results that prove the "failure" was just an AzDO timeout wrapper issue. +**Canceled/timed-out jobs**: Jobs canceled due to earlier stage failures or AzDO timeouts. Dependency-canceled jobs don't need investigation. **Timeout-canceled jobs may have all-passing Helix results** — the "failure" is just the AzDO job wrapper timing out, not actual test failures. To verify: use `hlx_status` on each Helix job in the timed-out build. If all work items passed, the build effectively passed. + +> ❌ **Don't dismiss timed-out builds.** A build marked "failed" due to a 3-hour AzDO timeout can have 100% passing Helix work items. Check before concluding it failed. **PR Change Correlation**: Files changed by PR appearing in failures - likely PR-related. @@ -110,173 +114,147 @@ The script operates in three distinct modes depending on what information you ha **Local test failures**: Some repos (e.g., dotnet/sdk) run tests directly on build agents. These can also match known issues - search for the test name with the "Known Build Error" label. -> ⚠️ **Be cautious labeling failures as "infrastructure."** If Build Analysis didn't flag a failure as a known issue, treat it as potentially real — even if it looks like a device failure, Docker issue, or network timeout. Only conclude "infrastructure" when you have strong evidence (e.g., identical failure on main branch, Build Analysis match, or confirmed outage). Dismissing failures as transient without evidence delays real bug discovery. - -> ❌ **Don't confuse "environment-related" with "infrastructure."** A test that fails because a required framework isn't installed (e.g., .NET 2.2) is a **test defect** — the test has wrong assumptions about what's available. Infrastructure failures are *transient*: network timeouts, Docker pull failures, agent crashes, disk space. If the failure would reproduce 100% of the time on any machine with the same setup, it's a code/test issue, not infra. The word "environment" in the error doesn't make it an infrastructure problem. - -> ❌ **Missing packages on flow PRs are NOT always infrastructure failures.** When a codeflow or dependency-update PR fails with "package not found" or "version not available", don't assume it's a feed propagation delay. Flow PRs bring in behavioral changes from upstream repos that can cause the build to request *different* packages than before. Example: an SDK flow changed runtime pack resolution logic, causing builds to look for `Microsoft.NETCore.App.Runtime.browser-wasm` (CoreCLR — doesn't exist) instead of `Microsoft.NETCore.App.Runtime.Mono.browser-wasm` (what had always been used). The fix was in the flowed code, not in feed infrastructure. Always check *which* package is missing and *why* it's being requested before diagnosing as infrastructure. - -## Retry Recommendations - -The script provides a recommendation at the end: - -| Recommendation | Meaning | -|----------------|---------| -| **KNOWN ISSUES DETECTED** | Tracked issues found that may correlate with failures. Review details. | -| **LIKELY PR-RELATED** | Failures correlate with PR changes. Fix issues first. | -| **POSSIBLY TRANSIENT** | No clear cause - check main branch, search for issues. | -| **REVIEW REQUIRED** | Could not auto-determine cause. Manual review needed. | - -## Analysis Workflow - -1. **Read PR context first** - Check title, description, comments -2. **Run the script** with `-ShowLogs` for detailed failure info -3. **Check Build Analysis** - Known issues are safe to retry -4. **Correlate with PR changes** - Same files failing = likely PR-related -5. **Interpret patterns** (but don't jump to conclusions): - - Same error across many jobs → Real code issue - - Build Analysis flags a known issue → Safe to retry - - Failure is **not** in Build Analysis → Investigate further before assuming transient - - Device failures, Docker pulls, network timeouts → *Could* be infrastructure, but verify against main branch first - - Test timeout but tests passed → Executor issue, not test failure - -## Presenting Results - -The script provides a recommendation at the end, but this is based on heuristics and may be incomplete. Before presenting conclusions to the user: +**Per-failure details** (`failedJobDetails` in JSON): Each failed job includes `errorCategory`, `errorSnippet`, and `helixWorkItems`. Use these for per-job classification instead of applying a single `recommendationHint` to all failures. -> ❌ **Don't blindly trust the script's recommendation.** The heuristic can misclassify failures. If the recommendation says "POSSIBLY TRANSIENT" but you see the same test failing 5 times on the same code path the PR touched — it's PR-related. +Error categories: `test-failure`, `build-error`, `test-timeout`, `crash` (exit codes 139/134/-4), `tests-passed-reporter-failed` (all tests passed but reporter crashed — genuinely infrastructure), `unclassified` (investigate manually). -1. Review the detailed failure information, not just the summary -2. Look for patterns the script may have missed (e.g., related failures across jobs) -3. Consider the PR context (what files changed, what the PR is trying to do) -4. Present findings with appropriate caveats - state what is known vs. uncertain -5. If the script's recommendation seems inconsistent with the details, trust the details - -## References +> ⚠️ **`crash` does NOT always mean tests failed.** Exit code -4 often means the Helix work item wrapper timed out *after* tests completed. Always check `testResults.xml` before concluding a crash is a real failure. See [Recovering Results from Crashed/Canceled Jobs](#recovering-results-from-crashedcanceled-jobs). -- **Helix artifacts & binlogs**: See [references/helix-artifacts.md](references/helix-artifacts.md) -- **Manual investigation steps**: See [references/manual-investigation.md](references/manual-investigation.md) -- **AzDO/Helix details**: See [references/azdo-helix-reference.md](references/azdo-helix-reference.md) +> ⚠️ **Be cautious labeling failures as "infrastructure."** Only conclude infrastructure with strong evidence: Build Analysis match, identical failure on target branch, or confirmed outage. Exception: `tests-passed-reporter-failed` is genuinely infrastructure. -## Recovering Results from Canceled Jobs +> ❌ **Missing packages on flow PRs ≠ infrastructure.** Flow PRs can cause builds to request *different* packages. Check *which* package and *why* before assuming feed delay. -Canceled jobs (typically from timeouts) often still have useful artifacts. The Helix work items may have completed successfully even though the AzDO job was killed while waiting to collect results. +### Recovering Results from Crashed/Canceled Jobs -**To investigate canceled jobs:** +When an AzDO job is canceled (timeout) or Helix work items show `Crash` (exit code -4), the tests may have actually passed. Follow this procedure: -1. **Download build artifacts**: Use the AzDO artifacts API to get `Logs_Build_*` pipeline artifacts for the canceled job. These contain binlogs even for canceled jobs. -2. **Extract Helix job IDs**: Use the MSBuild MCP server to load the `SendToHelix.binlog` and search for `"Sent Helix Job"` messages. Each contains a Helix job ID. -3. **Query Helix directly**: For each job ID, query `https://helix.dot.net/api/2019-06-17/jobs/{jobId}/workitems` to get actual pass/fail results. +1. **Find the Helix job IDs** — Read the AzDO "Send to Helix" step log (use `azure-devops-pipelines_get_build_log_by_id`) and search for lines containing `Sent Helix Job`. Extract the job GUIDs. -**Example**: A `browser-wasm windows WasmBuildTests` job was canceled after 3 hours. The binlog (truncated) still contained 12 Helix job IDs. Querying them revealed all 226 work items passed — the "failure" was purely a timeout in the AzDO wrapper. +2. **Check Helix job status** — Use `hlx_batch_status` (accepts comma-separated job IDs) or `hlx_status` per job. Look at `failedCount` vs `passedCount`. -**Key insight**: "Canceled" ≠ "Failed". Always check artifacts before concluding results are lost. +3. **For work items marked Crash/Failed** — Use `hlx_files` to check if `testResults.xml` was uploaded. If it exists: + - Download it with `hlx_download_url` + - Parse the XML: `total`, `passed`, `failed` attributes on the `` element + - If `failed=0` and `passed > 0`, the tests passed — the "crash" is the wrapper timing out after test completion -## Deep Investigation with Azure CLI +4. **Verdict**: + - All work items passed or crash-with-passing-results → **Tests effectively passed.** The failure is infrastructure (wrapper timeout). + - Some work items have `failed > 0` in testResults.xml → **Real test failures.** Investigate those specific tests. + - No testResults.xml uploaded → Tests may not have run at all. Check console logs for errors. -When the script and GitHub APIs aren't enough (e.g., investigating internal pipeline definitions or downloading build artifacts), you can use the Azure CLI with the `azure-devops` extension. +> This pattern is common with long-running test suites (e.g., WasmBuildTests) where tests complete but the Helix work item wrapper exceeds its timeout during result upload or cleanup. -> 💡 **Prefer `az pipelines` / `az devops` commands over raw REST API calls.** The CLI handles authentication, pagination, and JSON output formatting. Only fall back to manual `Invoke-RestMethod` calls when the CLI doesn't expose the endpoint you need (e.g., artifact download URLs, specialized timeline queries). The CLI's `--query` (JMESPath) and `-o table` flags are powerful for filtering without extra scripting. +## Generating Recommendations -### Checking Azure CLI Authentication +After the script outputs the `[CI_ANALYSIS_SUMMARY]` JSON block, **you** synthesize recommendations. Do not parrot the JSON — reason over it. -Before making direct AzDO API calls, verify the CLI is installed and authenticated: +### Decision logic -```powershell -# Ensure az is on PATH (Windows may need a refresh after install) -$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User") +Read `recommendationHint` as a starting point, then layer in context: -# Check if az CLI is available -az --version 2>$null | Select-Object -First 1 +| Hint | Action | +|------|--------| +| `BUILD_SUCCESSFUL` | No failures. Confirm CI is green. | +| `KNOWN_ISSUES_DETECTED` | Known tracked issues found — but this does NOT mean all failures are covered. Check the Build Analysis check status: if it's red, some failures are unmatched. Only recommend retry for failures that specifically match a known issue; investigate the rest. | +| `LIKELY_PR_RELATED` | Failures correlate with PR changes. Lead with "fix these before retrying" and list `correlatedFiles`. | +| `POSSIBLY_TRANSIENT` | Failures could not be automatically classified — does NOT mean they are transient. Use `failedJobDetails` to investigate each failure individually. | +| `REVIEW_REQUIRED` | Could not auto-determine cause. Review failures manually. | +| `MERGE_CONFLICTS` | PR has merge conflicts — CI won't run. Tell the user to resolve conflicts. Offer to analyze a previous build by ID. | +| `NO_BUILDS` | No AzDO builds found (CI not triggered). Offer to check if CI needs to be triggered or analyze a previous build. | -# Check if logged in and get current account -az account show --query "{name:name, user:user.name}" -o table 2>$null +Then layer in nuance the heuristic can't capture: -# If not logged in, prompt the user to authenticate: -# az login # Interactive browser login -# az login --use-device-code # Device code flow (for remote/headless) +- **Mixed signals**: Some failures match known issues AND some correlate with PR changes → separate them. Known issues = safe to retry; correlated = fix first. +- **Canceled jobs with recoverable results**: If `canceledJobNames` is non-empty, mention that canceled jobs may have passing Helix results (see "Recovering Results from Crashed/Canceled Jobs"). +- **Build still in progress**: If `lastBuildJobSummary.pending > 0`, note that more failures may appear. +- **Multiple builds**: If `builds` has >1 entry, `lastBuildJobSummary` reflects only the last build — use `totalFailedJobs` for the aggregate count. +- **BuildId mode**: `knownIssues` and `prCorrelation` won't be populated. Say "Build Analysis and PR correlation not available in BuildId mode." -# Get an AAD access token for AzDO REST API calls -$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv) -$headers = @{ "Authorization" = "Bearer $accessToken" } -``` +### How to Retry -> ⚠️ If `az` is not installed, use `winget install -e --id Microsoft.AzureCLI` (Windows). The `azure-devops` extension is also required — install or verify it with `az extension add --name azure-devops` (safe to run if already installed). Ask the user to authenticate if needed. +- **AzDO builds**: Comment `/azp run {pipeline-name}` on the PR (e.g., `/azp run dotnet-sdk-public`) +- **All pipelines**: Comment `/azp run` to retry all failing pipelines +- **Helix work items**: Cannot be individually retried — must re-run the entire AzDO build -> ⚠️ **Do NOT use `az devops configure --defaults`** — it writes to a global config file and will cause conflicts if multiple agents are running concurrently. Always pass `--org` and `--project` (or `-p`) explicitly on each command. +### Tone and output format -### Querying Pipeline Definitions and Builds +Be direct. Lead with the most important finding. Structure your response as: +1. **Summary verdict** (1-2 sentences) — Is CI green? Failures PR-related? Known issues? +2. **Failure details** (2-4 bullets) — what failed, why, evidence +3. **Recommended actions** (numbered) — retry, fix, investigate. Include `/azp run` commands. -When investigating build failures, it's often useful to look at the pipeline definition itself to understand what stages, jobs, and templates are involved. +Synthesize from: JSON summary (structured facts) + human-readable output (details/logs) + Step 0 context (PR type, author intent). -**Use `az` CLI commands first** — they're simpler and handle auth automatically. Set `$buildId` from a runs list or from the AzDO URL: +## Analysis Workflow -```powershell -$org = "https://dev.azure.com/dnceng" -$project = "internal" +### Step 0: Gather Context (before running anything) -# Find a pipeline definition by name -az pipelines list --name "dotnet-unified-build" --org $org -p $project --query "[].{id:id, name:name, path:path}" -o table +Before running the script, read the PR to understand what you're analyzing. Context changes how you interpret every failure. -# Get pipeline definition details (shows YAML path, triggers, etc.) -az pipelines show --id 1330 --org $org -p $project --query "{id:id, name:name, yamlPath:process.yamlFilename, repo:repository.name}" -o table +1. **Read PR metadata** — title, description, author, labels, linked issues +2. **Classify the PR type** — this determines your interpretation framework: -# List recent builds for a pipeline (with filtering) -az pipelines runs list --pipeline-ids 1330 --branch "refs/heads/main" --top 5 --org $org -p $project --query "[].{id:id, result:result, finish:finishTime}" -o table +| PR Type | How to detect | Interpretation shift | +|---------|--------------|---------------------| +| **Code PR** | Human author, code changes | Failures likely relate to the changes | +| **Flow/Codeflow PR** | Author is `dotnet-maestro[bot]`, title mentions "Update dependencies" | Missing packages may be behavioral, not infrastructure (see anti-pattern below) | +| **Backport** | Title mentions "backport", targets a release branch | Failures may be branch-specific; check if test exists on target branch | +| **Merge PR** | Merging between branches (e.g., release → main) | Conflicts and merge artifacts cause failures, not the individual changes | +| **Dependency update** | Bumps package versions, global.json changes | Build failures often trace to the dependency, not the PR's own code | -# Get a specific build's details -az pipelines runs show --id $buildId --org $org -p $project --query "{id:id, result:result, sourceBranch:sourceBranch}" -o table +3. **Check existing comments** — has someone already diagnosed the failures? Is there a retry pending? +4. **Note the changed files** — you'll use these to evaluate correlation after the script runs -# List build artifacts -az pipelines runs artifact list --run-id $buildId --org $org -p $project --query "[].{name:name, type:resource.type}" -o table -``` +> ❌ **Don't skip Step 0.** Running the script without PR context leads to misdiagnosis — especially for flow PRs where "package not found" looks like infrastructure but is actually a code issue. -**Fall back to REST API** only when the CLI doesn't expose what you need (e.g., build timelines, artifact downloads): +### Step 1: Run the script -```powershell -# Get build timeline (stages, jobs, tasks with results and durations) — no CLI equivalent -$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv) -$headers = @{ "Authorization" = "Bearer $accessToken" } -$timelineUrl = "https://dev.azure.com/dnceng/internal/_apis/build/builds/$buildId/timeline?api-version=7.1" -$timeline = (Invoke-RestMethod -Uri $timelineUrl -Headers $headers) -$timeline.records | Where-Object { $_.result -eq "failed" -and $_.type -eq "Job" } - -# Download a specific artifact (e.g., build logs with binlogs) — no CLI equivalent for zip download -$artifactName = "Windows_Workloads_x64_BuildPass2_BuildLogs_Attempt1" -$downloadUrl = "https://dev.azure.com/dnceng/internal/_apis/build/builds/$buildId/artifacts?artifactName=$artifactName&api-version=7.1&`$format=zip" -Invoke-WebRequest -Uri $downloadUrl -Headers $headers -OutFile "$env:TEMP\artifact.zip" -``` +Run with `-ShowLogs` for detailed failure info. -### Examining Pipeline YAML +### Step 2: Analyze results -All dotnet repos that use arcade put their pipeline definitions under `eng/pipelines/`. Use `az pipelines show` to find the YAML file path, then fetch it: +1. **Check Build Analysis** — If the Build Analysis GitHub check is **green**, all failures matched known issues and it's safe to retry. If it's **red**, some failures are unaccounted for — you must identify which failing jobs are covered by known issues and which are not. For 3+ failures, use SQL tracking to avoid missed matches (see [references/sql-tracking.md](references/sql-tracking.md)). +2. **Correlate with PR changes** — Same files failing = likely PR-related +3. **Compare with baseline** — If a test passes on the target branch but fails on the PR, compare Helix binlogs. See [references/binlog-comparison.md](references/binlog-comparison.md) — **delegate binlog download/extraction to subagents** to avoid burning context on mechanical work. +4. **Check build progression** — If the PR has multiple builds (multiple pushes), check whether earlier builds passed. A failure that appeared after a specific push narrows the investigation to those commits. See [references/build-progression-analysis.md](references/build-progression-analysis.md). Present findings as facts, not fix recommendations. +5. **Interpret patterns** (but don't jump to conclusions): + - Same error across many jobs → Real code issue + - Build Analysis flags a known issue → That *specific failure* is safe to retry (but others may not be) + - Failure is **not** in Build Analysis → Investigate further before assuming transient + - Device failures, Docker pulls, network timeouts → *Could* be infrastructure, but verify against the target branch first + - Test timeout but tests passed → Executor issue, not test failure +6. **Check for mismatch with user's question** — The script only reports builds for the current head SHA. If the user asks about a job, error, or cancellation that doesn't appear in the results, **ask** if they're referring to a prior build. Common triggers: + - User mentions a canceled job but `canceledJobNames` is empty + - User says "CI is failing" but the latest build is green + - User references a specific job name not in the current results + Offer to re-run with `-BuildId` if the user can provide the earlier build ID from AzDO. -```powershell -# Find the YAML path for a pipeline -az pipelines show --id 1330 --org $org -p $project --query "{yamlPath:process.yamlFilename, repo:repository.name}" -o table +### Step 3: Verify before claiming -# Fetch the YAML from the repo (example: dotnet/runtime's runtime-official pipeline) -# github-mcp-server-get_file_contents owner:dotnet repo:runtime path:eng/pipelines/runtime-official.yml +Before stating a failure's cause, verify your claim: -# For VMR unified builds, the YAML is in dotnet/dotnet: -# github-mcp-server-get_file_contents owner:dotnet repo:dotnet path:eng/pipelines/unified-build.yml +- **"Infrastructure failure"** → Did Build Analysis flag it? Does the same test pass on the target branch? If neither, don't call it infrastructure. +- **"Transient/flaky"** → Has it failed before? Is there a known issue? A single non-reproducing failure isn't enough to call it flaky. +- **"PR-related"** → Do the changed files actually relate to the failing test? Correlation in the script output is heuristic, not proof. +- **"Safe to retry"** → Are ALL failures accounted for (known issues or infrastructure), or are you ignoring some? Check the Build Analysis check status — if it's red, not all failures are matched. Map each failing job to a specific known issue before concluding "safe to retry." +- **"Not related to this PR"** → Have you checked if the test passes on the target branch? Don't assume — verify. -# Templates are usually in eng/pipelines/common/ or eng/pipelines/templates/ -``` +## References -This is especially useful when: -- A job name doesn't clearly indicate what it builds -- You need to understand stage dependencies (why a job was canceled) -- You want to find which template defines a specific step -- Investigating whether a pipeline change caused new failures +- **Helix artifacts & binlogs**: See [references/helix-artifacts.md](references/helix-artifacts.md) +- **Binlog comparison (passing vs failing)**: See [references/binlog-comparison.md](references/binlog-comparison.md) +- **Build progression (commit-to-build correlation)**: See [references/build-progression-analysis.md](references/build-progression-analysis.md) +- **Subagent delegation patterns**: See [references/delegation-patterns.md](references/delegation-patterns.md) +- **Azure CLI deep investigation**: See [references/azure-cli.md](references/azure-cli.md) +- **Manual investigation steps**: See [references/manual-investigation.md](references/manual-investigation.md) +- **SQL tracking for investigations**: See [references/sql-tracking.md](references/sql-tracking.md) +- **AzDO/Helix details**: See [references/azdo-helix-reference.md](references/azdo-helix-reference.md) ## Tips -1. Read PR description and comments first for context -2. Check if same test fails on main branch before assuming transient -3. Look for `[ActiveIssue]` attributes for known skipped tests -4. Use `-SearchMihuBot` for semantic search of related issues -5. Binlogs in artifacts help diagnose MSB4018 task failures -6. Use the MSBuild MCP server (`binlog.mcp`) to search binlogs for Helix job IDs, build errors, and properties -7. If checking CI status via `gh pr checks --json`, the valid fields are `bucket`, `completedAt`, `description`, `event`, `link`, `name`, `startedAt`, `state`, `workflow`. There is **no `conclusion` field** — `state` contains `SUCCESS`/`FAILURE` directly -8. When investigating internal AzDO pipelines, check `az account show` first to verify authentication before making REST API calls +1. Check if same test fails on the target branch before assuming transient +2. Look for `[ActiveIssue]` attributes for known skipped tests +3. Use `-SearchMihuBot` for semantic search of related issues +4. Use the binlog MCP tools (`mcp-binlog-tool-*`) to search binlogs for Helix job IDs, build errors, and properties +5. `gh pr checks --json` valid fields: `bucket`, `completedAt`, `description`, `event`, `link`, `name`, `startedAt`, `state`, `workflow` — no `conclusion` field, `state` has `SUCCESS`/`FAILURE` directly +6. "Canceled" ≠ "Failed" — canceled jobs may have recoverable Helix results. Check artifacts before concluding results are lost. diff --git a/.github/skills/ci-analysis/references/azure-cli.md b/.github/skills/ci-analysis/references/azure-cli.md new file mode 100644 index 00000000000000..ba0c5995e2f42f --- /dev/null +++ b/.github/skills/ci-analysis/references/azure-cli.md @@ -0,0 +1,93 @@ +# Deep Investigation with Azure CLI + +When the CI script and GitHub APIs aren't enough (e.g., investigating internal pipeline definitions or downloading build artifacts), use the Azure CLI with the `azure-devops` extension. + +> 💡 **Prefer `az pipelines` / `az devops` commands over raw REST API calls.** The CLI handles authentication, pagination, and JSON output formatting. Only fall back to manual `Invoke-RestMethod` calls when the CLI doesn't expose the endpoint you need (e.g., build timelines). The CLI's `--query` (JMESPath) and `-o table` flags are powerful for filtering without extra scripting. + +## Checking Authentication + +Before making AzDO API calls, verify the CLI is installed and authenticated: + +```powershell +# Ensure az is on PATH (Windows may need a refresh after install) +$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User") + +# Check if az CLI is available +az --version 2>$null | Select-Object -First 1 + +# Check if logged in and get current account +az account show --query "{name:name, user:user.name}" -o table 2>$null + +# If not logged in, prompt the user to authenticate: +# az login # Interactive browser login +# az login --use-device-code # Device code flow (for remote/headless) + +# Get an AAD access token for AzDO REST API calls (only needed for raw REST) +$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv) +$headers = @{ "Authorization" = "Bearer $accessToken" } +``` + +> ⚠️ If `az` is not installed, use `winget install -e --id Microsoft.AzureCLI` (Windows). The `azure-devops` extension is also required — install or verify it with `az extension add --name azure-devops` (safe to run if already installed). Ask the user to authenticate if needed. + +> ⚠️ **Do NOT use `az devops configure --defaults`** — it sets user-wide defaults that may not match the organization/project needed for dotnet repositories. Always pass `--org` and `--project` (or `-p`) explicitly on each command. + +## Querying Pipeline Definitions and Builds + +```powershell +$org = "https://dev.azure.com/dnceng" +$project = "internal" + +# Find a pipeline definition by name +az pipelines list --name "dotnet-unified-build" --org $org -p $project --query "[].{id:id, name:name, path:path}" -o table + +# Get pipeline definition details (shows YAML path, triggers, etc.) +az pipelines show --id 1330 --org $org -p $project --query "{id:id, name:name, yamlPath:process.yamlFilename, repo:repository.name}" -o table + +# List recent builds for a pipeline (replace {TARGET_BRANCH} with the PR's base branch, e.g., main or release/9.0) +az pipelines runs list --pipeline-ids 1330 --branch "refs/heads/{TARGET_BRANCH}" --top 5 --org $org -p $project --query "[].{id:id, result:result, finish:finishTime}" -o table + +# Get a specific build's details +az pipelines runs show --id $buildId --org $org -p $project --query "{id:id, result:result, sourceBranch:sourceBranch}" -o table + +# List build artifacts +az pipelines runs artifact list --run-id $buildId --org $org -p $project --query "[].{name:name, type:resource.type}" -o table + +# Download a build artifact +az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org $org -p $project +``` + +## REST API Fallback + +Fall back to REST API only when the CLI doesn't expose what you need: + +```powershell +# Get build timeline (stages, jobs, tasks with results and durations) — no CLI equivalent +$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv) +$headers = @{ "Authorization" = "Bearer $accessToken" } +$timelineUrl = "https://dev.azure.com/dnceng/internal/_apis/build/builds/$buildId/timeline?api-version=7.1" +$timeline = (Invoke-RestMethod -Uri $timelineUrl -Headers $headers) +$timeline.records | Where-Object { $_.result -eq "failed" -and $_.type -eq "Job" } +``` + +## Examining Pipeline YAML + +All dotnet repos that use arcade put their pipeline definitions under `eng/pipelines/`. Use `az pipelines show` to find the YAML file path, then fetch it: + +```powershell +# Find the YAML path for a pipeline +az pipelines show --id 1330 --org $org -p $project --query "{yamlPath:process.yamlFilename, repo:repository.name}" -o table + +# Fetch the YAML from the repo (example: dotnet/runtime's runtime-official pipeline) +# github-mcp-server-get_file_contents owner:dotnet repo:runtime path:eng/pipelines/runtime-official.yml + +# For VMR unified builds, the YAML is in dotnet/dotnet: +# github-mcp-server-get_file_contents owner:dotnet repo:dotnet path:eng/pipelines/unified-build.yml + +# Templates are usually in eng/pipelines/common/ or eng/pipelines/templates/ +``` + +This is especially useful when: +- A job name doesn't clearly indicate what it builds +- You need to understand stage dependencies (why a job was canceled) +- You want to find which template defines a specific step +- Investigating whether a pipeline change caused new failures diff --git a/.github/skills/ci-analysis/references/binlog-comparison.md b/.github/skills/ci-analysis/references/binlog-comparison.md new file mode 100644 index 00000000000000..15a5df3460a745 --- /dev/null +++ b/.github/skills/ci-analysis/references/binlog-comparison.md @@ -0,0 +1,144 @@ +# Deep Investigation: Binlog Comparison + +When a test **passes on the target branch but fails on a PR**, comparing MSBuild binlogs from both runs reveals the exact difference in task parameters without guessing. + +## When to Use This Pattern + +- Test assertion compares "expected vs actual" build outputs (e.g., CSC args, reference lists) +- A build succeeds on one branch but fails on another with different MSBuild behavior +- You need to find which MSBuild property/item change caused a specific task to behave differently + +## The Pattern: Delegate to Subagents + +> ⚠️ **Do NOT download, load, and parse binlogs in the main conversation context.** This burns 10+ turns on mechanical work. Delegate to subagents instead. + +### Step 1: Identify the two work items to compare + +Use `Get-CIStatus.ps1` to find the failing Helix job + work item, then find a corresponding passing build (recent PR merged to the target branch, or a CI run on that branch). + +**Finding Helix job IDs from build artifacts (binlogs to find binlogs):** +When the failing work item's Helix job ID isn't visible (e.g., canceled jobs, or finding a matching job from a passing build), the IDs are inside the build's `SendToHelix.binlog`: + +1. Download the build artifact with `az`: + ``` + az pipelines runs artifact list --run-id $buildId --org "https://dev.azure.com/dnceng-public" -p public --query "[].name" -o tsv + az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org "https://dev.azure.com/dnceng-public" -p public + ``` +2. Load the binlog and search for job IDs: + ``` + mcp-binlog-tool-load_binlog path:"$env:TEMP\artifact\...\SendToHelix.binlog" + mcp-binlog-tool-search_binlog binlog_file:"..." query:"Sent Helix Job" + ``` +3. Query each Helix job GUID with the CI script: + ``` + ./scripts/Get-CIStatus.ps1 -HelixJob "{GUID}" -FindBinlogs + ``` + +**For Helix work item binlogs (the common case):** +The CI script shows binlog URLs directly when you query a specific work item: +``` +./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}" +# Output includes: 🔬 msbuild.binlog: https://helix...blob.core.windows.net/... +``` + +### Step 2: Dispatch parallel subagents for extraction + +Launch two `task` subagents (can run in parallel), each with a prompt like: + +``` +Download the msbuild.binlog from Helix job {JOB_ID} work item {WORK_ITEM}. +Use the CI skill script to get the artifact URL: + ./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}" +Download the binlog URL to $env:TEMP\{label}.binlog. +Load it with the binlog MCP server (mcp-binlog-tool-load_binlog). +Search for the {TASK_NAME} task (mcp-binlog-tool-search_tasks_by_name). +Get full task details (mcp-binlog-tool-list_tasks_in_target) for the target containing the task. +Extract the CommandLineArguments parameter value. +Normalize paths: + - Replace Helix work dirs (/datadisks/disk1/work/XXXXXXXX) with {W} + - Replace runfile hashes (Program-[a-f0-9]+) with Program-{H} + - Replace temp dir names (dotnetSdkTests.[a-zA-Z0-9]+) with dotnetSdkTests.{T} +Parse into individual args using regex: (?:"[^"]+"|/[^\s]+|[^\s]+) +Sort the list and return it. +Report the total arg count prominently. +``` + +**Important:** When diffing, look for **extra or missing args** (different count), not value differences in existing args. A Debug/Release difference in `/define:` is expected noise — an extra `/analyzerconfig:` or `/reference:` arg is the real signal. + +### Step 3: Diff the results + +With two normalized arg lists, `Compare-Object` instantly reveals the difference. + +## Useful Binlog MCP Queries + +After loading a binlog with `mcp-binlog-tool-load_binlog`, use these queries (pass the loaded path as `binlog_file`): + +``` +# Find all invocations of a specific task +mcp-binlog-tool-search_tasks_by_name binlog_file:"$env:TEMP\my.binlog" taskName:"Csc" + +# Search for a property value +mcp-binlog-tool-search_binlog binlog_file:"..." query:"analysislevel" + +# Find what happened inside a specific target +mcp-binlog-tool-search_binlog binlog_file:"..." query:"under($target AddGlobalAnalyzerConfigForPackage_MicrosoftCodeAnalysisNetAnalyzers)" + +# Get all properties matching a pattern +mcp-binlog-tool-search_binlog binlog_file:"..." query:"GlobalAnalyzerConfig" + +# List tasks in a target (returns full parameter details including CommandLineArguments) +mcp-binlog-tool-list_tasks_in_target binlog_file:"..." projectId:22 targetId:167 +``` + +## Path Normalization + +Helix work items run on different machines with different paths. Normalize before comparing: + +| Pattern | Replacement | Example | +|---------|-------------|---------| +| `/datadisks/disk1/work/[A-F0-9]{8}` | `{W}` | Helix work directory (Linux) | +| `C:\h\w\[A-F0-9]{8}` | `{W}` | Helix work directory (Windows) | +| `Program-[a-f0-9]{64}` | `Program-{H}` | Runfile content hash | +| `dotnetSdkTests\.[a-zA-Z0-9]+` | `dotnetSdkTests.{T}` | Temp test directory | + +### After normalizing paths, focus on structural differences + +> ⚠️ **Ignore value-only differences in existing args** (e.g., Debug vs Release in `/define:`, different hash paths). These are expected configuration differences. Focus on **extra or missing args** — a different arg count indicates a real build behavior change. + +## Example: CscArguments Investigation + +A merge PR (release/10.0.3xx → main) had 208 CSC args vs 207 on main. The diff: + +``` +FAIL-ONLY: /analyzerconfig:{W}/p/d/sdk/11.0.100-ci/Sdks/Microsoft.NET.Sdk/analyzers/build/config/analysislevel_11_default.globalconfig +``` + +### What the binlog properties showed + +Both builds had identical property resolution: +- `EffectiveAnalysisLevel = 11.0` +- `_GlobalAnalyzerConfigFileName = analysislevel_11_default.globalconfig` +- `_GlobalAnalyzerConfigFile = .../config/analysislevel_11_default.globalconfig` + +### The actual root cause + +The `AddGlobalAnalyzerConfigForPackage` target has an `Exists()` condition: +```xml + + + +``` + +The merge's SDK layout **shipped** `analysislevel_11_default.globalconfig` on disk (from a newer roslyn-analyzers that flowed from 10.0.3xx), while main's SDK didn't have that file yet. Same property values, different files on disk = different build behavior. + +### Lesson learned + +Same MSBuild property resolution + different files on disk = different build behavior. Always check what's actually in the SDK layout, not just what the targets compute. + +## Anti-Patterns + +> ❌ **Don't manually split/parse CSC command lines in the main conversation.** CSC args have quoted paths, spaces, and complex structure. Regex parsing in PowerShell is fragile and burns turns on trial-and-error. Use a subagent. + +> ❌ **Don't assume the MSBuild property diff explains the behavior diff.** Two branches can compute identical property values but produce different outputs because of different files on disk, different NuGet packages, or different task assemblies. Compare the actual task invocation. + +> ❌ **Don't load large binlogs and browse them interactively in main context.** Use targeted searches: `mcp-binlog-tool-search_tasks_by_name` for a specific task, `mcp-binlog-tool-search_binlog` with a focused query. Get in, get the data, get out. diff --git a/.github/skills/ci-analysis/references/build-progression-analysis.md b/.github/skills/ci-analysis/references/build-progression-analysis.md new file mode 100644 index 00000000000000..df327454a404ac --- /dev/null +++ b/.github/skills/ci-analysis/references/build-progression-analysis.md @@ -0,0 +1,209 @@ +# Deep Investigation: Build Progression Analysis + +When the current build is failing, the PR's build history can reveal whether the failure existed from the start or appeared after specific changes. This is a fact-gathering technique — like target-branch comparison — that provides context for understanding the current failure. + +## When to Use This Pattern + +- Standard analysis (script + logs) hasn't identified the root cause of the current failure +- The PR has multiple pushes and you want to know whether earlier builds passed or failed +- You need to understand whether a failure is inherent to the PR's approach or was introduced by a later change + +## The Pattern + +### Step 0: Start with the recent builds + +Don't try to analyze the full build history upfront — especially on large PRs with many pushes. Start with the most recent N builds (5-8), present the progression table, and let the user decide whether to dig deeper into earlier builds. + +On large PRs, the user is usually iterating toward a solution. The recent builds are the most relevant. Offer: "Here are the last N builds — the pass→fail transition was between X and Y. Want me to look at earlier builds?" + +### Step 1: List builds for the PR + +`gh pr checks` only shows checks for the current HEAD SHA. To see the full build history, use AzDO MCP or CLI: + +**With AzDO MCP (preferred):** +``` +azure-devops-pipelines_get_builds with: + project: "public" + branchName: "refs/pull/{PR}/merge" + top: 20 + queryOrder: "QueueTimeDescending" +``` + +The response includes `triggerInfo` with `pr.sourceSha` — the PR's HEAD commit for each build. + +**Without MCP (fallback):** +```powershell +$org = "https://dev.azure.com/dnceng-public" +$project = "public" +az pipelines runs list --branch "refs/pull/{PR}/merge" --top 20 --org $org -p $project -o json +``` + +### Step 2: Map builds to the PR's head commit + +Each build's `triggerInfo` contains `pr.sourceSha` — the PR's HEAD commit when the build was triggered. Extract it from the `azure-devops-pipelines_get_builds` response or the `az` JSON output. + +> ⚠️ **`sourceVersion` is the merge commit**, not the PR's head commit. Use `triggerInfo.'pr.sourceSha'` instead. + +> ⚠️ **Target branch moves between builds.** Each build merges `pr.sourceSha` into the target branch HEAD *at the time the build starts*. If `main` received new commits between build N and N+1, the two builds merged against different baselines — even if `pr.sourceSha` is the same. Always extract the target branch HEAD to detect baseline shifts. + +### Step 2b: Extract the target branch HEAD from checkout logs + +The AzDO build API doesn't expose the target branch SHA. Extract it from the checkout task log. + +**With AzDO MCP (preferred):** +``` +azure-devops-pipelines_get_build_log_by_id with: + project: "public" + buildId: {BUILD_ID} + logId: 5 + startLine: 500 +``` + +Search the output for the merge line: +``` +HEAD is now at {mergeCommit} Merge {prSourceSha} into {targetBranchHead} +``` + +**Without MCP (fallback):** +```powershell +$token = az account get-access-token --resource "499b84ac-1321-427f-aa17-267ca6975798" --query accessToken -o tsv +$headers = @{ Authorization = "Bearer $token" } +$logUrl = "https://dev.azure.com/{org}/{project}/_apis/build/builds/{BUILD_ID}/logs/5" +$log = Invoke-RestMethod -Uri $logUrl -Headers $headers +``` + +> Note: log ID 5 is the first checkout task in most pipelines. The merge line is typically around line 500-650. If log 5 doesn't contain it, check the build timeline for "Checkout" tasks. + +Note: a PR may have more unique `pr.sourceSha` values than commits visible on GitHub, because force-pushes replace the commit history. Each force-push triggers a new build with a new merge commit and a new `pr.sourceSha`. + +### Step 3: Store progression in SQL + +Use the SQL tool to track builds as you discover them. This avoids losing context and enables queries across the full history: + +```sql +CREATE TABLE IF NOT EXISTS build_progression ( + build_id INT PRIMARY KEY, + pr_sha TEXT, + target_sha TEXT, + result TEXT, -- passed, failed, canceled + queued_at TEXT, + failed_jobs TEXT, -- comma-separated job names + notes TEXT +); +``` + +Insert rows as you extract data from each build: + +```sql +INSERT INTO build_progression VALUES + (1283986, '7af79ad', '2d638dc', 'failed', '2026-02-08T10:00:00Z', 'WasmBuildTests', 'Initial commits'), + (1284169, '28ec8a0', '0b691ba', 'failed', '2026-02-08T14:00:00Z', 'WasmBuildTests', 'Iteration 2'), + (1284433, '39dc0a6', '18a3069', 'passed', '2026-02-09T09:00:00Z', NULL, 'Iteration 3'); +``` + +Then query to find the pass→fail transition: + +```sql +-- Find where it went from passing to failing +SELECT * FROM build_progression ORDER BY queued_at; + +-- Did the target branch move between pass and fail? +SELECT pr_sha, target_sha, result FROM build_progression +WHERE result IN ('passed', 'failed') ORDER BY queued_at; + +-- Which builds share the same PR SHA? (force-push detection) +SELECT pr_sha, COUNT(*) as builds, GROUP_CONCAT(result) as results +FROM build_progression GROUP BY pr_sha HAVING builds > 1; +``` + +Present the table to the user: + +| PR HEAD | Target HEAD | Builds | Result | Notes | +|---------|-------------|--------|--------|-------| +| 7af79ad | 2d638dc | 1283986 | ❌ | Initial commits | +| 28ec8a0 | 0b691ba | 1284169 | ❌ | Iteration 2 | +| 39dc0a6 | 18a3069 | 1284433 | ✅ | Iteration 3 | +| f186b93 | 5709f35 | 1286087 | ❌ | Added commit C; target moved ~35 commits | +| 2e74845 | 482d8f9 | 1286967 | ❌ | Modified commit C | + +When both `pr.sourceSha` AND `Target HEAD` change between a pass→fail transition, either could be the cause. Analyze the failure content to determine which. If only the target moved (same `pr.sourceSha`), the failure came from the new baseline. + +#### Tracking individual test failures across builds + +For deeper analysis, track which tests failed in each build: + +```sql +CREATE TABLE IF NOT EXISTS build_failures ( + build_id INT, + job_name TEXT, + test_name TEXT, + error_snippet TEXT, + helix_job TEXT, + work_item TEXT, + PRIMARY KEY (build_id, job_name, test_name) +); +``` + +Insert failures as you investigate each build, then query for patterns: + +```sql +-- Tests that fail in every build (persistent, not flaky) +SELECT test_name, COUNT(DISTINCT build_id) as fail_count, GROUP_CONCAT(build_id) as builds +FROM build_failures GROUP BY test_name HAVING fail_count > 1; + +-- New failures in the latest build (what changed?) +SELECT f.* FROM build_failures f +LEFT JOIN build_failures prev ON f.test_name = prev.test_name AND prev.build_id = {PREV_BUILD_ID} +WHERE f.build_id = {LATEST_BUILD_ID} AND prev.test_name IS NULL; + +-- Flaky tests: fail in some builds, pass in others +SELECT test_name FROM build_failures GROUP BY test_name +HAVING COUNT(DISTINCT build_id) < (SELECT COUNT(*) FROM build_progression WHERE result = 'failed'); +``` + +### Step 4: Present findings, not conclusions + +Report what the progression shows: +- Which builds passed and which failed +- What commits were added between the last passing and first failing build +- Whether the failing commits were added in response to review feedback (check review threads) + +**Do not** make fix recommendations based solely on build progression. The progression narrows the investigation — it doesn't determine the right fix. The human may have context about why changes were made, what constraints exist, or what the reviewer intended. + +## Checking review context + +When the progression shows that a failure appeared after new commits, check whether those commits were review-requested: + +```powershell +# Get review comments with timestamps +gh api "repos/{OWNER}/{REPO}/pulls/{PR}/comments" ` + --jq '.[] | {author: .user.login, body: .body, created: .created_at}' +``` + +Present this as additional context: "Commit C was pushed after reviewer X commented requesting Y." Let the author decide how to proceed. + +## Combining with Binlog Comparison + +Build progression identifies **which change** correlates with the current failure. Binlog comparison (see [binlog-comparison.md](binlog-comparison.md)) shows **what's different** in the build between a passing and failing state. Together they provide a complete picture: + +1. Progression → "The current failure first appeared in build N+1, which added commit C" +2. Binlog comparison → "In the current (failing) build, task X receives parameter Y=Z, whereas in the passing build it received Y=W" + +## Relationship to Target-Branch Comparison + +Both techniques compare a failing build against a passing one: + +| Technique | Passing build from | Answers | +|-----------|-------------------|---------| +| **Target-branch comparison** | Recent build on the base branch (e.g., main) | "Does this test pass without the PR's changes at all?" | +| **Build progression** | Earlier build on the same PR | "Did this test pass with the PR's *earlier* changes?" | + +Use target-branch comparison first to confirm the failure is PR-related. Use build progression to narrow down *which part* of the PR introduced it. If build progression shows a pass→fail transition with the same `pr.sourceSha`, the target branch is the more likely culprit — use target-branch comparison to confirm. + +## Anti-Patterns + +> ❌ **Don't treat build history as a substitute for analyzing the current build.** The current build determines CI status. Build history is context for understanding and investigating the current failure. + +> ❌ **Don't make fix recommendations from progression alone.** "Build N passed and build N+1 failed after adding commit C" is a fact worth reporting. "Therefore revert commit C" is a judgment that requires more context than the agent has — the commit may be addressing a critical review concern, fixing a different bug, or partially correct. + +> ❌ **Don't assume earlier passing builds prove the original approach was complete.** A build may pass because it didn't change enough to trigger the failing test scenario. The reviewer who requested additional changes may have identified a real gap. diff --git a/.github/skills/ci-analysis/references/delegation-patterns.md b/.github/skills/ci-analysis/references/delegation-patterns.md new file mode 100644 index 00000000000000..e0b191ed68c37f --- /dev/null +++ b/.github/skills/ci-analysis/references/delegation-patterns.md @@ -0,0 +1,127 @@ +# Subagent Delegation Patterns + +CI investigations involve repetitive, mechanical work that burns main conversation context. Delegate data gathering to subagents; keep interpretation in the main agent. + +## Pattern 1: Scanning Multiple Console Logs + +**When:** Multiple failing work items across several jobs. + +**Delegate:** +``` +Extract all unique test failures from these Helix work items: + +Job: {JOB_ID_1}, Work items: {ITEM_1}, {ITEM_2} +Job: {JOB_ID_2}, Work items: {ITEM_3} + +For each, use hlx_logs with jobId and workItem to get console output. +If hlx MCP is not available, fall back to: + ./scripts/Get-CIStatus.ps1 -HelixJob "{JOB}" -WorkItem "{ITEM}" + +Extract lines ending with [FAIL] (xUnit format). Ignore [OUTPUT] and [PASS] lines. + +Return JSON: { "failures": [{ "test": "Namespace.Class.Method", "workItems": ["item1", "item2"] }] } +``` + +## Pattern 2: Finding a Baseline Build + +**When:** A test fails on a PR — need to confirm it passes on the target branch. + +**Delegate:** +``` +Find a recent passing build on {TARGET_BRANCH} of dotnet/{REPO} that ran the same test leg. + +Failing build: {BUILD_ID}, job: {JOB_NAME}, work item: {WORK_ITEM} + +Steps: +1. Search for recently merged PRs: + github-mcp-server-search_pull_requests query:"is:merged base:{TARGET_BRANCH}" owner:dotnet repo:{REPO} +2. Run: ./scripts/Get-CIStatus.ps1 -PRNumber {MERGED_PR} -Repository "dotnet/{REPO}" +3. Find the build with same job name that passed +4. Locate the Helix job ID (may need artifact download — see [azure-cli.md](azure-cli.md)) + +Return JSON: { "found": true, "buildId": N, "helixJob": "...", "workItem": "...", "result": "Pass" } +Or: { "found": false, "reason": "no passing build in last 5 merged PRs" } + +If authentication fails or API returns errors, STOP and return the error — don't troubleshoot. +``` + +## Pattern 3: Extracting Merge PR Changed Files + +**When:** A large merge PR (hundreds of files) has test failures — need the file list for the main agent to analyze. + +**Delegate:** +``` +List all changed files on merge PR #{PR_NUMBER} in dotnet/{REPO}. + +Use: github-mcp-server-pull_request_read method:get_files owner:dotnet repo:{REPO} pullNumber:{PR_NUMBER} + +For each file, note: path, change type (added/modified/deleted), lines changed. + +Return JSON: { "totalFiles": N, "files": [{ "path": "...", "changeType": "modified", "linesChanged": N }] } +``` + +> The main agent decides which files are relevant to the specific failures — don't filter in the subagent. + +## Pattern 4: Parallel Artifact Extraction + +**When:** Multiple builds or artifacts need independent analysis — binlog comparison, canceled job recovery, multi-build progression. + +**Key insight:** Launch one subagent per build/artifact in parallel. Each does its mechanical extraction independently. The main agent synthesizes results across all of them. + +**Delegate (per build, for binlog analysis):** +``` +Download and analyze binlog from AzDO build {BUILD_ID}, artifact {ARTIFACT_NAME}. + +Steps: +1. Download the artifact (see [azure-cli.md](azure-cli.md)) +2. Load: mcp-binlog-tool-load_binlog path:"{BINLOG_PATH}" +3. Find tasks: mcp-binlog-tool-search_tasks_by_name taskName:"Csc" +4. Get task parameters: mcp-binlog-tool-get_task_info + +Return JSON: { "buildId": N, "project": "...", "args": ["..."] } +``` + +**Delegate (per build, for canceled job recovery):** +``` +Check if canceled job "{JOB_NAME}" from build {BUILD_ID} has recoverable Helix results. + +Steps: +1. Use hlx_files with jobId:"{HELIX_JOB_ID}" workItem:"{WORK_ITEM}" to find testResults.xml +2. Download with hlx_download_url using the testResults.xml URI +3. Parse the XML for pass/fail counts on the element + +Return JSON: { "jobName": "...", "hasResults": true, "passed": N, "failed": N } +Or: { "jobName": "...", "hasResults": false, "reason": "no testResults.xml uploaded" } +``` + +This pattern scales to any number of builds — launch N subagents for N builds, collect results, compare. + +## Pattern 5: Build Progression with Target HEAD Extraction + +**When:** PR has multiple builds and you need the full progression table with target branch HEADs. + +**Delegate (one subagent per build):** +``` +Extract the target branch HEAD from AzDO build {BUILD_ID}. + +Use azure-devops-pipelines_get_build_log_by_id with: + project: "public", buildId: {BUILD_ID}, logId: 5, startLine: 500 + +Search for: "HEAD is now at {mergeCommit} Merge {prSourceSha} into {targetBranchHead}" + +Return JSON: { "buildId": N, "targetHead": "abc1234", "mergeCommit": "def5678" } +Or: { "buildId": N, "targetHead": null, "error": "merge line not found in log 5" } +``` + +Launch one per build in parallel. The main agent combines with `azure-devops-pipelines_get_builds` results to build the full progression table. + +## General Guidelines + +- **Use `general-purpose` agent type** — it has shell + MCP access (`hlx_status`, `azure-devops-pipelines_get_builds`, `mcp-binlog-tool-load_binlog`, etc.) +- **Run independent tasks in parallel** — the whole point of delegation +- **Include script paths** — subagents don't inherit skill context +- **Require structured JSON output** — enables comparison across subagents +- **Don't delegate interpretation** — subagents return facts, main agent reasons +- **STOP on errors** — subagents should return error details immediately, not troubleshoot auth/environment issues +- **Use SQL for many results** — when launching 5+ subagents or doing multi-phase delegation, store results in a SQL table (`CREATE TABLE results (agent_id TEXT, build_id INT, data TEXT, status TEXT)`) so you can query across all results instead of holding them in context +- **Specify `model: "claude-sonnet-4"` for MCP-heavy tasks** — default model may time out on multi-step MCP tool chains diff --git a/.github/skills/ci-analysis/references/helix-artifacts.md b/.github/skills/ci-analysis/references/helix-artifacts.md index 16bd426aad04ad..6b73691bd44a45 100644 --- a/.github/skills/ci-analysis/references/helix-artifacts.md +++ b/.github/skills/ci-analysis/references/helix-artifacts.md @@ -184,6 +184,100 @@ Get-ChildItem -Path $extractPath -Filter "*.binlog" -Recurse | ForEach-Object { If a test runs `dotnet build` internally (like SDK end-to-end tests), both sources may have relevant binlogs. +## Downloaded Artifact Layout + +When you download artifacts via MCP tools or manually, the directory structure can be confusing. Here's what to expect. + +### Helix Work Item Downloads + +Two MCP tools download Helix artifacts: +- **`hlx_download`** — downloads multiple files from a work item, with optional glob `pattern` (e.g., `pattern:"*.binlog"`). Returns local file paths. +- **`hlx_download_url`** — downloads a single file by direct URI (from `hlx_files` output). Use when you know exactly which file you need. + +`hlx_download` saves files to a temp directory. The structure is **flat** — all files from the work item land in one directory: + +``` +C:\...\Temp\helix-{hash}\ +├── console.d991a56d.log # Console output +├── testResults.xml # Test pass/fail details +├── msbuild.binlog # Only if test invoked MSBuild +├── publish.msbuild.binlog # Only if test did a publish +├── msbuild0.binlog # Numbered: first test's build +├── msbuild1.binlog # Numbered: second test's build +└── core.1000.34 # Only on crash +``` + +**Key confusion point:** Numbered binlogs (`msbuild0.binlog`, `msbuild1.binlog`) correspond to individual test cases within the work item, not to build phases. A work item like `Microsoft.NET.Build.Tests.dll.18` runs dozens of tests, each invoking MSBuild separately. To map a binlog to a specific test: +1. Load it with `mcp-binlog-tool-load_binlog` +2. Check the project paths inside — they usually contain the test name +3. Or check `testResults.xml` to correlate test execution order with binlog numbering + +### AzDO Build Artifact Downloads + +AzDO artifacts download as **ZIP files** with nested directory structures: + +``` +$env:TEMP\TestBuild_linux_x64\ +└── TestBuild_linux_x64\ # Artifact name repeated as subfolder + └── log\Release\ + ├── Build.binlog # Main build + ├── TestBuildTests.binlog # Test build verification + ├── ToolsetRestore.binlog # Toolset restore + └── SendToHelix.binlog # Contains Helix job GUIDs +``` + +**Key confusion point:** The artifact name appears twice in the path (extract folder + subfolder inside the ZIP). Use the full nested path with `mcp-binlog-tool-load_binlog`. + +### Mapping Binlogs to Failures + +This table shows the **typical** source for each binlog type. The boundaries aren't absolute — some repos run tests on the build agent (producing test binlogs in AzDO artifacts), and Helix work items for SDK/Blazor tests invoke `dotnet build` internally (producing build binlogs as Helix artifacts). + +| You want to investigate... | Look here first | But also check... | +|---------------------------|-----------------|-------------------| +| Why a test's internal `dotnet build` failed | Helix work item (`msbuild{N}.binlog`) | AzDO artifact if tests ran on agent | +| Why the CI build itself failed to compile | AzDO build artifact (`Build.binlog`) | — | +| Which Helix jobs were dispatched | AzDO build artifact (`SendToHelix.binlog`) | — | +| AOT compilation failure | Helix work item (`AOTBuild.binlog`) | — | +| Test build/publish behavior | Helix work item (`publish.msbuild.binlog`) | AzDO artifact (`TestBuildTests.binlog`) | + +> **Rule of thumb:** If the failing job name contains "Helix" or "Send to Helix", the test binlogs are in Helix. If the job runs tests directly (common in dotnet/sdk), check AzDO artifacts. + +### Tracking Downloaded Artifacts with SQL + +When downloading from multiple work items (e.g., binlog comparison between passing and failing builds), use SQL to avoid losing track of what's where: + +```sql +CREATE TABLE IF NOT EXISTS downloaded_artifacts ( + local_path TEXT PRIMARY KEY, + helix_job TEXT, + work_item TEXT, + build_id INT, + artifact_source TEXT, -- 'helix' or 'azdo' + file_type TEXT, -- 'binlog', 'testResults', 'console', 'crash' + notes TEXT -- e.g., 'passing baseline', 'failing PR build' +); +``` + +Key queries: +```sql +-- Find the pair of binlogs for comparison +SELECT local_path, notes FROM downloaded_artifacts +WHERE file_type = 'binlog' ORDER BY notes; + +-- What have I downloaded from a specific work item? +SELECT local_path, file_type FROM downloaded_artifacts +WHERE work_item = 'Microsoft.NET.Build.Tests.dll.18'; +``` + +Use this whenever you're juggling artifacts from 2+ Helix jobs (especially during the binlog comparison pattern in [binlog-comparison.md](binlog-comparison.md)). + +### Tips + +- **Multiple binlogs ≠ multiple builds.** A single work item can produce several binlogs if the test suite runs multiple `dotnet build`/`dotnet publish` commands. +- **Helix and AzDO binlogs can overlap.** Helix binlogs are *usually* from test execution and AzDO binlogs from the build phase, but SDK/Blazor tests invoke MSBuild inside Helix (producing build-like binlogs), and some repos run tests directly on the build agent (producing test binlogs in AzDO). Check both sources if you can't find what you need. +- **Not all work items have binlogs.** Standard unit tests only produce `testResults.xml` and console logs. +- **Use `hlx_download` with `pattern:"*.binlog"`** to filter downloads and avoid pulling large console logs. + ## Artifact Retention Helix artifacts are retained for a limited time (typically 30 days). Download important artifacts promptly if needed for long-term analysis. diff --git a/.github/skills/ci-analysis/references/manual-investigation.md b/.github/skills/ci-analysis/references/manual-investigation.md index ea3e82fb589198..c7a67b98ea91a3 100644 --- a/.github/skills/ci-analysis/references/manual-investigation.md +++ b/.github/skills/ci-analysis/references/manual-investigation.md @@ -73,10 +73,11 @@ Binlogs contain detailed MSBuild execution traces for diagnosing: - NuGet restore problems - Target execution order issues -**Using MSBuild MCP Server:** +**Using MSBuild binlog MCP tools:** ``` -msbuild-mcp analyze --binlog path/to/build.binlog --errors -msbuild-mcp analyze --binlog path/to/build.binlog --target ResolveReferences +mcp-binlog-tool-load_binlog path:"path/to/build.binlog" +mcp-binlog-tool-get_diagnostics binlog_file:"path/to/build.binlog" +mcp-binlog-tool-search_binlog binlog_file:"path/to/build.binlog" query:"error" ``` **Manual Analysis:** diff --git a/.github/skills/ci-analysis/references/sql-tracking.md b/.github/skills/ci-analysis/references/sql-tracking.md new file mode 100644 index 00000000000000..950e2f61a4465e --- /dev/null +++ b/.github/skills/ci-analysis/references/sql-tracking.md @@ -0,0 +1,107 @@ +# SQL Tracking for CI Investigations + +Use the SQL tool to track structured data during complex investigations. This avoids losing context across tool calls and enables queries that catch mistakes (like claiming "all failures known" when some are unmatched). + +## Failed Job Tracking + +Track each failure from the script output and map it to known issues as you verify them: + +```sql +CREATE TABLE IF NOT EXISTS failed_jobs ( + build_id INT, + job_name TEXT, + error_category TEXT, -- from failedJobDetails: test-failure, build-error, crash, etc. + error_snippet TEXT, + known_issue_url TEXT, -- NULL if unmatched + known_issue_title TEXT, + is_pr_correlated BOOLEAN DEFAULT FALSE, + recovery_status TEXT DEFAULT 'not-checked', -- effectively-passed, real-failure, no-results + notes TEXT, + PRIMARY KEY (build_id, job_name) +); +``` + +### Key queries + +```sql +-- Unmatched failures (Build Analysis red = these exist) +SELECT job_name, error_category, error_snippet FROM failed_jobs +WHERE known_issue_url IS NULL; + +-- Are ALL failures accounted for? +SELECT COUNT(*) as total, + SUM(CASE WHEN known_issue_url IS NOT NULL THEN 1 ELSE 0 END) as matched +FROM failed_jobs; + +-- Which crash/canceled jobs need recovery verification? +SELECT job_name, build_id FROM failed_jobs +WHERE error_category IN ('crash', 'unclassified') AND recovery_status = 'not-checked'; + +-- PR-correlated failures (fix before retrying) +SELECT job_name, error_snippet FROM failed_jobs WHERE is_pr_correlated = TRUE; +``` + +### Workflow + +1. After the script runs, insert one row per failed job from `failedJobDetails` (each entry includes `buildId`) +2. For each known issue from `knownIssues`, UPDATE matching rows with the issue URL +3. Query for unmatched failures — these need investigation +4. For crash/canceled jobs, update `recovery_status` after checking Helix results + +## Build Progression + +See [build-progression-analysis.md](build-progression-analysis.md) for the `build_progression` and `build_failures` tables that track pass/fail across multiple builds. + +> **`failed_jobs` vs `build_failures` — when to use each:** +> - `failed_jobs` (above): **Job-level** — maps each failed AzDO job to a known issue. Use for single-build triage ("are all failures accounted for?"). +> - `build_failures` (build-progression-analysis.md): **Test-level** — tracks individual test names across builds. Use for progression analysis ("which tests started failing after commit X?"). + +## PR Comment Tracking + +For deep-dive analysis — especially across a chain of related PRs (e.g., dependency flow failures, sequential merge PRs, or long-lived PRs with weeks of triage) — store PR comments so you can query them without re-fetching: + +```sql +CREATE TABLE IF NOT EXISTS pr_comments ( + pr_number INT, + repo TEXT DEFAULT 'dotnet/runtime', + comment_id INT PRIMARY KEY, + author TEXT, + created_at TEXT, + body TEXT, + is_triage BOOLEAN DEFAULT FALSE -- set TRUE if comment diagnoses a failure +); +``` + +### Key queries + +```sql +-- What has already been diagnosed? (avoid re-investigating) +SELECT author, created_at, substr(body, 1, 200) FROM pr_comments +WHERE is_triage = TRUE ORDER BY created_at; + +-- Cross-PR: same failure discussed in multiple PRs? +SELECT pr_number, author, substr(body, 1, 150) FROM pr_comments +WHERE body LIKE '%BlazorWasm%' ORDER BY created_at; + +-- Who was asked to investigate what? +SELECT author, substr(body, 1, 200) FROM pr_comments +WHERE body LIKE '%PTAL%' OR body LIKE '%could you%look%'; +``` + +### When to use + +- Long-lived PRs (>1 week) with 10+ comments containing triage context +- Analyzing a chain of related PRs where earlier PRs have relevant diagnosis +- When the same failure appears across multiple merge/flow PRs and you need to know what was already tried + +## When to Use SQL vs. Not + +| Situation | Use SQL? | +|-----------|----------| +| 1-2 failed jobs, all match known issues | No — straightforward, hold in context | +| 3+ failed jobs across multiple builds | Yes — prevents missed matches | +| Build progression with 5+ builds | Yes — see [build-progression-analysis.md](build-progression-analysis.md) | +| Crash recovery across multiple work items | Yes — cache testResults.xml findings | +| Single build, single failure | No — overkill | +| PR chain or long-lived PR with extensive triage comments | Yes — preserves diagnosis context across tool calls | +| Downloading artifacts from 2+ Helix jobs (e.g., binlog comparison) | Yes — see [helix-artifacts.md](helix-artifacts.md) | diff --git a/.github/skills/ci-analysis/scripts/Get-CIStatus.ps1 b/.github/skills/ci-analysis/scripts/Get-CIStatus.ps1 index e386604be3fa5e..07a7e29ce280dd 100644 --- a/.github/skills/ci-analysis/scripts/Get-CIStatus.ps1 +++ b/.github/skills/ci-analysis/scripts/Get-CIStatus.ps1 @@ -360,6 +360,16 @@ function Get-AzDOBuildIdFromPR { throw "Failed to fetch CI status for PR #$PR in $Repository - check PR number and permissions" } + # Check if PR has merge conflicts (no CI runs when mergeable_state is dirty) + $prMergeState = $null + $prMergeStateOutput = & gh api "repos/$Repository/pulls/$PR" --jq '.mergeable_state' 2>$null + $ghMergeStateExitCode = $LASTEXITCODE + if ($ghMergeStateExitCode -eq 0 -and $prMergeStateOutput) { + $prMergeState = $prMergeStateOutput.Trim() + } else { + Write-Verbose "Could not determine PR merge state (gh exit code $ghMergeStateExitCode)." + } + # Find ALL failing Azure DevOps builds $failingBuilds = @{} foreach ($line in $checksOutput) { @@ -382,11 +392,19 @@ function Get-AzDOBuildIdFromPR { $buildIdStr = $anyBuildMatch.Groups[1].Value $buildIdInt = 0 if ([int]::TryParse($buildIdStr, [ref]$buildIdInt)) { - return @($buildIdInt) + return @{ BuildIds = @($buildIdInt); Reason = $null; MergeState = $prMergeState } } } } - throw "No CI build found for PR #$PR in $Repository - the CI pipeline has not been triggered yet" + if ($prMergeState -eq 'dirty') { + Write-Host "`nPR #$PR has merge conflicts (mergeable_state: dirty)" -ForegroundColor Red + Write-Host "CI will not run until conflicts are resolved." -ForegroundColor Yellow + Write-Host "Resolve conflicts and push to trigger CI, or use -BuildId to analyze a previous build." -ForegroundColor Gray + return @{ BuildIds = @(); Reason = "MERGE_CONFLICTS"; MergeState = $prMergeState } + } + Write-Host "`nNo CI build found for PR #$PR in $Repository" -ForegroundColor Red + Write-Host "The CI pipeline has not been triggered yet." -ForegroundColor Yellow + return @{ BuildIds = @(); Reason = "NO_BUILDS"; MergeState = $prMergeState } } # Return all unique failing build IDs @@ -399,7 +417,7 @@ function Get-AzDOBuildIdFromPR { } } - return $buildIds + return @{ BuildIds = $buildIds; Reason = $null; MergeState = $prMergeState } } function Get-BuildAnalysisKnownIssues { @@ -523,53 +541,14 @@ function Get-PRChangedFiles { } function Get-PRCorrelation { - param( - [array]$ChangedFiles, - [string]$FailureInfo - ) - - # Extract potential file/test names from the failure info - $correlations = @() - - foreach ($file in $ChangedFiles) { - $fileName = [System.IO.Path]::GetFileNameWithoutExtension($file) - $fileNameWithExt = [System.IO.Path]::GetFileName($file) - - # Check if the failure mentions this file - if ($FailureInfo -match [regex]::Escape($fileName) -or - $FailureInfo -match [regex]::Escape($fileNameWithExt)) { - $correlations += @{ - File = $file - MatchType = "direct" - } - } - - # Check for test file patterns - if ($file -match '\.Tests?\.' -or $file -match '/tests?/' -or $file -match '\\tests?\\') { - # This is a test file - check if the test name appears in failures - if ($FailureInfo -match [regex]::Escape($fileName)) { - $correlations += @{ - File = $file - MatchType = "test" - } - } - } - } - - return $correlations | Select-Object -Unique -Property File, MatchType -} - -function Show-PRCorrelationSummary { param( [array]$ChangedFiles, [array]$AllFailures ) - if ($ChangedFiles.Count -eq 0) { - return - } + $result = @{ CorrelatedFiles = @(); TestFiles = @() } + if ($ChangedFiles.Count -eq 0 -or $AllFailures.Count -eq 0) { return $result } - # Combine all failure info into searchable text $failureText = ($AllFailures | ForEach-Object { $_.TaskName $_.JobName @@ -578,23 +557,12 @@ function Show-PRCorrelationSummary { $_.FailedTests -join "`n" }) -join "`n" - # Also include the raw local test failure messages which may contain test class names - # These come from the "issues" property on local failures - - # Find correlations - $correlatedFiles = @() - $testFiles = @() - foreach ($file in $ChangedFiles) { $fileName = [System.IO.Path]::GetFileNameWithoutExtension($file) $fileNameWithExt = [System.IO.Path]::GetFileName($file) + $baseTestName = $fileName -replace '\.[^.]+$', '' - # For files like NtAuthTests.FakeServer.cs, also check NtAuthTests - $baseTestName = $fileName -replace '\.[^.]+$', '' # Remove .FakeServer etc. - - # Check if this file appears in any failure $isCorrelated = $false - if ($failureText -match [regex]::Escape($fileName) -or $failureText -match [regex]::Escape($fileNameWithExt) -or $failureText -match [regex]::Escape($file) -or @@ -602,18 +570,31 @@ function Show-PRCorrelationSummary { $isCorrelated = $true } - # Track test files separately - $isTestFile = $file -match '\.Tests?\.' -or $file -match '[/\\]tests?[/\\]' -or $file -match 'Test\.cs$' -or $file -match 'Tests\.cs$' - if ($isCorrelated) { - if ($isTestFile) { - $testFiles += $file - } else { - $correlatedFiles += $file - } + $isTestFile = $file -match '\.Tests?\.' -or $file -match '[/\\]tests?[/\\]' -or $file -match 'Test\.cs$' -or $file -match 'Tests\.cs$' + if ($isTestFile) { $result.TestFiles += $file } else { $result.CorrelatedFiles += $file } } } + $result.CorrelatedFiles = @($result.CorrelatedFiles | Select-Object -Unique) + $result.TestFiles = @($result.TestFiles | Select-Object -Unique) + return $result +} + +function Show-PRCorrelationSummary { + param( + [array]$ChangedFiles, + [array]$AllFailures + ) + + if ($ChangedFiles.Count -eq 0) { + return + } + + $correlation = Get-PRCorrelation -ChangedFiles $ChangedFiles -AllFailures $AllFailures + $correlatedFiles = $correlation.CorrelatedFiles + $testFiles = $correlation.TestFiles + # Show results if ($correlatedFiles.Count -gt 0 -or $testFiles.Count -gt 0) { Write-Host "`n=== PR Change Correlation ===" -ForegroundColor Magenta @@ -644,7 +625,7 @@ function Show-PRCorrelationSummary { } } - Write-Host "`nThese failures are likely PR-related." -ForegroundColor Yellow + Write-Host "`nCorrelated files found — check JSON summary for details." -ForegroundColor Yellow } } @@ -1390,7 +1371,7 @@ function Get-HelixWorkItemDetails { # (https://github.com/dotnet/dnceng/issues/6072). ListFiles returns direct # blob storage URIs that always work. $listFiles = Get-HelixWorkItemFiles -JobId $JobId -WorkItemName $WorkItemName - if ($listFiles) { + if ($null -ne $listFiles) { $response.Files = @($listFiles | ForEach-Object { [PSCustomObject]@{ FileName = $_.Name @@ -1494,6 +1475,8 @@ function Format-TestFailure { $failureCount = 0 # Expanded failure detection patterns + # CAUTION: These trigger "failure block" capture. Overly broad patterns (e.g. \w+Error:) + # will grab Python harness/reporter noise and swamp the real test failure. $failureStartPatterns = @( '\[FAIL\]', 'Assert\.\w+\(\)\s+Failure', @@ -1501,7 +1484,8 @@ function Format-TestFailure { 'BUG:', 'FAILED\s*$', 'END EXECUTION - FAILED', - 'System\.\w+Exception:' + 'System\.\w+Exception:', + 'Timed Out \(timeout' ) $combinedPattern = ($failureStartPatterns -join '|') @@ -1737,8 +1721,43 @@ try { $buildIds = @() $knownIssuesFromBuildAnalysis = @() $prChangedFiles = @() + $noBuildReason = $null if ($PSCmdlet.ParameterSetName -eq 'PRNumber') { - $buildIds = @(Get-AzDOBuildIdFromPR -PR $PRNumber) + $buildResult = Get-AzDOBuildIdFromPR -PR $PRNumber + if ($buildResult.Reason) { + # No builds found — emit summary with reason and exit + $noBuildReason = $buildResult.Reason + $buildIds = @() + $summary = [ordered]@{ + mode = "PRNumber" + repository = $Repository + prNumber = $PRNumber + builds = @() + totalFailedJobs = 0 + totalLocalFailures = 0 + lastBuildJobSummary = [ordered]@{ + total = 0; succeeded = 0; failed = 0; canceled = 0; pending = 0; warnings = 0; skipped = 0 + } + failedJobNames = @() + failedJobDetails = @() + canceledJobNames = @() + knownIssues = @() + prCorrelation = [ordered]@{ + changedFileCount = 0 + hasCorrelation = $false + correlatedFiles = @() + } + recommendationHint = if ($noBuildReason -eq "MERGE_CONFLICTS") { "MERGE_CONFLICTS" } else { "NO_BUILDS" } + noBuildReason = $noBuildReason + mergeState = $buildResult.MergeState + } + Write-Host "" + Write-Host "[CI_ANALYSIS_SUMMARY]" + Write-Host ($summary | ConvertTo-Json -Depth 5) + Write-Host "[/CI_ANALYSIS_SUMMARY]" + exit 0 + } + $buildIds = @($buildResult.BuildIds) # Check Build Analysis for known issues $knownIssuesFromBuildAnalysis = @(Get-BuildAnalysisKnownIssues -PR $PRNumber) @@ -1757,6 +1776,10 @@ try { $totalFailedJobs = 0 $totalLocalFailures = 0 $allFailuresForCorrelation = @() + $allFailedJobNames = @() + $allCanceledJobNames = @() + $allFailedJobDetails = @() + $lastBuildJobSummary = $null foreach ($currentBuildId in $buildIds) { Write-Host "`n=== Azure DevOps Build $currentBuildId ===" -ForegroundColor Yellow @@ -1800,6 +1823,36 @@ try { # Also check for local test failures (non-Helix) $localTestFailures = Get-LocalTestFailures -Timeline $timeline -BuildId $currentBuildId + # Accumulate totals and compute job summary BEFORE any continue branches + $totalFailedJobs += $failedJobs.Count + $totalLocalFailures += $localTestFailures.Count + $allFailedJobNames += @($failedJobs | ForEach-Object { $_.name }) + $allCanceledJobNames += @($canceledJobs | ForEach-Object { $_.name }) + + $allJobs = @() + $succeededJobs = 0 + $pendingJobs = 0 + $canceledJobCount = 0 + $skippedJobs = 0 + $warningJobs = 0 + if ($timeline -and $timeline.records) { + $allJobs = @($timeline.records | Where-Object { $_.type -eq "Job" }) + $succeededJobs = @($allJobs | Where-Object { $_.result -eq "succeeded" }).Count + $warningJobs = @($allJobs | Where-Object { $_.result -eq "succeededWithIssues" }).Count + $pendingJobs = @($allJobs | Where-Object { -not $_.result -or $_.state -eq "pending" -or $_.state -eq "inProgress" }).Count + $canceledJobCount = @($allJobs | Where-Object { $_.result -eq "canceled" }).Count + $skippedJobs = @($allJobs | Where-Object { $_.result -eq "skipped" }).Count + } + $lastBuildJobSummary = [ordered]@{ + total = $allJobs.Count + succeeded = $succeededJobs + failed = if ($failedJobs) { $failedJobs.Count } else { 0 } + canceled = $canceledJobCount + pending = $pendingJobs + warnings = $warningJobs + skipped = $skippedJobs + } + if ((-not $failedJobs -or $failedJobs.Count -eq 0) -and $localTestFailures.Count -eq 0) { if ($buildStatus -and $buildStatus.Status -eq "inProgress") { Write-Host "`nNo failures yet - build still in progress" -ForegroundColor Cyan @@ -1885,7 +1938,6 @@ try { Write-Host "`n=== Summary ===" -ForegroundColor Yellow Write-Host "Local test failures: $($localTestFailures.Count)" -ForegroundColor Red Write-Host "Build URL: https://dev.azure.com/$Organization/$Project/_build/results?buildId=$currentBuildId" -ForegroundColor Cyan - $totalLocalFailures += $localTestFailures.Count continue } @@ -1914,6 +1966,15 @@ try { Write-Host "`n--- $($job.name) ---" -ForegroundColor Cyan Write-Host " Build: https://dev.azure.com/$Organization/$Project/_build/results?buildId=$currentBuildId&view=logs&j=$($job.id)" -ForegroundColor Gray + # Track per-job failure details for JSON summary + $jobDetail = [ordered]@{ + jobName = $job.name + buildId = $currentBuildId + errorSnippet = "" + helixWorkItems = @() + errorCategory = "unclassified" + } + # Get Helix tasks for this job $helixTasks = Get-HelixJobInfo -Timeline $timeline -JobId $job.id @@ -1941,6 +2002,8 @@ try { HelixLogs = @() FailedTests = $failures | ForEach-Object { $_.TestName } } + $jobDetail.errorCategory = "test-failure" + $jobDetail.errorSnippet = ($failures | Select-Object -First 3 | ForEach-Object { $_.TestName }) -join "; " } # Extract and optionally fetch Helix URLs @@ -1956,6 +2019,7 @@ try { $workItemName = "" if ($url -match '/workitems/([^/]+)/console') { $workItemName = $Matches[1] + $jobDetail.helixWorkItems += $workItemName } $helixLog = Get-HelixConsoleLog -Url $url @@ -1964,9 +2028,41 @@ try { if ($failureInfo) { Write-Host $failureInfo -ForegroundColor White + # Categorize failure from log content + if ($failureInfo -match 'Timed Out \(timeout') { + $jobDetail.errorCategory = "test-timeout" + } elseif ($failureInfo -match 'Exit Code:\s*(139|134|-4)' -or $failureInfo -match 'createdump') { + # Crash takes highest precedence — don't downgrade + if ($jobDetail.errorCategory -notin @("crash")) { + $jobDetail.errorCategory = "crash" + } + } elseif ($failureInfo -match 'Traceback \(most recent call last\)' -and $helixLog -match 'Tests run:.*Failures:\s*0') { + # Work item failed (non-zero exit from reporter crash) but all tests passed. + # The Python traceback is from Helix infrastructure, not from the test itself. + if ($jobDetail.errorCategory -notin @("crash", "test-timeout")) { + $jobDetail.errorCategory = "tests-passed-reporter-failed" + } + } elseif ($jobDetail.errorCategory -eq "unclassified") { + $jobDetail.errorCategory = "test-failure" + } + if (-not $jobDetail.errorSnippet) { + $jobDetail.errorSnippet = $failureInfo.Substring(0, [Math]::Min(200, $failureInfo.Length)) + } + # Search for known issues Show-KnownIssues -TestName $workItemName -ErrorMessage $failureInfo -IncludeMihuBot:$SearchMihuBot } + else { + # No failure pattern matched — show tail of log + $lines = $helixLog -split "`n" + $lastLines = $lines | Select-Object -Last 20 + $tailText = $lastLines -join "`n" + Write-Host $tailText -ForegroundColor White + if (-not $jobDetail.errorSnippet) { + $jobDetail.errorSnippet = $tailText.Substring(0, [Math]::Min(200, $tailText.Length)) + } + Show-KnownIssues -TestName $workItemName -ErrorMessage $tailText -IncludeMihuBot:$SearchMihuBot + } } } } @@ -2007,6 +2103,11 @@ try { HelixLogs = @() FailedTests = @() } + $jobDetail.errorCategory = "build-error" + if (-not $jobDetail.errorSnippet) { + $snippet = ($buildErrors | Select-Object -First 2) -join "; " + $jobDetail.errorSnippet = $snippet.Substring(0, [Math]::Min(200, $snippet.Length)) + } # Extract Helix log URLs from the full log content $helixLogUrls = Extract-HelixLogUrls -LogContent $logContent @@ -2042,6 +2143,7 @@ try { } } + $allFailedJobDetails += $jobDetail $processedJobs++ } catch { @@ -2055,25 +2157,6 @@ try { } } - $totalFailedJobs += $failedJobs.Count - $totalLocalFailures += $localTestFailures.Count - - # Compute job summary from timeline - $allJobs = @() - $succeededJobs = 0 - $pendingJobs = 0 - $canceledJobCount = 0 - $skippedJobs = 0 - $warningJobs = 0 - if ($timeline -and $timeline.records) { - $allJobs = @($timeline.records | Where-Object { $_.type -eq "Job" }) - $succeededJobs = @($allJobs | Where-Object { $_.result -eq "succeeded" }).Count - $warningJobs = @($allJobs | Where-Object { $_.result -eq "succeededWithIssues" }).Count - $pendingJobs = @($allJobs | Where-Object { -not $_.result -or $_.state -eq "pending" -or $_.state -eq "inProgress" }).Count - $canceledJobCount = @($allJobs | Where-Object { $_.result -eq "canceled" }).Count - $skippedJobs = @($allJobs | Where-Object { $_.result -eq "skipped" }).Count - } - Write-Host "`n=== Build $currentBuildId Summary ===" -ForegroundColor Yellow if ($allJobs.Count -gt 0) { $parts = @() @@ -2121,54 +2204,67 @@ if ($buildIds.Count -gt 1) { } } -# Smart retry recommendation -Write-Host "`n=== Recommendation ===" -ForegroundColor Magenta - -if ($knownIssuesFromBuildAnalysis.Count -gt 0) { - $knownIssueCount = $knownIssuesFromBuildAnalysis.Count - Write-Host "KNOWN ISSUES DETECTED" -ForegroundColor Yellow - Write-Host "$knownIssueCount tracked issue(s) found that may correlate with failures above." -ForegroundColor White - Write-Host "Review the failure details and linked issues to determine if retry is needed." -ForegroundColor Gray +# Build structured summary and emit as JSON +$summary = [ordered]@{ + mode = $PSCmdlet.ParameterSetName + repository = $Repository + prNumber = if ($PSCmdlet.ParameterSetName -eq 'PRNumber') { $PRNumber } else { $null } + builds = @($buildIds | ForEach-Object { + [ordered]@{ + buildId = $_ + url = "https://dev.azure.com/$Organization/$Project/_build/results?buildId=$_" + } + }) + totalFailedJobs = $totalFailedJobs + totalLocalFailures = $totalLocalFailures + lastBuildJobSummary = if ($lastBuildJobSummary) { $lastBuildJobSummary } else { [ordered]@{ + total = 0; succeeded = 0; failed = 0; canceled = 0; pending = 0; warnings = 0; skipped = 0 + } } + failedJobNames = @($allFailedJobNames) + failedJobDetails = @($allFailedJobDetails) + failedJobDetailsTruncated = ($allFailedJobNames.Count -gt $allFailedJobDetails.Count) + canceledJobNames = @($allCanceledJobNames) + knownIssues = @($knownIssuesFromBuildAnalysis | ForEach-Object { + [ordered]@{ number = $_.Number; title = $_.Title; url = $_.Url } + }) + prCorrelation = [ordered]@{ + changedFileCount = $prChangedFiles.Count + hasCorrelation = $false + correlatedFiles = @() + } + recommendationHint = "" } -elseif ($totalFailedJobs -eq 0 -and $totalLocalFailures -eq 0) { - Write-Host "BUILD SUCCESSFUL" -ForegroundColor Green - Write-Host "No failures detected." -ForegroundColor White -} -elseif ($prChangedFiles.Count -gt 0 -and $allFailuresForCorrelation.Count -gt 0) { - # Check if failures correlate with PR changes - $hasCorrelation = $false - foreach ($failure in $allFailuresForCorrelation) { - $failureText = ($failure.Errors + $failure.HelixLogs + $failure.FailedTests) -join " " - foreach ($file in $prChangedFiles) { - $fileName = [System.IO.Path]::GetFileNameWithoutExtension($file) - if ($failureText -match [regex]::Escape($fileName)) { - $hasCorrelation = $true - break - } - } - if ($hasCorrelation) { break } - } - - if ($hasCorrelation) { - Write-Host "LIKELY PR-RELATED" -ForegroundColor Red - Write-Host "Failures appear to correlate with files changed in this PR." -ForegroundColor White - Write-Host "Review the 'PR Change Correlation' section above and fix the issues before retrying." -ForegroundColor Gray - } - else { - Write-Host "POSSIBLY TRANSIENT" -ForegroundColor Yellow - Write-Host "No known issues matched, but failures don't clearly correlate with PR changes." -ForegroundColor White - Write-Host "Consider:" -ForegroundColor Gray - Write-Host " 1. Check if same tests are failing on main branch" -ForegroundColor Gray - Write-Host " 2. Search for existing issues: gh issue list --label 'Known Build Error' --search ''" -ForegroundColor Gray - Write-Host " 3. If infrastructure-related (device not found, network errors), retry may help" -ForegroundColor Gray - } + +# Compute PR correlation using shared helper +if ($prChangedFiles.Count -gt 0 -and $allFailuresForCorrelation.Count -gt 0) { + $correlation = Get-PRCorrelation -ChangedFiles $prChangedFiles -AllFailures $allFailuresForCorrelation + $allCorrelated = @($correlation.CorrelatedFiles) + @($correlation.TestFiles) | Select-Object -Unique + $summary.prCorrelation.hasCorrelation = $allCorrelated.Count -gt 0 + $summary.prCorrelation.correlatedFiles = @($allCorrelated) } -else { - Write-Host "REVIEW REQUIRED" -ForegroundColor Yellow - Write-Host "Could not automatically determine failure cause." -ForegroundColor White - Write-Host "Review the failures above to determine if they are PR-related or infrastructure issues." -ForegroundColor Gray + +# Compute recommendation hint +# Priority: KNOWN_ISSUES wins over LIKELY_PR_RELATED intentionally. +# When both exist, SKILL.md "Mixed signals" guidance tells the agent to separate them. +if (-not $lastBuildJobSummary -and $buildIds.Count -gt 0) { + $summary.recommendationHint = "REVIEW_REQUIRED" +} elseif ($knownIssuesFromBuildAnalysis.Count -gt 0) { + $summary.recommendationHint = "KNOWN_ISSUES_DETECTED" +} elseif ($totalFailedJobs -eq 0 -and $totalLocalFailures -eq 0) { + $summary.recommendationHint = "BUILD_SUCCESSFUL" +} elseif ($summary.prCorrelation.hasCorrelation) { + $summary.recommendationHint = "LIKELY_PR_RELATED" +} elseif ($prChangedFiles.Count -gt 0 -and $allFailuresForCorrelation.Count -gt 0) { + $summary.recommendationHint = "POSSIBLY_TRANSIENT" +} else { + $summary.recommendationHint = "REVIEW_REQUIRED" } +Write-Host "" +Write-Host "[CI_ANALYSIS_SUMMARY]" +Write-Host ($summary | ConvertTo-Json -Depth 5) +Write-Host "[/CI_ANALYSIS_SUMMARY]" + } catch { Write-Error "Error: $_"