Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
bc8b786
ci-analysis: replace canned recommendations with JSON summary + agent…
lewing Feb 10, 2026
a393494
ci-analysis: add Step 0 context gathering, structured output, verify-…
lewing Feb 10, 2026
c3cbad9
Address review: add missing reference files, fix BuildId mode wording…
lewing Feb 10, 2026
9a3194c
Fix empty array falsy check in Get-HelixWorkItemDetails
lewing Feb 10, 2026
5cf7c95
Address review: fix target-branch refs in reference docs, remove hard…
lewing Feb 10, 2026
bce86be
ci-analysis: move deep-dive content to references, reduce SKILL.md to…
lewing Feb 10, 2026
e931619
ci-analysis: add prior-build mismatch detection guidance
lewing Feb 10, 2026
3a44083
Fix POSSIBLY_TRANSIENT hint: require correlation data before claiming…
lewing Feb 10, 2026
28f02cc
Address review: guard against timeline fetch failure, fix target-bran…
lewing Feb 10, 2026
13ef005
ci-analysis: emit JSON summary for no-builds and merge-conflict PRs
lewing Feb 10, 2026
7c66ba0
Fix remaining main branch reference in delegation-patterns.md
lewing Feb 10, 2026
60f9b6e
Add build progression analysis reference and fix step numbering
lewing Feb 11, 2026
eea00eb
Address review: consistent return type, fix main→target branch refs, …
lewing Feb 11, 2026
1340331
Extract Get-PRCorrelation helper to eliminate divergent duplication
lewing Feb 11, 2026
bdcfe22
Remove duplicate Get-PRCorrelation function (old dead code)
lewing Feb 11, 2026
fe251bf
Rewrite delegation patterns: JSON output, parallel artifact extractio…
lewing Feb 11, 2026
5a852b1
Trim SKILL.md from ~4.6K to ~4K tokens: condense anti-patterns, merge…
lewing Feb 11, 2026
532a1b7
Address review: standardize binlog tool names, document hint priority
lewing Feb 11, 2026
2a9ffb9
build-progression: document triggerInfo.pr.sourceSha for commit mapping
lewing Feb 11, 2026
18d9331
Add per-failure details, Python error patterns, and log tail fallback
lewing Feb 11, 2026
f81cab5
Remove overly broad \w+Error: and Traceback failure-start patterns
lewing Feb 11, 2026
d26014f
build-progression: warn that target branch moves between builds
lewing Feb 11, 2026
dab10c6
ci-analysis: polish SKILL.md, add target HEAD tracking to build progr…
lewing Feb 11, 2026
4130053
ci-analysis: MCP-first patterns from live analysis
lewing Feb 11, 2026
e32a08e
ci-analysis: fix review findings from multi-model audit
lewing Feb 11, 2026
417a138
ci-analysis: trim mergeable_state output to fix whitespace comparison
lewing Feb 11, 2026
658d1bb
ci-analysis: consistent JSON schema and canonical MCP tool names
lewing Feb 11, 2026
b89a53e
Address Copilot review: errorCategory precedence, gh exit code handli…
lewing Feb 11, 2026
8851a6d
ci-analysis: enforce Build Analysis check status in recommendations
lewing Feb 11, 2026
6923d0e
ci-analysis: add crash/canceled job recovery procedure
lewing Feb 11, 2026
c405efc
ci-analysis: document MCP tool limitation for subagents
lewing Feb 12, 2026
5e2d427
Address review: truncation metadata, tool name consistency, hlx_batch…
lewing Feb 12, 2026
087ec04
Fix hlx tool name consistency in delegation-patterns
lewing Feb 12, 2026
bff659c
Add SQL-based progression tracking to build-progression-analysis
lewing Feb 12, 2026
e2fe9c4
Add SQL failure tracking across builds for progression analysis
lewing Feb 12, 2026
72efae4
Add SQL tracking reference for failure-to-known-issue mapping
lewing Feb 12, 2026
7ba79d2
Merge branch 'main' into skill/ci-analysis-json-summary
lewing Feb 12, 2026
20cf6c7
Fix section reference: 'Recovering Results from Crashed/Canceled Jobs'
lewing Feb 12, 2026
2566a31
Add PR comment tracking pattern for deep analysis and PR chains
lewing Feb 12, 2026
f9d9263
Fix plain-text cross-references to use markdown links
lewing Feb 12, 2026
a3e2bdb
Add buildId to failedJobDetails, include exit code -4 in crash regex,…
lewing Feb 12, 2026
9533720
Add downloaded artifact layout guide and SQL tracking for artifact ma…
lewing Feb 12, 2026
ce7083c
Clarify build discovery scope, SQL table purposes, and use concrete t…
lewing Feb 12, 2026
788a4f8
Soften binlog source guidance: AzDO and Helix boundaries aren't absolute
lewing Feb 12, 2026
c6a96ab
Clarify hlx_download vs hlx_download_url usage
lewing Feb 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 115 additions & 137 deletions .github/skills/ci-analysis/SKILL.md

Large diffs are not rendered by default.

93 changes: 93 additions & 0 deletions .github/skills/ci-analysis/references/azure-cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Deep Investigation with Azure CLI

When the CI script and GitHub APIs aren't enough (e.g., investigating internal pipeline definitions or downloading build artifacts), use the Azure CLI with the `azure-devops` extension.

> 💡 **Prefer `az pipelines` / `az devops` commands over raw REST API calls.** The CLI handles authentication, pagination, and JSON output formatting. Only fall back to manual `Invoke-RestMethod` calls when the CLI doesn't expose the endpoint you need (e.g., build timelines). The CLI's `--query` (JMESPath) and `-o table` flags are powerful for filtering without extra scripting.

## Checking Authentication

Before making AzDO API calls, verify the CLI is installed and authenticated:

```powershell
# Ensure az is on PATH (Windows may need a refresh after install)
$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")

# Check if az CLI is available
az --version 2>$null | Select-Object -First 1

# Check if logged in and get current account
az account show --query "{name:name, user:user.name}" -o table 2>$null

# If not logged in, prompt the user to authenticate:
# az login # Interactive browser login
# az login --use-device-code # Device code flow (for remote/headless)

# Get an AAD access token for AzDO REST API calls (only needed for raw REST)
$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
$headers = @{ "Authorization" = "Bearer $accessToken" }
```

> ⚠️ If `az` is not installed, use `winget install -e --id Microsoft.AzureCLI` (Windows). The `azure-devops` extension is also required — install or verify it with `az extension add --name azure-devops` (safe to run if already installed). Ask the user to authenticate if needed.

> ⚠️ **Do NOT use `az devops configure --defaults`** — it sets user-wide defaults that may not match the organization/project needed for dotnet repositories. Always pass `--org` and `--project` (or `-p`) explicitly on each command.

## Querying Pipeline Definitions and Builds

```powershell
$org = "https://dev.azure.com/dnceng"
$project = "internal"

# Find a pipeline definition by name
az pipelines list --name "dotnet-unified-build" --org $org -p $project --query "[].{id:id, name:name, path:path}" -o table

# Get pipeline definition details (shows YAML path, triggers, etc.)
az pipelines show --id 1330 --org $org -p $project --query "{id:id, name:name, yamlPath:process.yamlFilename, repo:repository.name}" -o table

# List recent builds for a pipeline (replace {TARGET_BRANCH} with the PR's base branch, e.g., main or release/9.0)
az pipelines runs list --pipeline-ids 1330 --branch "refs/heads/{TARGET_BRANCH}" --top 5 --org $org -p $project --query "[].{id:id, result:result, finish:finishTime}" -o table

# Get a specific build's details
az pipelines runs show --id $buildId --org $org -p $project --query "{id:id, result:result, sourceBranch:sourceBranch}" -o table

# List build artifacts
az pipelines runs artifact list --run-id $buildId --org $org -p $project --query "[].{name:name, type:resource.type}" -o table

# Download a build artifact
az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org $org -p $project
```

## REST API Fallback

Fall back to REST API only when the CLI doesn't expose what you need:

```powershell
# Get build timeline (stages, jobs, tasks with results and durations) — no CLI equivalent
$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
$headers = @{ "Authorization" = "Bearer $accessToken" }
$timelineUrl = "https://dev.azure.com/dnceng/internal/_apis/build/builds/$buildId/timeline?api-version=7.1"
$timeline = (Invoke-RestMethod -Uri $timelineUrl -Headers $headers)
$timeline.records | Where-Object { $_.result -eq "failed" -and $_.type -eq "Job" }
```

## Examining Pipeline YAML

All dotnet repos that use arcade put their pipeline definitions under `eng/pipelines/`. Use `az pipelines show` to find the YAML file path, then fetch it:

```powershell
# Find the YAML path for a pipeline
az pipelines show --id 1330 --org $org -p $project --query "{yamlPath:process.yamlFilename, repo:repository.name}" -o table

# Fetch the YAML from the repo (example: dotnet/runtime's runtime-official pipeline)
# github-mcp-server-get_file_contents owner:dotnet repo:runtime path:eng/pipelines/runtime-official.yml

# For VMR unified builds, the YAML is in dotnet/dotnet:
# github-mcp-server-get_file_contents owner:dotnet repo:dotnet path:eng/pipelines/unified-build.yml

# Templates are usually in eng/pipelines/common/ or eng/pipelines/templates/
```

This is especially useful when:
- A job name doesn't clearly indicate what it builds
- You need to understand stage dependencies (why a job was canceled)
- You want to find which template defines a specific step
- Investigating whether a pipeline change caused new failures
144 changes: 144 additions & 0 deletions .github/skills/ci-analysis/references/binlog-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Deep Investigation: Binlog Comparison

When a test **passes on the target branch but fails on a PR**, comparing MSBuild binlogs from both runs reveals the exact difference in task parameters without guessing.

## When to Use This Pattern

- Test assertion compares "expected vs actual" build outputs (e.g., CSC args, reference lists)
- A build succeeds on one branch but fails on another with different MSBuild behavior
- You need to find which MSBuild property/item change caused a specific task to behave differently

## The Pattern: Delegate to Subagents

> ⚠️ **Do NOT download, load, and parse binlogs in the main conversation context.** This burns 10+ turns on mechanical work. Delegate to subagents instead.

### Step 1: Identify the two work items to compare

Use `Get-CIStatus.ps1` to find the failing Helix job + work item, then find a corresponding passing build (recent PR merged to the target branch, or a CI run on that branch).

**Finding Helix job IDs from build artifacts (binlogs to find binlogs):**
When the failing work item's Helix job ID isn't visible (e.g., canceled jobs, or finding a matching job from a passing build), the IDs are inside the build's `SendToHelix.binlog`:

1. Download the build artifact with `az`:
```
az pipelines runs artifact list --run-id $buildId --org "https://dev.azure.com/dnceng-public" -p public --query "[].name" -o tsv
az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org "https://dev.azure.com/dnceng-public" -p public
```
2. Load the binlog and search for job IDs:
```
mcp-binlog-tool-load_binlog path:"$env:TEMP\artifact\...\SendToHelix.binlog"
mcp-binlog-tool-search_binlog binlog_file:"..." query:"Sent Helix Job"
```
3. Query each Helix job GUID with the CI script:
```
./scripts/Get-CIStatus.ps1 -HelixJob "{GUID}" -FindBinlogs
```

**For Helix work item binlogs (the common case):**
The CI script shows binlog URLs directly when you query a specific work item:
```
./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}"
# Output includes: 🔬 msbuild.binlog: https://helix...blob.core.windows.net/...
```

### Step 2: Dispatch parallel subagents for extraction

Launch two `task` subagents (can run in parallel), each with a prompt like:

```
Download the msbuild.binlog from Helix job {JOB_ID} work item {WORK_ITEM}.
Use the CI skill script to get the artifact URL:
./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}"
Download the binlog URL to $env:TEMP\{label}.binlog.
Load it with the binlog MCP server (mcp-binlog-tool-load_binlog).
Search for the {TASK_NAME} task (mcp-binlog-tool-search_tasks_by_name).
Get full task details (mcp-binlog-tool-list_tasks_in_target) for the target containing the task.
Extract the CommandLineArguments parameter value.
Normalize paths:
- Replace Helix work dirs (/datadisks/disk1/work/XXXXXXXX) with {W}
- Replace runfile hashes (Program-[a-f0-9]+) with Program-{H}
- Replace temp dir names (dotnetSdkTests.[a-zA-Z0-9]+) with dotnetSdkTests.{T}
Parse into individual args using regex: (?:"[^"]+"|/[^\s]+|[^\s]+)
Sort the list and return it.
Report the total arg count prominently.
```

**Important:** When diffing, look for **extra or missing args** (different count), not value differences in existing args. A Debug/Release difference in `/define:` is expected noise — an extra `/analyzerconfig:` or `/reference:` arg is the real signal.

### Step 3: Diff the results

With two normalized arg lists, `Compare-Object` instantly reveals the difference.

## Useful Binlog MCP Queries

After loading a binlog with `mcp-binlog-tool-load_binlog`, use these queries (pass the loaded path as `binlog_file`):

```
# Find all invocations of a specific task
mcp-binlog-tool-search_tasks_by_name binlog_file:"$env:TEMP\my.binlog" taskName:"Csc"

# Search for a property value
mcp-binlog-tool-search_binlog binlog_file:"..." query:"analysislevel"

# Find what happened inside a specific target
mcp-binlog-tool-search_binlog binlog_file:"..." query:"under($target AddGlobalAnalyzerConfigForPackage_MicrosoftCodeAnalysisNetAnalyzers)"

# Get all properties matching a pattern
mcp-binlog-tool-search_binlog binlog_file:"..." query:"GlobalAnalyzerConfig"

# List tasks in a target (returns full parameter details including CommandLineArguments)
mcp-binlog-tool-list_tasks_in_target binlog_file:"..." projectId:22 targetId:167
```

## Path Normalization

Helix work items run on different machines with different paths. Normalize before comparing:

| Pattern | Replacement | Example |
|---------|-------------|---------|
| `/datadisks/disk1/work/[A-F0-9]{8}` | `{W}` | Helix work directory (Linux) |
| `C:\h\w\[A-F0-9]{8}` | `{W}` | Helix work directory (Windows) |
| `Program-[a-f0-9]{64}` | `Program-{H}` | Runfile content hash |
| `dotnetSdkTests\.[a-zA-Z0-9]+` | `dotnetSdkTests.{T}` | Temp test directory |

### After normalizing paths, focus on structural differences

> ⚠️ **Ignore value-only differences in existing args** (e.g., Debug vs Release in `/define:`, different hash paths). These are expected configuration differences. Focus on **extra or missing args** — a different arg count indicates a real build behavior change.

## Example: CscArguments Investigation

A merge PR (release/10.0.3xx → main) had 208 CSC args vs 207 on main. The diff:

```
FAIL-ONLY: /analyzerconfig:{W}/p/d/sdk/11.0.100-ci/Sdks/Microsoft.NET.Sdk/analyzers/build/config/analysislevel_11_default.globalconfig
```

### What the binlog properties showed

Both builds had identical property resolution:
- `EffectiveAnalysisLevel = 11.0`
- `_GlobalAnalyzerConfigFileName = analysislevel_11_default.globalconfig`
- `_GlobalAnalyzerConfigFile = .../config/analysislevel_11_default.globalconfig`

### The actual root cause

The `AddGlobalAnalyzerConfigForPackage` target has an `Exists()` condition:
```xml
<ItemGroup Condition="Exists('$(_GlobalAnalyzerConfigFile_...)')">
<EditorConfigFiles Include="$(_GlobalAnalyzerConfigFile_...)" />
</ItemGroup>
```

The merge's SDK layout **shipped** `analysislevel_11_default.globalconfig` on disk (from a newer roslyn-analyzers that flowed from 10.0.3xx), while main's SDK didn't have that file yet. Same property values, different files on disk = different build behavior.

### Lesson learned

Same MSBuild property resolution + different files on disk = different build behavior. Always check what's actually in the SDK layout, not just what the targets compute.

## Anti-Patterns

> ❌ **Don't manually split/parse CSC command lines in the main conversation.** CSC args have quoted paths, spaces, and complex structure. Regex parsing in PowerShell is fragile and burns turns on trial-and-error. Use a subagent.

> ❌ **Don't assume the MSBuild property diff explains the behavior diff.** Two branches can compute identical property values but produce different outputs because of different files on disk, different NuGet packages, or different task assemblies. Compare the actual task invocation.

> ❌ **Don't load large binlogs and browse them interactively in main context.** Use targeted searches: `mcp-binlog-tool-search_tasks_by_name` for a specific task, `mcp-binlog-tool-search_binlog` with a focused query. Get in, get the data, get out.
Loading