Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions .github/scripts/ci/merge_gate_wait.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,14 @@
# Inputs (environment variables):
# GH_TOKEN required. Token with 'checks:read' for the repo.
# REPO required. owner/repo (e.g. microsoft/apm).
# SHA required. Head SHA of the PR.
# SHA required. Head SHA to poll (PR head, merge_group temp
# branch head, or workflow_dispatch-resolved PR head).
# EXPECTED_CHECKS required. Comma-separated list of check-run names to
# wait for. Whitespace around commas is trimmed.
# Example: "Build & Test (Linux),Build (Linux)"
# EVENT_NAME optional. The triggering event ('pull_request',
# 'merge_group', 'workflow_dispatch'). Used only to
# emit the right recovery instructions on timeout.
# TIMEOUT_MIN optional. Total wall-clock budget in minutes.
# Default: 30.
# POLL_SEC optional. Poll interval in seconds. Default: 30.
Expand Down Expand Up @@ -96,11 +100,13 @@ while [ "$(date +%s)" -lt "$deadline" ]; do
[ "${check_status[i]}" = "pending" ] || continue
pending_count=$((pending_count + 1))

# Filter by check-run name server-side. Most-recent first.
# Filter by check-run name server-side, asking GitHub for only the
# latest run per name (avoids client-side sort / pagination races
# when a check has been re-run on the same SHA).
Comment on lines +104 to +105
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment says filter=latest "avoids client-side sort", but the code still does sort_by(.started_at) | reverse when selecting .[0]. Consider tweaking the comment to reflect reality (e.g., filter=latest reduces pagination/rerun ambiguity; client-side sort remains as a defensive tie-breaker).

Suggested change
# latest run per name (avoids client-side sort / pagination races
# when a check has been re-run on the same SHA).
# latest run per name. This reduces pagination and re-run ambiguity
# on the same SHA; the client-side sort below remains as a defensive
# tie-breaker before selecting .[0].

Copilot uses AI. Check for mistakes.
encoded=$(jq -rn --arg n "$c" '$n|@uri')
payload=$(gh api \
-H "Accept: application/vnd.github+json" \
"repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&per_page=10" \
"repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&filter=latest&per_page=10" \
2>/dev/null) || payload='{"check_runs":[]}'

total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0)
Expand Down Expand Up @@ -166,8 +172,15 @@ if [ "${#missing[@]}" -gt 0 ]; then
for c in "${missing[@]}"; do echo " - ${c}"; done
echo ""
echo "This usually indicates a transient GitHub Actions webhook delivery failure. Recovery:"
echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push"
echo " 2. If that fails, close and reopen the PR."
if [ "${EVENT_NAME:-}" = "merge_group" ]; then
echo " Merge-queue context: pushing a commit will NOT retrigger the merge_group event."
echo " 1. Remove the PR from the merge queue and re-add it."
echo " 2. If it still fails, push an empty commit to the PR branch and re-queue:"
echo " git commit --allow-empty -m 'ci: retrigger' && git push"
else
echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push"
echo " 2. If that fails, close and reopen the PR."
fi
echo ""
echo "Merge Gate catches this failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml."
} >&2
Expand Down
4 changes: 0 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,6 @@ env:
on:
pull_request:
branches: [ main ]
paths-ignore:
- 'docs/**'
- '.gitignore'
- 'LICENSE'
# Tier 1 also runs in merge queue context so the same unit + build checks
# execute against the tentative merge commit that the queue creates. See
# microsoft/apm#770 for the design.
Expand Down
107 changes: 69 additions & 38 deletions .github/workflows/merge-gate.yml
Original file line number Diff line number Diff line change
@@ -1,49 +1,63 @@
# Merge Gate -- single-authority orchestrator that aggregates ALL required
# PR-time checks into one verdict. Branch protection requires only this
# check; this workflow polls the Checks API for all underlying checks.
# checks into one verdict for BOTH PR-time and merge-queue contexts.
# Branch protection / merge-queue ruleset requires only this check; this
# workflow polls the Checks API for all underlying checks.
#
# Why this file exists:
# GitHub's required-status-checks model is name-based, not workflow-based.
# Without this gate, branch protection had to require each PR-time check
# individually -- adding or renaming a check meant a ruleset edit. With
# this gate, branch protection requires only 'gate' (the check-run name
# of the job below) and the gate aggregates whatever underlying checks
# we declare in EXPECTED_CHECKS. Tide / bors single-authority pattern.
# Without this gate, every required check would need to be listed in the
# ruleset (and listed again for the merge-queue ruleset) -- adding or
# renaming a check meant editing rulesets. With this gate, the ruleset
# requires only 'gate' and the gate aggregates whatever underlying checks
# we declare in EXPECTED_CHECKS for each event context.
#
# Why a single trigger (not dual pull_request + pull_request_target):
# Why both pull_request and merge_group triggers:
# The merge-queue ruleset also requires 'gate'. Without a merge_group
# trigger, the gate check would never fire against the merge-queue temp
# branch SHA and PRs would sit in the queue with 'gate' stuck in
# "Expected -- Waiting for status to be reported" indefinitely
# (observed on PR #899). The merge_group trigger fires the same gate
# logic against the temp branch SHA and aggregates the merge-queue-time
# checks (ci.yml + ci-integration.yml).
#
# Why no pull_request_target dual trigger:
# We tried dual-trigger redundancy in PR #865 to harden against rare
# dropped 'pull_request' webhook deliveries (observed once on PR #856).
# It backfired: 'concurrency: cancel-in-progress' produced TWO check-runs
# per SHA -- one SUCCESS and one CANCELLED -- which poisons branch
# protection's status-check rollup ('CANCELLED' counts as failure ->
# PR BLOCKED). No GitHub Actions primitive cleanly de-duplicates checks
# across event channels. World-class OSS projects (k8s, rust, deno,
# next.js) accept this and use a single trigger plus manual recovery.
# PR BLOCKED). pull_request and merge_group are different event channels
# that target different SHAs, so they don't collide.
#
# Recovery if a 'pull_request' webhook is dropped:
# - Push an empty commit: git commit --allow-empty -m 'retrigger' && git push
# - Or trigger manually: gh workflow run merge-gate.yml -f pr_number=NNN
# - Or close + reopen the PR.
# Recovery if a webhook is dropped:
# pull_request context:
# - Push an empty commit: git commit --allow-empty -m 'retrigger' && git push
# - Or trigger manually: gh workflow run merge-gate.yml -f pr_number=NNN
# - Or close + reopen the PR.
# merge_group context (NB: pushing a commit will not retrigger the
# merge_group event, only pull_request):
# - Remove the PR from the merge queue and re-add it.

name: Merge Gate

on:
pull_request:
branches: [ main ]
paths-ignore:
- 'docs/**'
- '.gitignore'
- 'LICENSE'
merge_group:
branches: [ main ]
types: [ checks_requested ]
workflow_dispatch:
inputs:
pr_number:
description: 'PR number to re-run the gate against'
required: true
type: string

# Dedup pushes to the same PR: cancel any older in-flight gate run on
# the same PR head. Now safe -- only one trigger channel, so cancellations
# only happen on rapid push-after-push, not on cross-event collisions.
# Dedup pushes to the same PR / merge-queue entry: cancel any older
# in-flight gate run on the same head. In merge_group context, github.ref
# is refs/heads/gh-readonly-queue/main/pr-N-<sha>, which is unique per
# queue entry -- so cancel-in-progress only cancels rapid push-after-push
# on the same temp branch, never across PR <-> merge_group channels.
concurrency:
group: merge-gate-${{ github.event.pull_request.number || inputs.pr_number || github.ref }}
cancel-in-progress: true
Expand All @@ -57,26 +71,37 @@ jobs:
gate:
name: gate
runs-on: ubuntu-24.04
timeout-minutes: 35
# Job timeout sized above the poll budget (TIMEOUT_MIN below) to leave
# headroom for setup/teardown without false-failing the gate.
timeout-minutes: 60
steps:
- name: Resolve PR head SHA
- name: Resolve head SHA
id: sha
env:
GH_TOKEN: ${{ github.token }}
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
sha=$(gh api "repos/${{ github.repository }}/pulls/${{ inputs.pr_number }}" --jq '.head.sha')
else
sha="${{ github.event.pull_request.head.sha }}"
fi
case "${{ github.event_name }}" in
workflow_dispatch)
sha=$(gh api "repos/${{ github.repository }}/pulls/${{ inputs.pr_number }}" --jq '.head.sha')
;;
merge_group)
# Temp merge commit on the merge-queue temp branch; check
# runs from ci.yml/ci-integration.yml are reported here, NOT
# against the PR head SHA.
sha="${{ github.event.merge_group.head_sha }}"
;;
*)
sha="${{ github.event.pull_request.head.sha }}"
;;
esac
if [ -z "$sha" ]; then
echo "::error::Could not resolve PR head SHA"
echo "::error::Could not resolve head SHA for event ${{ github.event_name }}"
exit 1
fi
echo "sha=$sha" >> "$GITHUB_OUTPUT"
echo "[merge-gate] resolved head SHA: $sha"
echo "[merge-gate] event=${{ github.event_name }} resolved head SHA: $sha"

- name: Checkout PR head
- name: Checkout head
uses: actions/checkout@v4
with:
ref: ${{ steps.sha.outputs.sha }}
Expand All @@ -88,17 +113,23 @@ jobs:
GH_TOKEN: ${{ github.token }}
REPO: ${{ github.repository }}
SHA: ${{ steps.sha.outputs.sha }}
# All PR-time checks the gate aggregates. Keep this in sync with
# the underlying workflows. Currently only ci.yml emits PR-time
# checks ('Build & Test (Linux)', 'APM Self-Check');
# ci-integration.yml is merge_group-only and is NOT polled here.
EVENT_NAME: ${{ github.event_name }}
# All required checks the gate aggregates, branched by event:
# pull_request / workflow_dispatch -> only ci.yml runs at PR time
# merge_group -> ci.yml AND ci-integration.yml run
# Keep this in sync with the underlying workflows.
# NOTE: 'gate' (this job) MUST NOT appear here -- it would
# deadlock waiting for itself.
EXPECTED_CHECKS: 'Build & Test (Linux),APM Self-Check'
TIMEOUT_MIN: '30'
EXPECTED_CHECKS: ${{ github.event_name == 'merge_group' && 'Build & Test (Linux),APM Self-Check,Build (Linux),Smoke Test (Linux),Integration Tests (Linux),Release Validation (Linux)' || 'Build & Test (Linux),APM Self-Check' }}
# Poll budget: ci-integration.yml chains Build -> Smoke ->
# Integration (timeout 20m) -> Release Validation (timeout 20m).
# Theoretical worst case ~50m; observed today ~5m end-to-end.
# Sized at 55m to absorb growth without false-failing the gate.
TIMEOUT_MIN: '55'
POLL_SEC: '30'
run: |
chmod +x .github/scripts/ci/merge_gate_wait.sh
.github/scripts/ci/merge_gate_wait.sh



Loading