From 9946d5a4d7b76976f82f3f07466f878bd5067b19 Mon Sep 17 00:00:00 2001 From: Rolf Bjarne Kvinge Date: Mon, 27 Apr 2026 19:12:19 +0200 Subject: [PATCH 1/4] Add fix-random-ci-test-failure skill for diagnosing flaky tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../fix-random-ci-test-failure/SKILL.md | 108 ++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 .github/skills/fix-random-ci-test-failure/SKILL.md diff --git a/.github/skills/fix-random-ci-test-failure/SKILL.md b/.github/skills/fix-random-ci-test-failure/SKILL.md new file mode 100644 index 00000000000..ce55a66650d --- /dev/null +++ b/.github/skills/fix-random-ci-test-failure/SKILL.md @@ -0,0 +1,108 @@ +--- +name: fix-random-ci-test-failure +description: >- + Investigate and fix flaky/random CI test failures in dotnet/macios. Trigger on + GitHub issues describing intermittent test failures, CI postmortem issues, or when + asked to fix a flaky test. Analyzes test code, identifies root causes + (shared state, environment dependencies, race conditions), and applies fixes. +--- + +# Fix Random CI Test Failure + +Investigate and fix flaky CI test failures reported in GitHub issues for dotnet/macios. + +## When to Use + +- User shares a GitHub issue about a flaky/intermittent test failure +- Issue has the `ci-postmortem` label +- User asks to fix a randomly failing test +- Test failures appear across unrelated PRs (indicating environment-dependent flakiness) + +## Workflow + +### 1. Gather Failure Context + +Read the GitHub issue to extract: +- **Test name** and **test suite** (e.g., monotouch-test) +- **Error message** and **assertion failure details** +- **Platform** (iOS, macOS, tvOS, Mac Catalyst) +- **Affected builds** — check if failures span unrelated PRs (confirms flakiness vs. regression) + +If build URLs are provided, inspect logs for additional context (stack traces, timing info). + +### 2. Locate and Analyze the Test + +Find the test source file: +``` +grep -r "TestMethodName" tests/ +``` + +Read the **full test method** and all helper methods it calls. Understand: +- What external resources does it use? (keychain, file system, network, simulators) +- Does it use shared/hardcoded identifiers that could collide across parallel runs? +- Does it properly clean up before and after execution? +- Are there error paths that silently swallow failures? + +### 3. Identify Root Cause Category + +Common flaky test root causes in this repo: + +#### Shared State / Resource Conflicts +- **Symptom**: Test uses hardcoded identifiers (e.g., fixed keychain entries, file paths, port numbers) +- **Fix**: Use process-unique identifiers (PID, GUID, bundle ID + test name) +- **Reference**: `tests/monotouch-test/Security/KeyChainTest.cs` uses per-process unique IDs + +#### Environment-Dependent State +- **Symptom**: Test depends on OS-level state (keychain, permissions, network availability) +- **Fix**: Add robust setup/teardown, handle unexpected initial states, add retry logic for transient errors + +#### Unhandled Error Codes +- **Symptom**: Code only handles expected success/failure codes, silently fails on unexpected ones +- **Fix**: Add fallback handling for unexpected status codes, log diagnostic info + +#### Race Conditions / Timing +- **Symptom**: Test passes locally but fails intermittently in CI +- **Fix**: Add proper synchronization, increase timeouts, avoid timing-dependent assertions + +#### LAContext / Authentication Issues (Security tests) +- **Symptom**: `InvalidRecord` or authentication-related errors on keychain operations +- **Fix**: Only attach `LAContext` where actually needed (not on plain query/delete operations) + +### 4. Apply the Fix + +When fixing: + +1. **Prefer unique identifiers over shared ones.** Use `Environment.ProcessId`, `Guid.NewGuid()`, or `{bundleId}-{testName}-{pid}` patterns for resource identifiers. + +2. **Create minimal query records.** For search/delete operations, don't attach unnecessary attributes (like `LAContext`) that can cause intermittent errors. + +3. **Handle all status codes.** Never silently return `false` for unexpected error codes. Either handle them with a fallback path or fail with a descriptive assertion message. + +4. **Add diagnostic logging.** Use `TestContext.Out.WriteLine` to log operation results that would help diagnose future failures. + +5. **Clean up legacy state.** If renaming identifiers, also clean up entries from old names that may linger on CI agents. + +6. **Always clean up in `finally` blocks.** Ensure test resources are released even on failure. + +### 5. If the Fix is Unclear + +If you cannot determine the root cause with available information: + +1. **Add diagnostic logging** to capture operation results (status codes, error details) in future failures. +2. **Log the initial state** at test start (e.g., does the resource already exist before the test runs?). +3. **Include all relevant status codes** in assertion failure messages. +4. Explain to the user what additional information the logging will provide and what hypotheses it will help test. + +### 6. Create the PR + +- Branch naming: `dev/rolf/fix-{test-name}` or similar +- Commit message should explain all root causes addressed +- Reference the GitHub issue with `Fixes #NNNN` +- Add the `copilot` label to the PR + +## Key Patterns in This Repo + +- Keychain tests in `tests/monotouch-test/Security/` are particularly prone to flakiness due to shared macOS keychain state on CI agents +- `InitSecRecord` in `RecordTest.cs` attaches `LAContext` to all `SecRecord` instances — this is sometimes unnecessary and can cause `InvalidRecord` errors +- `KeyChainTest.cs` demonstrates the recommended pattern: per-process unique identifiers using bundle ID + test type + PID +- Tests run on shared CI agent machines where leftover state from previous runs can cause interference From 1ead3fa5b3388ada3bf5890d413bd6227d07958c Mon Sep 17 00:00:00 2001 From: Rolf Bjarne Kvinge Date: Mon, 27 Apr 2026 19:16:13 +0200 Subject: [PATCH 2/4] Use 'Ref' instead of 'Fixes' for logging-only changes in skill Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/skills/fix-random-ci-test-failure/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/skills/fix-random-ci-test-failure/SKILL.md b/.github/skills/fix-random-ci-test-failure/SKILL.md index ce55a66650d..8d5b2b38318 100644 --- a/.github/skills/fix-random-ci-test-failure/SKILL.md +++ b/.github/skills/fix-random-ci-test-failure/SKILL.md @@ -97,7 +97,7 @@ If you cannot determine the root cause with available information: - Branch naming: `dev/rolf/fix-{test-name}` or similar - Commit message should explain all root causes addressed -- Reference the GitHub issue with `Fixes #NNNN` +- Reference the GitHub issue with `Fixes #NNNN` if the change actually fixes the problem, or `Ref #NNNN` if it only adds logging/diagnostics - Add the `copilot` label to the PR ## Key Patterns in This Repo From 5c5d6132286b82af28b03aac27d93f3559fa5702 Mon Sep 17 00:00:00 2001 From: Rolf Bjarne Kvinge Date: Wed, 29 Apr 2026 10:51:17 +0200 Subject: [PATCH 3/4] Update .github/skills/fix-random-ci-test-failure/SKILL.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- .github/skills/fix-random-ci-test-failure/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/skills/fix-random-ci-test-failure/SKILL.md b/.github/skills/fix-random-ci-test-failure/SKILL.md index 8d5b2b38318..6e5caddb0b5 100644 --- a/.github/skills/fix-random-ci-test-failure/SKILL.md +++ b/.github/skills/fix-random-ci-test-failure/SKILL.md @@ -72,7 +72,7 @@ Common flaky test root causes in this repo: When fixing: -1. **Prefer unique identifiers over shared ones.** Use `Environment.ProcessId`, `Guid.NewGuid()`, or `{bundleId}-{testName}-{pid}` patterns for resource identifiers. +1. **Prefer unique identifiers over shared ones.** Use `Process.GetCurrentProcess ().Id`, `Guid.NewGuid ()`, or `{bundleId}-{testType}-{pid}` patterns for resource identifiers. 2. **Create minimal query records.** For search/delete operations, don't attach unnecessary attributes (like `LAContext`) that can cause intermittent errors. From 4e33fa76723d4da4b98df1dfe41aabb9d099ddbd Mon Sep 17 00:00:00 2001 From: Rolf Bjarne Kvinge Date: Wed, 29 Apr 2026 10:51:59 +0200 Subject: [PATCH 4/4] Apply suggestion from @rolfbjarne --- .github/skills/fix-random-ci-test-failure/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/skills/fix-random-ci-test-failure/SKILL.md b/.github/skills/fix-random-ci-test-failure/SKILL.md index 6e5caddb0b5..2d9335b0bb2 100644 --- a/.github/skills/fix-random-ci-test-failure/SKILL.md +++ b/.github/skills/fix-random-ci-test-failure/SKILL.md @@ -95,7 +95,7 @@ If you cannot determine the root cause with available information: ### 6. Create the PR -- Branch naming: `dev/rolf/fix-{test-name}` or similar +- Branch naming: `dev/{username}/fix-{test-name}` or similar - Commit message should explain all root causes addressed - Reference the GitHub issue with `Fixes #NNNN` if the change actually fixes the problem, or `Ref #NNNN` if it only adds logging/diagnostics - Add the `copilot` label to the PR