Fix long-running test workflow: add resource type health check and defensive cleanup#11618
Fix long-running test workflow: add resource type health check and defensive cleanup#11618willdavsmith merged 7 commits intomainfrom
Conversation
When the skip-delete-resources-list cache is unavailable (e.g. after cancelled runs), the cleanup script was deleting all resources.ucp.dev entries including system-critical ones. This caused UCP to lose resource type registrations, breaking every subsequent run. manage-radius-installation.sh: - Add verify_resource_types_available() that makes a live API call to detect missing resource types (instead of only checking stale pod logs) - When versions match but resource types are missing, automatically uninstall and reinstall Radius to re-register them - When versions match and types are healthy, still verify manifests and refresh the skip-delete-resources list cleanup-long-running-cluster.sh: - When no skip list is available, only delete scope.* entries (test resource groups) and preserve non-scope resource.* entries that may include system-critical UCP resources
The previous health check used rad env list with a temporary group, but the group creation failed silently, causing the check to return a 'resource group not found' error instead of detecting the missing resource types. Use rad resource-provider list instead, which only needs a workspace and directly checks whether Applications.Core is registered.
There was a problem hiding this comment.
Pull request overview
This PR hardens the long-running test GitHub Actions workflow by making the Radius installation step self-healing when the control plane has matching versions but missing resource type registrations, and by preventing cleanup from deleting system-critical UCP resources when the skip list cache is unavailable.
Changes:
- Add a live health check (
rad resource-provider list) to detect missingApplications.Coreregistration and trigger an automatic reinstall when versions match. - Refresh verification + skip-delete resource list even when install/upgrade is skipped (versions match and types are healthy).
- Make cleanup safe when the skip list is missing by deleting only
scope.*entries and preserving non-scoperesource.*entries.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/scripts/manage-radius-installation.sh | Adds resource provider “live” health check and reinstalls Radius if types are missing despite matching versions. |
| .github/scripts/cleanup-long-running-cluster.sh | Prevents catastrophic cleanup when the skip list is absent by restricting deletions to scope.* entries. |
- Let rad workspace create fail visibly instead of suppressing errors - Remove redundant log-based verify_manifests_registered from the version-match path since verify_resource_types_available already does a live API check; log-based check is only used after install/upgrade - Use grep -F -x for exact fixed-string whole-line matching in the skip list lookup to avoid regex and substring false positives
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #11618 +/- ##
==========================================
- Coverage 51.38% 51.37% -0.01%
==========================================
Files 699 699
Lines 44111 44111
==========================================
- Hits 22667 22663 -4
- Misses 19277 19279 +2
- Partials 2167 2169 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- verify_resource_types_available now returns distinct codes: 0=healthy, 1=provider missing, 2=query failed - Only trigger reinstall when the provider is definitively missing (rc=1); for query failures (rc=2), retry once after 30s before failing - Update cleanup log message to reflect conditional behavior when no skip list is available
Capture rad workspace create failure explicitly and return rc=2 instead of letting set -euo pipefail exit the script, so the caller's retry logic can handle it.
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
…fensive cleanup (#11618) ## Description Passing run: https://github.com/radius-project/radius/actions/runs/24211599641 The long-running test workflow has been consistently failing since March 31 with: ``` "resource type \"Applications.Core/environments\" not found" ``` ### Root Cause Cancelled runs on March 28-30 caused the `skip-delete-resources-list` cache to become unavailable. On March 31, the pre-test cleanup ran without a skip list and deleted **852 `resource.*` entries** from `resources.ucp.dev`, including system-critical UCP resources. This caused the UCP to lose resource type registrations. Every subsequent run failed because: - `manage-radius-installation.sh` saw versions matched (0.56.1) → skipped all verification - Resource types were never re-registered - `rad env create` failed every time ### Fix **`manage-radius-installation.sh`** — Self-healing recovery: - Add `verify_resource_types_available()` that uses `rad resource-provider list` to make a **live API check** for `Applications.Core` registration (instead of only checking stale pod logs) - When versions match but resource types are missing, automatically **uninstall and reinstall** Radius to re-register them - When versions match and types are healthy, refresh the skip-delete-resources list **`cleanup-long-running-cluster.sh`** — Prevention: - When no skip list is available, **only delete `scope.*` entries** (test resource groups) - Preserve non-scope `resource.*` entries that may include system-critical UCP resources - This prevents catastrophic cleanup when the cache is missing - Use exact fixed-string whole-line matching (`grep -F -x`) for skip list lookups ### Validation Tested on the `fix-lrt-4826` branch — the workflow [passed successfully](https://github.com/radius-project/radius/actions/runs/24159347749) after being broken for 12+ days. ## Type of change - This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and does not change the functionality of Radius (issue link optional). ## Contributor checklist Please verify that the PR meets the following requirements, where applicable: <!-- This checklist uses "TaskRadio" comments to make certain options mutually exclusive. See: https://github.com/mheap/require-checklist-action?tab=readme-ov-file#radio-groups For details on how this works and why it's required. --> - An overview of proposed schema changes is included in a linked GitHub issue. - [ ] Yes <!-- TaskRadio schema --> - [x] Not applicable <!-- TaskRadio schema --> - A design document PR is created in the [design-notes repository](https://github.com/radius-project/design-notes/), if new APIs are being introduced. - [ ] Yes <!-- TaskRadio design-pr --> - [x] Not applicable <!-- TaskRadio design-pr --> - The design document has been reviewed and approved by Radius maintainers/approvers. - [ ] Yes <!-- TaskRadio design-review --> - [x] Not applicable <!-- TaskRadio design-review --> - A PR for the [samples repository](https://github.com/radius-project/samples) is created, if existing samples are affected by the changes in this PR. - [ ] Yes <!-- TaskRadio samples-pr --> - [x] Not applicable <!-- TaskRadio samples-pr --> - A PR for the [documentation repository](https://github.com/radius-project/docs) is created, if the changes in this PR affect the documentation or any user facing updates are made. - [ ] Yes <!-- TaskRadio docs-pr --> - [x] Not applicable <!-- TaskRadio docs-pr --> - A PR for the [recipes repository](https://github.com/radius-project/recipes) is created, if existing recipes are affected by the changes in this PR. - [ ] Yes <!-- TaskRadio recipes-pr --> - [x] Not applicable <!-- TaskRadio recipes-pr -->
Description
Passing run: https://github.com/radius-project/radius/actions/runs/24211599641
The long-running test workflow has been consistently failing since March 31 with:
Root Cause
Cancelled runs on March 28-30 caused the
skip-delete-resources-listcache to become unavailable. On March 31, the pre-test cleanup ran without a skip list and deleted 852resource.*entries fromresources.ucp.dev, including system-critical UCP resources. This caused the UCP to lose resource type registrations.Every subsequent run failed because:
manage-radius-installation.shsaw versions matched (0.56.1) → skipped all verificationrad env createfailed every timeFix
manage-radius-installation.sh— Self-healing recovery:verify_resource_types_available()that usesrad resource-provider listto make a live API check forApplications.Coreregistration (instead of only checking stale pod logs)cleanup-long-running-cluster.sh— Prevention:scope.*entries (test resource groups)resource.*entries that may include system-critical UCP resourcesgrep -F -x) for skip list lookupsValidation
Tested on the
fix-lrt-4826branch — the workflow passed successfully after being broken for 12+ days.Type of change
Contributor checklist
Please verify that the PR meets the following requirements, where applicable: