Skip to content

Fix long-running test workflow: add resource type health check and defensive cleanup#11618

Merged
willdavsmith merged 7 commits intomainfrom
fix-lrt-4826
Apr 9, 2026
Merged

Fix long-running test workflow: add resource type health check and defensive cleanup#11618
willdavsmith merged 7 commits intomainfrom
fix-lrt-4826

Conversation

@willdavsmith
Copy link
Copy Markdown
Contributor

@willdavsmith willdavsmith commented Apr 9, 2026

Description

Passing run: https://github.com/radius-project/radius/actions/runs/24211599641

The long-running test workflow has been consistently failing since March 31 with:

"resource type \"Applications.Core/environments\" not found"

Root Cause

Cancelled runs on March 28-30 caused the skip-delete-resources-list cache to become unavailable. On March 31, the pre-test cleanup ran without a skip list and deleted 852 resource.* entries from resources.ucp.dev, including system-critical UCP resources. This caused the UCP to lose resource type registrations.

Every subsequent run failed because:

  • manage-radius-installation.sh saw versions matched (0.56.1) → skipped all verification
  • Resource types were never re-registered
  • rad env create failed every time

Fix

manage-radius-installation.sh — Self-healing recovery:

  • Add verify_resource_types_available() that uses rad resource-provider list to make a live API check for Applications.Core registration (instead of only checking stale pod logs)
  • When versions match but resource types are missing, automatically uninstall and reinstall Radius to re-register them
  • When versions match and types are healthy, refresh the skip-delete-resources list

cleanup-long-running-cluster.sh — Prevention:

  • When no skip list is available, only delete scope.* entries (test resource groups)
  • Preserve non-scope resource.* entries that may include system-critical UCP resources
  • This prevents catastrophic cleanup when the cache is missing
  • Use exact fixed-string whole-line matching (grep -F -x) for skip list lookups

Validation

Tested on the fix-lrt-4826 branch — the workflow passed successfully after being broken for 12+ days.

Type of change

  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and does not change the functionality of Radius (issue link optional).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Yes
    • Not applicable
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
    • Yes
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Yes
    • Not applicable
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
    • Yes
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Yes
    • Not applicable
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.
    • Yes
    • Not applicable

When the skip-delete-resources-list cache is unavailable (e.g. after
cancelled runs), the cleanup script was deleting all resources.ucp.dev
entries including system-critical ones. This caused UCP to lose resource
type registrations, breaking every subsequent run.

manage-radius-installation.sh:
- Add verify_resource_types_available() that makes a live API call to
  detect missing resource types (instead of only checking stale pod logs)
- When versions match but resource types are missing, automatically
  uninstall and reinstall Radius to re-register them
- When versions match and types are healthy, still verify manifests and
  refresh the skip-delete-resources list

cleanup-long-running-cluster.sh:
- When no skip list is available, only delete scope.* entries (test
  resource groups) and preserve non-scope resource.* entries that may
  include system-critical UCP resources
The previous health check used rad env list with a temporary group,
but the group creation failed silently, causing the check to return
a 'resource group not found' error instead of detecting the missing
resource types.

Use rad resource-provider list instead, which only needs a workspace
and directly checks whether Applications.Core is registered.
@willdavsmith willdavsmith requested a review from a team as a code owner April 9, 2026 17:40
Copilot AI review requested due to automatic review settings April 9, 2026 17:40
@willdavsmith willdavsmith requested a review from a team as a code owner April 9, 2026 17:40
@willdavsmith willdavsmith had a problem deploying to external-contributor-approval April 9, 2026 17:40 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the long-running test GitHub Actions workflow by making the Radius installation step self-healing when the control plane has matching versions but missing resource type registrations, and by preventing cleanup from deleting system-critical UCP resources when the skip list cache is unavailable.

Changes:

  • Add a live health check (rad resource-provider list) to detect missing Applications.Core registration and trigger an automatic reinstall when versions match.
  • Refresh verification + skip-delete resource list even when install/upgrade is skipped (versions match and types are healthy).
  • Make cleanup safe when the skip list is missing by deleting only scope.* entries and preserving non-scope resource.* entries.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/scripts/manage-radius-installation.sh Adds resource provider “live” health check and reinstalls Radius if types are missing despite matching versions.
.github/scripts/cleanup-long-running-cluster.sh Prevents catastrophic cleanup when the skip list is absent by restricting deletions to scope.* entries.

Comment thread .github/scripts/manage-radius-installation.sh Outdated
Comment thread .github/scripts/manage-radius-installation.sh Outdated
Comment thread .github/scripts/cleanup-long-running-cluster.sh Outdated
- Let rad workspace create fail visibly instead of suppressing errors
- Remove redundant log-based verify_manifests_registered from the
  version-match path since verify_resource_types_available already
  does a live API check; log-based check is only used after install/upgrade
- Use grep -F -x for exact fixed-string whole-line matching in the
  skip list lookup to avoid regex and substring false positives
@willdavsmith willdavsmith temporarily deployed to external-contributor-approval April 9, 2026 17:52 — with GitHub Actions Inactive
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.37%. Comparing base (b3af875) to head (55a3cf2).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11618      +/-   ##
==========================================
- Coverage   51.38%   51.37%   -0.01%     
==========================================
  Files         699      699              
  Lines       44111    44111              
==========================================
- Hits        22667    22663       -4     
- Misses      19277    19279       +2     
- Partials     2167     2169       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

Unit Tests

    2 files  ±0    415 suites  ±0   6m 40s ⏱️ -4s
4 872 tests ±0  4 870 ✅ ±0  2 💤 ±0  0 ❌ ±0 
5 774 runs  ±0  5 772 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit 55a3cf2. ± Comparison against base commit b3af875.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread .github/scripts/manage-radius-installation.sh Outdated
Comment thread .github/scripts/manage-radius-installation.sh
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread .github/scripts/manage-radius-installation.sh Outdated
Comment thread .github/scripts/manage-radius-installation.sh Outdated
Comment thread .github/scripts/cleanup-long-running-cluster.sh Outdated
- verify_resource_types_available now returns distinct codes: 0=healthy,
  1=provider missing, 2=query failed
- Only trigger reinstall when the provider is definitively missing (rc=1);
  for query failures (rc=2), retry once after 30s before failing
- Update cleanup log message to reflect conditional behavior when no
  skip list is available
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread .github/scripts/manage-radius-installation.sh
Capture rad workspace create failure explicitly and return rc=2
instead of letting set -euo pipefail exit the script, so the
caller's retry logic can handle it.
@willdavsmith willdavsmith temporarily deployed to external-contributor-approval April 9, 2026 20:28 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@radius-functional-tests
Copy link
Copy Markdown

radius-functional-tests Bot commented Apr 9, 2026

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref 55a3cf2
Unique ID func4a062fdb24
Image tag pr-func4a062fdb24
  • gotestsum 1.13.0
  • KinD: v0.29.0
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func4a062fdb24
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func4a062fdb24
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func4a062fdb24
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func4a062fdb24
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func4a062fdb24
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting ucp-cloud functional tests...
⌛ Starting corerp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@willdavsmith willdavsmith merged commit 719edd1 into main Apr 9, 2026
61 checks passed
@willdavsmith willdavsmith deleted the fix-lrt-4826 branch April 9, 2026 22:42
sk593 pushed a commit that referenced this pull request Apr 14, 2026
…fensive cleanup (#11618)

## Description

Passing run:
https://github.com/radius-project/radius/actions/runs/24211599641

The long-running test workflow has been consistently failing since March
31 with:
```
"resource type \"Applications.Core/environments\" not found"
```

### Root Cause

Cancelled runs on March 28-30 caused the `skip-delete-resources-list`
cache to become unavailable. On March 31, the pre-test cleanup ran
without a skip list and deleted **852 `resource.*` entries** from
`resources.ucp.dev`, including system-critical UCP resources. This
caused the UCP to lose resource type registrations.

Every subsequent run failed because:
- `manage-radius-installation.sh` saw versions matched (0.56.1) →
skipped all verification
- Resource types were never re-registered
- `rad env create` failed every time

### Fix

**`manage-radius-installation.sh`** — Self-healing recovery:
- Add `verify_resource_types_available()` that uses `rad
resource-provider list` to make a **live API check** for
`Applications.Core` registration (instead of only checking stale pod
logs)
- When versions match but resource types are missing, automatically
**uninstall and reinstall** Radius to re-register them
- When versions match and types are healthy, refresh the
skip-delete-resources list

**`cleanup-long-running-cluster.sh`** — Prevention:
- When no skip list is available, **only delete `scope.*` entries**
(test resource groups)
- Preserve non-scope `resource.*` entries that may include
system-critical UCP resources
- This prevents catastrophic cleanup when the cache is missing
- Use exact fixed-string whole-line matching (`grep -F -x`) for skip
list lookups

### Validation

Tested on the `fix-lrt-4826` branch — the workflow [passed
successfully](https://github.com/radius-project/radius/actions/runs/24159347749)
after being broken for 12+ days.

## Type of change

- This pull request is a minor refactor, code cleanup, test improvement,
or other maintenance task and does not change the functionality of
Radius (issue link optional).


## Contributor checklist
Please verify that the PR meets the following requirements, where
applicable:

<!--
This checklist uses "TaskRadio" comments to make certain options
mutually exclusive.
See:
https://github.com/mheap/require-checklist-action?tab=readme-ov-file#radio-groups
For details on how this works and why it's required.
-->

- An overview of proposed schema changes is included in a linked GitHub
issue.
    - [ ] Yes <!-- TaskRadio schema -->
    - [x] Not applicable <!-- TaskRadio schema -->
- A design document PR is created in the [design-notes
repository](https://github.com/radius-project/design-notes/), if new
APIs are being introduced.
    - [ ] Yes <!-- TaskRadio design-pr -->
    - [x] Not applicable <!-- TaskRadio design-pr -->
- The design document has been reviewed and approved by Radius
maintainers/approvers.
    - [ ] Yes <!-- TaskRadio design-review -->
    - [x] Not applicable <!-- TaskRadio design-review -->
- A PR for the [samples
repository](https://github.com/radius-project/samples) is created, if
existing samples are affected by the changes in this PR.
    - [ ] Yes <!-- TaskRadio samples-pr -->
    - [x] Not applicable <!-- TaskRadio samples-pr -->
- A PR for the [documentation
repository](https://github.com/radius-project/docs) is created, if the
changes in this PR affect the documentation or any user facing updates
are made.
    - [ ] Yes <!-- TaskRadio docs-pr -->
    - [x] Not applicable <!-- TaskRadio docs-pr -->
- A PR for the [recipes
repository](https://github.com/radius-project/recipes) is created, if
existing recipes are affected by the changes in this PR.
    - [ ] Yes <!-- TaskRadio recipes-pr -->
    - [x] Not applicable <!-- TaskRadio recipes-pr -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants