Skip to content

fix(bundler): add pre-flight finalizer check to undeploy.sh (#406)#561

Merged
mchmarny merged 2 commits intomainfrom
fix/undeploy-hangs
Apr 14, 2026
Merged

fix(bundler): add pre-flight finalizer check to undeploy.sh (#406)#561
mchmarny merged 2 commits intomainfrom
fix/undeploy-hangs

Conversation

@lockwobr
Copy link
Copy Markdown
Contributor

Summary

Add a pre-flight check to undeploy.sh that detects custom resources with active finalizers on CRDs owned by bundle operators. Warns the
user and exits before teardown to prevent the unrecoverable hang described in #406.

Motivation / Context

undeploy.sh hangs indefinitely when CRs have finalizers that require their controller to process, but the controller has already been
uninstalled. The deadlock: delete_release_cluster_resources() deletes CRDs without --wait=false (line 81), the CRD's
customresourcecleanup finalizer blocks until all CR instances are gone, but CRs in user namespaces have controller-managed finalizers that
can never be processed.

Fixes: #406
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Pre-flight check (check_release_for_stuck_crds / check_crd_for_stuck_resources):

  • Uses helm get manifest to discover CRDs owned by each Helm component (generic — no hardcoded CRD group list needed)
  • For each CRD, queries all namespaces for CR instances with non-empty .metadata.finalizers
  • If found: lists blocking objects with their finalizer names, then exits with instructions
  • Silently skips releases that aren't installed (helm get manifest failure → return 0)
  • --skip-preflight flag to bypass when the user knows what they're doing

Component manifest: Lists all components that will be removed (with type tags: [helm], [kustomize], [manifests]) so users can see
exactly what undeploy will do — important for shared components like cert-manager.

jq hard requirement: Replaced the HAS_JQ soft-check with a hard gate. Removes ~8 branching guards throughout the script. jq was
already effectively required for CRD cleanup and webhook handling to work correctly.

Stuck CRD handling: Changed the existing silent force-clear of CRD finalizers to a warning with remediation commands. Force-clearing CRD
finalizers can orphan CR data in etcd — the user should decide.

Testing

make qualify   # all tests, lint, e2e pass; scan fails on pre-existing stdlib CVEs (go1.26.1→1.26.2)

Unit tests (pkg/bundler/deployer/helm/helm_test.go):

=== RUN   TestGenerate_UndeployScriptExecutable
--- PASS: TestGenerate_UndeployScriptExecutable (0.02s)
PASS
ok    github.com/NVIDIA/aicr/pkg/bundler/deployer/helm    1.81s

New assertions verify:

  • jq is a hard requirement (no HAS_JQ)
  • Pre-flight section exists and runs before component uninstall (index comparison)
  • Pre-flight calls check_release_for_stuck_crds per component
  • Helper functions exist and use helm get manifest
  • Stuck CRD handling warns instead of force-clearing

Manual Kind cluster test (cert-manager + Issuer with fake finalizer):

❯ ./undeploy.sh
Undeploying Cloud Native Stack components...

  The following components will be removed (in order):
    [helm]       nvsentinel (nvsentinel)
    [helm]       nvidia-dra-driver-gpu (nvidia-dra-driver)
    [helm]       kai-scheduler (kai-scheduler)
    [helm]       gpu-operator (gpu-operator)
    [manifests]  skyhook-customizations
    [helm]       skyhook-operator (skyhook)
    [helm]       prometheus-adapter (monitoring)
    [helm]       k8s-ephemeral-storage-metrics (monitoring)
    [helm]       kube-prometheus-stack (monitoring)
    [helm]       cert-manager (cert-manager)
    [helm]       aws-efa (kube-system)
    [helm]       aws-ebs-csi-driver (kube-system)

Running pre-flight checks...

ERROR: Found custom resources with active finalizers that will block undeploy.
  After the operator is removed, these finalizers cannot be processed —
  causing an unrecoverable hang during CRD deletion.

  Delete these resources while their controller is still running,
  then re-run ./undeploy.sh

  cert-manager — issuers.cert-manager.io:
    test-workload/stuck-issuer  finalizers=[test.example.com/block-cleanup]

  To skip this check: ./undeploy.sh --skip-preflight

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: The pre-flight is additive — existing undeploy behavior is unchanged when no CRs with finalizers are found.
--skip-preflight provides an escape hatch. The jq hard requirement could block users without jq, but jq was already effectively required
for correct operation.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@lockwobr lockwobr requested a review from a team as a code owner April 13, 2026 22:40
@lockwobr lockwobr self-assigned this Apr 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 13, 2026

Coverage Report ✅

Metric Value
Coverage 74.6%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-74.6%25-green)

No Go source files changed in this PR.

mchmarny

This comment was marked as resolved.

@lockwobr lockwobr force-pushed the fix/undeploy-hangs branch from b242a3e to bc8bc91 Compare April 14, 2026 17:03
@lockwobr lockwobr requested a review from mchmarny April 14, 2026 17:04
@lockwobr lockwobr enabled auto-merge (squash) April 14, 2026 18:25
@mchmarny mchmarny disabled auto-merge April 14, 2026 18:31
@mchmarny mchmarny merged commit 228e518 into main Apr 14, 2026
23 checks passed
@mchmarny mchmarny deleted the fix/undeploy-hangs branch April 14, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: undeploy.sh hangs when CRs have finalizers and controller is already uninstalled

2 participants