From 0861f00fc1f0d0c5aa160ce325c0760c3f91f5af Mon Sep 17 00:00:00 2001 From: Alex Yuskauskas Date: Fri, 3 Apr 2026 16:36:57 -0700 Subject: [PATCH 1/3] docs(design): improve uninstall to support configuration, package support and label removal --- docs/designs/uninstall-enhancement.md | 225 ++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 docs/designs/uninstall-enhancement.md diff --git a/docs/designs/uninstall-enhancement.md b/docs/designs/uninstall-enhancement.md new file mode 100644 index 00000000..eb23f58d --- /dev/null +++ b/docs/designs/uninstall-enhancement.md @@ -0,0 +1,225 @@ +# Uninstall Enhancement design + +This document specifies target behavior for Skyhook package uninstall, and Custom Resource instance finalizers. + +## Problems + +### Uninstall configuration loss +The way uninstall is triggered is by deleting a package definition. This has a major problem in that all the environment variables and configmaps will not be available to the package when it runs uninstall which breaks many assumptions on different packages and makes uninstall not usable. + +### Label removal +Once uninstalled the labels and annotations need to be removed. However, not all packages actually support uninstall. So a package needs a way to let the operator know that it is okay to remove labels/annotations related to a package. + +### Finalizer and uninstall +When a Skyhook Custom Resource is deleted the packages are not cleaned up. Finalizer should trigger an uninstall if the packages mark themselves as supporting uninstall. + +## Goals + +1. **No configuration loss on uninstall**: Uninstall workloads must receive the same class of inputs as apply (ConfigMap data, env vars, resources) by running uninstall **while the full package definition still exists** in the Skyhook spec. +2. **Explicit capability**: Only packages that **effectively** support uninstall may run uninstall hooks and may have package-scoped operator metadata removed afterward. +3. **Explicit trigger**: Uninstall is requested with `uninstall.apply`, not by deleting a package key from `spec.packages`. +4. **Safe removal**: Admission blocks removing a package from the spec until it is safe (uninstalled / absent from node state per defined rules). +5. **Skyhook deletion**: The existing Skyhook finalizer must orchestrate uninstall (when supported) before releasing the CR and cleaning SCR-related metadata. + +## Feature flag: uninstall capability discovery + +Automatic capability discovery (registry access, Tier 1 / Tier 2) is **off by default** and must be explicitly enabled via an **operator feature flag** (exact mechanism is implementation detail: e.g. Helm `values.yaml`, operator `Deployment` env var, or command-line flag). + +| Flag | Behavior | +|------|----------| +| **Discovery disabled** (default) | The operator **does not** run Tier 1 or Tier 2. For `uninstall.enabled` **omitted / `null`** (`nil`), **effective** uninstall support is **`false`** (same as an explicit `false`). Uninstall hooks and post-uninstall metadata cleanup that depend on “supports uninstall” behave as **not supported** unless the user sets `uninstall.enabled: true` in the Skyhook spec. | +| **Discovery enabled** | When `uninstall.enabled` is `nil`, the operator may run [Capability discovery](#capability-discovery) (Tier 1 → Tier 2) and persist results per the rest of this document. | + +Explicit **`uninstall.enabled: true`** or **`false`** in spec is always honored regardless of the flag; the flag only gates **automatic** discovery for the unset case. + +## Current behavior (baseline) + +Today, uninstall is driven largely by **spec drift** relative to **node state**: + +- In [`HandleVersionChange`](../../operator/internal/controller/skyhook_controller.go), when a package **no longer exists** in `spec.packages` but still appears in node state, the controller transitions it to `StageUninstall` (`!exists && packageStatus.Stage != StageUninstall`). +- For that path the controller builds a **synthetic** `Package` with only `PackageRef` and `Image` for uninstall pods. [`createPodFromPackage`](../../operator/internal/controller/skyhook_controller.go) mounts per-package ConfigMap volumes and env only from the `Package` struct, so uninstall runs **without** CR `configMap` / `env`. +- [`HandleFinalizer`](../../operator/internal/controller/skyhook_controller.go) on Skyhook delete uncordons nodes, removes the finalizer, and does **not** run uninstall pods. + +The validating webhook in [`skyhook_webhook.go`](../../operator/api/v1alpha1/skyhook_webhook.go) implements `ValidateUpdate` for create/update but does **not** enforce rules for removing packages from `spec.packages`. `ValidateDelete` is a no-op. + +--- + +## API changes + +### `Package.uninstall` block + +Extend each entry under `spec.packages` with: + +```yaml +packages: + foo: + version: "1.0.0" + image: ghcr.io/example/pkg + configMap: { ... } + env: [ ... ] + uninstall: + enabled: true # see semantics below + apply: true # one-shot: request uninstall run +``` + +### `uninstall.enabled` — use `*bool` in Go / optional in CRD + +| Representation | Meaning (discovery **enabled**) | Meaning (discovery **disabled**, default) | +|----------------|--------------------------------|-------------------------------------------| +| **Omitted / `null`** (`nil` in Go) | Operator may **discover** (Tier 1 → Tier 2) and **persist** (see [Capability discovery](#capability-discovery)). | **Effective support is `false`**; no registry or layer access. Same as explicit `false` for policy purposes. | +| **`false`** | Package **does not** support uninstall. | Same. | +| **`true`** | Package **supports** uninstall. | Same. | + +When discovery is enabled, explicit non-nil values **override** discovery. After Tier 2 discovery, the controller may persist `true`/`false` into the spec or status (see [Persistence: spec vs status](#persistence-spec-vs-status)). + +### `uninstall.apply` + +- When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc. +- After a **successful** uninstall on all relevant nodes, the controller **clears** `uninstall.apply` (e.g. back to `false` or omit). Exact defaulting is an implementation detail; the CRD should document one behavior. + +### Validation when `apply: true` + +With the discovery flag **off** (default), **`uninstall.enabled` omitted** yields **effective `false`**, so `uninstall.apply: true` alone does **not** imply uninstall will run unless the user sets **`uninstall.enabled: true`**. + +If `uninstall.apply` is `true` and **effective** uninstall support is `false` **Reject** in validating admission. + +--- + +## Capability discovery + +Capability answers: “May this package run uninstall and may the operator strip package-scoped metadata after success?” + +**Prerequisite**: This entire section applies **only when** the [uninstall capability discovery feature flag](#feature-flag-uninstall-capability-discovery) is **enabled**. If the flag is off, skip to treating `nil` `enabled` as effective **`false`** (no Tier 1, Tier 2, or registry calls). + +**Persistence location after discovery:** annotation at `nodewright/repository@digest` or `nodewright/repository@SHA` when sha is set + +**Persistence annotation value meanings** + +* `true` - same as `uninstall.enabled=true` +* `false` - same as `uninstall.enabled=false` +* `unkown` - same as `uninstall.enabled=false` see failure modes for more + +### Tier 1 — OCI image config / labels (cheap) + +- Use a registry client (e.g. **google/go-containerregistry** or **containers/image**) to fetch **manifest + image config JSON** only — **no full layer pull**. +- Read **`config.Labels`** (OCI/Docker config). Read a label e.g `nodewright.nvidia.com/uninstall=enabled` (value convention TBD in implementation). +- **Auth**: Operator must use credentials available in the defined `imagePullSecret` set at Operator deploy time +- Persist + +**Execution context**: This should be done within the operator when the package is first seen. + +### Tier 2 — First-seen filesystem `config.json` + persist `enabled` + +- Stream **layer blobs** and inspect tar contents for a `/skyhook-package/config.json` inside the image (path is from the NodeWright package image contract) +- Parse **`config.json`** using the schema defined at [agent/skyhook-agent/src/skyhook_agent/schemas](agent/skyhook-agent/src/skyhook_agent/schemas) +- Persist + +### Failure modes + +If all methods fail either because they do not return anything or package fetching fails for any reason set a `nodewright.nvidia.com/UninstallCapabilityUnknown` condition on the SCR. This should also be considered as an effective `false` for `uninstall.enabled`. To avoid loops it should set the persist annotation value to `unknown` + +--- + +## Effective capability (`uninstallSupportedEffective`) + +Use this single notion everywhere: reconciliation, finalizer, label cleanup, and (optionally) admission. + +| Step | Rule | +|------|------| +| 0 | If the **discovery feature flag** is **off**, and `uninstall.enabled` is **`nil`**, **effective = `false`** (stop). If `enabled` is non-nil, use step 1 only. | +| 1 | If `spec.packages[name].uninstall.enabled` is **non-nil**, use that bool. | +| 2 | Else if **annotation** already stores the outcome for `(name, version, digest)`, use it. | +| 3 | Else run **Tier 1** (OCI labels). | +| 4 | If still unknown, run **Tier 2** **once** per `(name, version, digest)`, then persist per [Tier 2](#tier-2--first-seen-filesystem-configjson--persist-enabled). | +| 5 | If registry is unreachable and there is no cache, set `nodewright.nvidia.com/UninstallCapabilityUnknown` condition; policy for allowing `uninstall.apply` while unknown is: **block** uninstall pods until resolved or user sets `enabled` explicitly. | + +--- + +## Trigger semantics + +### Stop using “remove package key” as uninstall trigger + +- Remove the branch that starts uninstall when the package is **missing** from `spec.packages` (today `!exists` in `HandleVersionChange`). +- **New**: Uninstall runs when `uninstall.apply: true` and effective support is true, with the **full** `Package` from spec passed into pod creation. + +### Version upgrade / downgrade + +Keep existing flows where the spec still names the package but **version** changes: upgrade/downgrade logic that uses `StageUninstall` / `StageUpgrade` may remain, with a documented caveat: **downgrade** uninstall of the *old* version may see **new** version’s `configMap` in spec. Further enhancement of this is out of scope for this design. + +### Admission: “reject delete” + +Implement **validating admission** in [`SkyhookWebhook.ValidateUpdate`](../../operator/api/v1alpha1/skyhook_webhook.go): + +- For each package key **present** in `old.Spec.Packages` and **absent** in `new.Spec.Packages`, **reject** unless the package is **safe to remove**. + +**Safe to remove**: For package name `P`, for **every** node, `status.nodeState[node]` has **no** `PackageStatus` for `P` at any version + +--- + +## Reconciliation + +- When `uninstall.apply` is true and **effective** support is true: create uninstall pods using the **real** `*Package` from `spec.packages[name]` (same path as [`createPodFromPackage`](../../operator/internal/controller/skyhook_controller.go)), not the minimal synthetic package. +- When effective support is false: **do not** run uninstall containers; **do not** remove package-scoped labels/annotations or node state +- On successful uninstall for a capable package: remove entries from node state as today (see [`pod_controller`](../../operator/internal/controller/pod_controller.go) handling of `StageUninstall`). + +--- + +## Label and annotation cleanup + +- **Per package**: After successful uninstall, if **effective** support was true: + - remove operator-managed keys **scoped to that package** where they exist. Today much of node metadata is **per Skyhook** (`skyhook.nvidia.com/status_`, `nodeState_`, etc. in [`wrapper/node.go`](../../operator/internal/wrapper/node.go)); the implementation should enumerate which keys are truly per-package (e.g. pod labels `skyhook.nvidia.com/package`) vs per-SCR and only remove what is correct. + - zero out metrics related to that package and remove them from being reported. +- **Skyhook (SCR)**: When **all** packages are uninstalled / absent from node state per policy, either **remove** SCR labels/annotations the operator owns for rollout. + +--- + +## Finalizer on Skyhook delete + +Extend [`HandleFinalizer`](../../operator/internal/controller/skyhook_controller.go) (or a dedicated phase invoked from it): + +1. While `metadata.deletionTimestamp` is set and the Skyhook finalizer is present, if any node still has state for packages whose **effective** uninstall support is true, **run the same uninstall orchestration** as `uninstall.apply` (the **spec is still available** until the object is removed). +2. Packages **without** uninstall support: **skip** uninstall pods; node metadata should be kept for packages that do not support uninstall so that users may still know that those packages have been applied to the nodes. +3. When obligations are met, perform existing behavior (uncordon, metrics, remove finalizer) and **SCR metadata cleanup** per above. + +```mermaid +sequenceDiagram + participant User + participant API as Kubernetes API + participant Op as Skyhook reconciler + participant Node as Nodes and pods + User->>API: delete Skyhook + API->>Op: Reconcile with deletionTimestamp + Op->>Node: Run uninstall where effective support true + Node-->>Op: Uninstall complete + Op->>API: Patch nodes and SCR metadata + Op->>API: Remove finalizer + API-->>User: Skyhook removed +``` + +--- + +## Migration and testing + +- **Breaking change**: Clusters that rely on “remove package from spec → uninstall” must move to **`uninstall.apply: true`** before removing keys, subject to admission rules. +- **Chainsaw**: [`k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook`](../../k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook) currently documents removal-driven uninstall; update scenarios for `uninstall.apply`, discovery (with flag on/off), and admission. +- **Docs**: User-facing README / operator docs should describe uninstall, the discovery feature flag (default off), explicit `uninstall.enabled: true` when discovery is off, and GitOps implications. + +--- + +## Status conditions + +| Type | Purpose | +|------|---------| +| `nodewright.nvidia.com/UninstallCapabilityUnknown` | Registry/discovery failure; blocks on `apply` | +| `nodewright.nvidia.com/UninstallInProgress` | Set when finalizer is triggered | +| `nodewright.nvidia.com/UninstallFailed` | Set during finalizer for failures | + +--- + +## References + +- [uninstall-problem.md](uninstall-problem.md) +- [`operator/internal/controller/skyhook_controller.go`](../../operator/internal/controller/skyhook_controller.go) — `HandleVersionChange`, `createPodFromPackage`, `HandleFinalizer` +- [`operator/internal/controller/pod_controller.go`](../../operator/internal/controller/pod_controller.go) — uninstall completion / node state +- [`operator/api/v1alpha1/skyhook_webhook.go`](../../operator/api/v1alpha1/skyhook_webhook.go) — admission extension points +- [`operator/internal/wrapper/node.go`](../../operator/internal/wrapper/node.go) — node labels/annotations From 150739ef647606b25d53c957bc34beb5e20e7675 Mon Sep 17 00:00:00 2001 From: Alex Yuskauskas Date: Mon, 6 Apr 2026 10:16:49 -0700 Subject: [PATCH 2/3] docs(design/uninstall-enhancement): update for commenets --- docs/designs/uninstall-enhancement.md | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/docs/designs/uninstall-enhancement.md b/docs/designs/uninstall-enhancement.md index eb23f58d..0b7373f1 100644 --- a/docs/designs/uninstall-enhancement.md +++ b/docs/designs/uninstall-enhancement.md @@ -70,12 +70,19 @@ packages: | **`false`** | Package **does not** support uninstall. | Same. | | **`true`** | Package **supports** uninstall. | Same. | -When discovery is enabled, explicit non-nil values **override** discovery. After Tier 2 discovery, the controller may persist `true`/`false` into the spec or status (see [Persistence: spec vs status](#persistence-spec-vs-status)). +When discovery is enabled, explicit non-nil values **override** discovery. During discovery, the controller will persist `true`/`false` into an annotation (see [Persistence](#persistence)). ### `uninstall.apply` - When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc. -- After a **successful** uninstall on all relevant nodes, the controller **clears** `uninstall.apply` (e.g. back to `false` or omit). Exact defaulting is an implementation detail; the CRD should document one behavior. +- While uninstall is in progress state should be `uninstall_in_progress` to differeniate from the applicative `in_progress` state. +- After a **successful** uninstall on all relevant nodes, the package will be in an uninstalled state and the spec will remain unchanged. This will be a special state such that when the reconciler sees a package with `uninstall.apply=true` and the package is NOT in the node state annotation for a node it is considered to be in the `uninstalled` state and therefore does NOT run any apply, config, etc steps. + +#### Failure modes + +**Uninstall fails on some nodes**: This is the same as a failing install package and pods will continue to be scheduled until complete. + +**Canceling an uninstall**: Removal of `uninstall.apply` or setting `uninstall.apply=false` or setting the pause or stop on the package/custom resource will halt scheduling of uninstall pods. ### Validation when `apply: true` @@ -91,13 +98,14 @@ Capability answers: “May this package run uninstall and may the operator strip **Prerequisite**: This entire section applies **only when** the [uninstall capability discovery feature flag](#feature-flag-uninstall-capability-discovery) is **enabled**. If the flag is off, skip to treating `nil` `enabled` as effective **`false`** (no Tier 1, Tier 2, or registry calls). +### Persistence **Persistence location after discovery:** annotation at `nodewright/repository@digest` or `nodewright/repository@SHA` when sha is set **Persistence annotation value meanings** * `true` - same as `uninstall.enabled=true` * `false` - same as `uninstall.enabled=false` -* `unkown` - same as `uninstall.enabled=false` see failure modes for more +* `unknown` - same as `uninstall.enabled=false` see failure modes for more ### Tier 1 — OCI image config / labels (cheap) @@ -144,7 +152,7 @@ Use this single notion everywhere: reconciliation, finalizer, label cleanup, and ### Version upgrade / downgrade -Keep existing flows where the spec still names the package but **version** changes: upgrade/downgrade logic that uses `StageUninstall` / `StageUpgrade` may remain, with a documented caveat: **downgrade** uninstall of the *old* version may see **new** version’s `configMap` in spec. Further enhancement of this is out of scope for this design. +Keep existing flows where the spec still names the package but **version** changes: upgrade/downgrade logic that uses `StageUninstall` / `StageUpgrade` may remain, with a documented caveat: **downgrade** uninstall of the *old* version may see **new** version’s `configMap` in spec. Document this issue and provide work around examples such as uninstalling before apply if inducing a downgrade.Further enhancement of this is out of scope for this design. ### Admission: “reject delete” @@ -152,7 +160,9 @@ Implement **validating admission** in [`SkyhookWebhook.ValidateUpdate`](../../op - For each package key **present** in `old.Spec.Packages` and **absent** in `new.Spec.Packages`, **reject** unless the package is **safe to remove**. -**Safe to remove**: For package name `P`, for **every** node, `status.nodeState[node]` has **no** `PackageStatus` for `P` at any version +**Safe to remove**: At least one of the following is true: + * `uninstall.enabled=True` or discovered capability is true and For package name `P`, for **every** node, `status.nodeState[node]` has **no** `PackageStatus` for `P` at any version + * `uninstall.enabled=False` or discovered capability is false --- @@ -218,7 +228,6 @@ sequenceDiagram ## References -- [uninstall-problem.md](uninstall-problem.md) - [`operator/internal/controller/skyhook_controller.go`](../../operator/internal/controller/skyhook_controller.go) — `HandleVersionChange`, `createPodFromPackage`, `HandleFinalizer` - [`operator/internal/controller/pod_controller.go`](../../operator/internal/controller/pod_controller.go) — uninstall completion / node state - [`operator/api/v1alpha1/skyhook_webhook.go`](../../operator/api/v1alpha1/skyhook_webhook.go) — admission extension points From 58414581ede2327739afe16482687db725522e64 Mon Sep 17 00:00:00 2001 From: Alex Yuskauskas Date: Mon, 6 Apr 2026 11:07:30 -0700 Subject: [PATCH 3/3] docs(design/uninstall-enhancement): remove capability discovery to save for later design --- docs/designs/uninstall-enhancement.md | 121 ++++++++------------------ 1 file changed, 34 insertions(+), 87 deletions(-) diff --git a/docs/designs/uninstall-enhancement.md b/docs/designs/uninstall-enhancement.md index 0b7373f1..cc60fc79 100644 --- a/docs/designs/uninstall-enhancement.md +++ b/docs/designs/uninstall-enhancement.md @@ -2,6 +2,8 @@ This document specifies target behavior for Skyhook package uninstall, and Custom Resource instance finalizers. +Capability discovery from container images or registries (OCI labels, layer inspection, ORAS, etc.) is **out of scope** and may be covered by a separate design. This document assumes **`uninstall.enabled` is set only in the Skyhook spec** (boolean, no automatic discovery). + ## Problems ### Uninstall configuration loss @@ -16,21 +18,10 @@ When a Skyhook Custom Resource is deleted the packages are not cleaned up. Final ## Goals 1. **No configuration loss on uninstall**: Uninstall workloads must receive the same class of inputs as apply (ConfigMap data, env vars, resources) by running uninstall **while the full package definition still exists** in the Skyhook spec. -2. **Explicit capability**: Only packages that **effectively** support uninstall may run uninstall hooks and may have package-scoped operator metadata removed afterward. +2. **Explicit capability in spec**: Only packages with **`uninstall.enabled: true`** may run uninstall hooks and may have package-scoped operator metadata removed afterward. There is **no** operator-driven discovery from images or registries in this design. 3. **Explicit trigger**: Uninstall is requested with `uninstall.apply`, not by deleting a package key from `spec.packages`. 4. **Safe removal**: Admission blocks removing a package from the spec until it is safe (uninstalled / absent from node state per defined rules). -5. **Skyhook deletion**: The existing Skyhook finalizer must orchestrate uninstall (when supported) before releasing the CR and cleaning SCR-related metadata. - -## Feature flag: uninstall capability discovery - -Automatic capability discovery (registry access, Tier 1 / Tier 2) is **off by default** and must be explicitly enabled via an **operator feature flag** (exact mechanism is implementation detail: e.g. Helm `values.yaml`, operator `Deployment` env var, or command-line flag). - -| Flag | Behavior | -|------|----------| -| **Discovery disabled** (default) | The operator **does not** run Tier 1 or Tier 2. For `uninstall.enabled` **omitted / `null`** (`nil`), **effective** uninstall support is **`false`** (same as an explicit `false`). Uninstall hooks and post-uninstall metadata cleanup that depend on “supports uninstall” behave as **not supported** unless the user sets `uninstall.enabled: true` in the Skyhook spec. | -| **Discovery enabled** | When `uninstall.enabled` is `nil`, the operator may run [Capability discovery](#capability-discovery) (Tier 1 → Tier 2) and persist results per the rest of this document. | - -Explicit **`uninstall.enabled: true`** or **`false`** in spec is always honored regardless of the flag; the flag only gates **automatic** discovery for the unset case. +5. **Skyhook deletion**: The existing Skyhook finalizer must orchestrate uninstall when **`uninstall.enabled`** is true before releasing the CR and cleaning SCR-related metadata. ## Current behavior (baseline) @@ -58,24 +49,25 @@ packages: configMap: { ... } env: [ ... ] uninstall: - enabled: true # see semantics below + enabled: true # required semantics: see below apply: true # one-shot: request uninstall run ``` -### `uninstall.enabled` — use `*bool` in Go / optional in CRD +### `uninstall.enabled` + +Boolean in the API (e.g. **`+kubebuilder:default=false`** so omitted is treated as **false**). -| Representation | Meaning (discovery **enabled**) | Meaning (discovery **disabled**, default) | -|----------------|--------------------------------|-------------------------------------------| -| **Omitted / `null`** (`nil` in Go) | Operator may **discover** (Tier 1 → Tier 2) and **persist** (see [Capability discovery](#capability-discovery)). | **Effective support is `false`**; no registry or layer access. Same as explicit `false` for policy purposes. | -| **`false`** | Package **does not** support uninstall. | Same. | -| **`true`** | Package **supports** uninstall. | Same. | +| Value | Meaning | +|-------|--------| +| **`false`** (default) | Package **does not** support uninstall. The operator does **not** run uninstall pods, does **not** strip package-scoped metadata as if uninstall ran, and finalizer skips uninstall for this package. | +| **`true`** | Package **supports** uninstall. `uninstall.apply` may schedule uninstall work and post-success cleanup applies as described below. | -When discovery is enabled, explicit non-nil values **override** discovery. During discovery, the controller will persist `true`/`false` into an annotation (see [Persistence](#persistence)). +There is **no** `nil`/unset distinction beyond the CRD default: **unset behaves like `false`**. ### `uninstall.apply` -- When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc. -- While uninstall is in progress state should be `uninstall_in_progress` to differeniate from the applicative `in_progress` state. +- When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc., **only if** `uninstall.enabled` is **`true`**. +- While uninstall is in progress state should be `uninstall_in_progress` to differentiate from the applicative `in_progress` state. - After a **successful** uninstall on all relevant nodes, the package will be in an uninstalled state and the spec will remain unchanged. This will be a special state such that when the reconciler sees a package with `uninstall.apply=true` and the package is NOT in the node state annotation for a node it is considered to be in the `uninstalled` state and therefore does NOT run any apply, config, etc steps. #### Failure modes @@ -86,60 +78,15 @@ When discovery is enabled, explicit non-nil values **override** discovery. Durin ### Validation when `apply: true` -With the discovery flag **off** (default), **`uninstall.enabled` omitted** yields **effective `false`**, so `uninstall.apply: true` alone does **not** imply uninstall will run unless the user sets **`uninstall.enabled: true`**. - -If `uninstall.apply` is `true` and **effective** uninstall support is `false` **Reject** in validating admission. +If `uninstall.apply` is `true` and **`uninstall.enabled` is not `true`**, **reject** in validating admission. --- -## Capability discovery - -Capability answers: “May this package run uninstall and may the operator strip package-scoped metadata after success?” - -**Prerequisite**: This entire section applies **only when** the [uninstall capability discovery feature flag](#feature-flag-uninstall-capability-discovery) is **enabled**. If the flag is off, skip to treating `nil` `enabled` as effective **`false`** (no Tier 1, Tier 2, or registry calls). - -### Persistence -**Persistence location after discovery:** annotation at `nodewright/repository@digest` or `nodewright/repository@SHA` when sha is set - -**Persistence annotation value meanings** - -* `true` - same as `uninstall.enabled=true` -* `false` - same as `uninstall.enabled=false` -* `unknown` - same as `uninstall.enabled=false` see failure modes for more +## Uninstall support (single rule) -### Tier 1 — OCI image config / labels (cheap) +Use one rule everywhere: reconciliation, finalizer, label cleanup, and admission. -- Use a registry client (e.g. **google/go-containerregistry** or **containers/image**) to fetch **manifest + image config JSON** only — **no full layer pull**. -- Read **`config.Labels`** (OCI/Docker config). Read a label e.g `nodewright.nvidia.com/uninstall=enabled` (value convention TBD in implementation). -- **Auth**: Operator must use credentials available in the defined `imagePullSecret` set at Operator deploy time -- Persist - -**Execution context**: This should be done within the operator when the package is first seen. - -### Tier 2 — First-seen filesystem `config.json` + persist `enabled` - -- Stream **layer blobs** and inspect tar contents for a `/skyhook-package/config.json` inside the image (path is from the NodeWright package image contract) -- Parse **`config.json`** using the schema defined at [agent/skyhook-agent/src/skyhook_agent/schemas](agent/skyhook-agent/src/skyhook_agent/schemas) -- Persist - -### Failure modes - -If all methods fail either because they do not return anything or package fetching fails for any reason set a `nodewright.nvidia.com/UninstallCapabilityUnknown` condition on the SCR. This should also be considered as an effective `false` for `uninstall.enabled`. To avoid loops it should set the persist annotation value to `unknown` - ---- - -## Effective capability (`uninstallSupportedEffective`) - -Use this single notion everywhere: reconciliation, finalizer, label cleanup, and (optionally) admission. - -| Step | Rule | -|------|------| -| 0 | If the **discovery feature flag** is **off**, and `uninstall.enabled` is **`nil`**, **effective = `false`** (stop). If `enabled` is non-nil, use step 1 only. | -| 1 | If `spec.packages[name].uninstall.enabled` is **non-nil**, use that bool. | -| 2 | Else if **annotation** already stores the outcome for `(name, version, digest)`, use it. | -| 3 | Else run **Tier 1** (OCI labels). | -| 4 | If still unknown, run **Tier 2** **once** per `(name, version, digest)`, then persist per [Tier 2](#tier-2--first-seen-filesystem-configjson--persist-enabled). | -| 5 | If registry is unreachable and there is no cache, set `nodewright.nvidia.com/UninstallCapabilityUnknown` condition; policy for allowing `uninstall.apply` while unknown is: **block** uninstall pods until resolved or user sets `enabled` explicitly. | +**Supports uninstall** iff `spec.packages[name].uninstall.enabled == true`. --- @@ -148,11 +95,11 @@ Use this single notion everywhere: reconciliation, finalizer, label cleanup, and ### Stop using “remove package key” as uninstall trigger - Remove the branch that starts uninstall when the package is **missing** from `spec.packages` (today `!exists` in `HandleVersionChange`). -- **New**: Uninstall runs when `uninstall.apply: true` and effective support is true, with the **full** `Package` from spec passed into pod creation. +- **New**: Uninstall runs when `uninstall.apply: true` and **`uninstall.enabled: true`**, with the **full** `Package` from spec passed into pod creation. ### Version upgrade / downgrade -Keep existing flows where the spec still names the package but **version** changes: upgrade/downgrade logic that uses `StageUninstall` / `StageUpgrade` may remain, with a documented caveat: **downgrade** uninstall of the *old* version may see **new** version’s `configMap` in spec. Document this issue and provide work around examples such as uninstalling before apply if inducing a downgrade.Further enhancement of this is out of scope for this design. +Keep existing flows where the spec still names the package but **version** changes: upgrade/downgrade logic that uses `StageUninstall` / `StageUpgrade` may remain, with a documented caveat: **downgrade** uninstall of the *old* version may see **new** version’s `configMap` in spec. Document this issue and provide work around examples such as uninstalling before apply if inducing a downgrade. Further enhancement of this is out of scope for this design. ### Admission: “reject delete” @@ -161,22 +108,23 @@ Implement **validating admission** in [`SkyhookWebhook.ValidateUpdate`](../../op - For each package key **present** in `old.Spec.Packages` and **absent** in `new.Spec.Packages`, **reject** unless the package is **safe to remove**. **Safe to remove**: At least one of the following is true: - * `uninstall.enabled=True` or discovered capability is true and For package name `P`, for **every** node, `status.nodeState[node]` has **no** `PackageStatus` for `P` at any version - * `uninstall.enabled=False` or discovered capability is false + +- **`uninstall.enabled` is `true`** and for package name `P`, for **every** node, `status.nodeState[node]` has **no** `PackageStatus` for `P` at any version (fully gone from state after uninstall). +- **`uninstall.enabled` is `false`**: the package never supported uninstall; removal from spec is allowed per policy without an uninstall run (see reconciliation: no automatic node-state purge required for unsupported packages beyond what is explicitly implemented). --- ## Reconciliation -- When `uninstall.apply` is true and **effective** support is true: create uninstall pods using the **real** `*Package` from `spec.packages[name]` (same path as [`createPodFromPackage`](../../operator/internal/controller/skyhook_controller.go)), not the minimal synthetic package. -- When effective support is false: **do not** run uninstall containers; **do not** remove package-scoped labels/annotations or node state -- On successful uninstall for a capable package: remove entries from node state as today (see [`pod_controller`](../../operator/internal/controller/pod_controller.go) handling of `StageUninstall`). +- When `uninstall.apply` is true and **`uninstall.enabled` is true**: create uninstall pods using the **real** `*Package` from `spec.packages[name]` (same path as [`createPodFromPackage`](../../operator/internal/controller/skyhook_controller.go)), not the minimal synthetic package. +- When **`uninstall.enabled` is false**: **do not** run uninstall containers; **do not** remove package-scoped labels/annotations or node state as part of an uninstall path. +- On successful uninstall for a package with **`uninstall.enabled` true**: remove entries from node state as today (see [`pod_controller`](../../operator/internal/controller/pod_controller.go) handling of `StageUninstall`). --- ## Label and annotation cleanup -- **Per package**: After successful uninstall, if **effective** support was true: +- **Per package**: After successful uninstall, if **`uninstall.enabled` was true**: - remove operator-managed keys **scoped to that package** where they exist. Today much of node metadata is **per Skyhook** (`skyhook.nvidia.com/status_`, `nodeState_`, etc. in [`wrapper/node.go`](../../operator/internal/wrapper/node.go)); the implementation should enumerate which keys are truly per-package (e.g. pod labels `skyhook.nvidia.com/package`) vs per-SCR and only remove what is correct. - zero out metrics related to that package and remove them from being reported. - **Skyhook (SCR)**: When **all** packages are uninstalled / absent from node state per policy, either **remove** SCR labels/annotations the operator owns for rollout. @@ -187,8 +135,8 @@ Implement **validating admission** in [`SkyhookWebhook.ValidateUpdate`](../../op Extend [`HandleFinalizer`](../../operator/internal/controller/skyhook_controller.go) (or a dedicated phase invoked from it): -1. While `metadata.deletionTimestamp` is set and the Skyhook finalizer is present, if any node still has state for packages whose **effective** uninstall support is true, **run the same uninstall orchestration** as `uninstall.apply` (the **spec is still available** until the object is removed). -2. Packages **without** uninstall support: **skip** uninstall pods; node metadata should be kept for packages that do not support uninstall so that users may still know that those packages have been applied to the nodes. +1. While `metadata.deletionTimestamp` is set and the Skyhook finalizer is present, if any node still has state for packages with **`uninstall.enabled: true`**, **run the same uninstall orchestration** as `uninstall.apply` (the **spec is still available** until the object is removed). +2. Packages with **`uninstall.enabled: false`**: **skip** uninstall pods; node metadata should be kept for those packages so users may still know that those packages have been applied to the nodes. 3. When obligations are met, perform existing behavior (uncordon, metrics, remove finalizer) and **SCR metadata cleanup** per above. ```mermaid @@ -199,7 +147,7 @@ sequenceDiagram participant Node as Nodes and pods User->>API: delete Skyhook API->>Op: Reconcile with deletionTimestamp - Op->>Node: Run uninstall where effective support true + Op->>Node: Run uninstall where uninstall.enabled true Node-->>Op: Uninstall complete Op->>API: Patch nodes and SCR metadata Op->>API: Remove finalizer @@ -210,9 +158,9 @@ sequenceDiagram ## Migration and testing -- **Breaking change**: Clusters that rely on “remove package from spec → uninstall” must move to **`uninstall.apply: true`** before removing keys, subject to admission rules. -- **Chainsaw**: [`k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook`](../../k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook) currently documents removal-driven uninstall; update scenarios for `uninstall.apply`, discovery (with flag on/off), and admission. -- **Docs**: User-facing README / operator docs should describe uninstall, the discovery feature flag (default off), explicit `uninstall.enabled: true` when discovery is off, and GitOps implications. +- **Breaking change**: Clusters that rely on “remove package from spec → uninstall” must move to **`uninstall.apply: true`** (with **`uninstall.enabled: true`**) before removing keys, subject to admission rules. +- **Chainsaw**: [`k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook`](../../k8s-tests/chainsaw/skyhook/uninstall-upgrade-skyhook) currently documents removal-driven uninstall; update scenarios for `uninstall.apply`, explicit `uninstall.enabled`, and admission. +- **Docs**: User-facing README / operator docs should describe uninstall and the requirement to set **`uninstall.enabled: true`** for packages that support uninstall. --- @@ -220,7 +168,6 @@ sequenceDiagram | Type | Purpose | |------|---------| -| `nodewright.nvidia.com/UninstallCapabilityUnknown` | Registry/discovery failure; blocks on `apply` | | `nodewright.nvidia.com/UninstallInProgress` | Set when finalizer is triggered | | `nodewright.nvidia.com/UninstallFailed` | Set during finalizer for failures |