Skip to content

docs(design): improve uninstall to support configuration, package support and label removal#189

Open
ayuskauskas wants to merge 3 commits intomainfrom
design/uninstall_enhancement
Open

docs(design): improve uninstall to support configuration, package support and label removal#189
ayuskauskas wants to merge 3 commits intomainfrom
design/uninstall_enhancement

Conversation

@ayuskauskas
Copy link
Copy Markdown
Collaborator

No description provided.


### Failure modes

If all methods fail either because they do not return anything or package fetching fails for any reason set a `nodewright.nvidia.com/UninstallCapabilityUnknown` condition on the SCR. This should also be considered as an effective `false` for `uninstall.enabled`. To avoid loops it should set the persist annotation value to `unknown`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like the wrong level. this is at the package level for supporting uninstall, but this is at the scr level.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persistence is via an annotation that is specific to an image:

nodewright/repository@digest` or `nodewright/repository@SHA`

So while it is at the custom resource instance level it can still be targeted at a specific package.

### `uninstall.apply`

- When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc.
- After a **successful** uninstall on all relevant nodes, the controller **clears** `uninstall.apply` (e.g. back to `false` or omit). Exact defaulting is an implementation detail; the CRD should document one behavior.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design covers the happy path for uninstall.apply well but is thin on failure recovery:

  • What happens if uninstall fails on some nodes but succeeds on others? Is uninstall.apply left true indefinitely?
  • Is there a backoff/retry strategy? A max-retry count?
  • How does a user cancel a failing uninstall (set apply: false? remove the field?)?
  • The status conditions table lists UninstallFailed for the finalizer path, but there is no equivalent condition for the uninstall.apply path.

Consider adding a "Failure and cancellation semantics" subsection for the uninstall.apply flow.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section

@rayaank-afk
Copy link
Copy Markdown

Hey @ayuskauskas, can you confirm if my understanding is correct?

A. Uninstall cases

Case 1: Package supports uninstall (has uninstall.sh and uninstall_check.sh)

  1. The package in the SCR will have uninstall.enabled: true (users with existing SCRs will need to add this manually or through GitOps).
  2. When the user sets uninstall.apply: true (manually or through GitOps), the operator creates the uninstall pod with the full configMap and env from the spec — since the package definition is still present in the spec.
  3. The uninstall pod runs uninstall.sh successfully, cleaning up the host-level changes.
  4. The controller removes the package's entry from the node state annotation.
  5. The controller removes per-package labels/annotations from the targeted nodes.
  6. The package is marked as uninstalled.

Case 2: Package does NOT support uninstall — two sub-cases:

Case 2.1: User sets uninstall.enabled: true but no uninstall.sh exists

  1. When user sets uninstall.apply: true, the uninstall pod is created with the full configMap and env.
  2. The agent runs shellscript_run.sh uninstall which looks for configmaps/uninstall.sh — the file won't be found (since it was never provided in the configMap), and shellscript_run.sh exits 0 silently.
  3. The operator treats this as a successful uninstall — labels and annotations are removed from the node[Can @ayuskauskas confirm this removal of labels/annotations in this case] .
  4. The package is marked as uninstalled.

Note: This is a user error — they declared uninstall.enabled: true but didn't provide the uninstall scripts. The operator trusts the declaration and proceeds as if uninstall succeeded, even though nothing was actually cleaned up on the host.

Case 2.2: Package has uninstall.enabled: false (or field is missing)

  1. If uninstall.apply: true is set, the validating webhook rejects the update because uninstall.enabled is not true.
  2. If the package is removed from the spec entirely:
    • If there is no node state for the package → webhook allows removal.
    • If node state exists → webhook allows removal (since the package never claimed to support uninstall, there's no cleanup path to enforce). Labels/annotations remain on the node.

Additional notes

  1. Metrics corresponding to an uninstalled package are zeroed out and removed from reporting.
  2. When all packages in the SCR are uninstalled or absent from node state, the controller also removes the SCR-level labels/annotations from the nodes (e.g. skyhook.nvidia.com/status_<name>, nodeState_<name>, version_<name>, and node conditions).

B. Downgrade workflow

  1. First, set uninstall.apply: true on the current version and let it complete — this runs uninstall with the correct configMap/env for that version.
  2. After successful uninstallation, change the version to the lower one and set uninstall.apply: false.
  3. If you skip step 1 and just change the version directly, the downgrade uninstall of the old version would run with the new version's configMap/env, which may have different scripts or env vars — leading to incorrect or failed cleanup.

| Value | Meaning |
|-------|--------|
| **`false`** (default) | Package **does not** support uninstall. The operator does **not** run uninstall pods, does **not** strip package-scoped metadata as if uninstall ran, and finalizer skips uninstall for this package. |
| **`true`** | Package **supports** uninstall. `uninstall.apply` may schedule uninstall work and post-success cleanup applies as described below. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it say "may", is there any case when uninstall.enabled is true and pod is not scheduled?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to will. The only reasons it wouldn't is if the custom resource has the pause or skip annotation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.


#### Failure modes

**Uninstall fails on some nodes**: This is the same as a failing install package and pods will continue to be scheduled until complete.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be limit to retries if uninstall fails?

@ayuskauskas
Copy link
Copy Markdown
Collaborator Author

Hey @ayuskauskas, can you confirm if my understanding is correct?

Case 2.1: User sets uninstall.enabled: true but no uninstall.sh exists

  1. When user sets uninstall.apply: true, the uninstall pod is created with the full configMap and env.
  2. The agent runs shellscript_run.sh uninstall which looks for configmaps/uninstall.sh — the file won't be found (since it was never provided in the configMap), and shellscript_run.sh exits 0 silently.
  3. The operator treats this as a successful uninstall — labels and annotations are removed from the node[Can @ayuskauskas confirm this removal of labels/annotations in this case] .
  4. The package is marked as uninstalled.

Note: This is a user error — they declared uninstall.enabled: true but didn't provide the uninstall scripts. The operator trusts the declaration and proceeds as if uninstall succeeded, even though nothing was actually cleaned up on the host.

Yes the operator will respect the decisions by the user to set the uninstall.enabled=true even if they result in nothing happening.

Case 2.2: Package has uninstall.enabled: false (or field is missing)

  1. If uninstall.apply: true is set, the validating webhook rejects the update because uninstall.enabled is not true.

  2. If the package is removed from the spec entirely:

    • If there is no node state for the package → webhook allows removal.
    • If node state exists → webhook allows removal (since the package never claimed to support uninstall, there's no cleanup path to enforce). Labels/annotations remain on the node.

Yes. That is the case

Additional notes

  1. Metrics corresponding to an uninstalled package are zeroed out and removed from reporting.
  2. When all packages in the SCR are uninstalled or absent from node state, the controller also removes the SCR-level labels/annotations from the nodes (e.g. skyhook.nvidia.com/status_<name>, nodeState_<name>, version_<name>, and node conditions).

That is correct.

B. Downgrade workflow

  1. First, set uninstall.apply: true on the current version and let it complete — this runs uninstall with the correct configMap/env for that version.
  2. After successful uninstallation, change the version to the lower one and set uninstall.apply: false.
  3. If you skip step 1 and just change the version directly, the downgrade uninstall of the old version would run with the new version's configMap/env, which may have different scripts or env vars — leading to incorrect or failed cleanup.

That is correct.

@rayaank-afk
Copy link
Copy Markdown

So when is the implementation of this change expected? @ayuskauskas

@ayuskauskas
Copy link
Copy Markdown
Collaborator Author

So when is the implementation of this change expected? @ayuskauskas

@rayaank-afk it will be the next thing worked on. ETA would be 1 to 2 weeks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants