docs(design): improve uninstall to support configuration, package support and label removal#189
docs(design): improve uninstall to support configuration, package support and label removal#189ayuskauskas wants to merge 3 commits intomainfrom
Conversation
…port and label removal
|
|
||
| ### Failure modes | ||
|
|
||
| If all methods fail either because they do not return anything or package fetching fails for any reason set a `nodewright.nvidia.com/UninstallCapabilityUnknown` condition on the SCR. This should also be considered as an effective `false` for `uninstall.enabled`. To avoid loops it should set the persist annotation value to `unknown` |
There was a problem hiding this comment.
this seems like the wrong level. this is at the package level for supporting uninstall, but this is at the scr level.
There was a problem hiding this comment.
Persistence is via an annotation that is specific to an image:
nodewright/repository@digest` or `nodewright/repository@SHA`
So while it is at the custom resource instance level it can still be targeted at a specific package.
| ### `uninstall.apply` | ||
|
|
||
| - When `true`, the reconciler should schedule uninstall for that package **while the package remains** in `spec.packages` with full `configMap`, `env`, etc. | ||
| - After a **successful** uninstall on all relevant nodes, the controller **clears** `uninstall.apply` (e.g. back to `false` or omit). Exact defaulting is an implementation detail; the CRD should document one behavior. |
There was a problem hiding this comment.
The design covers the happy path for uninstall.apply well but is thin on failure recovery:
- What happens if uninstall fails on some nodes but succeeds on others? Is
uninstall.applylefttrueindefinitely? - Is there a backoff/retry strategy? A max-retry count?
- How does a user cancel a failing uninstall (set
apply: false? remove the field?)? - The status conditions table lists
UninstallFailedfor the finalizer path, but there is no equivalent condition for theuninstall.applypath.
Consider adding a "Failure and cancellation semantics" subsection for the uninstall.apply flow.
There was a problem hiding this comment.
Added a section
|
Hey @ayuskauskas, can you confirm if my understanding is correct? A. Uninstall casesCase 1: Package supports uninstall (has
Case 2: Package does NOT support uninstall — two sub-cases: Case 2.1: User sets
Case 2.2: Package has
Additional notes
B. Downgrade workflow
|
| | Value | Meaning | | ||
| |-------|--------| | ||
| | **`false`** (default) | Package **does not** support uninstall. The operator does **not** run uninstall pods, does **not** strip package-scoped metadata as if uninstall ran, and finalizer skips uninstall for this package. | | ||
| | **`true`** | Package **supports** uninstall. `uninstall.apply` may schedule uninstall work and post-success cleanup applies as described below. | |
There was a problem hiding this comment.
Why does it say "may", is there any case when uninstall.enabled is true and pod is not scheduled?
There was a problem hiding this comment.
I will change it to will. The only reasons it wouldn't is if the custom resource has the pause or skip annotation.
|
|
||
| #### Failure modes | ||
|
|
||
| **Uninstall fails on some nodes**: This is the same as a failing install package and pods will continue to be scheduled until complete. |
There was a problem hiding this comment.
Shouldn't there be limit to retries if uninstall fails?
Yes the operator will respect the decisions by the user to set the
Yes. That is the case
That is correct.
That is correct. |
|
So when is the implementation of this change expected? @ayuskauskas |
@rayaank-afk it will be the next thing worked on. ETA would be 1 to 2 weeks |
No description provided.