Skip to content

Uninstall pod does not receive package ConfigMap or Env — user-provided scripts and environment variables are silently dropped #186

@rayaank-afk

Description

@rayaank-afk

When a package is removed from an SCR, the operator triggers an uninstall pod. However, the uninstall pod does not receive the package's configMap (scripts) or env (environment variables). The uninstall pod runs with an empty configuration, silently succeeds without executing any user-provided cleanup logic, and the operator marks the uninstall as complete.

This means any host-level changes made during apply/config are never reversed during uninstall, leaving the node in a dirty state.

Testing environment:

  1. Skyhook Controller: v0.12.0
➜  workload-clusters git:(testing-skyhook) ✗ k describe deploy -n skyhook skyhook-skyhook-operator-controller-manager | grep Image
Annotations:            checkov.io/skip1: CKV_K8S_43=Image digest not required - we use tags
    Image:      nvcr.io/nvidia/skyhook/operator:v0.12.0@sha256:ce79a9778fca453e54d58506c71c8ff6765b65d44a73fb167441ab851c108dc2
    Image:      quay.io/brancz/kube-rbac-proxy:v0.15.0@sha256:2c7b120590cbe9f634f5099f2cbb91d0b668569023a81505ca124a5c437e7663
➜  workload-clusters git:(testing-skyhook) ✗ 
  1. Kubernetes: v1.34.5
➜  workload-clusters git:(testing-skyhook) ✗ k get nodes            
NAME                                        STATUS   ROLES           AGE   VERSION
1u1g-x570-0432.pdc1a2.colossus.nvidia.com   Ready    control-plane   30d   v1.34.5
1u1g-x570-0444.pdc1a2.colossus.nvidia.com   Ready    <none>          30d   v1.34.5
z370-0433.ipp3a1.colossus.nvidia.com        Ready    <none>          30d   v1.34.5
  1. SCR that I used is for opening a particular port on control plane node.
➜  workload-clusters git:(testing-skyhook) ✗ k get cm -n skyhook         
NAME                                                                          DATA   AGE
demo-baz-1.1.0                                                                2      7d23h
kube-root-ca.crt                                                              1      7d23h
open-port-12379-1u1g-x570-0432-pdc1a2-colossus-nvidia-com-metadata-453076ca   3      3m30s
open-port-12379-firewall-1.0.0                                                6      3m30s
➜  workload-clusters git:(testing-skyhook) ✗ k describe cm -n skyhook open-port-12379-1u1g-x570-0432-pdc1a2-colossus-nvidia-com-metadata-453076ca                              
Name:         open-port-12379-1u1g-x570-0432-pdc1a2-colossus-nvidia-com-metadata-453076ca
Namespace:    skyhook
Labels:       skyhook.nvidia.com/skyhook-node-meta=open-port-12379
Annotations:  skyhook.nvidia.com/Node.name: 1u1g-x570-0432.pdc1a2.colossus.nvidia.com
              skyhook.nvidia.com/name: open-port-12379

Data
====
packages.json:
----
{"agentVersion":"2bc0fe8c5c11130c843859dd0c8325e316bf4a9bb1d5883554c90a7a0574a771","packages":{"firewall":{"name":"firewall","version":"1.0.0","image":"ghcr.io/nvidia/skyhook-packages/shellscript"}}}
annotations.json:
----
{"cluster.x-k8s.io/annotations-from-machine":"","cluster.x-k8s.io/cluster-name":"pdc-nca-rayaankhan","cluster.x-k8s.io/cluster-namespace":"play","cluster.x-k8s.io/labels-from-machine":"","cluster.x-k8s.io/machine":"pdc-nca-rayaankhan-db5vw-qcdcg","cluster.x-k8s.io/owner-kind":"KubeadmControlPlane","cluster.x-k8s.io/owner-name":"pdc-nca-rayaankhan-db5vw","csi.volume.kubernetes.io/nodeid":"{\"csi.trident.netapp.io\":\"1u1g-x570-0432.pdc1a2.colossus.nvidia.com\"}","node.alpha.kubernetes.io/ttl":"0","projectcalico.org/IPv4Address":"10.46.254.176/16","projectcalico.org/IPv4IPIPTunnelAddr":"100.103.13.128","skyhook.nvidia.com/nodeState_open-port-12379":"{\"firewall|1.0.0\":{\"name\":\"firewall\",\"version\":\"1.0.0\",\"image\":\"ghcr.io/nvidia/skyhook-packages/shellscript\",\"stage\":\"config\",\"state\":\"complete\"}}","skyhook.nvidia.com/status_open-port-12379":"complete","skyhook.nvidia.com/version_open-port-12379":"v0.12.0","volumes.kubernetes.io/controller-managed-attach-detach":"true"}
labels.json:
----
{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"1u1g-x570-0432.pdc1a2.colossus.nvidia.com","kubernetes.io/os":"linux","node-role.kubernetes.io/control-plane":"","node.kubernetes.io/exclude-from-external-load-balancers":"","skyhook.nvidia.com/status_open-port-12379":"complete"}

BinaryData
====

Events:  <none>
➜  workload-clusters git:(testing-skyhook) ✗ k describe cm -n skyhook open-port-12379-firewall-1.0.0                                                                           
Name:         open-port-12379-firewall-1.0.0
Namespace:    skyhook
Labels:       skyhook.nvidia.com/name=open-port-12379
Annotations:  skyhook.nvidia.com/Package.Name: firewall
              skyhook.nvidia.com/Package.Version: 1.0.0
              skyhook.nvidia.com/name: open-port-12379

Data
====
apply.sh:
----
#!/bin/bash
set -e
if ! nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null; then
  nsenter -t 1 -n -- iptables -I INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT"
  echo "Opened port $PORT"
else
  echo "Port $PORT already open"
fi
apply_check.sh:
----
#!/bin/bash
set -e
nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null
config.sh:
----
#!/bin/bash
set -e
if ! nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null; then
  nsenter -t 1 -n -- iptables -I INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT"
  echo "Opened port $PORT"
else
  echo "Port $PORT already open"
fi
config_check.sh:
----
#!/bin/bash
set -e
nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null
uninstall.sh:
----
#!/bin/bash
set -e
if nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null; then
  nsenter -t 1 -n -- iptables -D INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT"
  echo "Removed port $PORT"
else
  echo "Port $PORT rule not found, nothing to remove"
fi
uninstall_check.sh:
----
#!/bin/bash
set -e
! nsenter -t 1 -n -- iptables -C INPUT -p tcp --dport $PORT -j ACCEPT -m comment --comment "$COMMENT" 2>/dev/null

BinaryData
====

Events:  <none>
➜  workload-clusters git:(testing-skyhook) ✗ 

Steps I performed:

  1. I installed the above SCR to my cluster and it succeeded i.e. the port got opened (I confirmed it).
  2. I removed the package from the SCR and applied it again.
  3. I was watching the logs of all the containers of all the pods in skyhook namespace and I found this
2026-04-01T08:26:13.72056843Z stdout F [out]2026-04-01T08:26:13.720263 Could not find file /var/lib/skyhook/open-port-12379/firewall-1.0.0-82a01934-6cb6-4ff7-a250-b00b7e7844bd-2/configmaps/uninstall.sh was this in the configmap?
2026-04-01T08:26:13.758153955Z stdout F [out]2026-04-01T08:26:13.720263 SUCEEDED: shellscript_run.sh uninstall

So even though I had configured the uninstall.sh in the configmap, it couldn't find it.

Expected Behavior

The uninstall pod should receive the same configMap scripts and env variables that were used during apply, so that uninstall.sh can execute the cleanup logic (e.g., removing the iptables rule).

Findings

  1. Faux package missing ConfigMap and Env
    In HandleVersionChange, when a package is removed from the spec, a "faux" package is created with only PackageRef and Image:
newPackage := &v1alpha1.Package{
    PackageRef: packageStatusRef,
    Image:      packageStatus.Image,
}

The Env and ConfigMap fields are not set because the node state annotation only stores name, version, image, stage, and state — not the original package configuration.

The logs and other important details can be referred from this drive link accessible only to NVIDIANs.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions