Skip to content

OCPBUGS-4101: Do not allow empty system reserved values#3439

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
harche:no_empty_system_reserved
Dec 10, 2022
Merged

OCPBUGS-4101: Do not allow empty system reserved values#3439
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
harche:no_empty_system_reserved

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented Dec 1, 2022

Signed-off-by: Harshal Patil harpatil@redhat.com

Fixes: https://issues.redhat.com/browse/OCPBUGS-4101

- What I did
Make sure none of the system reserved values end up empty and make kubelet crash during startup.

- How to verify it

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system1-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 1500Mi
      ephemeral-storage: "" 

Even after deploying that MC, the kubelet should start up and come up with default value of the ephemeral storage for system reserved (1Gi)

- Description for the changelog

Force default values of system reserved in case empty values were supplied.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 1, 2022
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: This pull request references Jira Issue OCPBUGS-4104, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Signed-off-by: Harshal Patil harpatil@redhat.com

Fixes: https://issues.redhat.com/browse/OCPBUGS-4101

- What I did
Make sure none of the system reserved values end up empty and make kubelet crash during startup.

- How to verify it

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
 name: system1-reserved-config
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
 kubeletConfig:
   systemReserved:
     cpu: 500m
     memory: 1500Mi
     ephemeral-storage: "" 

Even after deploying that MC, the kubelet should start up and come up with default value of the ephemeral storage for system reserved (1Gi)

- Description for the changelog

Force default values of system reserved in case empty values were supplied.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot requested review from cgwalters and jkyros December 1, 2022 15:21
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

/jira referesh

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: This pull request references Jira Issue OCPBUGS-4104, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: This pull request references Jira Issue OCPBUGS-4104, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harche harche changed the title OCPBUGS-4104: Do not allow empty system reserved values OCPBUGS-4101: Do not allow empty system reserved values Dec 1, 2022
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 1, 2022
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: This pull request references Jira Issue OCPBUGS-4101, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Signed-off-by: Harshal Patil harpatil@redhat.com

Fixes: https://issues.redhat.com/browse/OCPBUGS-4101

- What I did
Make sure none of the system reserved values end up empty and make kubelet crash during startup.

- How to verify it

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
 name: system1-reserved-config
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
 kubeletConfig:
   systemReserved:
     cpu: 500m
     memory: 1500Mi
     ephemeral-storage: "" 

Even after deploying that MC, the kubelet should start up and come up with default value of the ephemeral storage for system reserved (1Gi)

- Description for the changelog

Force default values of system reserved in case empty values were supplied.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot requested a review from rioliu-rh December 1, 2022 15:24
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

/cc @yuqi-zhang

@openshift-ci openshift-ci Bot requested a review from yuqi-zhang December 1, 2022 15:27
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

/retest-required

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, do you want to have Rio from the MCO QE test this? Or is there someone from the node team's side that can help verify?

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2022
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 1, 2022

lgtm, do you want to have Rio from the MCO QE test this? Or is there someone from the node team's side that can help verify?

From the node team I have verified it, so it would be better if MCO QE also verifies it. Thanks.

@yuqi-zhang
Copy link
Copy Markdown
Contributor

@rioliu-rh is this something you would like to verify pre-merge?

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 2, 2022

/hold

They are manually overriding files internally used for auto node sizing using machine config. Then even if we handle this in MCO, their upgrade will still fail. I am not sure at this point if this should be even supported. Manually overriding internally used files are difficult to protect for external interference.

https://issues.redhat.com/browse/OCPBUGS-4101?focusedCommentId=21354604&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21354604

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2022
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 2, 2022

I will spend some more on this to see if we can still fix it somehow for them.

@rioliu-rh
Copy link
Copy Markdown

@rioliu-rh is this something you would like to verify pre-merge?

@yuqi-zhang let me try to verify it with cluster-bot image

@rioliu-rh
Copy link
Copy Markdown

rioliu-rh commented Dec 6, 2022

tried to verify this fix , but it does not work for me

apply node sizing mc and kubelet config based on 4.11.16 successfully

cat node-sizing-enabled-for-worker.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-worker-node-sizing
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          compression: ""
          source: data:text/plain;charset=utf-8;base64,Tk9ERV9TSVpJTkdfRU5BQkxFRD1mYWxzZQpTWVNURU1fUkVTRVJWRURfTUVNT1JZPTNHaQpTWVNURU1fUkVTRVJWRURfQ1BVPTEwMDBtCg==
        mode: 420
        overwrite: true
        path: /etc/node-sizing-enabled.env

oc create -f node-sizing-enabled-for-worker.yaml
machineconfig.machineconfiguration.openshift.io/99-worker-node-sizing created

oc debug node/ci-ln-smybk7b-72292-x29cs-worker-a-gdhlg -- chroot /host cat /etc/node-sizing-enabled.env
Temporary namespace openshift-debug-kd6mt is created for debugging node...
Starting pod/ci-ln-smybk7b-72292-x29cs-worker-a-gdhlg-debug ...
To use host binaries, run `chroot /host`
NODE_SIZING_ENABLED=false
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=1000m

curl -sL https://raw.githubusercontent.com/openshift/managed-cluster-config/d4522b5245ae9ef97d579ec2e661b12f7d256dfd/deploy/kubelet-config/01-kubelet-config.yaml > kubelet-config.yaml

cat kubelet-config.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-kubelet
spec:
  machineConfigPoolSelector:
    matchExpressions:
    - key: machineconfiguration.openshift.io/mco-built-in
      operator: Exists
  autoSizingReserved: true

oc create -f kubelet-config.yaml
kubeletconfig.machineconfiguration.openshift.io/custom-kubelet created

oc debug node/ci-ln-smybk7b-72292-x29cs-worker-a-gdhlg -- chroot /host cat /etc/kubernetes/kubelet.conf | jq .systemReserved
Temporary namespace openshift-debug-qmr9b is created for debugging node...
Starting pod/ci-ln-smybk7b-72292-x29cs-worker-a-gdhlg-debug ...
To use host binaries, run `chroot /host`
{
  "ephemeral-storage": "1Gi"
}

upgrade the cluster to cluster-bot built image. registry.build05.ci.openshift.org/ci-ln-qhxbptb/release:latest

get auth token from https://console-openshift-console.apps.build05.l9oh.p1.openshiftapps.com/k8s/cluster/projects/ci-ln-qhxbptb and reset the pull-secret

oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-qhxbptb/release:latest --force --allow-explicit-upgrade

Upgradeable=False

  Reason: AdminAckRequired
  Message: Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.


oc -n openshift-config patch cm admin-acks --patch '{"data":{"ack-4.11-kube-1.25-api-removals-in-4.12":"true"}}' --type=merge
configmap/admin-acks patched

nodes are not ready

node
NAME                                       STATUS                        ROLES    AGE     VERSION
...
ci-ln-smybk7b-72292-x29cs-worker-a-gdhlg   NotReady,SchedulingDisabled   worker   3h25m   v1.24.6+5157800
ci-ln-smybk7b-72292-x29cs-worker-b-g2hp8   Ready                         worker   3h24m   v1.24.6+5157800
ci-ln-smybk7b-72292-x29cs-worker-c-t5st4   Ready

from kubelet log I can see

run.go:74] "command failed" err="failed to run Kubelet: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'"

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 6, 2022

@rioliu-rh Thanks for testing, however I am not sure if the node sizing MC should be applied in the first place. That MC is modifying the files used by auto node sizing directly. Also, the system reserved values it is trying to set should be set using kubelet config. That's the only supported way.

@rioliu-rh
Copy link
Copy Markdown

rioliu-rh commented Dec 6, 2022

@rioliu-rh Thanks for testing, however I am not sure if the node sizing MC should be applied in the first place. That MC is modifying the files used by auto node sizing directly. Also, the system reserved values it is trying to set should be set using kubelet config. That's the only supported way.

@harche So, I just need to apply below mc

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system1-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 1500Mi
      ephemeral-storage: "" 

then upgrade the cluster to the image which contains this fix

@rioliu-rh
Copy link
Copy Markdown

rioliu-rh commented Dec 6, 2022

verified with following steps. @harche

apply following mc

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 1500Mi
      ephemeral-storage: ""

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: enable-auto-sizing-reserved
spec:
  machineConfigPoolSelector:
    matchExpressions:
    - key: machineconfiguration.openshift.io/mco-built-in
      operator: Exists
  autoSizingReserved: true

mcp update is completed successfully

upgrade cluster from 4.11.16 to the image contains this pr registry.build05.ci.openshift.org/ci-ln-80tcgqk/release:latest

upgrade is success

oc adm upgrade
Cluster version is 4.12.0-0.ci.test-2022-12-06-083703-ci-ln-80tcgqk-latest

check file content of /etc/node-sizing-enabled.env on worker node

oc debug node/rioliu-1206b-ldgpl-worker-a-7cr96.c.openshift-qe.internal -- chroot /host cat /etc/node-sizing-enabled.env
Temporary namespace openshift-debug-r2gl7 is created for debugging node...
Starting pod/rioliu-1206b-ldgpl-worker-a-7cr96copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
NODE_SIZING_ENABLED=true
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi

check file content of kubelet.conf on worker node

oc debug node/rioliu-1206b-ldgpl-worker-a-7cr96.c.openshift-qe.internal -- chroot /host cat /etc/kubernetes/kubelet.conf
Temporary namespace openshift-debug-7wr89 is created for debugging node...
Starting pod/rioliu-1206b-ldgpl-worker-a-7cr96copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "staticPodPath": "/etc/kubernetes/manifests",
  "syncFrequency": "0s",
  "fileCheckFrequency": "0s",
  "httpCheckFrequency": "0s",
  "tlsCipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
  ],
  "tlsMinVersion": "VersionTLS12",
  "rotateCertificates": true,
  "serverTLSBootstrap": true,
  "authentication": {
    "x509": {
      "clientCAFile": "/etc/kubernetes/kubelet-ca.crt"
    },
    "webhook": {
      "cacheTTL": "0s"
    },
    "anonymous": {
      "enabled": false
    }
  },
  "authorization": {
    "webhook": {
      "cacheAuthorizedTTL": "0s",
      "cacheUnauthorizedTTL": "0s"
    }
  },
  "clusterDomain": "cluster.local",
  "clusterDNS": [
    "172.30.0.10"
  ],
  "streamingConnectionIdleTimeout": "0s",
  "nodeStatusUpdateFrequency": "0s",
  "nodeStatusReportFrequency": "0s",
  "imageMinimumGCAge": "0s",
  "volumeStatsAggPeriod": "0s",
  "systemCgroups": "/system.slice",
  "cgroupRoot": "/",
  "cgroupDriver": "systemd",
  "cpuManagerReconcilePeriod": "0s",
  "runtimeRequestTimeout": "0s",
  "maxPods": 250,
  "podPidsLimit": 4096,
  "kubeAPIQPS": 50,
  "kubeAPIBurst": 100,
  "serializeImagePulls": false,
  "evictionPressureTransitionPeriod": "0s",
  "featureGates": {
    "APIPriorityAndFairness": true,
    "CSIMigrationAzureFile": false,
    "CSIMigrationvSphere": false,
    "DownwardAPIHugePages": true,
    "RotateKubeletServerCertificate": true
  },
  "memorySwap": {},
  "containerLogMaxSize": "50Mi",
  "logging": {
    "flushFrequency": 0,
    "verbosity": 0,
    "options": {
      "json": {
        "infoBufferSize": "0"
      }
    }
  },
  "shutdownGracePeriod": "0s",
  "shutdownGracePeriodCriticalPods": "0s"
}

@harche harche force-pushed the no_empty_system_reserved branch from 66eddc3 to 1eae18c Compare December 7, 2022 14:20
Signed-off-by: Harshal Patil <harpatil@redhat.com>
@harche harche force-pushed the no_empty_system_reserved branch from 1eae18c to bd24f17 Compare December 7, 2022 14:22
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 7, 2022

@rioliu-rh I made the final script more resilient to the unexpected input.

sh-4.4#  /usr/local/sbin/dynamic-system-reserved-calc.sh true 
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=0.08
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 
sh-4.4# 
sh-4.4#  /usr/local/sbin/dynamic-system-reserved-calc.sh false
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
sh-4.4#                          
sh-4.4# 
sh-4.4#  /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi 1000m 1.5Gi
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=1000m
SYSTEM_RESERVED_ES=1.5Gi
sh-4.4# 
sh-4.4#  /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi 1000m      
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=1000m
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 
sh-4.4#  /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi       
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 7, 2022

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 7, 2022

@harche: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws bd24f17 link false /test okd-scos-e2e-aws
ci/prow/e2e-hypershift bd24f17 link false /test e2e-hypershift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rioliu-rh
Copy link
Copy Markdown

@harche verified with latest code

apply below machine config with empty value of ephemeral-storage

cat system-reserved-config.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 1500Mi
      ephemeral-storage: ""

check kubelet.conf and node-sizing-enabled.env for reserved setting

oc debug node/rioliu-1208a-228sg-worker-a-qw42r.c.openshift-qe.internal -- chroot /host cat /etc/kubernetes/kubelet.conf | jq .systemReserved
Temporary namespace openshift-debug-khb5m is created for debugging node...
Starting pod/rioliu-1208a-228sg-worker-a-qw42rcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
{
  "ephemeral-storage": ""
}

oc debug node/rioliu-1208a-228sg-worker-a-qw42r.c.openshift-qe.internal -- chroot /host cat /etc/node-sizing-enabled.env
Temporary namespace openshift-debug-5dgxx is created for debugging node...
Starting pod/rioliu-1208a-228sg-worker-a-qw42rcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
NODE_SIZING_ENABLED=false
SYSTEM_RESERVED_MEMORY=1500Mi
SYSTEM_RESERVED_CPU=500m

trigger upgrade

oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-0qy8c9k/release:latest --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.build05.ci.openshift.org/ci-ln-0qy8c9k/release:latest

upgrade is completed successfully

cv -o yaml | yq -y '.items[].status.history'
- completionTime: '2022-12-08T06:39:57Z'
  image: registry.build05.ci.openshift.org/ci-ln-0qy8c9k/release:latest
  startedTime: '2022-12-08T05:34:59Z'
  state: Completed
  verified: false
  version: 4.12.0-0.ci.test-2022-12-08-022049-ci-ln-0qy8c9k-latest
- completionTime: '2022-12-08T02:56:10Z'
  image: quay.io/openshift-release-dev/ocp-release@sha256:01c9ef7f802f0ba254847c22e960cd6453ed9f61d9296fdda8bd497712f53e23
  startedTime: '2022-12-08T02:36:53Z'
  state: Completed
  verified: false
  version: 4.11.16

check node-sizing-enabled.env on worker node to make sure SYSTEM_RESERVED_ES has default value

oc debug node/rioliu-1208a-228sg-worker-a-qw42r.c.openshift-qe.internal -- chroot /host cat /etc/node-sizing-enabled.env
Temporary namespace openshift-debug-7h5tt is created for debugging node...
Starting pod/rioliu-1208a-228sg-worker-a-qw42rcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
NODE_SIZING_ENABLED=false
SYSTEM_RESERVED_MEMORY=1500Mi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi

@rioliu-rh
Copy link
Copy Markdown

rioliu-rh commented Dec 8, 2022

@harche is it expected, if yes, I will add label qe-approved
cc: @yuqi-zhang

@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 8, 2022

@harche is it expected, if yes, I will add label qe-approved cc: @yuqi-zhang

Sounds good. Please add the qe-approved label whenever you are ready.

I also did additional testing with the actual script that was generating empty value for SYSTEM_RESERVED_ES variable. With that fix, the script should never output any of system reserved values empty irrespective of the input.

sh-4.4# /usr/local/sbin/dynamic-system-reserved-calc.sh true
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=0.08
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 
sh-4.4# 
sh-4.4# /usr/local/sbin/dynamic-system-reserved-calc.sh false
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 
sh-4.4# 
sh-4.4# /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi 1000m 1.5Gi
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=1000m
SYSTEM_RESERVED_ES=1.5Gi
sh-4.4# 
sh-4.4# /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi 1000m 
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=1000m
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 
sh-4.4# /usr/local/sbin/dynamic-system-reserved-calc.sh false 3Gi 
sh-4.4# cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=500m
SYSTEM_RESERVED_ES=1Gi
sh-4.4# 

@yuqi-zhang
Copy link
Copy Markdown
Contributor

I am +1 and can apply lgtm once we are satisfied

@rioliu-rh
Copy link
Copy Markdown

/label qe-approved

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Dec 9, 2022
@yuqi-zhang
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 98447c2 and 2 for PR HEAD bd24f17 in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 931c8a3 and 1 for PR HEAD bd24f17 in total

@openshift-merge-robot openshift-merge-robot merged commit 748e449 into openshift:master Dec 10, 2022
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-4101 has been moved to the MODIFIED state.

Details

In response to this:

Signed-off-by: Harshal Patil harpatil@redhat.com

Fixes: https://issues.redhat.com/browse/OCPBUGS-4101

- What I did
Make sure none of the system reserved values end up empty and make kubelet crash during startup.

- How to verify it

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
 name: system1-reserved-config
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
 kubeletConfig:
   systemReserved:
     cpu: 500m
     memory: 1500Mi
     ephemeral-storage: "" 

Even after deploying that MC, the kubelet should start up and come up with default value of the ephemeral storage for system reserved (1Gi)

- Description for the changelog

Force default values of system reserved in case empty values were supplied.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@harche harche deleted the no_empty_system_reserved branch December 13, 2022 15:41
@harche
Copy link
Copy Markdown
Contributor Author

harche commented Dec 13, 2022

/cherry-pick release-4.12

@openshift-cherrypick-robot
Copy link
Copy Markdown

@harche: new pull request created: #3453

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants