OTA-1307: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads by hongkailiu · Pull Request #1105 · openshift/cluster-version-operator

hongkailiu · 2024-11-06T22:38:49Z

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

…ests downloads Previously, the CVO launched a Job and waited for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so we're dropping it and monitoring the Pod ourselves.

hongkailiu · 2024-11-07T12:39:24Z

/retest-required

hongkailiu · 2024-11-07T18:47:55Z

Testing:

launch 4.17.2,#1105 aws

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:22:33Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--sjfwz failed with
  reason "DeadlineExceeded" and message "Pod was active on the node longer than the
  specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
  and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:28:53Z'
message: Payload loaded version="4.17.3" image="quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.2-0.test-2024-11-07-161355-ci-ln-zycfpg2-latest   True        True          19s     Working towards 4.17.3: 10 of 901 done (1% complete)

openshift-ci-robot · 2024-11-07T18:49:38Z

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2024-11-07T18:49:44Z

/jira refresh

openshift-ci-robot · 2024-11-07T18:49:48Z

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2024-11-07T22:16:13Z

Testing "unacceptable Sigstore signatures":

launch 4.17.2,openshift/cluster-version-operator#1105 aws

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T22:08:00Z'
message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7"
  failure=The update cannot be verified: unable to verify sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7
  against keyrings: verifier-public-key-redhat'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

The failure has the nice message but it seems not related to my PR.

The above testing is wrong. I will redo tomorrow.

hongkailiu · 2024-11-08T12:31:40Z

Redid the testing with

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

### too old to have a Sigstore sig
$ oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 | grep 'Digest'
Digest:         sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:16:36Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--kzxpw failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:20:30Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--bv4n8 failed with
  reason "DeadlineExceeded" and message "Pod was active on the node longer than the
  specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
  and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:22:52Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-08-112624-ci-ln-2nk1wv2-latest   True        True          71s     Working towards 4.18.0-ec.3: 115 of 952 done (12% complete), waiting on etcd, kube-apiserver

petr-muller · 2024-11-08T13:18:40Z

TBH I am not too comfortable with a gutting like this at this point of the cycle. Seems like something that should be done at the start of the cycle so that we have a chance to weed out a fallout of this, not right at the end 🤔

hongkailiu · 2024-11-08T23:07:55Z

Tested again with commit 48b579e:

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:00:55Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--prtj2 failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:01:55Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--ldjdj failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
  reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
  in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:02:51Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-08-221058-ci-ln-hkv88s2-latest   True        True          26s     Working towards 4.18.0-ec.3: 115 of 952 done (12% complete), waiting on etcd, kube-apiserver

petr-muller

Finally found time to properly read this, comments inline

petr-muller · 2024-11-13T15:17:36Z

 			Name:      name,
 			Namespace: namespace,
+			Labels: map[string]string{
+				"k8s-app": "retrieve-openshift-release",


This is a workload closely coupled to CVO, I'd put them under a single k8s-app

Re-used the label here:

cluster-version-operator/install/0000_00_cluster-version-operator_03_deployment.yaml

Lines 22 to 23 in b0eddfe

labels:

k8s-app: cluster-version-operator

I had to recover the label because it is used to list the version-- pod and prune.
CVO has the same label, then is killed too.

petr-muller · 2024-11-13T15:36:17Z

-		klog.Warningf("failed to prune jobs: %v", err)
+	// Prune older pods and directories while gracefully handling errors.
+	if err := r.prunePods(ctx); err != nil {
+		klog.Warningf("failed to prune pods: %v", err)


I wonder if a failure to prune pods is as safe to just warn about and continue like it was for jobs. We won't name-conflict because we add a random component to pod name but I wonder what are other implications of failing to delete pods - is it safe to spawn new pods? If there is an old pod still somehow running and refusing to be deleted, won't the two pods conflict somehow, like clobber each others output or something?

I honestly do not know the answer.

Let me understand the concern:

CVO started Pod A retrieving the payload.

CVO got restarted (while Pod A is still running) and started to prune Pod A but failed for some unknown reason.

CVO started Pod B retrieving the payload.

Then Pod A and Pod B may "clobber each others output"?

Would this be the same story of racing if "Pod"s are replaced with "Job"s above?
The Job controller may fail on deleting the pod for the job too.

Would this be the same story of racing if "Pod"s are replaced with "Job"s above?

https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs

I think i get the answer, "normally, only one Pod is started, unless the Pod fails." implies that with Job, Pod for Job B wont get started when Pod for Job A is still running.

The concern is mainly whether it is safe to continue at L169 - we know that r.prunePods failed, for whatever reason (err != nil). We log it (correct), but then we continue. It can be something like "apiserver was disrupted at that very moment". Maybe temporary, maybe permanent.

Is it safe to continue working? As opposed to returning and maybe trying the whole procedure again later? By "clobbering the output" I meant whether two simultaneously executing pods (one that we failed to delete, and one that we will spawn after we continue) are somehow not conflicting. The version pods' job is to extract the manifests into the filesystem shared with CVO itself - hypothetically, the two pods could write the pods into the same location (maybe, I did not look at the actual code).

It looks like continuing on error here was me in 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, #760) when we went from "no pruning at all" to "best-effort pruning". I'm comfortable moving to "require pruning" and failing out of this function on error instead of just warning about errors. But I agree with Hongkai that it seems like preexisting exposure.

hongkailiu · 2024-11-14T03:05:51Z

/retest-required

hongkailiu · 2024-11-14T14:08:19Z

/retest-required

The failure seems not related to us: The upgrade has reached the point where CVO applied manifests to the cluster which means payload is retrieved well (reason: PayloadLoaded).

$ curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1105/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1856896318669066240/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json' | yq -y '.items[].status.conditions'
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: The update channel has not been configured.
  reason: NoChannel
  status: 'False'
  type: RetrievedUpdates
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: Capabilities match configured spec
  reason: AsExpected
  status: 'False'
  type: ImplicitlyEnabledCapabilities
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: Payload loaded version="4.18.0-0.ci.test-2024-11-14-031235-ci-op-p5g3g1kq-initial"
    image="registry.build09.ci.openshift.org/ci-op-p5g3g1kq/release@sha256:75f3bafc70910062034c8d45f08b7fd46ee8b4f44c19839f385ac179197c9dfe"
    architecture="amd64"
  reason: PayloadLoaded
  status: 'True'
  type: ReleaseAccepted
- lastTransitionTime: '2024-11-14T04:29:13Z'
  message: Done applying 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest
  status: 'True'
  type: Available
- lastTransitionTime: '2024-11-14T07:11:40Z'
  message: 'Multiple errors are preventing progress:

    * Could not update role "openshift-kni-infra/host-networking-services" (754 of
    892)

    * Could not update service "openshift-service-ca-operator/metrics" (705 of 892)'
  reason: MultipleErrors
  status: 'True'
  type: Failing
- lastTransitionTime: '2024-11-14T04:29:13Z'
  message: 'Error while reconciling 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest:
    an unknown error has occurred: MultipleErrors'
  reason: MultipleErrors
  status: 'False'
  type: Progressing

hongkailiu · 2024-11-15T14:57:01Z

Tested with 4386a7b :

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:48:52Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--dl4gq failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:49:43Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--cfd2d failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
  reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
  in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest   True        True          45m     Working towards 4.18.0-ec.3: 3 of 952 done (0% complete)

Inspected the CVO log:

I1115 14:01:17.694736       1 cvo.go:636] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[12]: Unsupported value: "OperatorLifecycleManagerV1": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", "MachineAPI", "Build", "DeploymentConfig", "ImageRegistry", "OperatorLifecycleManager", "CloudCredential", "Ingress", "CloudControllerManager", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

It got stuck there because #1108 (comment)
The targeting version 4.18.0-ec.3 is too old to know the new Capability in the version installed by the bot 4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest.

So we upgrade to the most recent nightly:
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.18.0-0.ci/release/4.18.0-0.ci-2024-11-15-140111

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d --force --allow-explicit-upgrade --allow-upgrade-with-warnings
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-0.ci-2024-11-15-140111" image="registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest   True        True          50m     Working towards 4.18.0-0.ci-2024-11-15-140111: 115 of 954 done (12% complete), waiting on etcd, kube-apiserver

Now everything seems OK.

wking

Nice work!

/lgtm

Petr had held for more review, and now I've also reviewed again, and you've worked up the testing, and everything looks good, so:

/hold cancel

openshift-ci · 2024-11-15T15:48:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dis016 · 2024-11-20T16:44:46Z

/label qe-approved

openshift-ci-robot · 2024-11-20T16:44:53Z

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dis016 · 2024-11-20T18:05:42Z

Install Techpreview Cluster:

[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion 
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest   True        False         45m     Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


 Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
 Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


warning: Cannot display available updates:
 Reason: NoChannel
 Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]#

Trigger too old payload which doesn't have sigstore sig.

[root@preserve-dis016-centos9 cloud-user]# oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 | grep 'Digest'
Digest:         sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]#
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]#

clusterversion operator should report detail error about sigstore validation.

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18" failure=Unable to download and prepare the update: failed to prune pods: error deleting pods: failed to delete pod version--pjk7z: pods "version--pjk7z" not found


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]# 
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18" failure=Unable to download and prepare the update: pod version--k2qb5 failed at pending with reason "" and message "" and status initcontainer cleanup is waiting with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 because the signature validation failed: Source image rejected: A signature was required, but no signature exists"


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]#  
 
[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion version -o json  | jq -r  '.status.conditions[]|select(.type=="ReleaseAccepted")'
{
  "lastTransitionTime": "2024-11-20T17:54:09Z",
  "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18\" failure=Unable to download and prepare the update: pod version--ncgpc failed at pending with reason \"\" and message \"\" and status initcontainer cleanup is waiting with reason \"SignatureValidationFailed\" and message \"image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 because the signature validation failed: Source image rejected: A signature was required, but no signature exists\"",
  "reason": "RetrievePayload",
  "status": "False",
  "type": "ReleaseAccepted"
}
[root@preserve-dis016-centos9 cloud-user]#

Clear upgrade and trigger upgrade to invalid digest

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --clear 
Cancelled requested upgrade to quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
[root@preserve-dis016-centos9 cloud-user]#

clusterversion should report image pull error

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000" failure=Unable to download and prepare the update: pod version--c9df9 failed at pending with reason "" and message "" and status initcontainer cleanup is waiting with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000: reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000 in quay.io/openshift-release-dev/ocp-release: manifest unknown"


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion version -o json  | jq -r  '.status.conditions[]|select(.type=="ReleaseAccepted")'
{
  "lastTransitionTime": "2024-11-20T18:01:38Z",
  "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\" failure=Unable to download and prepare the update: pod version--vpmbj failed at pending with reason \"\" and message \"\" and status initcontainer cleanup is waiting with reason \"ErrImagePull\" and message \"initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000: reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000 in quay.io/openshift-release-dev/ocp-release: manifest unknown\"",
  "reason": "RetrievePayload",
  "status": "False",
  "type": "ReleaseAccepted"
}
[root@preserve-dis016-centos9 cloud-user]#

hongkailiu · 2024-11-22T14:50:02Z

We decide this to go to 4.19 instead of 4.18 as it is not considered as "critical fix".

/hold

hongkailiu · 2024-11-26T14:25:53Z

branching is complete

/hold cancel

openshift-ci-robot · 2024-11-26T17:19:59Z

/retest-required