Skip to content

OTA-1307: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads#1105

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
hongkailiu:drop-version-job-controller
Nov 28, 2024
Merged

OTA-1307: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads#1105
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
hongkailiu:drop-version-job-controller

Conversation

@hongkailiu
Copy link
Copy Markdown
Member

@hongkailiu hongkailiu commented Nov 6, 2024

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

…ests downloads

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload.  But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod.  The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2024
@openshift-ci openshift-ci Bot requested review from DavidHurta and wking November 6, 2024 22:39
@hongkailiu hongkailiu force-pushed the drop-version-job-controller branch 2 times, most recently from f29874e to 622efd2 Compare November 7, 2024 04:05
@hongkailiu
Copy link
Copy Markdown
Member Author

/retest-required

@hongkailiu hongkailiu force-pushed the drop-version-job-controller branch 2 times, most recently from 0db06cb to 3996ca9 Compare November 7, 2024 16:02
@hongkailiu hongkailiu changed the title WIP: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads Nov 7, 2024
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2024
@hongkailiu
Copy link
Copy Markdown
Member Author

Testing:

launch 4.17.2,#1105 aws

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:22:33Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--sjfwz failed with
  reason "DeadlineExceeded" and message "Pod was active on the node longer than the
  specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
  and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:28:53Z'
message: Payload loaded version="4.17.3" image="quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.2-0.test-2024-11-07-161355-ci-ln-zycfpg2-latest   True        True          19s     Working towards 4.17.3: 10 of 901 done (1% complete)

@hongkailiu hongkailiu changed the title pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads OTA-1307: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads Nov 7, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Nov 7, 2024

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 7, 2024
@hongkailiu
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Nov 7, 2024

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread pkg/cvo/updatepayload.go Outdated
@hongkailiu
Copy link
Copy Markdown
Member Author

hongkailiu commented Nov 7, 2024

Testing "unacceptable Sigstore signatures":

launch 4.17.2,openshift/cluster-version-operator#1105 aws

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T22:08:00Z'
message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7"
  failure=The update cannot be verified: unable to verify sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7
  against keyrings: verifier-public-key-redhat'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

The failure has the nice message but it seems not related to my PR.


The above testing is wrong. I will redo tomorrow.

@hongkailiu hongkailiu force-pushed the drop-version-job-controller branch 2 times, most recently from 4ecfda9 to 622e4b0 Compare November 8, 2024 02:11
@hongkailiu
Copy link
Copy Markdown
Member Author

Redid the testing with

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

### too old to have a Sigstore sig
$ oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 | grep 'Digest'
Digest:         sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:16:36Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--kzxpw failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:20:30Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--bv4n8 failed with
  reason "DeadlineExceeded" and message "Pod was active on the node longer than the
  specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
  and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:22:52Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-08-112624-ci-ln-2nk1wv2-latest   True        True          71s     Working towards 4.18.0-ec.3: 115 of 952 done (12% complete), waiting on etcd, kube-apiserver

Comment thread pkg/cvo/updatepayload.go
@petr-muller
Copy link
Copy Markdown
Member

TBH I am not too comfortable with a gutting like this at this point of the cycle. Seems like something that should be done at the start of the cycle so that we have a chance to weed out a fallout of this, not right at the end 🤔

@hongkailiu hongkailiu force-pushed the drop-version-job-controller branch from 622e4b0 to 48b579e Compare November 8, 2024 21:37
@hongkailiu
Copy link
Copy Markdown
Member Author

Tested again with commit 48b579e:

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:00:55Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--prtj2 failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:01:55Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--ldjdj failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
  reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
  in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T23:02:51Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-08-221058-ci-ln-hkv88s2-latest   True        True          26s     Working towards 4.18.0-ec.3: 115 of 952 done (12% complete), waiting on etcd, kube-apiserver

Copy link
Copy Markdown
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally found time to properly read this, comments inline

Comment thread lib/resourcebuilder/interface.go Outdated
Comment thread pkg/cvo/updatepayload.go
Name: name,
Namespace: namespace,
Labels: map[string]string{
"k8s-app": "retrieve-openshift-release",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workload closely coupled to CVO, I'd put them under a single k8s-app

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-used the label here:

labels:
k8s-app: cluster-version-operator

Copy link
Copy Markdown
Member Author

@hongkailiu hongkailiu Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to recover the label because it is used to list the version-- pod and prune.
CVO has the same label, then is killed too.

Comment thread pkg/cvo/updatepayload.go Outdated
Comment thread pkg/cvo/updatepayload.go Outdated
klog.Warningf("failed to prune jobs: %v", err)
// Prune older pods and directories while gracefully handling errors.
if err := r.prunePods(ctx); err != nil {
klog.Warningf("failed to prune pods: %v", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a failure to prune pods is as safe to just warn about and continue like it was for jobs. We won't name-conflict because we add a random component to pod name but I wonder what are other implications of failing to delete pods - is it safe to spawn new pods? If there is an old pod still somehow running and refusing to be deleted, won't the two pods conflict somehow, like clobber each others output or something?

Copy link
Copy Markdown
Member Author

@hongkailiu hongkailiu Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly do not know the answer.

Let me understand the concern:

  • CVO started Pod A retrieving the payload.
  • CVO got restarted (while Pod A is still running) and started to prune Pod A but failed for some unknown reason.
  • CVO started Pod B retrieving the payload.

Then Pod A and Pod B may "clobber each others output"?

Would this be the same story of racing if "Pod"s are replaced with "Job"s above?
The Job controller may fail on deleting the pod for the job too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be the same story of racing if "Pod"s are replaced with "Job"s above?

https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs

I think i get the answer, "normally, only one Pod is started, unless the Pod fails." implies that with Job, Pod for Job B wont get started when Pod for Job A is still running.

Copy link
Copy Markdown
Member

@petr-muller petr-muller Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is mainly whether it is safe to continue at L169 - we know that r.prunePods failed, for whatever reason (err != nil). We log it (correct), but then we continue. It can be something like "apiserver was disrupted at that very moment". Maybe temporary, maybe permanent.

Is it safe to continue working? As opposed to returning and maybe trying the whole procedure again later? By "clobbering the output" I meant whether two simultaneously executing pods (one that we failed to delete, and one that we will spawn after we continue) are somehow not conflicting. The version pods' job is to extract the manifests into the filesystem shared with CVO itself - hypothetically, the two pods could write the pods into the same location (maybe, I did not look at the actual code).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like continuing on error here was me in 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, #760) when we went from "no pruning at all" to "best-effort pruning". I'm comfortable moving to "require pruning" and failing out of this function on error instead of just warning about errors. But I agree with Hongkai that it seems like preexisting exposure.

Comment thread pkg/cvo/updatepayload.go Outdated
Comment thread pkg/cvo/updatepayload.go Outdated
Comment thread pkg/cvo/updatepayload.go Outdated
Comment thread pkg/cvo/updatepayload.go Outdated
@hongkailiu hongkailiu force-pushed the drop-version-job-controller branch from 48b579e to 7c3c704 Compare November 13, 2024 21:32
@hongkailiu
Copy link
Copy Markdown
Member Author

/retest-required

@hongkailiu
Copy link
Copy Markdown
Member Author

/retest-required

The failure seems not related to us: The upgrade has reached the point where CVO applied manifests to the cluster which means payload is retrieved well (reason: PayloadLoaded).

$ curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1105/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1856896318669066240/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json' | yq -y '.items[].status.conditions'
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: The update channel has not been configured.
  reason: NoChannel
  status: 'False'
  type: RetrievedUpdates
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: Capabilities match configured spec
  reason: AsExpected
  status: 'False'
  type: ImplicitlyEnabledCapabilities
- lastTransitionTime: '2024-11-14T03:53:38Z'
  message: Payload loaded version="4.18.0-0.ci.test-2024-11-14-031235-ci-op-p5g3g1kq-initial"
    image="registry.build09.ci.openshift.org/ci-op-p5g3g1kq/release@sha256:75f3bafc70910062034c8d45f08b7fd46ee8b4f44c19839f385ac179197c9dfe"
    architecture="amd64"
  reason: PayloadLoaded
  status: 'True'
  type: ReleaseAccepted
- lastTransitionTime: '2024-11-14T04:29:13Z'
  message: Done applying 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest
  status: 'True'
  type: Available
- lastTransitionTime: '2024-11-14T07:11:40Z'
  message: 'Multiple errors are preventing progress:

    * Could not update role "openshift-kni-infra/host-networking-services" (754 of
    892)

    * Could not update service "openshift-service-ca-operator/metrics" (705 of 892)'
  reason: MultipleErrors
  status: 'True'
  type: Failing
- lastTransitionTime: '2024-11-14T04:29:13Z'
  message: 'Error while reconciling 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest:
    an unknown error has occurred: MultipleErrors'
  reason: MultipleErrors
  status: 'False'
  type: Progressing

@hongkailiu
Copy link
Copy Markdown
Member Author

Tested with 4386a7b :

launch 4.18,openshift/cluster-version-operator#1105 aws,techpreview

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:48:52Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: pod version--dl4gq failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:49:43Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: pod version--cfd2d failed at
  pending with reason "" and message "" and status initcontainer cleanup is waiting
  with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
  reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
  in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest   True        True          45m     Working towards 4.18.0-ec.3: 3 of 952 done (0% complete)

Inspected the CVO log:

I1115 14:01:17.694736       1 cvo.go:636] Error handling openshift-cluster-version/version: ClusterVersion.config.openshift.io "version" is invalid: [status.capabilities.enabledCapabilities[12]: Unsupported value: "OperatorLifecycleManagerV1": supported values: "openshift-samples", "baremetal", "marketplace", "Console", "Insights", "Storage", "CSISnapshot", "NodeTuning", "MachineAPI", "Build", "DeploymentConfig", "ImageRegistry", "OperatorLifecycleManager", "CloudCredential", "Ingress", "CloudControllerManager", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

It got stuck there because #1108 (comment)
The targeting version 4.18.0-ec.3 is too old to know the new Capability in the version installed by the bot 4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest.

So we upgrade to the most recent nightly:
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.18.0-0.ci/release/4.18.0-0.ci-2024-11-15-140111

$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d --force --allow-explicit-upgrade --allow-upgrade-with-warnings
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-0.ci-2024-11-15-140111" image="registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d"
  architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted

$ oc get clusterversion version
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest   True        True          50m     Working towards 4.18.0-0.ci-2024-11-15-140111: 115 of 954 done (12% complete), waiting on etcd, kube-apiserver

Now everything seems OK.

Copy link
Copy Markdown
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

/lgtm

Petr had held for more review, and now I've also reviewed again, and you've worked up the testing, and everything looks good, so:

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2024
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dis016
Copy link
Copy Markdown

dis016 commented Nov 20, 2024

/label qe-approved

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Nov 20, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Nov 20, 2024

@hongkailiu: This pull request references OTA-1307 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dis016
Copy link
Copy Markdown

dis016 commented Nov 20, 2024

Install Techpreview Cluster:

[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion 
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest   True        False         45m     Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


 Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
 Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


warning: Cannot display available updates:
 Reason: NoChannel
 Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]# 

Trigger too old payload which doesn't have sigstore sig.

[root@preserve-dis016-centos9 cloud-user]# oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 | grep 'Digest'
Digest:         sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]#
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]# 

clusterversion operator should report detail error about sigstore validation.

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18" failure=Unable to download and prepare the update: failed to prune pods: error deleting pods: failed to delete pod version--pjk7z: pods "version--pjk7z" not found


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]# 
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18" failure=Unable to download and prepare the update: pod version--k2qb5 failed at pending with reason "" and message "" and status initcontainer cleanup is waiting with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 because the signature validation failed: Source image rejected: A signature was required, but no signature exists"


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]#  
 
[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion version -o json  | jq -r  '.status.conditions[]|select(.type=="ReleaseAccepted")'
{
  "lastTransitionTime": "2024-11-20T17:54:09Z",
  "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18\" failure=Unable to download and prepare the update: pod version--ncgpc failed at pending with reason \"\" and message \"\" and status initcontainer cleanup is waiting with reason \"SignatureValidationFailed\" and message \"image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 because the signature validation failed: Source image rejected: A signature was required, but no signature exists\"",
  "reason": "RetrievePayload",
  "status": "False",
  "type": "ReleaseAccepted"
}
[root@preserve-dis016-centos9 cloud-user]#

Clear upgrade and trigger upgrade to invalid digest

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --clear 
Cancelled requested upgrade to quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
[root@preserve-dis016-centos9 cloud-user]# 

clusterversion should report image pull error

[root@preserve-dis016-centos9 cloud-user]# oc adm upgrade 
Cluster version is 4.18.0-0.test-2024-11-20-162921-ci-ln-n0pyvh2-latest


Upgradeable=False


  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates


ReleaseAccepted=False


  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000" failure=Unable to download and prepare the update: pod version--c9df9 failed at pending with reason "" and message "" and status initcontainer cleanup is waiting with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000: reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000 in quay.io/openshift-release-dev/ocp-release: manifest unknown"


warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.


[root@preserve-dis016-centos9 cloud-user]# oc get clusterversion version -o json  | jq -r  '.status.conditions[]|select(.type=="ReleaseAccepted")'
{
  "lastTransitionTime": "2024-11-20T18:01:38Z",
  "message": "Retrieving payload failed version=\"\" image=\"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\" failure=Unable to download and prepare the update: pod version--vpmbj failed at pending with reason \"\" and message \"\" and status initcontainer cleanup is waiting with reason \"ErrImagePull\" and message \"initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000: reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000 in quay.io/openshift-release-dev/ocp-release: manifest unknown\"",
  "reason": "RetrievePayload",
  "status": "False",
  "type": "ReleaseAccepted"
}
[root@preserve-dis016-centos9 cloud-user]# 

@hongkailiu
Copy link
Copy Markdown
Member Author

We decide this to go to 4.19 instead of 4.18 as it is not considered as "critical fix".

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2024
@hongkailiu
Copy link
Copy Markdown
Member Author

branching is complete

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

11 similar comments
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b6b7345 and 2 for PR HEAD 4386a7b in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 28, 2024

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 0659bfb into openshift:master Nov 28, 2024
@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.19.0-202411280839.p0.g0659bfb.assembly.stream.el9.
All builds following this will include this PR.

@hongkailiu
Copy link
Copy Markdown
Member Author

Testing with:

launch 4.19.0-ec.4 aws,techpreview

and

$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.4   True        False         13m     Cluster version is 4.19.0-ec.4

$ oc get -o json featuregate cluster | jq .spec
{
  "featureSet": "TechPreviewNoUpgrade"
}

Upgrade to quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18:

oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2025-04-22T16:07:29Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
  failure=Unable to download and prepare the update: fetching update payload: pod
  version--f94hp failed at pending with reason "" and message "" and status initcontainer
  cleanup is waiting with reason "SignatureValidationFailed" and message "image pull
  failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
  because the signature validation failed: Source image rejected: A signature was
  required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

Clear the previous upgrade:

$ oc adm upgrade --clear
Cancelled requested upgrade to quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18

and upgrade to quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:

$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000

$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2025-04-22T16:09:12Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
  failure=Unable to download and prepare the update: fetching update payload: pod
  version--drjdf failed at pending with reason "" and message "" and status initcontainer
  cleanup is waiting with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
  reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
  in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants