OTA-1307: pkg/cvo/updatepayload: Drop the Job controller for release-manifests downloads#1105
Conversation
…ests downloads Previously, the CVO launched a Job and waited for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so we're dropping it and monitoring the Pod ourselves.
f29874e to
622efd2
Compare
|
/retest-required |
0db06cb to
3996ca9
Compare
|
Testing:
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:22:33Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
failure=Unable to download and prepare the update: pod version--sjfwz failed with
reason "DeadlineExceeded" and message "Pod was active on the node longer than the
specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted
$ oc adm upgrade --clear
oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T17:28:53Z'
message: Payload loaded version="4.17.3" image="quay.io/openshift-release-dev/ocp-release@sha256:2bf757225166a2d1f51370f0bd2ec734bcc8cf759fa1a53ab9531ce8c584c17c"
architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.2-0.test-2024-11-07-161355-ci-ln-zycfpg2-latest True True 19s Working towards 4.17.3: 10 of 901 done (1% complete)
|
|
@hongkailiu: This pull request references OTA-1307 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@hongkailiu: This pull request references OTA-1307 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Testing "unacceptable Sigstore signatures":
$ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-07T22:08:00Z'
message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7"
failure=The update cannot be verified: unable to verify sha256:221ddaead9c1fe0151650954a0094a63b5669ba3b06aefe916e073c18de58db7
against keyrings: verifier-public-key-redhat'
reason: RetrievePayload
status: 'False'
type: ReleaseAcceptedThe failure has the nice message but it seems not related to my PR. The above testing is wrong. I will redo tomorrow. |
4ecfda9 to
622e4b0
Compare
|
Redid the testing with
### too old to have a Sigstore sig
$ oc adm release info quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 | grep 'Digest'
Digest: sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:16:36Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
failure=Unable to download and prepare the update: pod version--kzxpw failed at
pending with reason "" and message "" and status initcontainer cleanup is waiting
with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
because the signature validation failed: Source image rejected: A signature was
required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted
$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:20:30Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
failure=Unable to download and prepare the update: pod version--bv4n8 failed with
reason "DeadlineExceeded" and message "Pod was active on the node longer than the
specified deadline" and status initcontainer cleanup is waiting with reason "ImagePullBackOff"
and message "Back-off pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000\""'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted
$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-08T12:22:52Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted
oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-0.test-2024-11-08-112624-ci-ln-2nk1wv2-latest True True 71s Working towards 4.18.0-ec.3: 115 of 952 done (12% complete), waiting on etcd, kube-apiserver |
|
TBH I am not too comfortable with a gutting like this at this point of the cycle. Seems like something that should be done at the start of the cycle so that we have a chance to weed out a fallout of this, not right at the end 🤔 |
622e4b0 to
48b579e
Compare
|
Tested again with commit 48b579e:
|
petr-muller
left a comment
There was a problem hiding this comment.
Finally found time to properly read this, comments inline
| Name: name, | ||
| Namespace: namespace, | ||
| Labels: map[string]string{ | ||
| "k8s-app": "retrieve-openshift-release", |
There was a problem hiding this comment.
This is a workload closely coupled to CVO, I'd put them under a single k8s-app
There was a problem hiding this comment.
Re-used the label here:
There was a problem hiding this comment.
I had to recover the label because it is used to list the version-- pod and prune.
CVO has the same label, then is killed too.
| klog.Warningf("failed to prune jobs: %v", err) | ||
| // Prune older pods and directories while gracefully handling errors. | ||
| if err := r.prunePods(ctx); err != nil { | ||
| klog.Warningf("failed to prune pods: %v", err) |
There was a problem hiding this comment.
I wonder if a failure to prune pods is as safe to just warn about and continue like it was for jobs. We won't name-conflict because we add a random component to pod name but I wonder what are other implications of failing to delete pods - is it safe to spawn new pods? If there is an old pod still somehow running and refusing to be deleted, won't the two pods conflict somehow, like clobber each others output or something?
There was a problem hiding this comment.
I honestly do not know the answer.
Let me understand the concern:
- CVO started Pod A retrieving the payload.
- CVO got restarted (while Pod A is still running) and started to prune Pod A but failed for some unknown reason.
- CVO started Pod B retrieving the payload.
Then Pod A and Pod B may "clobber each others output"?
Would this be the same story of racing if "Pod"s are replaced with "Job"s above?
The Job controller may fail on deleting the pod for the job too.
There was a problem hiding this comment.
Would this be the same story of racing if "Pod"s are replaced with "Job"s above?
https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
I think i get the answer, "normally, only one Pod is started, unless the Pod fails." implies that with Job, Pod for Job B wont get started when Pod for Job A is still running.
There was a problem hiding this comment.
The concern is mainly whether it is safe to continue at L169 - we know that r.prunePods failed, for whatever reason (err != nil). We log it (correct), but then we continue. It can be something like "apiserver was disrupted at that very moment". Maybe temporary, maybe permanent.
Is it safe to continue working? As opposed to returning and maybe trying the whole procedure again later? By "clobbering the output" I meant whether two simultaneously executing pods (one that we failed to delete, and one that we will spawn after we continue) are somehow not conflicting. The version pods' job is to extract the manifests into the filesystem shared with CVO itself - hypothetically, the two pods could write the pods into the same location (maybe, I did not look at the actual code).
There was a problem hiding this comment.
It looks like continuing on error here was me in 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, #760) when we went from "no pruning at all" to "best-effort pruning". I'm comfortable moving to "require pruning" and failing out of this function on error instead of just warning about errors. But I agree with Hongkai that it seems like preexisting exposure.
48b579e to
7c3c704
Compare
|
/retest-required |
|
/retest-required The failure seems not related to us: The upgrade has reached the point where CVO applied manifests to the cluster which means payload is retrieved well ( $ curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1105/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change/1856896318669066240/artifacts/e2e-agnostic-ovn-upgrade-out-of-change/gather-extra/artifacts/clusterversion.json' | yq -y '.items[].status.conditions'
- lastTransitionTime: '2024-11-14T03:53:38Z'
message: The update channel has not been configured.
reason: NoChannel
status: 'False'
type: RetrievedUpdates
- lastTransitionTime: '2024-11-14T03:53:38Z'
message: Capabilities match configured spec
reason: AsExpected
status: 'False'
type: ImplicitlyEnabledCapabilities
- lastTransitionTime: '2024-11-14T03:53:38Z'
message: Payload loaded version="4.18.0-0.ci.test-2024-11-14-031235-ci-op-p5g3g1kq-initial"
image="registry.build09.ci.openshift.org/ci-op-p5g3g1kq/release@sha256:75f3bafc70910062034c8d45f08b7fd46ee8b4f44c19839f385ac179197c9dfe"
architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted
- lastTransitionTime: '2024-11-14T04:29:13Z'
message: Done applying 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest
status: 'True'
type: Available
- lastTransitionTime: '2024-11-14T07:11:40Z'
message: 'Multiple errors are preventing progress:
* Could not update role "openshift-kni-infra/host-networking-services" (754 of
892)
* Could not update service "openshift-service-ca-operator/metrics" (705 of 892)'
reason: MultipleErrors
status: 'True'
type: Failing
- lastTransitionTime: '2024-11-14T04:29:13Z'
message: 'Error while reconciling 4.18.0-0.ci.test-2024-11-14-031713-ci-op-p5g3g1kq-latest:
an unknown error has occurred: MultipleErrors'
reason: MultipleErrors
status: 'False'
type: Progressing
|
|
Tested with 4386a7b :
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:48:52Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
failure=Unable to download and prepare the update: pod version--dl4gq failed at
pending with reason "" and message "" and status initcontainer cleanup is waiting
with reason "SignatureValidationFailed" and message "image pull failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
because the signature validation failed: Source image rejected: A signature was
required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted
$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:49:43Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
failure=Unable to download and prepare the update: pod version--cfd2d failed at
pending with reason "" and message "" and status initcontainer cleanup is waiting
with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted
$ oc adm upgrade --clear
$ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c --force --allow-explicit-upgrade
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-ec.3" image="quay.io/openshift-release-dev/ocp-release@sha256:d2d34aafe0adda79953dd928b946ecbda34673180ee9a80d2ee37c123a0f510c"
architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest True True 45m Working towards 4.18.0-ec.3: 3 of 952 done (0% complete)Inspected the CVO log: It got stuck there because #1108 (comment) So we upgrade to the most recent nightly: $ oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d --force --allow-explicit-upgrade --allow-upgrade-with-warnings
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2024-11-15T13:50:05Z'
message: Payload loaded version="4.18.0-0.ci-2024-11-15-140111" image="registry.ci.openshift.org/ocp/release@sha256:a784a557b2d7dec0376002a52197897d16864940ce128ab51f7e63d82ff9574d"
architecture="amd64"
reason: PayloadLoaded
status: 'True'
type: ReleaseAccepted
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-0.test-2024-11-15-125415-ci-ln-5bt62p2-latest True True 50m Working towards 4.18.0-0.ci-2024-11-15-140111: 115 of 954 done (12% complete), waiting on etcd, kube-apiserverNow everything seems OK. |
wking
left a comment
There was a problem hiding this comment.
Nice work!
/lgtm
Petr had held for more review, and now I've also reviewed again, and you've worked up the testing, and everything looks good, so:
/hold cancel
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label qe-approved |
|
@hongkailiu: This pull request references OTA-1307 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Install Techpreview Cluster: Trigger too old payload which doesn't have sigstore sig. clusterversion operator should report detail error about sigstore validation. Clear upgrade and trigger upgrade to invalid digest clusterversion should report image pull error |
|
We decide this to go to 4.19 instead of 4.18 as it is not considered as "critical fix". /hold |
|
branching is complete /hold cancel |
11 similar comments
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[ART PR BUILD NOTIFIER] Distgit: cluster-version-operator |
|
Testing with: and $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.4 True False 13m Cluster version is 4.19.0-ec.4
$ oc get -o json featuregate cluster | jq .spec
{
"featureSet": "TechPreviewNoUpgrade"
}Upgrade to oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2025-04-22T16:07:29Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18"
failure=Unable to download and prepare the update: fetching update payload: pod
version--f94hp failed at pending with reason "" and message "" and status initcontainer
cleanup is waiting with reason "SignatureValidationFailed" and message "image pull
failed for quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18
because the signature validation failed: Source image rejected: A signature was
required, but no signature exists"'
reason: RetrievePayload
status: 'False'
type: ReleaseAcceptedClear the previous upgrade: $ oc adm upgrade --clear
Cancelled requested upgrade to quay.io/openshift-release-dev/ocp-release@sha256:4c5a7e26d707780be6466ddc9591865beb2e3baa5556432d23e8d57966a2dd18and upgrade to $ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
$ oc get clusterversion version -o yaml | yq -y '.status.conditions[]|select(.type=="ReleaseAccepted")'
lastTransitionTime: '2025-04-22T16:09:12Z'
message: 'Retrieving payload failed version="" image="quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000"
failure=Unable to download and prepare the update: fetching update payload: pod
version--drjdf failed at pending with reason "" and message "" and status initcontainer
cleanup is waiting with reason "ErrImagePull" and message "initializing source docker://quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000:
reading manifest sha256:0000000000000000000000000000000000000000000000000000000000000000
in quay.io/openshift-release-dev/ocp-release: manifest unknown"'
reason: RetrievePayload
status: 'False'
type: ReleaseAccepted |
Previously, the CVO launched a Job and waited for it to complete to
get manifests for an incoming release payload. But the Job controller
doesn't bubble up details about why the pod has trouble
(e.g. Init:SignatureValidationFailed), so to get those details, we
need direct access to the Pod. The Job controller doesn't seem like
it's adding much value here, so we're dropping it and monitoring the
Pod ourselves.