Bug 2080058: pkg/cvo/updatepayload: Prune previous payload downloads by wking · Pull Request #769 · openshift/cluster-version-operator

wking · 2022-04-28T21:16:26Z

mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty

by pruning all scratch directories before downloading a new payload.

Picks #760, #765, and #767 back to 4.10.

Avoid [1]: mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty by pruning all scratch directories before downloading a new payload. Also shift the job pruning up here too, so it's all collected together. For jobs, pending and active jobs are now added to the deletion queue too, and we pivot to not retaining any old jobs. Either the older jobs were for other targets, in which case we no longer care about them, or they are looking at the same directory we're about to target with the job we're about to launch, in which case they will likely step on each other's toes. If we want to return to keeping more jobs around, we'd need to pivot to target directory names that are more unique than just "pullspec hash", e.g. by including a job-launch timestap. The directory pruning avoids sticking subsequent download jobs after a corrupted download attempt (e.g. if something terminates a download job part way through its 'mv' call, subsequent download ). Reproducing this issue locally, with /tmp on a different filesystem: $ mkdir -p a /tmp/a/a $ /bin/mv a /tmp /bin/mv: inter-device move failed: 'a' to '/tmp/a'; unable to remove target: Directory not empty $ /bin/mv --version mv (GNU coreutils) 8.31 ... The directory pruning will also avoid leaking previous sets of manifests. Before this commit, the /etc/cvo/updatepayloads directories on the control plane nodes would gradually accumulate more and more release manifest sets as new targets were downloaded for inspection. The manifest sets aren't that big: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.10.6-x86_64 Extracted release payload from digest sha256:88b394e633e09dc23aa1f1a61ededd8e52478edf34b51a7dbbb21d9abde2511a created at 2022-03-23T07:34:52Z $ du -hs manifests 6.5M manifests But still, better to have the pruning than leak forever. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c0

I'd attempted to add this cleanup on the CVO side in 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, openshift#760). But Evgeni pointed out that that doesn't work because the CVO's volume mount is 'readOnly: true': 2022-04-11T20:42:32.056428784Z W0411 20:42:32.056387 1 updatepayload.go:149] failed to prune update payload directory: unlinkat /etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg: read-only file system This commit shifts removal into the job. To avoid issues with loading half-populated directories, I've also adjusted the job to populate {baseDir}/{targetName}-{randomSuffix}, and then rename that populated result to {baseDir}/{targetName}. That way, future targetUpdatePayloadDir calls using the ValidateDirectory precheck will successfully distinguish "previous job got everything for this release" (where we don't want a new download job) from "previous job got some bits over, but not everything" (where we do want a new download job). POSIX 2018 requires mv to use rename [1]. While rename discusses atomicity in the informative Rationale section, there's nothing in the normative description that mentions atomicity [2]. Still, "atomic renames" are a common dance on Linux, especially when renaming within a directory, where it is a single directory table being edited. [3] talks through the pattern, and points out the need for a pre-rename fsync or similar if you want to avoid: 1. Open a/file, write to it, and close the file descriptor. 2. Kernel holding file contents in memory, not yet on disk. 3. Atomically move a/ to b/. 4. Crash, losing any kernel-memory caches. 5. b/file exists, but it may be empty or partial. I think the risk of kernel-crashes leaving us with currupted manifests is small enough that we can ignore it, to save the cost of calling fsync. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mv.html [2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html [3]: https://lwn.net/Articles/789600/

All of the input strings were internally controlled, so there wasn't a risk of them containing a shell-sensitive character. But that safety was not immediately obvious when reading the code. By converting to initContainers, which run sequentially to completion before the next container is run [1], we have the same effect without involving the shell. As a side benefit, we also get clearer status logging showing exactly which steps have succeeded or failed. [1]: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#understanding-init-containers

openshift-ci · 2022-04-28T21:16:39Z

@wking: This pull request references Bugzilla bug 2080058, which is invalid:

expected the bug to target the "4.11.0" release, but it targets "4.10.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 2080058: pkg/cvo/updatepayload: Prune previous payload downloads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Some background on recent changes: * 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, openshift#760) attempted to add CVO-side directory removal, but that failed because the CVO mounts the shared volume 'readOnly: true'. * a5af89d (pkg/cvo/updatepayload: Shift previous-download removal into the job, 2022-04-18, openshift#765) shifted removal into the job itself. As far as I can tell, this worked. * c45a981 (pkg/cvo/updatepayload: Use initContainers instead of shell &&-chains, 2022-04-20, openshift#765) addressed concerns with unquoted shell arguments by pivoting to initContainers and dropping the shell. This broke the * pathname expansion that rm depends on to find directories to remove. This commit returns to using the shell to invoke the rm call, so we get pathname expansion back [1]. But I avoid the possibility of unquoted argument injection by using workingDir to bring in baseDir. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_06

wking · 2022-04-28T21:18:14Z

I fixed the base branch to point at 4.10.

/bugzilla refresh

openshift-ci · 2022-04-28T21:18:19Z

@wking: This pull request references Bugzilla bug 2080058, which is invalid:

expected dependent Bugzilla bug 2070805 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

I fixed the base branch to point at 4.10.

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-28T21:18:28Z

@wking: This pull request references Bugzilla bug 2080058, which is invalid:

expected dependent Bugzilla bug 2070805 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 2080058: pkg/cvo/updatepayload: Prune previous payload downloads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-28T23:21:56Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic	`a57685e`	link	true	`/test e2e-agnostic`
ci/prow/gofmt	`a57685e`	link	true	`/test gofmt`
ci/prow/e2e-agnostic-operator	`a57685e`	link	true	`/test e2e-agnostic-operator`
ci/prow/e2e-agnostic-upgrade	`a57685e`	link	true	`/test e2e-agnostic-upgrade`
ci/prow/unit	`a57685e`	link	true	`/test unit`
ci/prow/golangci-lint	`a57685e`	link	true	`/test golangci-lint`
ci/prow/e2e-agnostic-upgrade	`a57685e`	link	true	`/test e2e-agnostic-upgrade`
ci/prow/images	`a57685e`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2022-05-01T17:12:29Z

/bugzilla refresh

openshift-ci · 2022-05-01T17:12:35Z

@wking: This pull request references Bugzilla bug 2080058, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.z) matches configured target release for branch (4.10.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2070805 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
dependent Bugzilla bug 2070805 targets the "4.11.0" release, which is one of the valid target releases: 4.11.0
bug has dependents

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-05-01T17:13:05Z

/refresh
/retest

wking · 2022-05-01T20:50:18Z

/skip

LalatenduMohanty

/label backport-risk-assessed

LalatenduMohanty

/lgtm

openshift-ci · 2022-05-02T19:47:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jianlinliu · 2022-05-10T03:48:36Z

/label cherry-pick-approved

wking · 2022-05-10T03:57:11Z

I think Tide is confused about my originally opening this PR against master before fixing to target release-4.10. Despite me having pushed a fresh commit hash after retargeting. Anyhow, looking at a57685e results in the test history, the only failure is the update run, and the only issue there was etcdHighNumberOfLeaderChanges and SDN readiness probes, which are both unrelated to my change. I'm just going to push the merge button...

openshift-ci · 2022-05-10T03:57:40Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#769

Bugzilla bug 2080058 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2080058: pkg/cvo/updatepayload: Prune previous payload downloads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added 3 commits April 28, 2022 14:12

openshift-ci Bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 28, 2022

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2022

wking changed the base branch from master to release-4.10 April 28, 2022 21:17

wking force-pushed the robust-repeat-version-collection branch from a57685e to e202c48 Compare April 28, 2022 21:17

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2022

openshift-ci Bot requested review from jottofar and vrutkovs April 28, 2022 21:18

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2022

openshift-ci Bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 1, 2022

openshift-ci Bot requested a review from evakhoni May 1, 2022 17:12

LalatenduMohanty approved these changes May 2, 2022

View reviewed changes

openshift-ci Bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 2, 2022

LalatenduMohanty approved these changes May 2, 2022

View reviewed changes

openshift-ci Bot assigned LalatenduMohanty May 2, 2022

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2022

openshift-ci Bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label May 10, 2022

wking merged commit 0472990 into openshift:release-4.10 May 10, 2022

wking deleted the robust-repeat-version-collection branch May 10, 2022 03:57

Conversation

wking commented Apr 28, 2022

Uh oh!

openshift-ci Bot commented Apr 28, 2022

Uh oh!

wking commented Apr 28, 2022

Uh oh!

openshift-ci Bot commented Apr 28, 2022

Uh oh!

openshift-ci Bot commented Apr 28, 2022

Uh oh!

openshift-ci Bot commented Apr 28, 2022

Uh oh!

wking commented May 1, 2022

Uh oh!

openshift-ci Bot commented May 1, 2022

Uh oh!

wking commented May 1, 2022

Uh oh!

wking commented May 1, 2022

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 2, 2022

Uh oh!

jianlinliu commented May 10, 2022

Uh oh!

wking commented May 10, 2022

Uh oh!

openshift-ci Bot commented May 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants