Skip to content

Bug 2070805: pkg/cvo/updatepayload: Shift previous-download removal into the job#765

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
wking:job-side-payload-cleanup
Apr 25, 2022
Merged

Bug 2070805: pkg/cvo/updatepayload: Shift previous-download removal into the job#765
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
wking:job-side-payload-cleanup

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Apr 18, 2022

I'd attempted to add this cleanup on the CVO side in 507b474 (#760), but @evakhoni pointed out that that doesn't work because the CVO's volume mount is readOnly: true:

2022-04-11T20:42:32.056428784Z W0411 20:42:32.056387       1 updatepayload.go:149] failed to prune update payload directory: unlinkat /etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg: read-only file system

This commit shifts removal into the job.

And because we want that removal to also cover "a previous job run half-populated the target release directory", I've removed the pre-fetch ValidateDirectory check. In cases where we already had valid content locally, that's one extra download, which seems like a lower price than assuming a previous partial download was actually complete. If we see many extra downloads, we use atomic renames or something so it's easy for an incoming CVO to distinguish between content that was completely populated by a previous job run from content that was incompletely populated.

@openshift-ci openshift-ci Bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 18, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2022

@wking: This pull request references Bugzilla bug 2070805, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

Bug 2070805: pkg/cvo/updatepayload: Shift previous-download removal into the job

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2022
@wking wking force-pushed the job-side-payload-cleanup branch from 4be3a8f to 624f7fd Compare April 18, 2022 20:55
@jottofar
Copy link
Copy Markdown
Contributor

In cases where we already had valid content locally, that's one extra download, which seems like a lower price than assuming a previous partial download was actually complete. If we see many extra download

With https://bugzilla.redhat.com/show_bug.cgi?id=2072389 still open a typical extra download scenario is payload fails signature validation and user comes back and usesforce. However I'm thinking the fix for the aforementioned bug is to continually syncPayload (and thereby recheck preconditions) if the reason for payload load failure is precondition failed which, with this change, will continually download the payload too if I'm understanding correctly.

Comment thread pkg/cvo/updatepayload.go Outdated
@wking wking force-pushed the job-side-payload-cleanup branch 2 times, most recently from d70b3c9 to dee4b9b Compare April 19, 2022 17:26
@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 19, 2022

However I'm thinking the fix for the aforementioned bug is to continually syncPayload (and thereby recheck preconditions) if the reason for payload load failure is precondition failed which, with this change, will continually download the payload too if I'm understanding correctly.

I've pushed d70b3c9 -> dee4b9b to pivot to populating {baseDir}/{targetName}-{randomSuffix} and then atomically rename that populated result to {baseDir}/{targetName}. That way, future targetUpdatePayloadDir calls using the ValidateDirectory precheck will successfully distinguish "previous job got everything for this release" (where we don't want a new download job) from "previous job got some bits over, but not everything" (where we do want a new download job).

Comment thread pkg/cvo/updatepayload.go Outdated
Comment thread pkg/cvo/updatepayload.go Outdated
toReleasePath := filepath.Join(tmpDir, payload.ReleaseManifestDir)
releaseCmd := fmt.Sprintf("mv %s %s", fromReleasePath, toReleasePath)

moveInPlaceCmd := fmt.Sprintf("mv %s %s", tmpDirCmd, tdir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have to first create tdir to make the rename atomic, https://man7.org/linux/man-pages/man2/rename.2.html

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From POSIX:

This rename() function is equivalent for regular files to that defined by the ISO C standard. Its inclusion here expands that definition to include actions on directories and specifies behavior when the new parameter names a file that already exists. That specification requires that the action of the function be atomic.

The "If newpath already exists ..." section of the Linux man page is also in the "... and also when the target does exist..." category. Although it would be nice if someone said "this is atomic when the target does not exist" more explicitly on either side.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This article talks through the atomic rename approach, and from that, it looks like we might need a pre-rename fsync to be very secure about the transaction. I'm not sure that covering the kernel-crash corner case is worth the overhead of calling fsync.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed 4be737e -> a5af89d with an updated commit message to walk through this.

@wking wking force-pushed the job-side-payload-cleanup branch 3 times, most recently from 295db8e to 4be737e Compare April 19, 2022 21:17
I'd attempted to add this cleanup on the CVO side in 507b474
(pkg/cvo/updatepayload: Prune previous payload downloads, 2022-04-03, openshift#760).
But Evgeni pointed out that that doesn't work because the CVO's volume
mount is 'readOnly: true':

  2022-04-11T20:42:32.056428784Z W0411 20:42:32.056387       1 updatepayload.go:149] failed to prune update payload directory: unlinkat /etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg: read-only file system

This commit shifts removal into the job.

To avoid issues with loading half-populated directories, I've also
adjusted the job to populate {baseDir}/{targetName}-{randomSuffix},
and then rename that populated result to {baseDir}/{targetName}.  That
way, future targetUpdatePayloadDir calls using the ValidateDirectory
precheck will successfully distinguish "previous job got everything
for this release" (where we don't want a new download job) from
"previous job got some bits over, but not everything" (where we do
want a new download job).  POSIX 2018 requires mv to use rename [1].
While rename discusses atomicity in the informative Rationale section,
there's nothing in the normative description that mentions atomicity
[2].  Still, "atomic renames" are a common dance on Linux, especially
when renaming within a directory, where it is a single directory table
being edited.  [3] talks through the pattern, and points out the need
for a pre-rename fsync or similar if you want to avoid:

1. Open a/file, write to it, and close the file descriptor.
2. Kernel holding file contents in memory, not yet on disk.
3. Atomically move a/ to b/.
4. Crash, losing any kernel-memory caches.
5. b/file exists, but it may be empty or partial.

I think the risk of kernel-crashes leaving us with currupted manifests
is small enough that we can ignore it, to save the cost of calling
fsync.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/mv.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/rename.html
[3]: https://lwn.net/Articles/789600/
@wking wking force-pushed the job-side-payload-cleanup branch 5 times, most recently from d8d8e2c to 452f872 Compare April 21, 2022 06:23
All of the input strings were internally controlled, so there wasn't a
risk of them containing a shell-sensitive character.  But that safety
was not immediately obvious when reading the code.  By converting to
initContainers, which run sequentially to completion before the next
container is run [1], we have the same effect without involving the
shell.  As a side benefit, we also get clearer status logging showing
exactly which steps have succeeded or failed.

[1]: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#understanding-init-containers
@wking wking force-pushed the job-side-payload-cleanup branch from 452f872 to c45a981 Compare April 21, 2022 06:29
@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 21, 2022

Or better, I think in this case instead of returning commands-as-strings you could return an array of strings and avoid /bin/sh interpreting them at all.

I've added c45a981 with a pivot to initContainers for command chaining, so I can avoid sh and &&.

@jottofar
Copy link
Copy Markdown
Contributor

/test e2e-agnostic-upgrade

@jottofar
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 25, 2022

Watchdog gap is unrelated.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Watchdog gap is unrelated.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 76f1885 into openshift:master Apr 25, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2022

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2070805 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2070805: pkg/cvo/updatepayload: Shift previous-download removal into the job

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the job-side-payload-cleanup branch April 25, 2022 22:25
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 27, 2022
Some backround on recent changes:

* 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads,
  2022-04-03, openshift#760) attempted to add CVO-side directory removal, but
  that failed because the CVO mounts the shared volume 'readOnly:
  true'.

* a5af89d (pkg/cvo/updatepayload: Shift previous-download removal
  into the job, 2022-04-18, openshift#765) shifted removal into the job itself.
  As far as I can tell, this worked.

* c45a981 (pkg/cvo/updatepayload: Use initContainers instead of
  shell &&-chains, 2022-04-20, openshift#765) addressed concerns with unquoted
  shell arguments by pivoting to initContainers and dropping the
  shell.  This broke the * pathname expansion that rm depends on to
  find directories to remove.

This commit returns to using the shell to invoke the rm call, so we
get pathname expansion back [1].  But I avoid the possibility of
unquoted argument injection by using workingDir to bring in baseDir.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_06
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 27, 2022
Some background on recent changes:

* 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads,
  2022-04-03, openshift#760) attempted to add CVO-side directory removal, but
  that failed because the CVO mounts the shared volume 'readOnly:
  true'.

* a5af89d (pkg/cvo/updatepayload: Shift previous-download removal
  into the job, 2022-04-18, openshift#765) shifted removal into the job itself.
  As far as I can tell, this worked.

* c45a981 (pkg/cvo/updatepayload: Use initContainers instead of
  shell &&-chains, 2022-04-20, openshift#765) addressed concerns with unquoted
  shell arguments by pivoting to initContainers and dropping the
  shell.  This broke the * pathname expansion that rm depends on to
  find directories to remove.

This commit returns to using the shell to invoke the rm call, so we
get pathname expansion back [1].  But I avoid the possibility of
unquoted argument injection by using workingDir to bring in baseDir.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_06
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 28, 2022
Some background on recent changes:

* 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads,
  2022-04-03, openshift#760) attempted to add CVO-side directory removal, but
  that failed because the CVO mounts the shared volume 'readOnly:
  true'.

* a5af89d (pkg/cvo/updatepayload: Shift previous-download removal
  into the job, 2022-04-18, openshift#765) shifted removal into the job itself.
  As far as I can tell, this worked.

* c45a981 (pkg/cvo/updatepayload: Use initContainers instead of
  shell &&-chains, 2022-04-20, openshift#765) addressed concerns with unquoted
  shell arguments by pivoting to initContainers and dropping the
  shell.  This broke the * pathname expansion that rm depends on to
  find directories to remove.

This commit returns to using the shell to invoke the rm call, so we
get pathname expansion back [1].  But I avoid the possibility of
unquoted argument injection by using workingDir to bring in baseDir.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_06
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 28, 2022
Some background on recent changes:

* 507b474 (pkg/cvo/updatepayload: Prune previous payload downloads,
  2022-04-03, openshift#760) attempted to add CVO-side directory removal, but
  that failed because the CVO mounts the shared volume 'readOnly:
  true'.

* a5af89d (pkg/cvo/updatepayload: Shift previous-download removal
  into the job, 2022-04-18, openshift#765) shifted removal into the job itself.
  As far as I can tell, this worked.

* c45a981 (pkg/cvo/updatepayload: Use initContainers instead of
  shell &&-chains, 2022-04-20, openshift#765) addressed concerns with unquoted
  shell arguments by pivoting to initContainers and dropping the
  shell.  This broke the * pathname expansion that rm depends on to
  find directories to remove.

This commit returns to using the shell to invoke the rm call, so we
get pathname expansion back [1].  But I avoid the possibility of
unquoted argument injection by using workingDir to bring in baseDir.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants