Bug 2111817: daemon: Add a workaround for bug 2111817 by cgwalters · Pull Request #3291 · openshift/machine-config-operator

cgwalters · 2022-08-16T14:43:46Z

So about a month ago we merged
0bcb4d2
which dropped all the old workarounds we had for rpm-ostree bugs.

I'm sure you are all shocked that less than a month later, we have
a new one to add in this section.

Actually though one of those previous bugs was arguably really a
systemd bug provoked by rpm-ostree...and that's the case again here.

This is all covered in
https://bugzilla.redhat.com/show_bug.cgi?id=2104619
and
coreos/rpm-ostree@21c82ff
but TL;DR systemd's implementation of InaccessiblePaths has
quadratic behavior in number of mounts, and so of course
we only discover this when we're testing heavily loaded OCP clusters.

The fix to rpm-ostree is inbound. But...adding this workaround which
has the MCD dynamically reconfigure the system to drop that config will
help unstick clusters so they can get the real fix.

So about a month ago we merged openshift@0bcb4d2 which dropped all the old workarounds we had for rpm-ostree bugs. I'm sure you are all shocked that less than a month later, we have a new one to add in this section. Actually though one of those previous bugs was arguably really a systemd bug provoked by rpm-ostree...and that's the case again here. This is all covered in https://bugzilla.redhat.com/show_bug.cgi?id=2104619 and coreos/rpm-ostree@21c82ff but TL;DR systemd's implementation of `InaccessiblePaths` has quadratic behavior in number of mounts, and so of course we only discover this when we're testing heavily loaded OCP clusters. The fix to rpm-ostree is inbound. But...adding this workaround which has the MCD dynamically reconfigure the system to drop that config will help unstick clusters so they can get the real fix.

cgwalters · 2022-08-16T14:47:51Z

Only compile tested locally, but CI should likely prove out the difference by comparing CPU time.

openshift-ci · 2022-08-16T14:53:34Z

@cgwalters: This pull request references Bugzilla bug 2111817, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.12.0) matches configured target release for branch (4.12.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @mike-nguyen

Details

In response to this:

Bug 2111817: daemon: Add a workaround for bug 2111817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2022-08-16T16:06:16Z

As a general question, is this because backporting rpm-ostree changes via RHEL would take substantially longer, and we just want to do this to unblock 4.11? Or is this also required as a parallel?

cgwalters · 2022-08-16T17:11:00Z

As a general question, is this because backporting rpm-ostree changes via RHEL would take substantially longer

That's most definitely part of it. But to be clear, I am starting all the paperwork for that...doing it properly involves things like building in RHEL9 first etc.

Or is this also required as a parallel?

I touched on the other rationale in the commit message:

But...adding this workaround which
has the MCD dynamically reconfigure the system to drop that config will
help unstick clusters so they can get the real fix.

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

yuqi-zhang · 2022-08-16T17:45:32Z


+// See https://bugzilla.redhat.com/show_bug.cgi?id=2111817
+func bug2111817Workaround() error {
+	targetUnit := "/run/systemd/system/rpm-ostreed.service.d/bug2111817.conf"


Given the relative urgency of the issue involved, I'm not against doing it this way, but this does come with the consideration of whether we want to eventually remove this service from nodes.

Put another way, would this work with a dropin file template shipped in 4.11 instead? Then we can just remove the template file when we want it gone

Note that the MCD is writing the file explicitly into /run. That means it's transient, and will go away on reboot (and be reapplied the next time the MCD lands - hence when we drop the workaround from the MCD, it will automatically go away).

Ah yeah ok I see. That's fine then. I am +1 to shipping this since its only transient

yuqi-zhang · 2022-08-16T17:46:58Z

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

Right, but as I understand it, this is only an issue in 4.11 and 4.12. If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right? That was my main question (how long we would need to keep this workaround)

sdodson · 2022-08-16T18:08:49Z

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

Right, but as I understand it, this is only an issue in 4.11 and 4.12. If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right? That was my main question (how long we would need to keep this workaround)

I think Colin is pointing out that shipping a new rpm-ostree doesn't unstick already stuck upgrades. So the MCD workaround allows us to unstick existing stuck upgrades. I don't have a firm opinion on whether this is necessary but if we believe the rpm-ostree delivery timelines are in excess of one week (in RHCOS 4.11 by next Wednesday) this MCO change, provided it's all agreed to be safe, is probably worth pursuing.

cgwalters · 2022-08-16T18:10:05Z

If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right?

Ah yes sorry, you are 100% correct here. I was thinking of the 4.11 ➡️ 4.12 scenario but that's not production yet.

So... https://bugzilla.redhat.com/show_bug.cgi?id=2118774 now tracks the 8.7 bug and I'll have to z-stream clone to 8.6. I think I can plumb that all through this week, fingers crossed.

Dunno. I am OK to close this and pursue that path exclusively if you prefer.

cgwalters · 2022-08-16T18:13:43Z

I launched a cluster from this PR and verified it works. You can tell by doing e.g.:

$ systemctl restart rpm-ostreed
$ systemctl stop rpm-ostreed
$ systemctl status rpm-ostreed

The CPU timing should be very low, e.g. here on a relatively idle node:

Aug 16 18:11:20 ci-ln-t48c9vk-f76d1-7vhvx-master-2 systemd[1]: rpm-ostreed.service: Consumed 160ms CPU time

But doing rm /run/systemd/system/rpm-ostreed.service.d/bug2111817.conf && systemctl daemon-reload to remove the workaround (i.e. current state) I see

Aug 16 18:12:05 ci-ln-t48c9vk-f76d1-7vhvx-master-2 systemd[1]: rpm-ostreed.service: Consumed 2.975s CPU time

yuqi-zhang · 2022-08-16T18:33:58Z

Dunno. I am OK to close this and pursue that path exclusively if you prefer.

I think based on Scott's assessment, this has value. I am +1 on getting this in in that context.

/retest

Just to be safe we pass the upgrade test. I can lgtm today since I don't see much risk here.

yuqi-zhang · 2022-08-16T18:37:52Z

Also gives us an opportunity to test the new Jira backport process :)

openshift-ci · 2022-08-16T21:20:59Z

@cgwalters: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang

/lgtm

Should be a safe workaround, to be removed when rpm-ostree fix goes in

openshift-ci · 2022-08-16T22:50:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-08-16T22:53:10Z

@cgwalters: All pull requests linked via external trackers have merged:

Bugzilla bug 2111817 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2111817: daemon: Add a workaround for bug 2111817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sinnykumari · 2022-08-17T09:43:20Z

Since we need this fix in 4.11, backporting.
/cherry-pick release-4.11

openshift-cherrypick-robot · 2022-08-17T09:44:05Z

@sinnykumari: #3291 failed to apply on top of branch "release-4.11":

Applying: daemon: Add a workaround for bug 2111817
Using index info to reconstruct a base tree...
M	pkg/daemon/rpm-ostree.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/daemon/rpm-ostree.go
CONFLICT (content): Merge conflict in pkg/daemon/rpm-ostree.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 daemon: Add a workaround for bug 2111817
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

Since we need this fix in 4.11, backporting.
/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sinnykumari · 2022-08-17T10:36:09Z

Manual backport PR #3292
Although Jira automation is messed up, this is going to be fun

openshift-ci Bot requested review from sinnykumari and yuqi-zhang August 16, 2022 14:48

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2022

cgwalters changed the title ~~daemon: Add a workaround for bug 2111817~~ Bug 2111817: daemon: Add a workaround for bug 2111817 Aug 16, 2022

openshift-ci Bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 16, 2022

openshift-ci Bot requested a review from mike-nguyen August 16, 2022 14:53

yuqi-zhang reviewed Aug 16, 2022

View reviewed changes

yuqi-zhang approved these changes Aug 16, 2022

View reviewed changes

openshift-ci Bot assigned yuqi-zhang Aug 16, 2022

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2022

openshift-merge-robot merged commit 3e333ad into openshift:master Aug 16, 2022

Conversation

cgwalters commented Aug 16, 2022

Uh oh!

cgwalters commented Aug 16, 2022

Uh oh!

openshift-ci Bot commented Aug 16, 2022

Uh oh!

yuqi-zhang commented Aug 16, 2022

Uh oh!

cgwalters commented Aug 16, 2022

Uh oh!

yuqi-zhang Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

cgwalters Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Aug 16, 2022

Uh oh!

sdodson commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Aug 16, 2022

Uh oh!

cgwalters commented Aug 16, 2022

Uh oh!

yuqi-zhang commented Aug 16, 2022

Uh oh!

yuqi-zhang commented Aug 16, 2022

Uh oh!

openshift-ci Bot commented Aug 16, 2022

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Aug 16, 2022

Uh oh!

openshift-ci Bot commented Aug 16, 2022

Uh oh!

sinnykumari commented Aug 17, 2022

Uh oh!

openshift-cherrypick-robot commented Aug 17, 2022

Uh oh!

sinnykumari commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sdodson commented Aug 16, 2022 •

edited

Loading

sinnykumari commented Aug 17, 2022 •

edited

Loading