Skip to content

Bug 2111817: daemon: Add a workaround for bug 2111817#3291

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cgwalters:fix-rpmostree-systemd-bug
Aug 16, 2022
Merged

Bug 2111817: daemon: Add a workaround for bug 2111817#3291
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cgwalters:fix-rpmostree-systemd-bug

Conversation

@cgwalters
Copy link
Copy Markdown
Member

So about a month ago we merged
0bcb4d2
which dropped all the old workarounds we had for rpm-ostree bugs.

I'm sure you are all shocked that less than a month later, we have
a new one to add in this section.

Actually though one of those previous bugs was arguably really a
systemd bug provoked by rpm-ostree...and that's the case again here.

This is all covered in
https://bugzilla.redhat.com/show_bug.cgi?id=2104619
and
coreos/rpm-ostree@21c82ff
but TL;DR systemd's implementation of InaccessiblePaths has
quadratic behavior in number of mounts, and so of course
we only discover this when we're testing heavily loaded OCP clusters.

The fix to rpm-ostree is inbound. But...adding this workaround which
has the MCD dynamically reconfigure the system to drop that config will
help unstick clusters so they can get the real fix.

So about a month ago we merged
openshift@0bcb4d2
which dropped all the old workarounds we had for rpm-ostree bugs.

I'm sure you are all shocked that less than a month later, we have
a new one to add in this section.

Actually though one of those previous bugs was arguably really a
systemd bug provoked by rpm-ostree...and that's the case again here.

This is all covered in
https://bugzilla.redhat.com/show_bug.cgi?id=2104619
and
coreos/rpm-ostree@21c82ff
but TL;DR systemd's implementation of `InaccessiblePaths` has
quadratic behavior in number of mounts, and so of course
we only discover this when we're testing heavily loaded OCP clusters.

The fix to rpm-ostree is inbound.  But...adding this workaround which
has the MCD dynamically reconfigure the system to drop that config will
help unstick clusters so they can get the real fix.
@cgwalters
Copy link
Copy Markdown
Member Author

Only compile tested locally, but CI should likely prove out the difference by comparing CPU time.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2022
@cgwalters cgwalters changed the title daemon: Add a workaround for bug 2111817 Bug 2111817: daemon: Add a workaround for bug 2111817 Aug 16, 2022
@openshift-ci openshift-ci Bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 16, 2022

@cgwalters: This pull request references Bugzilla bug 2111817, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.12.0) matches configured target release for branch (4.12.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @mike-nguyen

Details

In response to this:

Bug 2111817: daemon: Add a workaround for bug 2111817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot requested a review from mike-nguyen August 16, 2022 14:53
@yuqi-zhang
Copy link
Copy Markdown
Contributor

As a general question, is this because backporting rpm-ostree changes via RHEL would take substantially longer, and we just want to do this to unblock 4.11? Or is this also required as a parallel?

@cgwalters
Copy link
Copy Markdown
Member Author

As a general question, is this because backporting rpm-ostree changes via RHEL would take substantially longer

That's most definitely part of it. But to be clear, I am starting all the paperwork for that...doing it properly involves things like building in RHEL9 first etc.

Or is this also required as a parallel?

I touched on the other rationale in the commit message:

But...adding this workaround which
has the MCD dynamically reconfigure the system to drop that config will
help unstick clusters so they can get the real fix.

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

Comment thread pkg/daemon/rpm-ostree.go

// See https://bugzilla.redhat.com/show_bug.cgi?id=2111817
func bug2111817Workaround() error {
targetUnit := "/run/systemd/system/rpm-ostreed.service.d/bug2111817.conf"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the relative urgency of the issue involved, I'm not against doing it this way, but this does come with the consideration of whether we want to eventually remove this service from nodes.

Put another way, would this work with a dropin file template shipped in 4.11 instead? Then we can just remove the template file when we want it gone

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the MCD is writing the file explicitly into /run. That means it's transient, and will go away on reboot (and be reapplied the next time the MCD lands - hence when we drop the workaround from the MCD, it will automatically go away).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah ok I see. That's fine then. I am +1 to shipping this since its only transient

@yuqi-zhang
Copy link
Copy Markdown
Contributor

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

Right, but as I understand it, this is only an issue in 4.11 and 4.12. If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right? That was my main question (how long we would need to keep this workaround)

@sdodson
Copy link
Copy Markdown
Member

sdodson commented Aug 16, 2022

In other words, to get the fixed OS, they need to be able to upgrade to it first...this unsticks that.

Right, but as I understand it, this is only an issue in 4.11 and 4.12. If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right? That was my main question (how long we would need to keep this workaround)

I think Colin is pointing out that shipping a new rpm-ostree doesn't unstick already stuck upgrades. So the MCD workaround allows us to unstick existing stuck upgrades. I don't have a firm opinion on whether this is necessary but if we believe the rpm-ostree delivery timelines are in excess of one week (in RHCOS 4.11 by next Wednesday) this MCO change, provided it's all agreed to be safe, is probably worth pursuing.

@cgwalters
Copy link
Copy Markdown
Member Author

If the rpm-ostree fix gets into 4.11, if the user goes from 4.10->4.11.z with fix, we wouldn't need this fix right?

Ah yes sorry, you are 100% correct here. I was thinking of the 4.11 ➡️ 4.12 scenario but that's not production yet.

So... https://bugzilla.redhat.com/show_bug.cgi?id=2118774 now tracks the 8.7 bug and I'll have to z-stream clone to 8.6. I think I can plumb that all through this week, fingers crossed.

Dunno. I am OK to close this and pursue that path exclusively if you prefer.

@cgwalters
Copy link
Copy Markdown
Member Author

I launched a cluster from this PR and verified it works. You can tell by doing e.g.:

$ systemctl restart rpm-ostreed
$ systemctl stop rpm-ostreed
$ systemctl status rpm-ostreed

The CPU timing should be very low, e.g. here on a relatively idle node:

Aug 16 18:11:20 ci-ln-t48c9vk-f76d1-7vhvx-master-2 systemd[1]: rpm-ostreed.service: Consumed 160ms CPU time

But doing rm /run/systemd/system/rpm-ostreed.service.d/bug2111817.conf && systemctl daemon-reload to remove the workaround (i.e. current state) I see

Aug 16 18:12:05 ci-ln-t48c9vk-f76d1-7vhvx-master-2 systemd[1]: rpm-ostreed.service: Consumed 2.975s CPU time

@yuqi-zhang
Copy link
Copy Markdown
Contributor

Dunno. I am OK to close this and pursue that path exclusively if you prefer.

I think based on Scott's assessment, this has value. I am +1 on getting this in in that context.

/retest

Just to be safe we pass the upgrade test. I can lgtm today since I don't see much risk here.

@yuqi-zhang
Copy link
Copy Markdown
Contributor

Also gives us an opportunity to test the new Jira backport process :)

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 16, 2022

@cgwalters: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Should be a safe workaround, to be removed when rpm-ostree fix goes in

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 16, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 3e333ad into openshift:master Aug 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 16, 2022

@cgwalters: All pull requests linked via external trackers have merged:

Bugzilla bug 2111817 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2111817: daemon: Add a workaround for bug 2111817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Copy Markdown
Contributor

Since we need this fix in 4.11, backporting.
/cherry-pick release-4.11

@openshift-cherrypick-robot
Copy link
Copy Markdown

@sinnykumari: #3291 failed to apply on top of branch "release-4.11":

Applying: daemon: Add a workaround for bug 2111817
Using index info to reconstruct a base tree...
M	pkg/daemon/rpm-ostree.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/daemon/rpm-ostree.go
CONFLICT (content): Merge conflict in pkg/daemon/rpm-ostree.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 daemon: Add a workaround for bug 2111817
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

Since we need this fix in 4.11, backporting.
/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Copy Markdown
Contributor

sinnykumari commented Aug 17, 2022

Manual backport PR #3292
Although Jira automation is messed up, this is going to be fun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants