Skip to content

daemon: Be very loud about failures of ostree-finalize-staged.service#3404

Merged
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
cgwalters:finalize-staged-query
Nov 22, 2022
Merged

daemon: Be very loud about failures of ostree-finalize-staged.service#3404
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
cgwalters:finalize-staged-query

Conversation

@cgwalters
Copy link
Copy Markdown
Member

@cgwalters cgwalters commented Nov 7, 2022

Depends: #3403

Also draft while I test this.


daemon: Be very loud about failures of ostree-finalize-staged.service

A while ago, as part opf
ostreedev/ostree#2589
and motivated by
https://bugzilla.redhat.com/show_bug.cgi?id=2075126

We added some validation of kernel arguments.

However...sadly still today we don't make failures of systemd
units very visible. Further, a failure in kernel argument validation
is almost certainly root caused to the failure of ostree-finalize-staged.service.

For maximum visibility, we want the error messages from both
to live in the operator status.

So this uses a new method in the client code check for such a failure,
and if we find one, it adds it into the error context we find
from e.g. comparing kargs.

To make it even more visible, we add a node annotation and an
event.


@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2022
@cgwalters cgwalters force-pushed the finalize-staged-query branch 2 times, most recently from 0b81698 to fd740ce Compare November 8, 2022 15:12
@cgwalters cgwalters marked this pull request as draft November 8, 2022 15:30
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 8, 2022
@cgwalters
Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-gcp-op

@cgwalters
Copy link
Copy Markdown
Member Author

The customer that is hitting failures here is still running 4.7...backporting this even to 4.11 will immediately hit on the fact that we only landed #3302 for 4.12. But, we can probably deal with that by inlining the code for the backport.

There is another alternative here, where I just try to backport the ostree side where we have a failing systemd unit. But in that path, we would want to gain awareness of systemd failures and highlight those somehow which seems larger in scope. Although, it'd certainly be very valuable for other reasons (custom systemd units are much more likely with layering).

Yet another variant which I'm just thinking about now - we could teach rpm-ostree to automatically roll this up into rpm-ostree status --json (it exists in the CLI text today in an ad-hoc way!)...then we backport that, plus a bit of glue to bridge it in the MCD.

But...my inclination is probably to just go with this for now.

@cgwalters cgwalters force-pushed the finalize-staged-query branch from fd740ce to 8610761 Compare November 8, 2022 20:16
@cgwalters
Copy link
Copy Markdown
Member Author

OK yep, with one bugfix I've verified this works! 🎉

  - lastTransitionTime: "2022-11-08T19:43:48Z"
    message: 'Node jupiter.verbum.org is reporting: "unexpected on-disk state validating
      against rendered-master-c5d18087256956974fb5ed36c34e3629: Missing expected kernel
      arguments: [foo=bar]; possible root cause: error: mkdir(boot/loader.1): Operation
      not permitted"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

Note the new "possible root cause". I forced an error here by doing (as root on the node) unshare -m /bin/sh -c 'mount -o remount,rw /boot && chattr +i /boot, then creating a dummy MC which tries to roll a kernel argument:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 55-controlplane-foobar
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - foo=bar

(This is a SNO pet machine, so I did master)

@cgwalters cgwalters marked this pull request as ready for review November 9, 2022 14:48
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2022
@openshift-ci openshift-ci Bot requested a review from sinnykumari November 9, 2022 14:50
@cgwalters cgwalters marked this pull request as draft November 9, 2022 14:51
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2022
@cgwalters
Copy link
Copy Markdown
Member Author

(well, still draft until #3403 merges)

@cgwalters
Copy link
Copy Markdown
Member Author

OK...so this will help us detect the problem better. But, in playing with things we still have the problem in that the MCD will still try to remove nonexistent kernel arguments - I think we want to fix that so that recovering from this situation just requires using the force flag and not a manual addition of the karg and a reboot.

I seem to have added this but not used it...then I got confused
why it was null.
A while ago, as part opf
ostreedev/ostree#2589
and motivated by
https://bugzilla.redhat.com/show_bug.cgi?id=2075126

We added some validation of kernel arguments.

However...sadly still today we don't make failures of systemd
units very visible.  Further, a failure in kernel argument validation
is almost certainly root caused to the failure of `ostree-finalize-staged.service`.

For maximum visibility, we want the error messages from *both*
to live in the operator status.

So this uses a new method in the client code check for such a failure,
and if we find one, it adds it into the error context we find
from e.g. comparing kargs.

To make it even more visible, we add a node annotation *and* an
event.
@cgwalters cgwalters force-pushed the finalize-staged-query branch from 8610761 to c13af6d Compare November 9, 2022 19:21
@cgwalters cgwalters marked this pull request as ready for review November 9, 2022 19:21
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2022
@cgwalters
Copy link
Copy Markdown
Member Author

OK rebased 🏄 and I split out a prep patch for bumping the vendored deps.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 9, 2022

@cgwalters: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws c13af6d link false /test okd-scos-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Copy Markdown
Member Author

/test e2e-aws

@cgwalters
Copy link
Copy Markdown
Member Author

Ping?

@sinnykumari
Copy link
Copy Markdown
Contributor

This is great. This also makes much easier to figure out these failure by looking at logs without asking for journal logs.
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Nov 22, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 22, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 7ac378b into openshift:master Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants