daemon: Be very loud about failures of ostree-finalize-staged.service#3404
Conversation
0b81698 to
fd740ce
Compare
|
/test e2e-aws |
|
The customer that is hitting failures here is still running 4.7...backporting this even to 4.11 will immediately hit on the fact that we only landed #3302 for 4.12. But, we can probably deal with that by inlining the code for the backport. There is another alternative here, where I just try to backport the ostree side where we have a failing systemd unit. But in that path, we would want to gain awareness of systemd failures and highlight those somehow which seems larger in scope. Although, it'd certainly be very valuable for other reasons (custom systemd units are much more likely with layering). Yet another variant which I'm just thinking about now - we could teach rpm-ostree to automatically roll this up into But...my inclination is probably to just go with this for now. |
fd740ce to
8610761
Compare
|
OK yep, with one bugfix I've verified this works! 🎉 Note the new "possible root cause". I forced an error here by doing (as root on the node) (This is a SNO pet machine, so I did master) |
|
(well, still draft until #3403 merges) |
|
OK...so this will help us detect the problem better. But, in playing with things we still have the problem in that the MCD will still try to remove nonexistent kernel arguments - I think we want to fix that so that recovering from this situation just requires using the force flag and not a manual addition of the karg and a reboot. |
I seem to have added this but not used it...then I got confused why it was null.
Prep for using a new API.
A while ago, as part opf ostreedev/ostree#2589 and motivated by https://bugzilla.redhat.com/show_bug.cgi?id=2075126 We added some validation of kernel arguments. However...sadly still today we don't make failures of systemd units very visible. Further, a failure in kernel argument validation is almost certainly root caused to the failure of `ostree-finalize-staged.service`. For maximum visibility, we want the error messages from *both* to live in the operator status. So this uses a new method in the client code check for such a failure, and if we find one, it adds it into the error context we find from e.g. comparing kargs. To make it even more visible, we add a node annotation *and* an event.
8610761 to
c13af6d
Compare
|
OK rebased 🏄 and I split out a prep patch for bumping the vendored deps. |
|
@cgwalters: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/test e2e-aws |
|
Ping? |
|
This is great. This also makes much easier to figure out these failure by looking at logs without asking for journal logs. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Depends: #3403
Also draft while I test this.
daemon: Be very loud about failures of ostree-finalize-staged.service
A while ago, as part opf
ostreedev/ostree#2589
and motivated by
https://bugzilla.redhat.com/show_bug.cgi?id=2075126
We added some validation of kernel arguments.
However...sadly still today we don't make failures of systemd
units very visible. Further, a failure in kernel argument validation
is almost certainly root caused to the failure of
ostree-finalize-staged.service.For maximum visibility, we want the error messages from both
to live in the operator status.
So this uses a new method in the client code check for such a failure,
and if we find one, it adds it into the error context we find
from e.g. comparing kargs.
To make it even more visible, we add a node annotation and an
event.