Bug 1913536: update.go: add broken symlink check + removal during unit enable#2338
Conversation
|
@yuqi-zhang: This pull request references Bugzilla bug 1913536, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
There was a problem hiding this comment.
Shouldn't we do a check for RHEL 7?
There was a problem hiding this comment.
that's another option, I could hard check for RHEL7 and remove broken symlinks that we wrote originally
edit: added a RHEL check
There was a problem hiding this comment.
The enable code does not assume multi-user.target.wants; the unit could have a different target.
There was a problem hiding this comment.
Right, this attempts to fix ONLY multi-user.target.wants, because that's where we wrote broken symlinks back when we had hardcoding. This ONLY attempts to fix those, and ONLY in the list of enables, and other user error is not handled.
There was a problem hiding this comment.
The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload.
Under the axiom that RHEL 7 can't handle null-target symlinks, I wonder if it would be simpler/safer to:
- check for symlinks in
/etc/systed/system(since you can't assume the unit is just for multi-user.target.wants) - check if the target is dangling, remove it
- then systemctl-enable
- guard for RHEL 7 only
There was a problem hiding this comment.
The effect of this is that any failure to enable the units will trigger the check. Since you are changing on the on-disk state, you may have to do a systemctl daemon-reload.
Right, and I propagate the old error just in case. Put another way I am just performing one extra "cleanup" when I fail the first time, so normal operation specifically doesn't hit this.
check for symlinks in /etc/systed/system (since you can't assume the unit is just for multi-user.target.wants)
I don't want to do that because then we'd be potentially managing items not defined in a machineconfig or correct user error which I do not want to happen.
There was a problem hiding this comment.
Basically, if user error occurs, I still want to error as usual. I don't want this to actually fix incorrect unit definitions, for example, or if a user wrote a bad link with a files section config. I am only attempting to clean up bad links from our own code pre 4.7. I can try to make that clearer in the commit message if you think the approach makes sense
There was a problem hiding this comment.
Would you be ok if the logic was the same, but I only attempted this on RHEL7? Or if I parse the error specifically for "file exists" which is what the rhel 7 bug was? (somewhat hacky I feel)
|
/retest |
|
Added a check for rhel |
|
/retest |
If unit enable fails, remove broken symlinks in multi-user.target.wants and try again. This fixes a bug where enables would fail on cluster upgrades with RHEL 7 nodes between 4.6 -> 4.7. Context: before openshift#2145, the MCO hard coded a symlink from /etc/system/systemd/$UNIT to /etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd version is newer and is able to remove broken symlinks, but on RHEL 7 nodes, it will not first attempt to remove broken symlinks and thus fails the enable. As a workaround, this PR thus attempts to remove broken symlinks when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades should not hit this, and failing ones would report full errors. The error checking is perhaps a bit overkill but the original bug case should only run through this logic once before it is fixed. Future errors are likely actual errors and will be reported as such. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
b4b84cb to
8e90105
Compare
|
Rebased, all tests passed, tested locally with 4.6.6->4.6.11-> 4.7 nightly with this PR + rhel node, all errors were automatically fixed, upgrade was successful, and no dangling symlinks remain. This should be good to go (assuming we're ok with the above assumptions) |
sinnykumari
left a comment
There was a problem hiding this comment.
Overall LGTM.
Will leave final approval on Ben.
|
/retest |
|
/approve After talking to Jerry offline, looks good to me. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: darkmuggle, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@yuqi-zhang: All pull requests linked via external trackers have merged: Bugzilla bug 1913536 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
If unit enable fails, remove broken symlinks in multi-user.target.wants
and try again. This fixes a bug where enables would fail on cluster upgrades
with RHEL 7 nodes between 4.6 -> 4.7.
Context: before #2145,
the MCO hard coded a symlink from /etc/system/systemd/$UNIT to
/etc/systemd/system/multi-user.target.wants/$UNIT, which is not the case
for every unit and thus caused broken symlinks. On RHCOS/FCOS, the systemd
version is newer and is able to remove broken symlinks, but on RHEL 7 nodes,
it will not first attempt to remove broken symlinks and thus fails the
enable. As a workaround, this PR thus attempts to remove broken symlinks
when the first enable fails, and then try again. Successful FCOS/RHCOS upgrades
should not hit this, and failing ones would report full errors.
The error checking is perhaps a bit overkill but the original bug case should
only run through this logic once before it is fixed. Future errors are likely
actual errors and will be reported as such.
Signed-off-by: Yu Qi Zhang jerzhang@redhat.com