OCPBUGS-11124, OCPBUGS-11411: overlay: Inject pcrphase service definition#1279
OCPBUGS-11124, OCPBUGS-11411: overlay: Inject pcrphase service definition#1279mkowalski wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-11411, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test all |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
OK, I have read some of the background on this. ISTM like we should just mask all the |
|
I am not sure if I'm missing something, but we can't do lots of operations on this service (i.e. I cannot mask it as well as we couldn't simply disable it). Maybe it's related to the fact that this is some systemd-originated service |
|
Sorry I know we've debugged this in multiple places in those bugs, but are you really, 100% sure that this will fix the problem?
One thing I notice here is that indeed we have but also: So I am still not convinced that I was wrong over here https://issues.redhat.com/browse/OCPBUGS-11124?focusedId=22056799&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22056799 |
To address this part: is this with a custom RHCOS containing this PR? It worked for me locally on the latest RHCOS 4.13 build. If we go this way, I can provide more information about how we'd do this at compose time. If you want to test whether it fixes the issue, you can also write an Ignition config/MC to mask the unit. |
The more I look at this the more confused I am, but I am now looking at RHCOS 411.86.202302021552-0 and what I see is following So we have OCP 4.11 which is not affected by this problem but But the more I think about it, the more I feel that maybe we should simply edit My rationale overall is as follows - between RHEL8 and RHEL9 we didn't change definitions of already existing stuff. But we did introduce pcrphase service that has some dependencies. By removing those dependencies, we almost go back to the previous state. Note that pcrphase service pulls |
|
/hold The following combination of changes gives me ability to login at any time (i.e. when nodeip-configuration is still running) and This looks relatively elegant to me with the only disadvantage being that we cannot simply drop-in replacement for dependencies but need to copy-paste the whole unit. Of course instead of copy-pasta of |
…tion This PR removes `After=` section from the definition of systemd-pcrphase. It is because currently it blocks possibility to SSH into the node which for any reason has nodeip-configuration or configure-ovs not succeeding. The self-healing functionality of the latter creates a scenario in which network-online.targed is not yet reached but we already want to access the node for debugging purposes. At the same time as by default systemd-pcrphase blocks user sessions and depends on remote-fs, this creates a deadlock. In order to remediate this situation, we are removing dependency on remote-fs here. It is justified as OpenShift nodes are not meant to use remote home directories. We are also modifying dependencies of systemd-user-sessions.service so that we are explicitly saying that we do not need network in order to allow access to the node. Fixes: OCPBUGS-11124 Fixes: OCPBUGS-11411
|
@mkowalski: This pull request references Jira Issue OCPBUGS-11411, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@mkowalski: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
I agree with Colin and I think that the problem is somewhere else. If masking this unit really unblocks us then I'll say OK but this still needs more investigation.
Edit: See comment below. |
That's not true. As per upstream documentation https://github.com/systemd/systemd/blob/250e9fadbcc0ca90e697d7efb40855b054ed3b8f/man/systemd.unit.xml#L1786 you cannot set empty |
Wow, nice catch. I did not know that. Thanks and sorry. |
|
This unfortunately makes this whole fix less desirable for us as we'll have to keep those files in sync with the systemd ones. If we go this route, a sed line in a post-process script would be more maintainable. |
Agreed, that make sense. Can we just agree here whether the pcrphase itself should be masked or copy-pasted with removed dependencies? We have user-sessions.service for which there is only one way of dealing with the issue (modifying |
|
Shouldn't masking the unit also work in this case? |
|
One thing that would really help drill down into the exact issue is to try to reproduce this outside of OCP on a single node RHCOS. E.g. have a Butane config which creates a systemd service similar in effect to the systemd units concerned here (e.g. with the same dependencies, but the actual I briefly tried this exercise, and did get it to block logins on 4.13, but it exhibited the same behaviour on 4.12, so clearly I wasn't capturing all the subtleties. |
|
Totally agree about isolating this. @jlebon can you post your butane config? |
|
Here's what I've been playing with variant: fcos
version: 1.4.0
systemd:
units:
# override the builtin openvswitch.service
- name: openvswitch.service
enabled: true
contents: |
[Unit]
Description=Fake Open vSwitch
# These match openvswitch.service
Before=network.target network.service
After=network-pre.target ovsdb-server.service ovs-vswitchd.service
PartOf=network.target
Requires=ovsdb-server.service
Requires=ovs-vswitchd.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=true
[Install]
WantedBy=multi-user.target
- name: nodeip-configuration.service
enabled: true
contents: |
[Unit]
Description=Fake nodeip-configuration
# These match the nodeip-configuration.service template
Wants=NetworkManager-wait-online.service crio-wipe.service
After=NetworkManager-wait-online.service ignition-firstboot-complete.service crio-wipe.service
Before=kubelet.service crio.service ovs-configuration.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=sleep infinity
[Install]
WantedBy=multi-user.target
- name: ovs-configuration.service
enabled: true
contents: |
[Unit]
Description=Fake ovs-configuration
# These match the ovs-configuration.service template
Requires=openvswitch.service
Wants=NetworkManager-wait-online.service
After=NetworkManager-wait-online.service openvswitch.service network.service nodeip-configuration.service
Before=network-online.target kubelet.service crio.service node-valid-hostname.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=sleep infinity
[Install]
WantedBy=multi-user.targetSo So likely there's some other detail I'm missing in this mock test that actually makes it hang on 4.13 but not 4.12. |
I may need to digest the details, but this is not the behaviour we see in OCP 4.13; what I do see when setting Not sure what's missing from the butane config, but I don't understand how possibly we could get a console with In fact, if we consider only serial console, 4.12 and 4.13 are the same. The difference is that with 4.13 when the console is stuck, I cannot SSH to the node. In 4.12 I can (so no console login, but SSH is possible) |
|
That's interesting. I'd like to dig into this. I'm mostly offline today but will take a look at it more deeply tomorrow. |
|
OK, I think this is superseded by #1294. |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-11411. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |

This PR removes
After=section from the definition ofsystemd-pcrphase. It is because currently it blocks possibility to SSH
into the node which for any reason has nodeip-configuration or
configure-ovs not succeeding.
The self-healing functionality of the latter creates a scenario in which
network-online.targed is not yet reached but we already want to access
the node for debugging purposes.
At the same time as by default systemd-pcrphase blocks user sessions and
depends on remote-fs, this creates a deadlock. In order to remediate
this situation, we are removing dependency on remote-fs here. It is
justified as OpenShift nodes are not meant to use remote home
directories.
We are also modifying dependencies of systemd-user-sessions.service so
that we are explicitly saying that we do not need network in order to
allow access to the node.
Fixes: OCPBUGS-11124
Fixes: OCPBUGS-11411