Skip to content

overlay: disable iscsi.service by default#1294

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
jlebon:pr/disable-iscsi
May 29, 2023
Merged

overlay: disable iscsi.service by default#1294
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
jlebon:pr/disable-iscsi

Conversation

@jlebon
Copy link
Copy Markdown
Member

@jlebon jlebon commented May 23, 2023

iscsi.service has Before=remote-fs-pre.target and After=network-online.target. This forces remote-fs-pre.target to block on network-online.target and hence in OCP, on ovs-configuration.service (which has Before=network-online.target).

So this transitively makes systemd-user-sessions.service block on network-online.target.

This was an issue in Fedora as well and was discussed in a devel thread[1]. iscsi.service was subsequently reworked[2][3] so that it was only activated if iSCSI was actually used by the system.

On RHEL 8, iscsi.service and co. were directly enabled by RPM scriptlets rather than using presets. In RHCOS, we explicitly make presets canonical[4] so we shipped with iscsi.service disabled by default. On RHEL 9, the units were fixed to use presets[5]. This is why we started seeing this issue after moving to RHEL 9.

So all we need in theory is to have the Fedora patch backported to RHEL
9. However, since we don't really need the functionality from iscsi.service by default in RHCOS, we can fast-track its (re-)disablement and not wait for the iscsi-starter.service workaround.

Note that iscsi.service is only used to bring up iSCSI sessions marked for autostart in /var/lib/iscsi/nodes and is separate from iscsid.service, which is what actually manages the iSCSI connections. In OpenShift, we rely on the latter only (e.g. configured iSCSI PVCs are done by the kubelet directly calling out to iscsiadm). It's also separate from iSCSI devices that use host bus adapters, which are transparent to RHCOS/OCP.

Fixes: https://issues.redhat.com/browse/OCPBUGS-11124

@openshift-ci openshift-ci Bot requested review from RishabhSaini and jschintag May 23, 2023 21:52
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2023
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 23, 2023

@mkowalski Can you verify that the real life situation in OCPBUGS-11124 is indeed fixed if you also add a MachineConfig/Ignition fragment to disable iscsi.service?

{
  "ignition": {
    "version": "3.3.0"
  },
  "systemd": {
    "units": [
      {
        "enabled": false,
        "name": "iscsi.service"
      }
    ]
  }
}

@@ -1 +1,7 @@
disable iscsid.socket
Copy link
Copy Markdown
Member Author

@jlebon jlebon May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: this came in via 929ac48 (in #905) which doesn't have any context on what exactly this fixes, which isn't great. @travier, do you know why we needed this? I'd like to add a comment there (and otherwise just drop it).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg sorry, it was meant as a workaround to be investigated later and then I forgot.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I disabled it as it was an additionally enabled unit and that made our kola test fail.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I'll remove it in a follow-up PR and if anything breaks, at least we'll be able to add some context to it. :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hadn't seen your second comment here. If that's the case, I think that was fixed by coreos/coreos-assembler#3275.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, we disable the iscsi.service service here, not the d one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop this here or in another PR. Up to you.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll drop it in a follow-up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened follow-up in #1298.

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 23, 2023

It's possible (though IMO unlikely) some customers are configuring host-level OS-initiated iSCSI mounts and relying on iscsi.service to autostart sessions. Before 4.13, they would have had to enable it themselves. If they upgraded from 4.12, the e.g. MachineConfig to enable it would still be hanging around there. If they started at 4.13, they didn't technically have to enable it, but with this patch, they would have to (until the Fedora patch is backported).

@cgwalters
Copy link
Copy Markdown
Member

Nice investigation! That makes total sense.
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2023
@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD b3adeb1 and 2 for PR HEAD a6c9055 in total

@mkowalski
Copy link
Copy Markdown

Can you verify that the real life situation in OCPBUGS-11124 is indeed fixed if you also add a MachineConfig/Ignition fragment to disable iscsi.service?

I can confirm that disabling iscsi.service is solving the issue completely, i.e. hanging nodeip-configuration.service does not anymore block access to the system. Thanks a lot for finding this out!

@mike-nguyen
Copy link
Copy Markdown
Member

/retest

@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 425197c and 1 for PR HEAD a6c9055 in total

@travier
Copy link
Copy Markdown
Member

travier commented May 24, 2023

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown

/retest-required

Remaining retests: 0 against base HEAD 2f1c135 and 0 for PR HEAD a6c9055 in total

@openshift-ci-robot
Copy link
Copy Markdown

/hold

Revision a6c9055 was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 24, 2023
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 26, 2023

/retest

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request May 26, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 26, 2023

The test added in this PR is being upstreamed in coreos/fedora-coreos-config#2437. Once the FCOS submodule is updated to include it, we can remove our copy here.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request May 26, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294
@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 27, 2023

This requires coreos/coreos-assembler#3487.

@jlebon jlebon removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2023
`iscsi.service` has `Before=remote-fs-pre.target` *and*
`After=network-online.target`. This forces `remote-fs-pre.target`
to block on `network-online.target` and hence in OCP, on
`ovs-configuration.service` (which has `Before=network-online.target`).

So this transitively makes `systemd-user-sessions.service` block on
`network-online.target`.

This was an issue in Fedora as well and was discussed in a devel
thread[[1]]. `iscsi.service` was subsequently reworked[[2]][[3]] so that
it was only activated if iSCSI was actually used by the system.

On RHEL 8, `iscsi.service` and co. were directly enabled by RPM
scriptlets rather than using presets. In RHCOS, we explicitly make
presets canonical[[4]] so we shipped with `iscsi.service` disabled by
default. On RHEL 9, the units were fixed to use presets[[5]]. This is
why we started seeing this issue after moving to RHEL 9.

So all we need in theory is to have the Fedora patch backported to RHEL
9. However, since we don't really need the functionality from
`iscsi.service` by default in RHCOS, we can fast-track its
(re-)disablement and not wait for the `iscsi-starter.service` workaround.

Note that `iscsi.service` is only used to bring up iSCSI sessions
marked for autostart in `/var/lib/iscsi/nodes` and is separate from
`iscsid.service`, which is what actually manages the iSCSI connections.
In OpenShift, we rely on the latter only (e.g. configured iSCSI PVCs
are done by the kubelet directly calling out to `iscsiadm`). It's
also separate from iSCSI devices that use host bus adapters, which are
transparent to RHCOS/OCP.

Fixes: https://issues.redhat.com/browse/OCPBUGS-11124

[1]: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/HACVEJ3FMOIM2TOENOVH5CPOUNR7NCMS
[2]: https://src.fedoraproject.org/rpms/iscsi-initiator-utils/c/1e689cd0c6667eca838c85975a1b7a070209e5ad
[3]: https://src.fedoraproject.org/rpms/fedora-release/pull-request/246
[4]: https://github.com/coreos/fedora-coreos-config/blob/1553518214088a89d6a2360a6fcdddbd3915628a/manifests/ignition-and-ostree.yaml#L35-L44
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1930458
@jlebon jlebon force-pushed the pr/disable-iscsi branch from a6c9055 to b5c5a05 Compare May 27, 2023 18:30
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 27, 2023
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request May 27, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2023

@jlebon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jlebon
Copy link
Copy Markdown
Member Author

jlebon commented May 28, 2023

This one just needs another /lgtm now! (No changes, just restamped to tickle a full CI rerun after coreos/coreos-assembler#3487.)

Copy link
Copy Markdown
Contributor

@c4rt0 c4rt0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: c4rt0, cgwalters, jlebon, travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [c4rt0,cgwalters,jlebon,travier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit ff3d5b0 into openshift:master May 29, 2023
@mkowalski
Copy link
Copy Markdown

/cherry-pick release-4.13

@openshift-cherrypick-robot
Copy link
Copy Markdown

@mkowalski: new pull request created: #1301

Details

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Copy Markdown

Thanks for this a lot, @jlebon! Given that we need backport to 4.13, should coreos/coreos-assembler#3487 get backport to rhcos-4.13 too?

@travier
Copy link
Copy Markdown
Member

travier commented May 30, 2023

It will need an f-c-c backport and bump as well

travier pushed a commit to travier/fedora-coreos-config that referenced this pull request Jun 1, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294

(cherry picked from commit 5f322cb)
travier pushed a commit to coreos/fedora-coreos-config that referenced this pull request Jun 8, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294

(cherry picked from commit 5f322cb)
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this pull request Oct 10, 2023
This test was originally added in RHCOS[[1]], but it's equally valid to
run in FCOS. That would have caught the original issue when it was still
present in Fedora.

[1]: openshift/os#1294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants