Skip to content

[ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures#3738

Open
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix
Open

[ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures#3738
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix

Conversation

@mnietoji
Copy link
Copy Markdown
Contributor

@mnietoji mnietoji commented Mar 4, 2026

…ures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets ipv4.method=disabled. NMState operator then preserves this disabled state, causing permanent loss of provisioning network connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on different nodes. If metal3 restarts while a node is attempting DHCP renewal, the temporary unavailability of metal3-dnsmasq causes the renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3 label) and configure provisionServerNodeSelector in baremetalSetTemplate to schedule OpenStackProvisionServer on the same node. This ensures provisioning network connectivity is maintained because metal3-static-ip-manager maintains a static IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,d660efa12350eb88ab3c89b1d91a04abcbc82293

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d660efa to 369ae18 Compare March 4, 2026 11:42
@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,369ae185bc2b7d5a266e63c93224f86f1d2723cd

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 369ae18 to 3fa51c9 Compare March 4, 2026 11:49
@softwarefactory-project-zuul
Copy link
Copy Markdown

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,3fa51c9d28a6f3c53f0c99dbbdef1baf476724d5

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3217375613864e0d83f7f88f394dcfaa

openstack-k8s-operators-content-provider FAILURE in 7m 16s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal-minor-update SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ cifmw-pod-zuul-files SUCCESS in 4m 25s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 21s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 44s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 54s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 24s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 14s

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d0cf92f to 6b9c8b0 Compare March 4, 2026 14:44
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 6b9c8b0 to 1339a1d Compare March 4, 2026 17:48
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 1339a1d to cf58db9 Compare March 5, 2026 10:55
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 3 times, most recently from e29b915 to 08a2b2b Compare March 10, 2026 15:14
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0b04bcb1f4f54d518d017da862888f74

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 03m 30s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 22m 08s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 25m 42s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 49m 34s
✔️ cifmw-pod-zuul-files SUCCESS in 5m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 35s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
cifmw-pod-pre-commit TIMED_OUT in 31m 04s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 29s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 33s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 07s

@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 309d835 to 073a7c2 Compare April 14, 2026 15:11
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 073a7c2 to 6ed9e2a Compare April 14, 2026 21:16
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 7 times, most recently from 7720fe4 to 2ffc089 Compare April 14, 2026 21:42
@mnietoji mnietoji changed the title [multiple] Co-locate provisionserver with metal3 to prevent DHCP fail… [ci_gen_kustomize_values] Co-locate provisionserver with metal3 to prevent DHCP failures Apr 14, 2026
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 2ffc089 to 2fc519d Compare April 15, 2026 10:46
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from 2fc519d to d8e4383 Compare April 20, 2026 10:52
Copy link
Copy Markdown
Contributor

@michburk michburk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!
A couple of style nits, I'm still uncertain on the location of the two tasks you're introducting.

Comment thread roles/ci_gen_kustomize_values/tasks/main.yml Outdated
Comment thread roles/ci_gen_kustomize_values/tasks/main.yml
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from d8e4383 to 2ac037f Compare April 24, 2026 10:52
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch 3 times, most recently from 27c2747 to e36910b Compare April 24, 2026 11:19
@mnietoji mnietoji force-pushed the dhcp_provisioning_with_fix branch from e36910b to 976f9b1 Compare April 24, 2026 21:52
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1ac27c6d8a5a4031b7def732f9c1b147

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 09m 58s
podified-multinode-edpm-deployment-crc FAILURE in 21m 13s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 45s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 56m 59s
✔️ cifmw-pod-zuul-files SUCCESS in 8m 21s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 19s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 46s
✔️ cifmw-pod-pre-commit SUCCESS in 9m 20s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 47s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 6m 33s

…event DHCP failures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on
the provisioning network (172.23.0.0/24), NetworkManager fails to renew
and sets ipv4.method=disabled. NMState operator then preserves this
disabled state, causing permanent loss of provisioning network
connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on
different nodes. If metal3 restarts while a node is attempting DHCP
renewal, the temporary unavailability of metal3-dnsmasq causes the
renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3
label) and configure provisionServerNodeSelector in baremetalSetTemplate
to schedule OpenStackProvisionServer on the same node. This ensures
provisioning network connectivity is maintained because
metal3-static-ip-manager maintains a static IP (172.23.0.3) on the
metal3 node regardless of dnsmasq restarts.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Miguel Angel Nieto Jimenez <mnietoji@redhat.com>
@michburk
Copy link
Copy Markdown
Contributor

michburk commented May 1, 2026

changes are looking good, just wanted to double check: is this tested and working as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants