Skip to content

baremetal: stop ironic on bootstrap after masters are booted#3075

Closed
stbenjam wants to merge 1 commit intoopenshift:masterfrom
stbenjam:ironic-stop
Closed

baremetal: stop ironic on bootstrap after masters are booted#3075
stbenjam wants to merge 1 commit intoopenshift:masterfrom
stbenjam:ironic-stop

Conversation

@stbenjam
Copy link
Copy Markdown
Member

@stbenjam stbenjam commented Feb 7, 2020

The bootstrap can now co-exist with machine-api being online. That
means there could be an instance of Ironic, dnsmasq, etc running in
both the cluster and the bootstrap. This causes problems, as it's not
deterministic which dnsmasq instance the worker provisioned by the
machine-api will use. If it uses the bootstrap, then the worker will not
come online.

This is causing a percentage of baremetal installs to fail, with the
worker being offline, ingress and other operators never come up.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 7, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign stbenjam
You can assign the PR to them by writing /assign @stbenjam in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The bootstrap can now co-exist with machine-api being online. That
means there could be an instance of Ironic, dnsmasq, etc running in
both the cluster and the bootstrap. This causes problems, as it's not
deterministic which dnsmasq instance the worker provisioned by the
machine-api will use. If it uses the bootstrap, then the worker will not
come online.

This is causing a percentage of baremetal installs to fail, with the
worker being offline, ingress and other operators never come up.
@stbenjam
Copy link
Copy Markdown
Member Author

stbenjam commented Feb 7, 2020

/label platform/baremetal

@openshift-ci-robot openshift-ci-robot added the platform/baremetal IPI bare metal hosts platform label Feb 7, 2020
@stbenjam stbenjam changed the title [WIP] baremetal: stop ironic on bootstrap after masters are booted baremetal: stop ironic on bootstrap after masters are booted Feb 7, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2020
# We must ensure we stop the bootstrap's Ironic before that can happen.
ACTIVE_NODES=$(curl -H "X-OpenStack-Ironic-API-Version: 1.9" 'http://localhost:6385/v1/nodes?provision_state=active' | jq '.nodes | length')
if [[ "$ACTIVE_NODES" == "3" ]]; then
sleep 60
Copy link
Copy Markdown
Member Author

@stbenjam stbenjam Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to wait for the hosts to be active, and give enough time for the installer to see that as well via it's polling

# machine-api is started in the cluster, there can end up being 2 DHCP servers running on the network.
# We must ensure we stop the bootstrap's Ironic before that can happen.
ACTIVE_NODES=$(curl -H "X-OpenStack-Ironic-API-Version: 1.9" 'http://localhost:6385/v1/nodes?provision_state=active' | jq '.nodes | length')
if [[ "$ACTIVE_NODES" == "3" ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with this approach we'll need to template the number of nodes as I know some folks have tested with a single master

@hardys
Copy link
Copy Markdown

hardys commented Feb 7, 2020

I wonder if we could instead make the MAO defer starting the ironic-api and dnsmasq containers until it sees the bootstrap-complete, but this seems like a viable workaround.

We also discussed restricting the bootstrap dnsmasq to only reply to the mac addresses in the install-config, which may work but I'm not certain if we'd still see racy behavior since there are still going to be two DHCP servers on the network with that solution?

@dhellmann
Copy link
Copy Markdown
Contributor

I wonder if we could instead make the MAO defer starting the ironic-api and dnsmasq containers until it sees the bootstrap-complete, but this seems like a viable workaround.

I like that.

We also discussed restricting the bootstrap dnsmasq to only reply to the mac addresses in the install-config, which may work but I'm not certain if we'd still see racy behavior since there are still going to be two DHCP servers on the network with that solution?

I like this too, but after more thought I'm not sure it's enough. If we think we're ever going to have 2 servers running, even if one only responds to the control plane hosts, then we have to ensure that they do not give the same IP to different hosts at the same time.

@stbenjam
Copy link
Copy Markdown
Member Author

stbenjam commented Feb 7, 2020

/hold

pending outcome of discussion of other mechanisms

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020
@metal3ci
Copy link
Copy Markdown

metal3ci commented Feb 7, 2020

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1498/

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@stbenjam: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 4212443 link /test e2e-libvirt
ci/prow/e2e-aws-fips 4212443 link /test e2e-aws-fips
ci/prow/e2e-openstack 4212443 link /test e2e-openstack
ci/prow/e2e-ovirt 4212443 link /test e2e-ovirt
ci/prow/e2e-aws 4212443 link /test e2e-aws
ci/prow/e2e-aws-scaleup-rhel7 4212443 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stbenjam
Copy link
Copy Markdown
Member Author

stbenjam commented Feb 7, 2020

/close

In favor of #3079 for now

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@stbenjam: Closed this PR.

Details

In response to this:

/close

In favor of #3079 for now

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. platform/baremetal IPI bare metal hosts platform size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants