baremetal: stop ironic on bootstrap after masters are booted#3075
baremetal: stop ironic on bootstrap after masters are booted#3075stbenjam wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
The bootstrap can now co-exist with machine-api being online. That means there could be an instance of Ironic, dnsmasq, etc running in both the cluster and the bootstrap. This causes problems, as it's not deterministic which dnsmasq instance the worker provisioned by the machine-api will use. If it uses the bootstrap, then the worker will not come online. This is causing a percentage of baremetal installs to fail, with the worker being offline, ingress and other operators never come up.
|
/label platform/baremetal |
| # We must ensure we stop the bootstrap's Ironic before that can happen. | ||
| ACTIVE_NODES=$(curl -H "X-OpenStack-Ironic-API-Version: 1.9" 'http://localhost:6385/v1/nodes?provision_state=active' | jq '.nodes | length') | ||
| if [[ "$ACTIVE_NODES" == "3" ]]; then | ||
| sleep 60 |
There was a problem hiding this comment.
We have to wait for the hosts to be active, and give enough time for the installer to see that as well via it's polling
| # machine-api is started in the cluster, there can end up being 2 DHCP servers running on the network. | ||
| # We must ensure we stop the bootstrap's Ironic before that can happen. | ||
| ACTIVE_NODES=$(curl -H "X-OpenStack-Ironic-API-Version: 1.9" 'http://localhost:6385/v1/nodes?provision_state=active' | jq '.nodes | length') | ||
| if [[ "$ACTIVE_NODES" == "3" ]]; then |
There was a problem hiding this comment.
If we go with this approach we'll need to template the number of nodes as I know some folks have tested with a single master
|
I wonder if we could instead make the MAO defer starting the ironic-api and dnsmasq containers until it sees the bootstrap-complete, but this seems like a viable workaround. We also discussed restricting the bootstrap dnsmasq to only reply to the mac addresses in the install-config, which may work but I'm not certain if we'd still see racy behavior since there are still going to be two DHCP servers on the network with that solution? |
I like that.
I like this too, but after more thought I'm not sure it's enough. If we think we're ever going to have 2 servers running, even if one only responds to the control plane hosts, then we have to ensure that they do not give the same IP to different hosts at the same time. |
|
/hold pending outcome of discussion of other mechanisms |
|
Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1498/ |
|
@stbenjam: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/close In favor of #3079 for now |
|
@stbenjam: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The bootstrap can now co-exist with machine-api being online. That
means there could be an instance of Ironic, dnsmasq, etc running in
both the cluster and the bootstrap. This causes problems, as it's not
deterministic which dnsmasq instance the worker provisioned by the
machine-api will use. If it uses the bootstrap, then the worker will not
come online.
This is causing a percentage of baremetal installs to fail, with the
worker being offline, ingress and other operators never come up.