Skip to content

node_controller: mark unavailable if configs differ#699

Closed
runcom wants to merge 1 commit intoopenshift:masterfrom
runcom:fix-race-mcc
Closed

node_controller: mark unavailable if configs differ#699
runcom wants to merge 1 commit intoopenshift:masterfrom
runcom:fix-race-mcc

Conversation

@runcom
Copy link
Copy Markdown
Member

@runcom runcom commented May 3, 2019

Skip Unreconcilable nodes to allow them to roll back though.
The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 3, 2019
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2019
Skip Unreconcilable nodes to allow them to roll back though.
The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.
nodeNotReady := !isNodeReady(node)
if dconfig == currentConfig && (dconfig != cconfig || nodeNotReady) {
// we want to be able to roll back if a bad MC caused an unreconcilable state
if (dconfig != cconfig || nodeNotReady) && dstate != daemonconsts.MachineConfigDaemonStateUnreconcilable {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense...just a note to self that the function is now not the inverse of getReadyMachines.

(also this PR will conflict with the doc comments I added in the other PR)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense...just a note to self that the function is now not the inverse of getReadyMachines.

I'm not really sure about this actually, but wanted to check what tests say :(

@cgwalters
Copy link
Copy Markdown
Member

/test e2e-aws-op

@cgwalters
Copy link
Copy Markdown
Member

OK right so this PR is aiming to fix #697 (comment)

I think I see now how it could be doing so. However, it feels like there's a lot of other related logic issues here. For example:

func getCandidateMachines(pool *mcfgv1.MachineConfigPool, nodes []*corev1.Node, progress int) []*corev1.Node {
	acted := getReadyMachines(pool.Status.Configuration.Name, nodes)
	acted = append(acted, getUnavailableMachines(pool.Status.Configuration.Name, nodes)...)

If we have a node that went degraded, and we're targeting a new config, we should probably fix that one first right?

Also this code should recognize that it's always safe to revert a node to its current config. IOW we don't need to respect maxUnavailable when resetting back to current.

@cgwalters
Copy link
Copy Markdown
Member

OK I feel like I'm down a big rabbit hole here...trying to figure out what we really intend things like "ready" versus "updated" versus "unavailable" to mean and what the state transitions should be.

In the end...what's clearly broken right now is reverting a MC. I don't understand quite how this test even passed before.

We also really need to verify that we're not ever exceeding maxUnavailable = 1.

@runcom
Copy link
Copy Markdown
Member Author

runcom commented May 3, 2019

We also really need to verify that we're not ever exceeding maxUnavailable = 1.

I believe we were never honoring that in reality since we're wrongly checking a given MC for a pool (that's why I dropped that check in here). So yeah, we need to make sure about that.

I don't understand quite how this test even passed before.

This has worked till this since we effectively allow e.g. 1 node degraded for a given rendered-mc of a pool but we can still make progress if the rendered-mc for the pool changes (I hope it makes sense but maybe I'm not able to communicate it).

"ready" versus "updated" versus "unavailable" to mean and what the state transitions should be.

yah, guess we need reverse engineer all that code and cross fingers that current units will catch any regression we're trying to introduce during a potential refactor.

@cgwalters
Copy link
Copy Markdown
Member

I took a stab at this in #701

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

is there a specific order these (now) 3 prs need to go in?

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

or is 701 going to be the PR?

@cgwalters
Copy link
Copy Markdown
Member

or is 701 going to be the PR?

I think we don't know yet 😦 The other upgrade BZ is taking most of the mental energy right now and this bug is...ugly. I am fearing the rabbit hole that trying to fix it is going to lead us down.

@ashcrow
Copy link
Copy Markdown
Member

ashcrow commented May 7, 2019

=== RUN   TestMCDeployed
panic: test timed out after 55m0s

Retesting in the event that's a flake.

/test e2e-aws-op

@cgwalters
Copy link
Copy Markdown
Member

So this one did pass CI and it's obviously a lot simpler than #701

However, this PR isn't explicitly trying to fix https://bugzilla.redhat.com/show_bug.cgi?id=1707212 either, though it may be doing so also? I think the key is the combination of not filtering unavailable by the target config, and also explicitly checking the MCD state. Both PRs do that.

I found the logic very confusing before though, and 701 goes a lot farther in trying to address it; we avoid the duplicate logic in makeProgress, and the unit tests are also extended with some MCD state.

@runcom
Copy link
Copy Markdown
Member Author

runcom commented May 8, 2019

I'm closing this in favor of #701 which I like more

@runcom runcom closed this May 8, 2019
cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request May 8, 2019
Bigger rewrite for openshift#699

This should ensure that the node controller avoids exceeding maxUnavailable,
by changing the notion of "unavailable" to mean nodes which are either
not ready, *or* may become not ready as they're working on an update.

That list is also further filtered by nodes which are degraded/unreconcilable.

Also while we're here, just hardcode `maxUnavailable=1` for any pool with
the name `master` since that's all we support.

Note the clearer separation between "degraded" and "unavailable" now required changes
in the tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants