node_controller: mark unavailable if configs differ by runcom · Pull Request #699 · openshift/machine-config-operator

runcom · 2019-05-03T18:34:10Z

Skip Unreconcilable nodes to allow them to roll back though.
The NodeController shouldn't rely just on what it's syncing at that
moment to deduce that a node is unavailable. It may happen that the pool
is updating to a rendered-config-A but it's not done, and the code was
still going to apply a new rendered-config-B causing more than 1 node at
the time to go unschedulable. This patch should fix that.

openshift-ci-robot · 2019-05-03T18:34:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Skip Unreconcilable nodes to allow them to roll back though. The NodeController shouldn't rely just on what it's syncing at that moment to deduce that a node is unavailable. It may happen that the pool is updating to a rendered-config-A but it's not done, and the code was still going to apply a new rendered-config-B causing more than 1 node at the time to go unschedulable. This patch should fix that.

cgwalters · 2019-05-03T19:09:03Z

 		nodeNotReady := !isNodeReady(node)
-		if dconfig == currentConfig && (dconfig != cconfig || nodeNotReady) {
+		// we want to be able to roll back if a bad MC caused an unreconcilable state
+		if (dconfig != cconfig || nodeNotReady) && dstate != daemonconsts.MachineConfigDaemonStateUnreconcilable {


I think this makes sense...just a note to self that the function is now not the inverse of getReadyMachines.

(also this PR will conflict with the doc comments I added in the other PR)

I think this makes sense...just a note to self that the function is now not the inverse of getReadyMachines.

I'm not really sure about this actually, but wanted to check what tests say :(

cgwalters · 2019-05-03T21:40:25Z

/test e2e-aws-op

cgwalters · 2019-05-03T21:46:08Z

OK right so this PR is aiming to fix #697 (comment)

I think I see now how it could be doing so. However, it feels like there's a lot of other related logic issues here. For example:

func getCandidateMachines(pool *mcfgv1.MachineConfigPool, nodes []*corev1.Node, progress int) []*corev1.Node {
	acted := getReadyMachines(pool.Status.Configuration.Name, nodes)
	acted = append(acted, getUnavailableMachines(pool.Status.Configuration.Name, nodes)...)

If we have a node that went degraded, and we're targeting a new config, we should probably fix that one first right?

Also this code should recognize that it's always safe to revert a node to its current config. IOW we don't need to respect maxUnavailable when resetting back to current.

cgwalters · 2019-05-03T22:10:52Z

OK I feel like I'm down a big rabbit hole here...trying to figure out what we really intend things like "ready" versus "updated" versus "unavailable" to mean and what the state transitions should be.

In the end...what's clearly broken right now is reverting a MC. I don't understand quite how this test even passed before.

We also really need to verify that we're not ever exceeding maxUnavailable = 1.

runcom · 2019-05-03T22:16:11Z

We also really need to verify that we're not ever exceeding maxUnavailable = 1.

I believe we were never honoring that in reality since we're wrongly checking a given MC for a pool (that's why I dropped that check in here). So yeah, we need to make sure about that.

I don't understand quite how this test even passed before.

This has worked till this since we effectively allow e.g. 1 node degraded for a given rendered-mc of a pool but we can still make progress if the rendered-mc for the pool changes (I hope it makes sense but maybe I'm not able to communicate it).

"ready" versus "updated" versus "unavailable" to mean and what the state transitions should be.

yah, guess we need reverse engineer all that code and cross fingers that current units will catch any regression we're trying to introduce during a potential refactor.

cgwalters · 2019-05-03T23:06:30Z

I took a stab at this in #701

kikisdeliveryservice · 2019-05-03T23:08:18Z

is there a specific order these (now) 3 prs need to go in?

kikisdeliveryservice · 2019-05-03T23:10:00Z

or is 701 going to be the PR?

cgwalters · 2019-05-07T01:46:13Z

or is 701 going to be the PR?

I think we don't know yet 😦 The other upgrade BZ is taking most of the mental energy right now and this bug is...ugly. I am fearing the rabbit hole that trying to fix it is going to lead us down.

ashcrow · 2019-05-07T17:47:58Z

=== RUN   TestMCDeployed
panic: test timed out after 55m0s

Retesting in the event that's a flake.

/test e2e-aws-op

cgwalters · 2019-05-08T14:19:27Z

So this one did pass CI and it's obviously a lot simpler than #701

However, this PR isn't explicitly trying to fix https://bugzilla.redhat.com/show_bug.cgi?id=1707212 either, though it may be doing so also? I think the key is the combination of not filtering unavailable by the target config, and also explicitly checking the MCD state. Both PRs do that.

I found the logic very confusing before though, and 701 goes a lot farther in trying to address it; we avoid the duplicate logic in makeProgress, and the unit tests are also extended with some MCD state.

runcom · 2019-05-08T14:21:16Z

I'm closing this in favor of #701 which I like more

Bigger rewrite for openshift#699 This should ensure that the node controller avoids exceeding maxUnavailable, by changing the notion of "unavailable" to mean nodes which are either not ready, *or* may become not ready as they're working on an update. That list is also further filtered by nodes which are degraded/unreconcilable. Also while we're here, just hardcode `maxUnavailable=1` for any pool with the name `master` since that's all we support. Note the clearer separation between "degraded" and "unavailable" now required changes in the tests.

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 3, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2019

openshift-ci-robot requested review from cgwalters and kikisdeliveryservice May 3, 2019 18:34

runcom force-pushed the fix-race-mcc branch from b5970ae to 5af9ba5 Compare May 3, 2019 18:36

runcom mentioned this pull request May 3, 2019

adjust sigterm handler and bump test timeouts #697

Merged

cgwalters reviewed May 3, 2019

View reviewed changes

cgwalters mentioned this pull request May 8, 2019

Bug 1707212: Rework controller progress #701

Merged

runcom closed this May 8, 2019

Conversation

runcom commented May 3, 2019

Uh oh!

openshift-ci-robot commented May 3, 2019

Uh oh!

cgwalters May 3, 2019

Choose a reason for hiding this comment

Uh oh!

runcom May 3, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters commented May 3, 2019

Uh oh!

cgwalters commented May 3, 2019

Uh oh!

cgwalters commented May 3, 2019

Uh oh!

runcom commented May 3, 2019

Uh oh!

cgwalters commented May 3, 2019

Uh oh!

kikisdeliveryservice commented May 3, 2019

Uh oh!

kikisdeliveryservice commented May 3, 2019

Uh oh!

cgwalters commented May 7, 2019

Uh oh!

ashcrow commented May 7, 2019

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

runcom commented May 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants