Bug 1707212: Rework controller progress by cgwalters · Pull Request #701 · openshift/machine-config-operator

cgwalters · 2019-05-03T23:06:10Z

Bigger rewrite for #699

This should ensure that the node controller avoids exceeding maxUnavailable,
by changing the notion of "unavailable" to mean nodes which are either
not ready, or may become not ready as they're working on an update.

That list is also further filtered by nodes which are degraded/unreconcilable.

ashcrow · 2019-05-07T17:46:42Z

Unittests are failing. Specifically:

TestGetCandidateMachines/case#3
TestGetCandidateMachines/case#4
TestGetCandidateMachines/case#5

cgwalters · 2019-05-08T02:43:38Z

Notice how this case changed - before even though we had one node already targeting v1 we'd go ahead and change desired for node-3 too.

cgwalters · 2019-05-08T02:44:30Z

This case also changed to differentiate on "non-working" MCD state - we do want to reset annotations on nodes that have failed on an update and aren't targeting our current config.

cgwalters · 2019-05-08T02:46:24Z

This is obviously a lot of code churn. However, we do have decent unit test coverage here.

runcom · 2019-05-08T09:59:14Z

This looks graet but I believe this is breaking the rollback scenario which is:

create a bad machineconfig that can't reconcile
watch nodes (1) going unreconcilable
delete the bad machineconfig
watch nodes reconciling back to a rendered w/o the bad mc

RobertKrawitz · 2019-05-08T13:28:58Z

Thank you for the very useful comment 👍 Maybe someone won't be tempted to "optimize" this out in the future.

cgwalters · 2019-05-08T13:41:02Z

/hold
This depends on #718 and I'd like that to go in first.

This also could use some extra review.

cgwalters · 2019-05-08T14:58:09Z

OK e2e-aws-op needs changes in the test to check for degraded and not unavailable, fixing.

cgwalters · 2019-05-08T17:27:15Z

OK interesting, that e2e-aws-upgrade run had a
failed reboot cycle.

Yet...there is no previous MCD logs. Actually for any MCDs. Yet they have 3-4 reboots. I think we should change the MCD to scrape all journal entries from the previous boot that it wrote or so?

cgwalters · 2019-05-08T17:38:56Z

test comment

runcom · 2019-05-08T17:49:41Z

OK interesting, that e2e-aws-upgrade run had a
failed reboot cycle.

this is definitely #719 if you look at clusteroperators.json (not sure why MCD are empty tho)

runcom · 2019-05-08T20:35:07Z

just triple checking with @RobertKrawitz and @hexfusion that we never wants more than 1 master unschedulable/rebooted

runcom · 2019-05-08T20:38:33Z

do we have coverage on the flip above though? or it can't never happen with this new changes?

Mmmm...I think I'd answer that question by saying it's a better test if we indeed keep one unready node, and make it look like this:

diff --git a/pkg/controller/node/node_controller_test.go b/pkg/controller/node/node_controller_test.go index 74e951df..0a772379 100644 --- a/pkg/controller/node/node_controller_test.go +++ b/pkg/controller/node/node_controller_test.go @@ -440,10 +440,10 @@ func TestGetCandidateMachines(t *testing.T) { expected: nil, }, { // node-2 is going to change config, so we can only progress one more - progress: 2, + progress: 3, nodes: []*corev1.Node{ newNodeWithReady("node-0", "v1", "v1", corev1.ConditionTrue), - newNodeWithReady("node-1", "v1", "v1", corev1.ConditionTrue), + newNodeWithReady("node-1", "v1", "v1", corev1.ConditionFalse), newNodeWithReady("node-2", "v0", "v1", corev1.ConditionTrue), newNodeWithReady("node-3", "v0", "v0", corev1.ConditionTrue), newNodeWithReady("node-4", "v0", "v0", corev1.ConditionTrue),

will push after this current CI run if you agree.

runcom · 2019-05-08T20:39:58Z

@cgwalters about the failing aws op maybe #701 (comment) ?

cgwalters · 2019-05-08T21:33:29Z

@abhinavdahiya if you get a few minutes to sanity check this PR that could be very useful - particularly note #701 (comment) but fixing that unwound into a big patch.

Continuing on the quest to make the controller logs more useful.

The `nodeChanged` logic was not accounting for the fact that the controller takes readiness into account too. However, rather than change that function, let's just remove it and extend our logic below that which was also effectively doing change detection so it could log it. This also removes the logging from `nodeReady()` which got very noisy in status; we now consistently log just changes in the node controller. However, the change detection logic was also implicitly ignoring nodes which didn't appear to be managed by the MCD at all - think Windows nodes. Let's explicitly skip nodes that don't have a `currentConfig` annotation.

Bigger rewrite for openshift#699 This should ensure that the node controller avoids exceeding maxUnavailable, by changing the notion of "unavailable" to mean nodes which are either not ready, *or* may become not ready as they're working on an update. That list is also further filtered by nodes which are degraded/unreconcilable. Also while we're here, just hardcode `maxUnavailable=1` for any pool with the name `master` since that's all we support. Note the clearer separation between "degraded" and "unavailable" now required changes in the tests.

runcom · 2019-05-09T09:20:14Z

/approve

So glad we now have this and can control all of this in a better and readable way... @abhinavdahiya just echoing Colin's ping if you have a minute to validate this

/assign @kikisdeliveryservice

cgwalters · 2019-05-09T13:25:41Z

/retest

cgwalters · 2019-05-09T13:26:20Z

/test e2e-aws
/test e2e-aws-op
/test e2e-aws-upgrade

cgwalters · 2019-05-09T13:28:43Z

/test e2e-aws e2e-aws-op

cgwalters · 2019-05-09T15:09:04Z

All tests passed again

runcom · 2019-05-09T15:33:45Z

/lgtm

openshift-ci-robot · 2019-05-09T15:34:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 3, 2019

openshift-ci-robot requested review from kikisdeliveryservice and runcom May 3, 2019 23:06

cgwalters mentioned this pull request May 3, 2019

node_controller: mark unavailable if configs differ #699

Closed

cgwalters force-pushed the controller-progress branch 2 times, most recently from c1c3056 to 551ed0f Compare May 4, 2019 18:38

cgwalters force-pushed the controller-progress branch 2 times, most recently from fd4f4da to 8751c3e Compare May 8, 2019 02:39

cgwalters commented May 8, 2019

View reviewed changes

Comment thread pkg/controller/node/node_controller_test.go Outdated

cgwalters commented May 8, 2019

View reviewed changes

cgwalters force-pushed the controller-progress branch from 8751c3e to 9ac5a78 Compare May 8, 2019 02:44

runcom reviewed May 8, 2019

View reviewed changes

Comment thread pkg/controller/node/node_controller.go Outdated

RobertKrawitz reviewed May 8, 2019

View reviewed changes

cgwalters force-pushed the controller-progress branch from 9ac5a78 to 21f030a Compare May 8, 2019 13:39

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2019

cgwalters force-pushed the controller-progress branch from 21f030a to f4d53d3 Compare May 8, 2019 15:01

cgwalters changed the title ~~Controller progress~~ Bug 1707212: Rework controller progress May 8, 2019

cgwalters force-pushed the controller-progress branch from f4d53d3 to cd51e66 Compare May 8, 2019 19:03

openshift deleted a comment from cgwalters May 8, 2019

runcom reviewed May 8, 2019

View reviewed changes

Comment thread test/e2e/mcd_test.go Outdated

runcom reviewed May 8, 2019

View reviewed changes

runcom mentioned this pull request May 8, 2019

controller: Fix "changed" logic to account for nodeReady changes #718

Closed

cgwalters force-pushed the controller-progress branch from cd51e66 to 5a33cb8 Compare May 8, 2019 21:19

cgwalters added 3 commits May 8, 2019 23:57

status: Log when a pool completes

45af7e4

Continuing on the quest to make the controller logs more useful.

cgwalters force-pushed the controller-progress branch from 5a33cb8 to 03b6843 Compare May 8, 2019 23:57

openshift-ci-robot assigned kikisdeliveryservice May 9, 2019

openshift-ci-robot assigned runcom May 9, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2019

runcom removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2019

openshift-merge-robot merged commit ecc6f58 into openshift:master May 9, 2019

runcom mentioned this pull request May 9, 2019

Bug 1707928: hack/build-go.sh: use just git hashes for building #728

Merged

Conversation

cgwalters commented May 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashcrow commented May 7, 2019

Uh oh!

Uh oh!

cgwalters May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

Uh oh!

runcom commented May 8, 2019

Uh oh!

RobertKrawitz May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

cgwalters commented May 8, 2019

Uh oh!

runcom commented May 8, 2019

Uh oh!

runcom May 8, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

runcom May 8, 2019

Choose a reason for hiding this comment

Uh oh!

cgwalters May 8, 2019

Choose a reason for hiding this comment

Uh oh!

runcom commented May 8, 2019

Uh oh!

cgwalters commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented May 9, 2019

Uh oh!

cgwalters commented May 9, 2019

Uh oh!

cgwalters commented May 9, 2019

Uh oh!

cgwalters commented May 9, 2019

Uh oh!

cgwalters commented May 9, 2019

Uh oh!

runcom commented May 9, 2019

Uh oh!

openshift-ci-robot commented May 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cgwalters commented May 3, 2019 •

edited

Loading

cgwalters commented May 8, 2019 •

edited

Loading