Skip to content

Bug 1707212: Rework controller progress#701

Merged
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
cgwalters:controller-progress
May 9, 2019
Merged

Bug 1707212: Rework controller progress#701
openshift-merge-robot merged 3 commits intoopenshift:masterfrom
cgwalters:controller-progress

Conversation

@cgwalters
Copy link
Copy Markdown
Member

@cgwalters cgwalters commented May 3, 2019

Bigger rewrite for #699

This should ensure that the node controller avoids exceeding maxUnavailable,
by changing the notion of "unavailable" to mean nodes which are either
not ready, or may become not ready as they're working on an update.

That list is also further filtered by nodes which are degraded/unreconcilable.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 3, 2019
@cgwalters cgwalters force-pushed the controller-progress branch 2 times, most recently from c1c3056 to 551ed0f Compare May 4, 2019 18:38
@ashcrow
Copy link
Copy Markdown
Member

ashcrow commented May 7, 2019

Unittests are failing. Specifically:

  • TestGetCandidateMachines/case#3
  • TestGetCandidateMachines/case#4
  • TestGetCandidateMachines/case#5

@cgwalters cgwalters force-pushed the controller-progress branch 2 times, most recently from fd4f4da to 8751c3e Compare May 8, 2019 02:39
Comment thread pkg/controller/node/node_controller_test.go Outdated
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice how this case changed - before even though we had one node already targeting v1 we'd go ahead and change desired for node-3 too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case also changed to differentiate on "non-working" MCD state - we do want to reset annotations on nodes that have failed on an update and aren't targeting our current config.

@cgwalters cgwalters force-pushed the controller-progress branch from 8751c3e to 9ac5a78 Compare May 8, 2019 02:44
@cgwalters
Copy link
Copy Markdown
Member Author

This is obviously a lot of code churn. However, we do have decent unit test coverage here.

Comment thread pkg/controller/node/node_controller.go Outdated
@runcom
Copy link
Copy Markdown
Member

runcom commented May 8, 2019

This looks graet but I believe this is breaking the rollback scenario which is:

  • create a bad machineconfig that can't reconcile
  • watch nodes (1) going unreconcilable
  • delete the bad machineconfig
  • watch nodes reconciling back to a rendered w/o the bad mc

Comment thread pkg/controller/node/status.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very useful comment 👍 Maybe someone won't be tempted to "optimize" this out in the future.

@cgwalters cgwalters force-pushed the controller-progress branch from 9ac5a78 to 21f030a Compare May 8, 2019 13:39
@cgwalters
Copy link
Copy Markdown
Member Author

/hold
This depends on #718 and I'd like that to go in first.

This also could use some extra review.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 8, 2019
@cgwalters
Copy link
Copy Markdown
Member Author

OK e2e-aws-op needs changes in the test to check for degraded and not unavailable, fixing.

@cgwalters cgwalters force-pushed the controller-progress branch from 21f030a to f4d53d3 Compare May 8, 2019 15:01
@cgwalters
Copy link
Copy Markdown
Member Author

OK interesting, that e2e-aws-upgrade run had a
failed reboot cycle.

Yet...there is no previous MCD logs. Actually for any MCDs. Yet they have 3-4 reboots. I think we should change the MCD to scrape all journal entries from the previous boot that it wrote or so?

@cgwalters
Copy link
Copy Markdown
Member Author

test comment

@runcom
Copy link
Copy Markdown
Member

runcom commented May 8, 2019

OK interesting, that e2e-aws-upgrade run had a
failed reboot cycle.

this is definitely #719 if you look at clusteroperators.json (not sure why MCD are empty tho)

@cgwalters cgwalters changed the title Controller progress Bug 1707212: Rework controller progress May 8, 2019
@cgwalters cgwalters force-pushed the controller-progress branch from f4d53d3 to cd51e66 Compare May 8, 2019 19:03
@openshift openshift deleted a comment from cgwalters May 8, 2019
@openshift openshift deleted a comment from cgwalters May 8, 2019
Comment thread pkg/controller/node/node_controller.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just triple checking with @RobertKrawitz and @hexfusion that we never wants more than 1 master unschedulable/rebooted

Comment thread test/e2e/mcd_test.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have coverage on the flip above though? or it can't never happen with this new changes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm...I think I'd answer that question by saying it's a better test if we indeed keep one unready node, and make it look like this:

diff --git a/pkg/controller/node/node_controller_test.go b/pkg/controller/node/node_controller_test.go
index 74e951df..0a772379 100644
--- a/pkg/controller/node/node_controller_test.go
+++ b/pkg/controller/node/node_controller_test.go
@@ -440,10 +440,10 @@ func TestGetCandidateMachines(t *testing.T) {
 		expected: nil,
 	}, {
 		// node-2 is going to change config, so we can only progress one more
-		progress: 2,
+		progress: 3,
 		nodes: []*corev1.Node{
 			newNodeWithReady("node-0", "v1", "v1", corev1.ConditionTrue),
-			newNodeWithReady("node-1", "v1", "v1", corev1.ConditionTrue),
+			newNodeWithReady("node-1", "v1", "v1", corev1.ConditionFalse),
 			newNodeWithReady("node-2", "v0", "v1", corev1.ConditionTrue),
 			newNodeWithReady("node-3", "v0", "v0", corev1.ConditionTrue),
 			newNodeWithReady("node-4", "v0", "v0", corev1.ConditionTrue),

will push after this current CI run if you agree.

@runcom
Copy link
Copy Markdown
Member

runcom commented May 8, 2019

@cgwalters about the failing aws op maybe #701 (comment) ?

@cgwalters
Copy link
Copy Markdown
Member Author

cgwalters commented May 8, 2019

@abhinavdahiya if you get a few minutes to sanity check this PR that could be very useful - particularly note #701 (comment) but fixing that unwound into a big patch.

cgwalters added 3 commits May 8, 2019 23:57
Continuing on the quest to make the controller logs more useful.
The `nodeChanged` logic was not accounting for the fact that
the controller takes readiness into account too.  However, rather
than change that function, let's just remove it and extend our
logic below that which was also effectively doing change detection
so it could log it.

This also removes the logging from `nodeReady()` which got very
noisy in status; we now consistently log just changes in the node
controller.

However, the change detection logic was also implicitly ignoring
nodes which didn't appear to be managed by the MCD at all - think
Windows nodes.  Let's explicitly skip nodes that don't have a `currentConfig`
annotation.
Bigger rewrite for openshift#699

This should ensure that the node controller avoids exceeding maxUnavailable,
by changing the notion of "unavailable" to mean nodes which are either
not ready, *or* may become not ready as they're working on an update.

That list is also further filtered by nodes which are degraded/unreconcilable.

Also while we're here, just hardcode `maxUnavailable=1` for any pool with
the name `master` since that's all we support.

Note the clearer separation between "degraded" and "unavailable" now required changes
in the tests.
@cgwalters cgwalters force-pushed the controller-progress branch from 5a33cb8 to 03b6843 Compare May 8, 2019 23:57
@runcom
Copy link
Copy Markdown
Member

runcom commented May 9, 2019

/approve

So glad we now have this and can control all of this in a better and readable way... @abhinavdahiya just echoing Colin's ping if you have a minute to validate this

/assign @kikisdeliveryservice

@cgwalters
Copy link
Copy Markdown
Member Author

/retest

@cgwalters
Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-aws-op
/test e2e-aws-upgrade

@cgwalters
Copy link
Copy Markdown
Member Author

/test e2e-aws e2e-aws-op

@cgwalters
Copy link
Copy Markdown
Member Author

All tests passed again

@runcom
Copy link
Copy Markdown
Member

runcom commented May 9, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2019
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom runcom removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2019
@openshift-merge-robot openshift-merge-robot merged commit ecc6f58 into openshift:master May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants