Bug 1707212: Rework controller progress#701
Bug 1707212: Rework controller progress#701openshift-merge-robot merged 3 commits intoopenshift:masterfrom
Conversation
c1c3056 to
551ed0f
Compare
|
Unittests are failing. Specifically:
|
fd4f4da to
8751c3e
Compare
There was a problem hiding this comment.
Notice how this case changed - before even though we had one node already targeting v1 we'd go ahead and change desired for node-3 too.
There was a problem hiding this comment.
This case also changed to differentiate on "non-working" MCD state - we do want to reset annotations on nodes that have failed on an update and aren't targeting our current config.
8751c3e to
9ac5a78
Compare
|
This is obviously a lot of code churn. However, we do have decent unit test coverage here. |
|
This looks graet but I believe this is breaking the rollback scenario which is:
|
There was a problem hiding this comment.
Thank you for the very useful comment 👍 Maybe someone won't be tempted to "optimize" this out in the future.
9ac5a78 to
21f030a
Compare
|
/hold This also could use some extra review. |
|
OK e2e-aws-op needs changes in the test to check for degraded and not unavailable, fixing. |
21f030a to
f4d53d3
Compare
|
OK interesting, that e2e-aws-upgrade run had a Yet...there is no previous MCD logs. Actually for any MCDs. Yet they have 3-4 reboots. I think we should change the MCD to scrape all journal entries from the previous boot that it wrote or so? |
|
test comment |
this is definitely #719 if you look at clusteroperators.json (not sure why MCD are empty tho) |
f4d53d3 to
cd51e66
Compare
There was a problem hiding this comment.
just triple checking with @RobertKrawitz and @hexfusion that we never wants more than 1 master unschedulable/rebooted
There was a problem hiding this comment.
do we have coverage on the flip above though? or it can't never happen with this new changes?
There was a problem hiding this comment.
Mmmm...I think I'd answer that question by saying it's a better test if we indeed keep one unready node, and make it look like this:
diff --git a/pkg/controller/node/node_controller_test.go b/pkg/controller/node/node_controller_test.go
index 74e951df..0a772379 100644
--- a/pkg/controller/node/node_controller_test.go
+++ b/pkg/controller/node/node_controller_test.go
@@ -440,10 +440,10 @@ func TestGetCandidateMachines(t *testing.T) {
expected: nil,
}, {
// node-2 is going to change config, so we can only progress one more
- progress: 2,
+ progress: 3,
nodes: []*corev1.Node{
newNodeWithReady("node-0", "v1", "v1", corev1.ConditionTrue),
- newNodeWithReady("node-1", "v1", "v1", corev1.ConditionTrue),
+ newNodeWithReady("node-1", "v1", "v1", corev1.ConditionFalse),
newNodeWithReady("node-2", "v0", "v1", corev1.ConditionTrue),
newNodeWithReady("node-3", "v0", "v0", corev1.ConditionTrue),
newNodeWithReady("node-4", "v0", "v0", corev1.ConditionTrue),
will push after this current CI run if you agree.
|
@cgwalters about the failing aws op maybe #701 (comment) ? |
cd51e66 to
5a33cb8
Compare
|
@abhinavdahiya if you get a few minutes to sanity check this PR that could be very useful - particularly note #701 (comment) but fixing that unwound into a big patch. |
Continuing on the quest to make the controller logs more useful.
The `nodeChanged` logic was not accounting for the fact that the controller takes readiness into account too. However, rather than change that function, let's just remove it and extend our logic below that which was also effectively doing change detection so it could log it. This also removes the logging from `nodeReady()` which got very noisy in status; we now consistently log just changes in the node controller. However, the change detection logic was also implicitly ignoring nodes which didn't appear to be managed by the MCD at all - think Windows nodes. Let's explicitly skip nodes that don't have a `currentConfig` annotation.
Bigger rewrite for openshift#699 This should ensure that the node controller avoids exceeding maxUnavailable, by changing the notion of "unavailable" to mean nodes which are either not ready, *or* may become not ready as they're working on an update. That list is also further filtered by nodes which are degraded/unreconcilable. Also while we're here, just hardcode `maxUnavailable=1` for any pool with the name `master` since that's all we support. Note the clearer separation between "degraded" and "unavailable" now required changes in the tests.
5a33cb8 to
03b6843
Compare
|
/approve So glad we now have this and can control all of this in a better and readable way... @abhinavdahiya just echoing Colin's ping if you have a minute to validate this /assign @kikisdeliveryservice |
|
/retest |
|
/test e2e-aws |
|
/test e2e-aws e2e-aws-op |
|
All tests passed again |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Bigger rewrite for #699
This should ensure that the node controller avoids exceeding maxUnavailable,
by changing the notion of "unavailable" to mean nodes which are either
not ready, or may become not ready as they're working on an update.
That list is also further filtered by nodes which are degraded/unreconcilable.