Bump timeout for e2e-aws-op by cgwalters · Pull Request #692 · openshift/machine-config-operator

cgwalters · 2019-05-02T11:34:55Z

We seem to be taking longer here.

runcom · 2019-05-02T12:12:21Z

I'm fine with this if we're just timing out on the latest test added (quorum guard afaict)

runcom · 2019-05-02T12:12:38Z

/approve
/lgtm

hexfusion · 2019-05-02T12:14:25Z

/lgtm

Thanks @cgwalters @runcom !

openshift-ci-robot · 2019-05-02T12:14:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, hexfusion, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2019-05-02T12:51:35Z

/test e2e-aws-upgrade

cgwalters · 2019-05-02T13:47:30Z

              "degradedMachineCount": 0,
                "machineCount": 3,
                "observedGeneration": 1,
                "readyMachineCount": 2,
                "unavailableMachineCount": 1,
                "updatedMachineCount": 2
            }

runcom · 2019-05-02T13:48:11Z

@cgwalters that just looks like the worker pool is still rolling and reconciling to me

We seem to be taking longer here.

openshift-ci-robot · 2019-05-02T13:48:49Z

New changes are detected. LGTM label has been removed.

kikisdeliveryservice · 2019-05-02T14:28:52Z

120 seems awfully high? i assume we just trying it to see if it's truly a timeout issue?

cgwalters · 2019-05-02T15:11:25Z

i assume we just trying it to see if it's truly a timeout issue?

Right.
/hold
Since we probably don't want to merge as is.
Also I suspect 2 hours may also be the timeout for Prow jobs in general? Need to investigate that.

cgwalters · 2019-05-02T15:52:04Z

I find myself getting confused by getUnavailableMachines...it feels like it's supposed to be getMachinesToUpdate but shouldn't it be checking for nodes which are ready?

diff --git a/pkg/controller/node/status.go b/pkg/controller/node/status.go
index 75c16477..dfdd2ee1 100644
--- a/pkg/controller/node/status.go
+++ b/pkg/controller/node/status.go
@@ -192,10 +192,12 @@ func getUnavailableMachines(currentConfig string, nodes []*corev1.Node) []*corev
 			continue
 		}
 
-		nodeNotReady := !isNodeReady(node)
+		if !isNodeReady(node) {
+			continue
+		}
 		if dconfig == currentConfig && (dconfig != cconfig || nodeNotReady) {
 			unavail = append(unavail, node)
-			glog.V(2).Infof("Node %s unavailable: different configs (desired: %s, current %s) or node not ready %v", node.Name, dconfig, cconfig, nodeNotReady)
+			glog.V(2).Infof("Node %s unavailable: different configs (desired: %s, current %s)", node.Name, dconfig, cconfig)
 		}
 	}
 	return unavail

cgwalters · 2019-05-02T15:53:36Z

Paste from @sjenning

default                                      72s         Warning   MCDBootstrapSyncFailure       node/master-1                                          pending config rendered-master-1eb4fcd28e9f85ea9ac914d1789ad529 bootID a86994cc-fd33-47d4-a3e6-76da3c3d6fa2 matches current! Failed to reboot?
default                                      47s         Warning   MCDBootstrapSyncFailure       node/master-0                                          pending config rendered-master-8e7d5408d19158180e4cd0b6750b434c bootID d1a6236c-df42-412d-bac6-49155ea4ea82 matches current! Failed to reboot?
openshift-machine-config-operator            14s         Warning   FailedScheduling              pod/etcd-quorum-guard-745f7c4d8f-9wstp                 0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable, 3 node(s) didn't match node selector.

$ oc get nodes
NAME       STATUS                     ROLES    AGE   VERSION
master-0   Ready,SchedulingDisabled   master   20h   v1.13.4+e8d1fd69b
master-1   Ready,SchedulingDisabled   master   20h   v1.13.4+e8d1fd69b
master-2   Ready                      master   20h   v1.13.4+e8d1fd69b
worker-0   Ready                      worker   20h   v1.13.4+1f90f9755
worker-1   Ready                      worker   20h   v1.13.4+1f90f9755
worker-2   Ready                      worker   20h   v1.13.4+1f90f9755

kikisdeliveryservice · 2019-05-02T15:53:43Z

Noticing in the CI artifacts logs that one of the worker daemons has no logs whatsoever

edit: double checked and it was a worker daemon see: openshift-machine-config-operator_machine-config-daemon-gfwj7_machine-config-daemon.log.gz

openshift-ci-robot · 2019-05-02T16:01:16Z

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-upgrade	`cde17af`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-aws-op	`cde17af`	link	`/test e2e-aws-op`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

runcom · 2019-05-02T16:13:38Z

I find myself getting confused by getUnavailableMachines...it feels like it's supposed to be getMachinesToUpdate but shouldn't it be checking for nodes which are ready?

@cgwalters your patch is reporting unavailable=0 when nodes aren't ready tho, which is what makes progress if everything is available.

cgwalters · 2019-05-02T20:20:30Z

Closing in favor of #697

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 2, 2019

openshift-ci-robot requested review from jlebon and runcom May 2, 2019 11:35

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019

openshift-ci-robot assigned runcom May 2, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2019

openshift-ci-robot assigned hexfusion May 2, 2019

runcom mentioned this pull request May 2, 2019

templates: increase kubelet loglevel for better troubleshooting #681

Merged

Bump timeout for e2e-aws-op

cde17af

We seem to be taking longer here.

cgwalters force-pushed the bump-e2e-op-time branch from e6bfb28 to cde17af Compare May 2, 2019 13:48

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 2, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2019

cgwalters closed this May 2, 2019

Conversation

cgwalters commented May 2, 2019

Uh oh!

runcom commented May 2, 2019

Uh oh!

runcom commented May 2, 2019

Uh oh!

hexfusion commented May 2, 2019

Uh oh!

openshift-ci-robot commented May 2, 2019

Uh oh!

hexfusion commented May 2, 2019

Uh oh!

cgwalters commented May 2, 2019

Uh oh!

runcom commented May 2, 2019

Uh oh!

openshift-ci-robot commented May 2, 2019

Uh oh!

kikisdeliveryservice commented May 2, 2019

Uh oh!

cgwalters commented May 2, 2019

Uh oh!

cgwalters commented May 2, 2019

Uh oh!

cgwalters commented May 2, 2019

Uh oh!

kikisdeliveryservice commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 2, 2019

Uh oh!

runcom commented May 2, 2019

Uh oh!

cgwalters commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kikisdeliveryservice commented May 2, 2019 •

edited

Loading