Skip to content

Bump timeout for e2e-aws-op#692

Closed
cgwalters wants to merge 1 commit into
openshift:masterfrom
cgwalters:bump-e2e-op-time
Closed

Bump timeout for e2e-aws-op#692
cgwalters wants to merge 1 commit into
openshift:masterfrom
cgwalters:bump-e2e-op-time

Conversation

@cgwalters
Copy link
Copy Markdown
Member

We seem to be taking longer here.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 2, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019
@runcom
Copy link
Copy Markdown
Member

runcom commented May 2, 2019

I'm fine with this if we're just timing out on the latest test added (quorum guard afaict)

@runcom
Copy link
Copy Markdown
Member

runcom commented May 2, 2019

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 2, 2019
@hexfusion
Copy link
Copy Markdown
Contributor

/lgtm

Thanks @cgwalters @runcom !

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, hexfusion, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Copy Markdown
Contributor

/test e2e-aws-upgrade

@cgwalters
Copy link
Copy Markdown
Member Author

              "degradedMachineCount": 0,
                "machineCount": 3,
                "observedGeneration": 1,
                "readyMachineCount": 2,
                "unavailableMachineCount": 1,
                "updatedMachineCount": 2
            }

@runcom
Copy link
Copy Markdown
Member

runcom commented May 2, 2019

@cgwalters that just looks like the worker pool is still rolling and reconciling to me

We seem to be taking longer here.
@cgwalters cgwalters force-pushed the bump-e2e-op-time branch from e6bfb28 to cde17af Compare May 2, 2019 13:48
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 2, 2019
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

120 seems awfully high? i assume we just trying it to see if it's truly a timeout issue?

@cgwalters
Copy link
Copy Markdown
Member Author

i assume we just trying it to see if it's truly a timeout issue?

Right.
/hold
Since we probably don't want to merge as is.
Also I suspect 2 hours may also be the timeout for Prow jobs in general? Need to investigate that.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2019
@cgwalters
Copy link
Copy Markdown
Member Author

I find myself getting confused by getUnavailableMachines...it feels like it's supposed to be getMachinesToUpdate but shouldn't it be checking for nodes which are ready?

diff --git a/pkg/controller/node/status.go b/pkg/controller/node/status.go
index 75c16477..dfdd2ee1 100644
--- a/pkg/controller/node/status.go
+++ b/pkg/controller/node/status.go
@@ -192,10 +192,12 @@ func getUnavailableMachines(currentConfig string, nodes []*corev1.Node) []*corev
 			continue
 		}
 
-		nodeNotReady := !isNodeReady(node)
+		if !isNodeReady(node) {
+			continue
+		}
 		if dconfig == currentConfig && (dconfig != cconfig || nodeNotReady) {
 			unavail = append(unavail, node)
-			glog.V(2).Infof("Node %s unavailable: different configs (desired: %s, current %s) or node not ready %v", node.Name, dconfig, cconfig, nodeNotReady)
+			glog.V(2).Infof("Node %s unavailable: different configs (desired: %s, current %s)", node.Name, dconfig, cconfig)
 		}
 	}
 	return unavail

@cgwalters
Copy link
Copy Markdown
Member Author

Paste from @sjenning

default                                      72s         Warning   MCDBootstrapSyncFailure       node/master-1                                          pending config rendered-master-1eb4fcd28e9f85ea9ac914d1789ad529 bootID a86994cc-fd33-47d4-a3e6-76da3c3d6fa2 matches current! Failed to reboot?
default                                      47s         Warning   MCDBootstrapSyncFailure       node/master-0                                          pending config rendered-master-8e7d5408d19158180e4cd0b6750b434c bootID d1a6236c-df42-412d-bac6-49155ea4ea82 matches current! Failed to reboot?
openshift-machine-config-operator            14s         Warning   FailedScheduling              pod/etcd-quorum-guard-745f7c4d8f-9wstp                 0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable, 3 node(s) didn't match node selector.

$ oc get nodes
NAME       STATUS                     ROLES    AGE   VERSION
master-0   Ready,SchedulingDisabled   master   20h   v1.13.4+e8d1fd69b
master-1   Ready,SchedulingDisabled   master   20h   v1.13.4+e8d1fd69b
master-2   Ready                      master   20h   v1.13.4+e8d1fd69b
worker-0   Ready                      worker   20h   v1.13.4+1f90f9755
worker-1   Ready                      worker   20h   v1.13.4+1f90f9755
worker-2   Ready                      worker   20h   v1.13.4+1f90f9755

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented May 2, 2019

Noticing in the CI artifacts logs that one of the worker daemons has no logs whatsoever

edit: double checked and it was a worker daemon see: openshift-machine-config-operator_machine-config-daemon-gfwj7_machine-config-daemon.log.gz

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade cde17af link /test e2e-aws-upgrade
ci/prow/e2e-aws-op cde17af link /test e2e-aws-op

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@runcom
Copy link
Copy Markdown
Member

runcom commented May 2, 2019

I find myself getting confused by getUnavailableMachines...it feels like it's supposed to be getMachinesToUpdate but shouldn't it be checking for nodes which are ready?

@cgwalters your patch is reporting unavailable=0 when nodes aren't ready tho, which is what makes progress if everything is available.

@cgwalters
Copy link
Copy Markdown
Member Author

Closing in favor of #697

@cgwalters cgwalters closed this May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants