Skip to content

Bug 1834895: Bump waitForPoolComplete to 30 mins#1723

Closed
yuqi-zhang wants to merge 1 commit intoopenshift:masterfrom
yuqi-zhang:bump-wait-pool-timeout
Closed

Bug 1834895: Bump waitForPoolComplete to 30 mins#1723
yuqi-zhang wants to merge 1 commit intoopenshift:masterfrom
yuqi-zhang:bump-wait-pool-timeout

Conversation

@yuqi-zhang
Copy link
Copy Markdown
Contributor

Tests are timing out for e2e-gcp-op due to hitting this timeout,
which takes just a little over 20 minutes at the moment.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2020
@haircommander
Copy link
Copy Markdown
Member

LOL I was just about to submit a PR for 25 😄

@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

/retest

Different errors

@sinnykumari
Copy link
Copy Markdown
Contributor

/retest

@sinnykumari
Copy link
Copy Markdown
Contributor

Looks like waiting for 30 mins for MCP to complete update on all nodes is not enough for all the tests we run. Two of the e2e-op tests failed due to timeout.

  1. During KernelType test - one of the MCD pod stopped doing updates in the middle
  2. During KerrnelArguments test - 2 worker node got updated and it seems timeout was reached and hence third worker node didn't even started update, mcd log controller log

Not sure why this is happening all of a sudden. From the e2e-gcp-op test history this seems to have started around 2020-05-08 19:11:36 UTC.

This doesn't seem like related to MCO since we have merged last PR in MCO master on 2020-05-06 .

@miabbott @cgwalters Wondering if something changed on RHCOS side which could be related to this 🤔

@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

Right, that sounds like we hit the 2h timeout, let me also try bumping that

@yuqi-zhang yuqi-zhang force-pushed the bump-wait-pool-timeout branch from 15b5236 to 050bdee Compare May 12, 2020 14:31
@sinnykumari
Copy link
Copy Markdown
Contributor

Looks like I spoke a bit early about time issue, after looking closely at controller log, it seems each node takes approaximately 8 mins,

I0512 09:01:09.220518       1 render_controller.go:516] Pool worker: now targeting: rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:14.220845       1 node_controller.go:759] Setting node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal to desired config rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:14.243081       1 node_controller.go:453] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:15.267968       1 node_controller.go:453] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/state = Working
I0512 09:01:15.959566       1 node_controller.go:434] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is reporting Unschedulable
I0512 09:08:29.298006       1 node_controller.go:443] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal has completed update to rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:08:29.322461       1 node_controller.go:436] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is now reporting ready

so, 30 mins should be enough?

Comment thread Makefile Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch Jerry

@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

It seems that the timing it takes for the MCD to get ready post reboot went up from 30s to 5min

example successful:

I0508 02:05:54.874116    2278 daemon.go:564] Starting MachineConfigDaemon
I0508 02:05:54.874240    2278 daemon.go:571] Enabling Kubelet Healthz Monitor
E0508 02:05:58.357412    2278 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0508 02:05:58.357419    2278 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0508 02:06:20.702700    2278 daemon.go:767] Current config: rendered-worker-3cc47eec6c9d5de66f0e27f866ea1ef3
I0508 02:06:20.702726    2278 daemon.go:768] Desired config: rendered-worker-297ef2c427f1dc0c8b9189a8f3caa500

example now:

I0512 07:45:27.942365    2186 daemon.go:564] Starting MachineConfigDaemon
I0512 07:45:27.942468    2186 daemon.go:571] Enabling Kubelet Healthz Monitor
E0512 07:45:30.886759    2186 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0512 07:45:30.886943    2186 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0512 07:50:26.363793    2186 daemon.go:767] Current config: rendered-worker-b35a6311f1b4a539a89158ff922790a9
I0512 07:50:26.363819    2186 daemon.go:768] Desired config: rendered-worker-231aad58dd7b1c50829219d7b4026cc9

so its taking the informer 4min extra to get ready?

@yuqi-zhang yuqi-zhang changed the title WIP: Bump waitForPoolComplete to 30 mins Bug 1834895: Bump waitForPoolComplete to 30 mins May 12, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 12, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yuqi-zhang: This pull request references Bugzilla bug 1834895, which is invalid:

  • expected the bug to target the "4.5.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1834895: Bump waitForPoolComplete to 30 mins

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 12, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yuqi-zhang: This pull request references Bugzilla bug 1834895, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

the underlying cause of the timeouts should be figured out and the fix will fix our ci... i dont think bumping the timeout is going to facilitate that... for now holding as i will look further..

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2020
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented May 13, 2020

Well turns out bumping didn't do anything anyway and the test ran for 3.5 hours..

Looking the TestMCDeployed from this run with the extra time it is taking 3750.12s, but when I look at a passing run from recently (last Thursday) it takes ~1558.13s. In 4.4 ~1508.10 and 4.3, the same test took ~926.09? Something very recent is causing this to just take a ton of time and I'm not yet sure what it is..

Here's a passing run from a few days ago for comparison, showing the timing differences..
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2123/artifacts/e2e-gcp-op/container-logs/test.log

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented May 13, 2020

Opened 2 BZs for the differences I see between new runs that fail and older (a few days) runs that pass:
https://bugzilla.redhat.com/show_bug.cgi?id=1835042
https://bugzilla.redhat.com/show_bug.cgi?id=1834948

Tests are timing out for e2e-gcp-op due to hitting this timeout,
which takes just a little over 20 minutes at the moment.

Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
@yuqi-zhang yuqi-zhang force-pushed the bump-wait-pool-timeout branch from 050bdee to aeb0328 Compare May 13, 2020 13:41
@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

Just for curiosity purposes I bumped to 4h to see how long it actually takes the test to finish

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yuqi-zhang: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-op aeb0328 link /test e2e-gcp-op

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Copy Markdown
Contributor Author

Fixed in #1731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants