Bug 1834895: Bump waitForPoolComplete to 30 mins by yuqi-zhang · Pull Request #1723 · openshift/machine-config-operator

yuqi-zhang · 2020-05-11T17:36:02Z

Tests are timing out for e2e-gcp-op due to hitting this timeout,
which takes just a little over 20 minutes at the moment.

openshift-ci-robot · 2020-05-11T17:36:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2020-05-11T18:54:47Z

LOL I was just about to submit a PR for 25 😄

yuqi-zhang · 2020-05-11T19:06:08Z

/retest

Different errors

sinnykumari · 2020-05-12T06:36:04Z

/retest

sinnykumari · 2020-05-12T11:51:48Z

Looks like waiting for 30 mins for MCP to complete update on all nodes is not enough for all the tests we run. Two of the e2e-op tests failed due to timeout.

During KernelType test - one of the MCD pod stopped doing updates in the middle
During KerrnelArguments test - 2 worker node got updated and it seems timeout was reached and hence third worker node didn't even started update, mcd log controller log

Not sure why this is happening all of a sudden. From the e2e-gcp-op test history this seems to have started around 2020-05-08 19:11:36 UTC.

This doesn't seem like related to MCO since we have merged last PR in MCO master on 2020-05-06 .

@miabbott @cgwalters Wondering if something changed on RHCOS side which could be related to this 🤔

yuqi-zhang · 2020-05-12T14:27:44Z

Right, that sounds like we hit the 2h timeout, let me also try bumping that

sinnykumari · 2020-05-12T14:52:30Z

Looks like I spoke a bit early about time issue, after looking closely at controller log, it seems each node takes approaximately 8 mins,

I0512 09:01:09.220518       1 render_controller.go:516] Pool worker: now targeting: rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:14.220845       1 node_controller.go:759] Setting node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal to desired config rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:14.243081       1 node_controller.go:453] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:01:15.267968       1 node_controller.go:453] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal changed machineconfiguration.openshift.io/state = Working
I0512 09:01:15.959566       1 node_controller.go:434] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is now reporting unready: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is reporting Unschedulable
I0512 09:08:29.298006       1 node_controller.go:443] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal has completed update to rendered-worker-726c784ab300dec5da9f9a2467a5082e
I0512 09:08:29.322461       1 node_controller.go:436] Pool worker: node ci-op-q5hh5-w-b-cf67x.c.openshift-gce-devel-ci.internal is now reporting ready

so, 30 mins should be enough?

sinnykumari · 2020-05-12T14:54:03Z

ah, good catch Jerry

yuqi-zhang · 2020-05-12T15:07:04Z

It seems that the timing it takes for the MCD to get ready post reboot went up from 30s to 5min

example successful:

I0508 02:05:54.874116    2278 daemon.go:564] Starting MachineConfigDaemon
I0508 02:05:54.874240    2278 daemon.go:571] Enabling Kubelet Healthz Monitor
E0508 02:05:58.357412    2278 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0508 02:05:58.357419    2278 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0508 02:06:20.702700    2278 daemon.go:767] Current config: rendered-worker-3cc47eec6c9d5de66f0e27f866ea1ef3
I0508 02:06:20.702726    2278 daemon.go:768] Desired config: rendered-worker-297ef2c427f1dc0c8b9189a8f3caa500

example now:

I0512 07:45:27.942365    2186 daemon.go:564] Starting MachineConfigDaemon
I0512 07:45:27.942468    2186 daemon.go:571] Enabling Kubelet Healthz Monitor
E0512 07:45:30.886759    2186 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0512 07:45:30.886943    2186 reflector.go:178] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0512 07:50:26.363793    2186 daemon.go:767] Current config: rendered-worker-b35a6311f1b4a539a89158ff922790a9
I0512 07:50:26.363819    2186 daemon.go:768] Desired config: rendered-worker-231aad58dd7b1c50829219d7b4026cc9

so its taking the informer 4min extra to get ready?

openshift-ci-robot · 2020-05-12T15:43:13Z

@yuqi-zhang: This pull request references Bugzilla bug 1834895, which is invalid:

expected the bug to target the "4.5.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1834895: Bump waitForPoolComplete to 30 mins

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2020-05-12T15:43:41Z

/bugzilla refresh

openshift-ci-robot · 2020-05-12T15:44:49Z

@yuqi-zhang: This pull request references Bugzilla bug 1834895, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-05-12T17:53:37Z

the underlying cause of the timeouts should be figured out and the fix will fix our ci... i dont think bumping the timeout is going to facilitate that... for now holding as i will look further..

/hold

kikisdeliveryservice · 2020-05-13T00:36:11Z

Well turns out bumping didn't do anything anyway and the test ran for 3.5 hours..

Looking the TestMCDeployed from this run with the extra time it is taking 3750.12s, but when I look at a passing run from recently (last Thursday) it takes ~1558.13s. In 4.4 ~1508.10 and 4.3, the same test took ~926.09? Something very recent is causing this to just take a ton of time and I'm not yet sure what it is..

Here's a passing run from a few days ago for comparison, showing the timing differences..
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2123/artifacts/e2e-gcp-op/container-logs/test.log

kikisdeliveryservice · 2020-05-13T01:17:07Z

Opened 2 BZs for the differences I see between new runs that fail and older (a few days) runs that pass:
https://bugzilla.redhat.com/show_bug.cgi?id=1835042
https://bugzilla.redhat.com/show_bug.cgi?id=1834948

Tests are timing out for e2e-gcp-op due to hitting this timeout, which takes just a little over 20 minutes at the moment. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

yuqi-zhang · 2020-05-13T13:41:53Z

Just for curiosity purposes I bumped to 4h to see how long it actually takes the test to finish

openshift-ci-robot · 2020-05-13T17:42:34Z

@yuqi-zhang: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-op	`aeb0328`	link	`/test e2e-gcp-op`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2020-05-14T17:18:03Z

Fixed in #1731

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2020

openshift-ci-robot requested review from kikisdeliveryservice and runcom May 11, 2020 17:36

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2020

yuqi-zhang force-pushed the bump-wait-pool-timeout branch from 15b5236 to 050bdee Compare May 12, 2020 14:31

sinnykumari reviewed May 12, 2020

View reviewed changes

Comment thread Makefile Outdated

Copy link
Copy Markdown

Contributor

sinnykumari May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch Jerry

yuqi-zhang changed the title ~~WIP: Bump waitForPoolComplete to 30 mins~~ Bug 1834895: Bump waitForPoolComplete to 30 mins May 12, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 12, 2020

sinnykumari mentioned this pull request May 13, 2020

Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking #1724

Merged

Bump waitForPoolComplete to 30 mins

aeb0328

Tests are timing out for e2e-gcp-op due to hitting this timeout, which takes just a little over 20 minutes at the moment. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

yuqi-zhang force-pushed the bump-wait-pool-timeout branch from 050bdee to aeb0328 Compare May 13, 2020 13:41

openshift-ci-robot mentioned this pull request May 14, 2020

Bug 1834895: pkg/daemon: Set AddFunc on the nodeInformer as well #1731

Merged

yuqi-zhang closed this May 14, 2020

yuqi-zhang deleted the bump-wait-pool-timeout branch May 14, 2020 17:17

Conversation

yuqi-zhang commented May 11, 2020

Uh oh!

openshift-ci-robot commented May 11, 2020

Uh oh!

haircommander commented May 11, 2020

Uh oh!

yuqi-zhang commented May 11, 2020

Uh oh!

sinnykumari commented May 12, 2020

Uh oh!

sinnykumari commented May 12, 2020

Uh oh!

yuqi-zhang commented May 12, 2020

Uh oh!

sinnykumari commented May 12, 2020

Uh oh!

sinnykumari May 12, 2020

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented May 12, 2020

Uh oh!

openshift-ci-robot commented May 12, 2020

Uh oh!

yuqi-zhang commented May 12, 2020

Uh oh!

openshift-ci-robot commented May 12, 2020

Uh oh!

kikisdeliveryservice commented May 12, 2020

Uh oh!

kikisdeliveryservice commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kikisdeliveryservice commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi-zhang commented May 13, 2020

Uh oh!

openshift-ci-robot commented May 13, 2020

Uh oh!

yuqi-zhang commented May 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kikisdeliveryservice commented May 13, 2020 •

edited

Loading

kikisdeliveryservice commented May 13, 2020 •

edited

Loading