Skip to content

test: increase e2e test run with 15 minutes#2184

Closed
sinnykumari wants to merge 1 commit intoopenshift:masterfrom
sinnykumari:gcp-op-timeout
Closed

test: increase e2e test run with 15 minutes#2184
sinnykumari wants to merge 1 commit intoopenshift:masterfrom
sinnykumari:gcp-op-timeout

Conversation

@sinnykumari
Copy link
Copy Markdown
Contributor

fixes e2e-gcp-op test failing in ci due to timeout

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2020
@sinnykumari
Copy link
Copy Markdown
Contributor Author

sinnykumari commented Oct 27, 2020

By analyzing some of gcp-op test run logs, it seems system reboot time has increased by around 30 seconds.

  1. With increased reboot time ~90 seconds
    https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2182/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1320812936498778112/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-5c86g_machine-config-daemon.log
I1026 20:31:17.810600    1972 update.go:1590] initiating reboot: Node will reboot into config rendered-worker-ddf64801127323a695f308d91109d951
I1026 20:31:17.893443    1972 daemon.go:641] Shutting down MachineConfigDaemon
I1026 20:32:50.940436    2088 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-370-g999521b6-dirty (999521b61c81577b156331b7bf8495347a8503c1)
I1026 20:32:50.949591    2088 start.go:121] Calling chroot("/rootfs")
I1026 20:32:50.949813    2088 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I1026 20:32:51.457544    2088 daemon.go:226] Booted osImageURL: registry.build01.ci.openshift.org/ci-op-y4rk2lpr/stable@sha256:ce348cfb50d39297969c9a0c2f928d23eb2ab8ded7cacd5e39685bd0931bbfac (47.82.202010261347-0)

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2181/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1321047711511744512/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-h5bd9_machine-config-daemon.log

I1027 12:09:08.575381    1848 update.go:1607] initiating reboot: Node will reboot into config rendered-worker-4ad27ca30531f77416a00e38bda8c8e6
I1027 12:09:08.668785    1848 daemon.go:641] Shutting down MachineConfigDaemon
I1027 12:10:44.764737    2142 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-370-gf72a5ace-dirty (f72a5ace0b5432ad96bba59b2b91633d8bb8315c)
I1027 12:10:44.772268    2142 start.go:121] Calling chroot("/rootfs")
I1027 12:10:44.772468    2142 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I1027 12:10:45.343467    2142 daemon.go:226] Booted osImageURL: registry.build01.ci.openshift.org/ci-op-gdsvbywh/stable@sha256:cd6d0d6c4cbaa7ceaca75e7a00fad2ced6344bc89c38cadfa121b24209038e2f (47.82.202010270142-0)
  1. Previous reboot time ~60 seconds

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2177/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1319724839077941248/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-wmvdr_machine-config-daemon.log

I1023 21:06:25.466932    1842 update.go:1590] initiating reboot: Node will reboot into config rendered-infra-54a8f08a03796f8c94187a986edcbc7a
I1023 21:06:25.594958    1842 daemon.go:641] Shutting down MachineConfigDaemon
I1023 21:07:30.693671    1850 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-370-g56ded555-dirty (56ded5550c030d88cdffa8de630ff1a1287303f3)
I1023 21:07:30.700093    1850 start.go:121] Calling chroot("/rootfs")
I1023 21:07:30.700507    1850 rpm-ostree.go:261] Running captured: rpm-ostree status --json
I1023 21:07:31.107832    1850 daemon.go:226] Booted osImageURL: registry.build01.ci.openshift.org/ci-op-vcw3wmdq/stable@sha256:113a36da35d4aff5f8ef43ff97bed0e97cdaea2139c8db44fbac31051bec43c8 (47.82.202010231442-0)

@sinnykumari
Copy link
Copy Markdown
Contributor Author

I am not sure why the time has increased, we can investigate that later. Until then let's increase the timeout so that we unblock PRs.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented Oct 27, 2020

I think we should figure out the underlying problem and not extend the test time. Since reboot time per node is 50% more

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

(As per slack, we're doing some investigation on this before deciding how to resolve)

@sinnykumari
Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2020
@cgwalters
Copy link
Copy Markdown
Member

Sadly we never summarized this bug anywhere public on the MCO repo AFAICS.

openshift/cluster-network-operator#859 merged which should help - let's try to verify that.
The other case is https://bugzilla.redhat.com/show_bug.cgi?id=1893360 - it needs some design work.

@cgwalters
Copy link
Copy Markdown
Member

OK another status update on this; I kept being confused how it wasn't helping but it turns out CI (and nightly) payload generation has been broken until just now, so our CI runs were still using an old cluster-network-operator.

Let's keep an eye out now to see if openshift/cluster-network-operator#859 actually improves things!

fixes e2e-gcp-op test failing in ci due to timeout
@sinnykumari
Copy link
Copy Markdown
Contributor Author

nightly images are green now, openshift/cluster-network-operator#859 should be included in recent payload in ci run now. For sanity check, re triggered test here.

@cgwalters
Copy link
Copy Markdown
Member

We should probably consider this to start clearing out the PR backlog.
/retest

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

We should probably consider this to start clearing out the PR backlog.

e2e-aws and e2e-aws-serial are also both currently broken and being worked on. we shouldnt be overriding all of those (3+) tests just to get prs in imo.

for gcp-op: openshift/cluster-dns-operator#213 (comment) (and #2229) need to be merged but can't bc of the above e2e-aws issues

so it seems like first the e2e-aws tests needs to get fixed bc that's blocking the dns pr which once that merges unblocks our ci.

@cgwalters
Copy link
Copy Markdown
Member

It's all interlinked though. We now have so many PRs outstanding that we're waiting on AWS "leases" in some, e.g.
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-dns-operator/213/pull-ci-openshift-cluster-dns-operator-master-e2e-aws/1328393945482268672
is waiting - it won't even start until we clear the queue some.

For example we could combine this PR with #2229
and merge when e2e-gcp-op goes green, even if e2e-aws isn't (changes in our tests clearly can't affect that).

Basically I think we should try to do something other than be blocked.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented Nov 16, 2020

Right but a bigger problem mentioned in slack is that ci and nightly payload acceptance is also broken due to this...

I'm going to go and try to create more urgency on the aws bc we shouldnt be merging in these condiation but it should also be more important than it is rn.

I don't think overriding required tests across the board is the right choice. we can land 2229 but we will still be blocked on other required tests. if there are maybe problems with payloads and across ocp.... aws really needs to get fixed bc green means nothing. ☹️

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

@sinnykumari: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-op dcf26d4 link /test e2e-gcp-op
ci/prow/e2e-ovn-step-registry dcf26d4 link /test e2e-ovn-step-registry
ci/prow/e2e-agnostic-upgrade dcf26d4 link /test e2e-agnostic-upgrade
ci/prow/e2e-aws-serial dcf26d4 link /test e2e-aws-serial
ci/prow/e2e-aws dcf26d4 link /test e2e-aws
ci/prow/okd-e2e-aws dcf26d4 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

As an update: this comes down to https://bugzilla.redhat.com/show_bug.cgi?id=1897604 (which is also blocking the dns fix: openshift/cluster-dns-operator#213) and blocking all merges across ocp. There's a new channel (incident-kcm..) now where people having started working on it.

Instead of trying to hack around and merge this, which won't help bc other required tests are blocking on all repos incl this one, we are waiting for the above bz to be resolved.

@sinnykumari
Copy link
Copy Markdown
Contributor Author

Closing this PR since actual slowness issue has been fixed with openshift/cluster-dns-operator#213
/close

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sinnykumari: Closed this PR.

Details

In response to this:

Closing this PR since actual slowness issue has been fixed with openshift/cluster-dns-operator#213
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. team-mco

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants