ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds by wking · Pull Request #12647 · openshift/release

wking · 2020-10-09T23:31:22Z

Looking at this job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
{"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}

That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step.

…seconds Looking at [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not' 2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc" 2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s 2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki" 2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s 2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf" 2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s 2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp" 2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s 2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac" 2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s 2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install" 2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s 2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test" {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"} 2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki" {"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"} That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads. This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992

openshift-ci-robot · 2020-10-09T23:48:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/gather/OWNERS~~ [stevekuznetsov,wking]
~~ci-operator/step-registry/openshift/OWNERS~~ [stevekuznetsov,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-10-09T23:51:15Z

@wking: Updated the following 2 configmaps:

step-registry configmap in namespace ci at cluster api.ci using the following files:
- key gather-loki-ref.yaml using file ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
- key openshift-e2e-test-ref.yaml using file ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
step-registry configmap in namespace ci at cluster app.ci using the following files:
- key gather-loki-ref.yaml using file ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
- key openshift-e2e-test-ref.yaml using file ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml

Details

In response to this:

Looking at this job:
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
{"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}
That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Using the 'upgrade-all' precedent from cfcd60f (release: Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594). I'm not clear on why we are joining with a newline instead of '&&'; presumably this is getting wrapped in a 'set -e' or equivalent. But I'm sticking with newline to match precedent. This increases the risk that we time out these slow jobs (e.g. [1] took 3h42m), but we really want to exercise tests like openshift/origin@9f7fe0089d (Add test for scaling machineSets, 2019-04-11, openshift/origin#22564), which is in openshift/conformance/serial, because machines launch with the born-in boot images until we get [2]. And in fact, the reason why we didn't have this post-update suite in 4.6 was because of 3bc9d8e (stop running e2e tests after three upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436). But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have to worry about getting logs when that step is slow. So we might not pass if we're slow, but we'll still get logs to debug why we're slow. Only for 4.6 and later, because 4.5 is live and if we had problems there we'd probably have already heard about them from customers. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808 [2]: openshift/enhancements#201

Like 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), but for the gather-core-dump step. This helps hold time to run futher gathers and tear down the cluster under test, if gather-core-dump gets hung up, like it did for over 2h here [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560/build-log.txt | grep -A2 'Executing pod.*gather-core-dump' 2020/11/12 20:02:15 Executing pod "e2e-aws-gather-core-dump" 2020/11/12 20:02:42 Container cp-secret-wrapper in pod e2e-aws-gather-core-dump completed successfully {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-12T22:11:29Z"} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020

openshift-ci-robot requested review from smarterclayton and stevekuznetsov October 9, 2020 23:31

wking mentioned this pull request Oct 9, 2020

pkg/steps: Configurable activeDeadlineSeconds and terminationGracePeriodSeconds openshift/ci-tools#1257

Merged

stevekuznetsov approved these changes Oct 9, 2020

View reviewed changes

openshift-ci-robot assigned stevekuznetsov Oct 9, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2020

openshift-merge-robot merged commit 1f0f65b into openshift:master Oct 9, 2020

wking deleted the step-timeouts branch October 9, 2020 23:52

wking mentioned this pull request Nov 13, 2020

ci-operator/step-registry/gather/core-dump: 10m active_deadline_seconds #13600

Merged

wking mentioned this pull request Nov 25, 2020

ci-operator/step-registry/ipi/deprovision: 25m active_deadline_seconds #13884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds#12647

ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds#12647
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
wking:step-timeouts

wking commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wking commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants