ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds#12647
Conversation
…seconds Looking at [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not' 2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc" 2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s 2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki" 2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s 2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf" 2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s 2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp" 2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s 2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac" 2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s 2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install" 2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s 2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test" {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"} 2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki" {"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"} That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads. This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Using the 'upgrade-all' precedent from cfcd60f (release: Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594). I'm not clear on why we are joining with a newline instead of '&&'; presumably this is getting wrapped in a 'set -e' or equivalent. But I'm sticking with newline to match precedent. This increases the risk that we time out these slow jobs (e.g. [1] took 3h42m), but we really want to exercise tests like openshift/origin@9f7fe0089d (Add test for scaling machineSets, 2019-04-11, openshift/origin#22564), which is in openshift/conformance/serial, because machines launch with the born-in boot images until we get [2]. And in fact, the reason why we didn't have this post-update suite in 4.6 was because of 3bc9d8e (stop running e2e tests after three upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436). But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have to worry about getting logs when that step is slow. So we might not pass if we're slow, but we'll still get logs to debug why we're slow. Only for 4.6 and later, because 4.5 is live and if we had problems there we'd probably have already heard about them from customers. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808 [2]: openshift/enhancements#201
Like 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), but for the gather-core-dump step. This helps hold time to run futher gathers and tear down the cluster under test, if gather-core-dump gets hung up, like it did for over 2h here [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560/build-log.txt | grep -A2 'Executing pod.*gather-core-dump' 2020/11/12 20:02:15 Executing pod "e2e-aws-gather-core-dump" 2020/11/12 20:02:42 Container cp-secret-wrapper in pod e2e-aws-gather-core-dump completed successfully {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-12T22:11:29Z"} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560
Like 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), but for the deprovision step. This helps hold time to gather assets, if ipi-deprovision gets hung up, like it did for over 2h here [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/build-log.txt | grep -A2 'Executing pod.*ipi- deprovision' 2020/11/20 09:54:54 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision" 2020/11/20 09:54:57 Container cp-secret-wrapper in pod e2e-aws-operator-ipi-deprovision-deprovision completed successfully {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-20T12:24:09Z"} 25 minutes should be enough for most teardown. We have a number of successful runs which take longer: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=Pod+.*ipi-deprovision-deprovision+succeeded+after' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*after \([0-9]*\)\([smh]\).*/\2 \1/' | sort -n | uniq -c 29 h 1 1 h 2 16 m 1 21 m 10 36 m 11 33 m 12 47 m 13 40 m 14 23 m 15 21 m 16 22 m 17 20 m 18 17 m 19 18 m 2 11 m 20 13 m 21 19 m 22 22 m 23 8 m 24 17 m 25 12 m 26 8 m 27 14 m 28 9 m 29 16 m 3 15 m 30 17 m 31 ... Success after over an hour is crazy. Which jobs were those? $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=Pod+.*ipi-deprovision-deprovision+succeeded+after+[0-9]h' | jq -r 'to_entries[] | .key as $uri | .value | to_entries[].value[].co ntext[] | $uri + " " + .' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-kube-reporting-metering-operator-release-4.5-metering-periodic-aws/1333923915013033984 2020/12/02 01:27:54 Pod metering-periodic-aws-ipi-deprovision-deprovision succeeded after 1h12m0s https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.6-integration-e2e-periodic/1333923958377943040 2020/12/02 01:28:12 Pod integration-e2e-periodic-ipi-deprovision-deprovision succeeded after 1h5m41s https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1333804827905363968 2020/12/01 19:44:10 Pod e2e-aws-proxy-ipi-deprovision-deprovision succeeded after 1h4m2s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/che-incubator_chectl/992/pull-ci-che-incubator-chectl-master-v5-chectl-e2e-operator-installer/1333897643826352128 2020/12/02 01:02:07 Pod chectl-e2e-operator-installer-ipi-deprovision-deprovision succeeded after 1h41m5s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/cri-o_cri-o/4377/pull-ci-cri-o-cri-o-master-e2e-agnostic/1333901999330037760 2020/12/02 00:10:14 Pod e2e-agnostic-ipi-deprovision-deprovision succeeded after 1h8m0s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/cri-o_cri-o/4394/pull-ci-cri-o-cri-o-master-e2e-agnostic/1333866701950816256 2020/12/01 23:03:06 Pod e2e-agnostic-ipi-deprovision-deprovision succeeded after 1h15m3s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-psap_special-resource-operator/41/pull-ci-openshift-psap-special-resource-operator-master-e2e-aws/1333903530364243968 2020/12/02 00:16:52 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h12m8s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-logging-operator/821/pull-ci-openshift-cluster-logging-operator-master-functional/1333919264872075264 2020/12/02 01:00:51 Pod functional-ipi-deprovision-deprovision succeeded after 1h3m35s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/852/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry/1333766191600111616 2020/12/01 16:15:42 Pod e2e-ovn-hybrid-step-registry-ipi-deprovision-deprovision succeeded after 1h4m35s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/886/pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows-custom-vxlan/1333813238646706176 2020/12/01 18:01:39 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h8m32s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows-custom-vxlan/1333907617315033088 2020/12/02 01:32:28 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h23m3s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console-operator/491/pull-ci-openshift-console-operator-master-e2e-aws-operator/1333879999068901376 2020/12/02 00:11:37 Pod e2e-aws-operator-ipi-deprovision-deprovision succeeded after 1h7m16s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4429/pull-ci-openshift-installer-master-e2e-aws-workers-rhel7/1333912709506273280 2020/12/02 01:34:18 Pod e2e-aws-workers-rhel7-ipi-deprovision-deprovision succeeded after 1h12m35s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4440/pull-ci-openshift-installer-master-e2e-aws-shared-vpc/1333864843601514496 2020/12/01 23:32:40 Pod e2e-aws-shared-vpc-ipi-deprovision-deprovision succeeded after 1h2m49s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes-autoscaler/180/pull-ci-openshift-kubernetes-autoscaler-master-e2e-aws-operator/1333874595857436672 2020/12/01 23:02:01 Pod e2e-aws-operator-ipi-deprovision-deprovision succeeded after 1h19m11s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/769/pull-ci-openshift-machine-api-operator-master-e2e-aws/1333776539803717632 2020/12/01 16:58:56 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h15m39s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2210/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1333867792788623360 2020/12/02 00:09:53 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h19m58s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2210/pull-ci-openshift-machine-config-operator-master-okd-e2e-aws/1333839577021943808 2020/12/01 21:59:09 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h2m1s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2272/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1333899974395564032 2020/12/02 01:07:48 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h5m13s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/657/pull-ci-openshift-oc-master-e2e-aws/1333850678639988736 2020/12/01 21:43:41 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h6m6s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/657/pull-ci-openshift-oc-master-e2e-aws/1333910781766406144 2020/12/02 00:49:46 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h17m17s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/660/pull-ci-openshift-oc-master-e2e-aws-serial/1333891301321478144 2020/12/01 23:33:01 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h3m49s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/660/pull-ci-openshift-oc-master-e2e-aws-serial/1333919666019504128 2020/12/02 01:01:44 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h2m25s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/365/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-windows-custom-vxlan/1333781276464779264 2020/12/01 17:09:03 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h31m37s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/366/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-windows-custom-vxlan/1333877629530411008 2020/12/02 00:02:31 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h7m45s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_prometheus/67/pull-ci-openshift-prometheus-master-e2e-aws/1333873401026056192 2020/12/02 00:45:30 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h23m0s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/13998/rehearse-13998-pull-ci-openshift-windows-machine-config-operator-release-4.7-aws-e2e-operator/1333906001283256320 2020/12/02 01:22:07 Pod aws-e2e-operator-ipi-deprovision-deprovision succeeded after 1h1m36s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/416/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1333878269035941888 2020/12/02 00:29:39 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h51m18s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/425/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1333762244420308992 2020/12/01 16:07:41 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h22m15s https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496 2020/12/02 01:06:12 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 2h15m28s Checking that slowest job: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log 2>/dev/null | head -n1 time="2020-12-01T22:51:17Z" level=debug msg="OpenShift Installer unreleased-master-3988-g47b4c1a046ee43bc500ff19c98a231a754cc2a70-dirty" $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log | tail -n1 time="2020-12-02T01:06:11Z" level=info msg="Time elapsed: 2h14m54s" So the delay is legitimate. But it's due to a hiccup in IAM-user listing (AWS-side bug? Throttling?): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log | grep -A1 'search for IAM users' time="2020-12-01T22:53:53Z" level=debug msg="search for IAM users" time="2020-12-02T00:08:08Z" level=debug msg="search for IAM instance profiles" -- time="2020-12-02T00:38:54Z" level=debug msg="search for IAM users" time="2020-12-02T01:04:14Z" level=debug msg="search for IAM instance profiles" -- time="2020-12-02T01:05:27Z" level=debug msg="search for IAM users" time="2020-12-02T01:05:42Z" level=debug msg="search for IAM instance profiles" -- time="2020-12-02T01:05:45Z" level=debug msg="search for IAM users" time="2020-12-02T01:05:48Z" level=debug msg="search for IAM instance profiles" Anyhow, letting a few of those slip through the cracks isn't a problem, because we will still have logs from the first 25 minutes of the deprovision logs in the job itself, and any remaining resources will eventually be reaped by [1] and other backstops. Picking a number that's less than 30m gives us a change to wrap things up within the termination grace period granted to Prow jobs. [1]: https://github.com/openshift/ci-tools/blob/9dd6f87a5cd070f9e81af7a6cb181bb466f47933/cmd/ipi-deprovision/ipi-deprovision.sh
Looking at this job:
That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads.
This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step.