From 3c915e2f0c99d672c53450a5d99f5cbc71cd8e62 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Fri, 9 Oct 2020 16:22:04 -0700 Subject: [PATCH] ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds Looking at [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not' 2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc" 2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s 2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki" 2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s 2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf" 2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s 2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp" 2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s 2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac" 2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s 2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install" 2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s 2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test" {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"} 2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki" {"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"} That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads. This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992 --- ci-operator/step-registry/gather/loki/gather-loki-ref.yaml | 1 + .../step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml | 1 + 2 files changed, 2 insertions(+) diff --git a/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml b/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml index 527ee2e8ce0ed..0be96779da811 100644 --- a/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml +++ b/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml @@ -5,6 +5,7 @@ ref: name: cli-jq tag: latest commands: gather-loki-commands.sh + active_deadline_seconds: 600 resources: requests: cpu: 300m diff --git a/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml b/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml index 0e511454c1c02..4859357891e25 100644 --- a/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml +++ b/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml @@ -2,6 +2,7 @@ ref: as: openshift-e2e-test from: tests commands: openshift-e2e-test-commands.sh + active_deadline_seconds: 7200 env: - name: TEST_COMMAND default: run