From 3c915e2f0c99d672c53450a5d99f5cbc71cd8e62 Mon Sep 17 00:00:00 2001
From: "W. Trevor King" <wking@tremily.us>
Date: Fri, 9 Oct 2020 16:22:04 -0700
Subject: [PATCH] ci-operator/step-registry/openshift/e2e/test: Add 2h
 active_deadline_seconds

Looking at [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
  2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
  2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
  2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
  2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
  2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
  2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
  2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
  2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
  2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
  2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
  2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
  2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
  2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
  2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
  {"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}

That job was moving along fine, but hung up in the openshift-e2e-test
step.  After around 3h in the step, it hit the cumulative 4h timeout
for the ci-operator run, terminated the openshift-e2e-test step, and
entered a 30m teardown grace period.  The first step of that grace
period was the Loki gather, but, presumably because the cluster was
dead, the Loki step hung and absorbed the entire grace period.  This
left no time for the emergency gather-extra collection, cluster
teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout
to gather-loki to limit the damage from similar hangs in the future,
allowing us to get artifacts that will help understand why the cluster
hung in the openshift-e2e-test step.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992
---
 ci-operator/step-registry/gather/loki/gather-loki-ref.yaml       | 1 +
 .../step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml | 1 +
 2 files changed, 2 insertions(+)

diff --git a/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml b/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
index 527ee2e8ce0ed..0be96779da811 100644
--- a/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
+++ b/ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
@@ -5,6 +5,7 @@ ref:
     name: cli-jq
     tag: latest
   commands: gather-loki-commands.sh
+  active_deadline_seconds: 600
   resources:
     requests:
       cpu: 300m
diff --git a/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml b/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
index 0e511454c1c02..4859357891e25 100644
--- a/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
+++ b/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
@@ -2,6 +2,7 @@ ref:
   as: openshift-e2e-test
   from: tests
   commands: openshift-e2e-test-commands.sh
+  active_deadline_seconds: 7200
   env:
   - name: TEST_COMMAND
     default: run