Skip to content

ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds#12647

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
wking:step-timeouts
Oct 9, 2020
Merged

ci-operator/step-registry/openshift/e2e/test: Add 2h active_deadline_seconds#12647
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
wking:step-timeouts

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Oct 9, 2020

Looking at this job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
{"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}

That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step.

…seconds

Looking at [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
  2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
  2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
  2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
  2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
  2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
  2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
  2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
  2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
  2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
  2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
  2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
  2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
  2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
  2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
  {"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}

That job was moving along fine, but hung up in the openshift-e2e-test
step.  After around 3h in the step, it hit the cumulative 4h timeout
for the ci-operator run, terminated the openshift-e2e-test step, and
entered a 30m teardown grace period.  The first step of that grace
period was the Loki gather, but, presumably because the cluster was
dead, the Loki step hung and absorbed the entire grace period.  This
left no time for the emergency gather-extra collection, cluster
teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout
to gather-loki to limit the damage from similar hangs in the future,
allowing us to get artifacts that will help understand why the cluster
hung in the openshift-e2e-test step.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 1f0f65b into openshift:master Oct 9, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wking: Updated the following 2 configmaps:

  • step-registry configmap in namespace ci at cluster api.ci using the following files:
    • key gather-loki-ref.yaml using file ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
    • key openshift-e2e-test-ref.yaml using file ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key gather-loki-ref.yaml using file ci-operator/step-registry/gather/loki/gather-loki-ref.yaml
    • key openshift-e2e-test-ref.yaml using file ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-ref.yaml
Details

In response to this:

Looking at this job:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/785/pull-ci-openshift-cluster-network-operator-master-e2e-upgrade/1308837080670932992/build-log.txt | grep 'Executing\|Pod .* succeeded after\|Process did not'
2020/09/23 18:38:46 Executing "e2e-upgrade-ipi-install-monitoringpvc"
2020/09/23 18:38:52 Pod e2e-upgrade-ipi-install-monitoringpvc succeeded after 4s
2020/09/23 18:38:52 Executing "e2e-upgrade-ipi-install-loki"
2020/09/23 18:38:58 Pod e2e-upgrade-ipi-install-loki succeeded after 4s
2020/09/23 18:38:58 Executing "e2e-upgrade-ipi-conf"
2020/09/23 18:39:25 Pod e2e-upgrade-ipi-conf succeeded after 26s
2020/09/23 18:39:25 Executing "e2e-upgrade-ipi-conf-gcp"
2020/09/23 18:39:31 Pod e2e-upgrade-ipi-conf-gcp succeeded after 4s
2020/09/23 18:39:31 Executing "e2e-upgrade-ipi-install-rbac"
2020/09/23 18:39:36 Pod e2e-upgrade-ipi-install-rbac succeeded after 4s
2020/09/23 18:39:36 Executing "e2e-upgrade-ipi-install-install"
2020/09/23 19:18:00 Pod e2e-upgrade-ipi-install-install succeeded after 38m23s
2020/09/23 19:18:01 Executing "e2e-upgrade-openshift-e2e-test"
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-09-23T22:34:21Z"}
2020/09/23 22:34:22 Executing "e2e-upgrade-gather-loki"
{"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-09-23T23:04:21Z"}

That job was moving along fine, but hung up in the openshift-e2e-test step. After around 3h in the step, it hit the cumulative 4h timeout for the ci-operator run, terminated the openshift-e2e-test step, and entered a 30m teardown grace period. The first step of that grace period was the Loki gather, but, presumably because the cluster was dead, the Loki step hung and absorbed the entire grace period. This left no time for the emergency gather-extra collection, cluster teardown, or artifact uploads.

This commit adds a 2h timeout to openshift-e2e-test and a 10m timeout to gather-loki to limit the damage from similar hangs in the future, allowing us to get artifacts that will help understand why the cluster hung in the openshift-e2e-test step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the step-timeouts branch October 9, 2020 23:52
wking added a commit to wking/openshift-release that referenced this pull request Oct 22, 2020
Using the 'upgrade-all' precedent from cfcd60f (release:
Standardize all ci-chat-bot jobs, 2020-04-27, openshift#8594).  I'm not clear
on why we are joining with a newline instead of '&&'; presumably this
is getting wrapped in a 'set -e' or equivalent.  But I'm sticking with
newline to match precedent.

This increases the risk that we time out these slow jobs (e.g. [1]
took 3h42m), but we really want to exercise tests like
openshift/origin@9f7fe0089d (Add test for scaling machineSets,
2019-04-11, openshift/origin#22564), which is in
openshift/conformance/serial, because machines launch with the born-in
boot images until we get [2].

And in fact, the reason why we didn't have this post-update suite in
4.6 was because of 3bc9d8e (stop running e2e tests after three
upgrades because we hit timeouts and lose upgrade signal, 2020-10-05, openshift#12436).
But since 3c915e2 (ci-operator/step-registry/openshift/e2e/test:
Add 2h active_deadline_seconds, 2020-10-09, openshift#12647), we no longer have
to worry about getting logs when that step is slow.  So we might not
pass if we're slow, but we'll still get logs to debug why we're slow.

Only for 4.6 and later, because 4.5 is live and if we had problems
there we'd probably have already heard about them from customers.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci/1318709056830967808
[2]: openshift/enhancements#201
wking added a commit to wking/openshift-release that referenced this pull request Nov 13, 2020
Like 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h
active_deadline_seconds, 2020-10-09, openshift#12647), but for the
gather-core-dump step.  This helps hold time to run futher gathers and
tear down the cluster under test, if gather-core-dump gets hung up,
like it did for over 2h here [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560/build-log.txt | grep -A2 'Executing pod.*gather-core-dump'
  2020/11/12 20:02:15 Executing pod "e2e-aws-gather-core-dump"
  2020/11/12 20:02:42 Container cp-secret-wrapper in pod e2e-aws-gather-core-dump completed successfully
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-12T22:11:29Z"}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-update-keys/25/pull-ci-openshift-cluster-update-keys-master-e2e-aws/1326950727196610560
wking added a commit to wking/openshift-release that referenced this pull request Dec 2, 2020
Like 3c915e2 (ci-operator/step-registry/openshift/e2e/test: Add 2h
active_deadline_seconds, 2020-10-09, openshift#12647), but for the deprovision
step.  This helps hold time to gather assets, if ipi-deprovision gets
hung up, like it did for over 2h here [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/build-log.txt | grep -A2 'Executing pod.*ipi-
deprovision'
  2020/11/20 09:54:54 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision"
  2020/11/20 09:54:57 Container cp-secret-wrapper in pod e2e-aws-operator-ipi-deprovision-deprovision completed successfully
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-20T12:24:09Z"}

25 minutes should be enough for most teardown.  We have a number of
successful runs which take longer:

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=Pod+.*ipi-deprovision-deprovision+succeeded+after' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*after \([0-9]*\)\([smh]\).*/\2 \1/' | sort -n | uniq -c
       29 h 1
        1 h 2
       16 m 1
       21 m 10
       36 m 11
       33 m 12
       47 m 13
       40 m 14
       23 m 15
       21 m 16
       22 m 17
       20 m 18
       17 m 19
       18 m 2
       11 m 20
       13 m 21
       19 m 22
       22 m 23
        8 m 24
       17 m 25
       12 m 26
        8 m 27
       14 m 28
        9 m 29
       16 m 3
       15 m 30
       17 m 31
       ...

Success after over an hour is crazy.  Which jobs were those?

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=build-log&context=0&search=Pod+.*ipi-deprovision-deprovision+succeeded+after+[0-9]h' | jq -r 'to_entries[] | .key as $uri | .value | to_entries[].value[].co
  ntext[] | $uri + " " + .'
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-kube-reporting-metering-operator-release-4.5-metering-periodic-aws/1333923915013033984 2020/12/02 01:27:54 Pod metering-periodic-aws-ipi-deprovision-deprovision succeeded after 1h12m0s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.6-integration-e2e-periodic/1333923958377943040 2020/12/02 01:28:12 Pod integration-e2e-periodic-ipi-deprovision-deprovision succeeded after 1h5m41s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1333804827905363968 2020/12/01 19:44:10 Pod e2e-aws-proxy-ipi-deprovision-deprovision succeeded after 1h4m2s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/che-incubator_chectl/992/pull-ci-che-incubator-chectl-master-v5-chectl-e2e-operator-installer/1333897643826352128 2020/12/02 01:02:07 Pod chectl-e2e-operator-installer-ipi-deprovision-deprovision succeeded after 1h41m5s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/cri-o_cri-o/4377/pull-ci-cri-o-cri-o-master-e2e-agnostic/1333901999330037760 2020/12/02 00:10:14 Pod e2e-agnostic-ipi-deprovision-deprovision succeeded after 1h8m0s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/cri-o_cri-o/4394/pull-ci-cri-o-cri-o-master-e2e-agnostic/1333866701950816256 2020/12/01 23:03:06 Pod e2e-agnostic-ipi-deprovision-deprovision succeeded after 1h15m3s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-psap_special-resource-operator/41/pull-ci-openshift-psap-special-resource-operator-master-e2e-aws/1333903530364243968 2020/12/02 00:16:52 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h12m8s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-logging-operator/821/pull-ci-openshift-cluster-logging-operator-master-functional/1333919264872075264 2020/12/02 01:00:51 Pod functional-ipi-deprovision-deprovision succeeded after 1h3m35s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/852/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-hybrid-step-registry/1333766191600111616 2020/12/01 16:15:42 Pod e2e-ovn-hybrid-step-registry-ipi-deprovision-deprovision succeeded after 1h4m35s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/886/pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows-custom-vxlan/1333813238646706176 2020/12/01 18:01:39 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h8m32s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/891/pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-windows-custom-vxlan/1333907617315033088 2020/12/02 01:32:28 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h23m3s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console-operator/491/pull-ci-openshift-console-operator-master-e2e-aws-operator/1333879999068901376 2020/12/02 00:11:37 Pod e2e-aws-operator-ipi-deprovision-deprovision succeeded after 1h7m16s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4429/pull-ci-openshift-installer-master-e2e-aws-workers-rhel7/1333912709506273280 2020/12/02 01:34:18 Pod e2e-aws-workers-rhel7-ipi-deprovision-deprovision succeeded after 1h12m35s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4440/pull-ci-openshift-installer-master-e2e-aws-shared-vpc/1333864843601514496 2020/12/01 23:32:40 Pod e2e-aws-shared-vpc-ipi-deprovision-deprovision succeeded after 1h2m49s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes-autoscaler/180/pull-ci-openshift-kubernetes-autoscaler-master-e2e-aws-operator/1333874595857436672 2020/12/01 23:02:01 Pod e2e-aws-operator-ipi-deprovision-deprovision succeeded after 1h19m11s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/769/pull-ci-openshift-machine-api-operator-master-e2e-aws/1333776539803717632 2020/12/01 16:58:56 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h15m39s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2210/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1333867792788623360 2020/12/02 00:09:53 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h19m58s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2210/pull-ci-openshift-machine-config-operator-master-okd-e2e-aws/1333839577021943808 2020/12/01 21:59:09 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h2m1s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2272/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1333899974395564032 2020/12/02 01:07:48 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h5m13s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/657/pull-ci-openshift-oc-master-e2e-aws/1333850678639988736 2020/12/01 21:43:41 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h6m6s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/657/pull-ci-openshift-oc-master-e2e-aws/1333910781766406144 2020/12/02 00:49:46 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h17m17s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/660/pull-ci-openshift-oc-master-e2e-aws-serial/1333891301321478144 2020/12/01 23:33:01 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h3m49s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/660/pull-ci-openshift-oc-master-e2e-aws-serial/1333919666019504128 2020/12/02 01:01:44 Pod e2e-aws-serial-ipi-deprovision-deprovision succeeded after 1h2m25s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/365/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-windows-custom-vxlan/1333781276464779264 2020/12/01 17:09:03 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h31m37s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/366/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-windows-custom-vxlan/1333877629530411008 2020/12/02 00:02:31 Pod e2e-aws-ovn-windows-custom-vxlan-ipi-deprovision-deprovision succeeded after 1h7m45s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_prometheus/67/pull-ci-openshift-prometheus-master-e2e-aws/1333873401026056192 2020/12/02 00:45:30 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h23m0s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/13998/rehearse-13998-pull-ci-openshift-windows-machine-config-operator-release-4.7-aws-e2e-operator/1333906001283256320 2020/12/02 01:22:07 Pod aws-e2e-operator-ipi-deprovision-deprovision succeeded after 1h1m36s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/416/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1333878269035941888 2020/12/02 00:29:39 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h51m18s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/425/pull-ci-openshift-sriov-network-operator-master-e2e-aws/1333762244420308992 2020/12/01 16:07:41 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 1h22m15s
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496 2020/12/02 01:06:12 Pod e2e-aws-ipi-deprovision-deprovision succeeded after 2h15m28s

Checking that slowest job:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log 2>/dev/null | head -n1
  time="2020-12-01T22:51:17Z" level=debug msg="OpenShift Installer unreleased-master-3988-g47b4c1a046ee43bc500ff19c98a231a754cc2a70-dirty"
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log | tail -n1
  time="2020-12-02T01:06:11Z" level=info msg="Time elapsed: 2h14m54s"

So the delay is legitimate.  But it's due to a hiccup in IAM-user
listing (AWS-side bug?  Throttling?):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_image-registry/257/pull-ci-openshift-image-registry-master-e2e-aws/1333873914832490496/artifacts/e2e-aws/ipi-deprovision-deprovision/.openshift_install.log | grep -A1 'search for IAM users'
  time="2020-12-01T22:53:53Z" level=debug msg="search for IAM users"
  time="2020-12-02T00:08:08Z" level=debug msg="search for IAM instance profiles"
  --
  time="2020-12-02T00:38:54Z" level=debug msg="search for IAM users"
  time="2020-12-02T01:04:14Z" level=debug msg="search for IAM instance profiles"
  --
  time="2020-12-02T01:05:27Z" level=debug msg="search for IAM users"
  time="2020-12-02T01:05:42Z" level=debug msg="search for IAM instance profiles"
  --
  time="2020-12-02T01:05:45Z" level=debug msg="search for IAM users"
  time="2020-12-02T01:05:48Z" level=debug msg="search for IAM instance profiles"

Anyhow, letting a few of those slip through the cracks isn't a
problem, because we will still have logs from the first 25 minutes of
the deprovision logs in the job itself, and any remaining resources
will eventually be reaped by [1] and other backstops.

Picking a number that's less than 30m gives us a change to wrap things
up within the termination grace period granted to Prow jobs.

[1]: https://github.com/openshift/ci-tools/blob/9dd6f87a5cd070f9e81af7a6cb181bb466f47933/cmd/ipi-deprovision/ipi-deprovision.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants