pkg/steps: Configurable activeDeadlineSeconds and terminationGracePeriodSeconds by wking · Pull Request #1257 · openshift/ci-tools

wking · 2020-09-24T01:34:58Z

Sometimes test steps like updates take a long time. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown.

stevekuznetsov

Add a test to test/e2e/multi-stage

stevekuznetsov · 2020-09-24T02:36:25Z

 	// Commands are the shell commands to run in
 	// the repository root to execute tests.
 	Commands string `json:"commands,omitempty"`
+


You want these in TestStep and LiteralTestStep - not here.

As of a67ec4ba6, I'm touching two streams:

a. api.TestStepConfiguration -> steps.TestStep -> steps.PodStepConfiguration -> podStep.generatePodForStep.
b. api.LiteralTestStep -> multiStageTestStep.generatePods.

(b) makes sense to me for multi-step ref YAML. I'm not really sure what (a) is about, or how (a) is related to (b).

TestStepConfiguration corresponds to the entries in the tests section of the configuration. It will affect all types of tests (container, template, multi-stage).

Is that ok? Or do we want to exclude those for now?

I think we want to have this on TestStep level, not TestStepConfiguration for consistency with LiteralTestStep (so that YAML indent levels are consistent between the two). We'll want people to mostly use test steps anyway.

TestStep contains a LiteralTestStep, so if we add these there, you'd have the possibility of disagreement between TestStep.ActiveDeadlineSeconds and TestStep.LiteralTestStep.ActiveDeadlineSeconds, right? My current positioning has the new properties as siblings of commands in two of the three properties that have Commands (PipelineImageCacheStepConfiguration is the third type with Commands, and I dunno if we want the timeout properties in there or not). Do you still want me to shift some of these properties to a different struct?

No, there can be no disagreement. These fields should not be on TestStepConfiguration.

Dropped from TestStepConfiguration with 35e4b8e -> e3eb04d. So now the only pkg/api/types.go is LiteralTestStep. Did you want them on TestStep too, per this earlier comment, or does my argument that TestStep contains LiteralTestStep mean that touching just LiteralTestStep is sufficient?

wking · 2020-09-24T04:15:40Z

I took a stab at test/e2e/multi-stage with 7f6de2a -> a67ec4b.

wking · 2020-09-24T04:49:02Z

integration:

[ERROR] incorrect dry-run output:
1c1
< [ERROR] timeout while waiting for `sleep` to start:
---
> {"kind":"Secret","apiVersion":"v1","metadata":{"name":"test","creationTimestamp":null},"data":{"test0.txt":"dGVzdDAK"},"type":"Opaque"}
make: *** [integration] Error 1
2020/09/24 04:30:23 Container test in pod integration failed, exit code 2, reason Error

I'm not sure if that's my new test/e2e/multi-stage/registry/timeout case or not. Doesn't look quite right. I bet my new case will be in the e2e job.

bbguimaraes · 2020-09-24T07:54:05Z

The integration failure is a known but rare race.

…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)

wking · 2020-09-28T21:13:44Z

Hmm, 35e4b8e has an os::cmd::expect_code_and_text ... exercising the new timeout step. But it seems to be hanging in CI.

$ date --utc --iso=m
2020-09-28T21:10+0000
$ oc -n ci-op-jsv79xhg logs --timestamps e2e-e2e -c test | tail -n1
2020-09-28T20:35:59.194973943Z Running test/e2e/multi-stage.sh:30: executing 'ci-operator  --artifact-dir /tmp/openshift/test-e2e --resolver-address http://127.0.0.1:8080 --target timeout --unresolved-config /go/src/github.com/openshift/ci-tools/test/e2e/multi-stage/config.yaml' expecting exit code 1 and text 'fixme-error-message'...

So it's been hung for over 30m on a test that has a 10s timeout on a 1m sleep. Not sure how to figure out what it's hung on...

vrutkovs · 2020-10-02T06:46:12Z

/test e2e

stevekuznetsov · 2020-10-02T14:13:01Z

Something deleted the namespace, but the failure mode should not be so slow, right?

 2020/10/02 06:58:36 ci-operator version v20201002-1f36b63
2020/10/02 06:58:36 Resolved source https://github.com/test/test to master@6d231cc3, merging: #1234 538680df @droslean
2020/10/02 06:58:36 Resolved openshift/centos:7 to sha256:fe2347002c630d5d61bf2f28f21246ad1c21cc6fd343e70b4cf1e5102f8711a9
2020/10/02 06:58:36 Using namespace https://console-openshift-console.apps.build01.ci.devcluster.openshift.com/k8s/cluster/projects/ci-op-73zcybv3
2020/10/02 06:58:36 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/10/02 06:58:36 Running [input:os], timeout
2020/10/02 06:58:36 Creating namespace ci-op-73zcybv3
2020/10/02 06:58:36 Creating rolebinding for user droslean in namespace ci-op-73zcybv3
2020/10/02 06:58:36 Setting up pipeline imagestream for the test
2020/10/02 06:58:36 Created PDB for pods with openshift.io/build.name label
2020/10/02 06:58:36 Created PDB for pods with created-by-ci label
2020/10/02 06:58:36 Tagging openshift/centos:7 into pipeline:os
2020/10/02 06:58:36 Creating multi-stage test secret "timeout"
2020/10/02 06:58:36 Creating multi-stage test credentials for "timeout"
2020/10/02 06:58:36 Executing "timeout-timeout"
2020/10/02 06:58:39 Container cp-secret-wrapper in pod timeout-timeout completed successfully
2020/10/02 07:59:20 The namespace in which this test is executing has been deleted, cancelling the test...
2020/10/02 07:59:20 cleanup: Deleting pods with label ci.openshift.io/multi-stage-test=timeout
2020/10/02 07:59:20 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/10/02 07:59:20 Unable to create build-resources directory: mkdir /tmp/openshift/test-e2e/build-resources: file exists
2020/10/02 07:59:20 Ran for 1h0m43s
error: some steps failed:
  * could not run steps: execution cancelled
  * could not run steps: step timeout failed: "timeout" test steps failed: ["timeout" pod "timeout-timeout" failed: the pod ci-op-73zcybv3/timeout-timeout failed after 1h0m44s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
Link to step on registry info site: https://steps.ci.openshift.org/reference/timeout
Link to job on registry info site: https://steps.ci.openshift.org/job?org=test&repo=test&branch=master&test=timeout, cancelled]

/test e2e

stevekuznetsov · 2020-10-02T16:41:17Z

@wking the logic that's watching the Pod does not see it complete or something - so the hang we're seeing is that the ci-operator process continues to wait for the Pod to finish, but never does, and the 1h TTL on a test namespace without running Pods kicks in.

stevekuznetsov

We detect failure with podJobIsFailed - might want to check the resulting Pod YAML for an aborted Pod with this mechanism to see why we're not catching that.

stevekuznetsov · 2020-10-02T16:42:42Z

 	// Commands are the shell commands to run in
 	// the repository root to execute tests.
 	Commands string `json:"commands,omitempty"`
+


No, there can be no disagreement. These fields should not be on TestStepConfiguration.

…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)

vrutkovs · 2020-10-02T17:00:31Z

4943e39 would ensure that expected yaml is generated and 63b1a61 would fix e2e tests - although I'm not sure why it takes at least an hour to tun the test

…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)

…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. It seems like we'd want these to be configurable alongside all the places where 'commands' was configurable. But: * I dunno what PipelineImageCacheStepConfiguration is about, so I did not add these properties there. * Steve says he doesn't want them on TestStepConfiguration [4], which handles the 'tests' section of the configuration, including container, template, multi-step tests [5]. So I'm only adding them to the single-step LiteralTestStep. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment) [4]: openshift#1257 (comment) [5]: openshift#1257 (comment)

wking · 2020-10-06T16:13:15Z

/retest

So I can hopefully figure out why podJobIsFailed isn't tripping.

…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)

And, when we see that finalizer in the pod and are no longer interested in watching it, remove the finalizer so the deletion can go through. This should avoid cases where the parallel Delete (e.g. the one in podStep.run()) removes the pod, the change somehow slips through the watch, and we lose track of the pod [1]: 2020/10/06 23:04:25 error: could not wait for pod 'timeout-timeout': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) [1]: openshift#1257 (comment)

vrutkovs

golangci-lint wants to check for pod != nil exactly once, before first attempt to use it - see dominikh/go-tools#641 (comment).

See 8b7ea33

vrutkovs · 2020-10-09T12:33:16Z

/lgtm cancel

oops, wrong button

vrutkovs · 2020-10-09T19:17:56Z

Linter is fixed \o/, time to update e2e

wking · 2020-10-09T19:40:21Z

e2e:

FAILURE after 11.558s: test/e2e/multi-stage.sh:30: executing 'ci-operator  --artifact-dir /tmp/openshift/test-e2e --resolver-address http://127.0.0.1:8080 --target timeout --unresolved-config /go/src/github.com/openshift/ci-tools/test/e2e/multi-stage/config.yaml' expecting exit code 1 and text '"timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod [^ ]* failed after 1[0-5]s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline': the output content test failed
There was no output from the command.
Standard error from the command:
2020/10/09 17:07:03 ci-operator version v20201009-b6416db
...
2020/10/09 17:07:15 Ran for 11s
error: some steps failed:
  * could not run steps: step timeout failed: "timeout" test steps failed: "timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod ci-op-dvxikfl7/timeout-timeout failed after 11s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
...

Not sure what's wrong there. When I grep it locally, I get a match:

$ grep '"timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod [^ ]* failed after 1[0-5]s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline' <<EOF
> error: some steps failed:
>   * could not run steps: step timeout failed: "timeout" test steps failed: "timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod ci-op-dvxikfl7/timeout-timeout failed after 11s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
> ...
> EOF
  * could not run steps: step timeout failed: "timeout" test steps failed: "timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod ci-op-dvxikfl7/timeout-timeout failed after 11s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline

Get something that makes it a bit more clear that this is a "pod ran long" issue, vs. the pod controller's: Pod was active on the node longer than the specified deadline which sounds like it might be a node-side issue. Upstream Kubernetes has no public constants wrapping DeadlineExceeded: kubernetes$ $ git --no-pager log --oneline -1 f30d6a463dd (HEAD -> master, origin/master, origin/HEAD) Merge pull request #93779 from yodarshafrir1/fix_restart_job_failure_with_restart_policy_never kubernetes$ git grep '"DeadlineExceeded"' | grep -v test | sed 's/\t/ /g' pkg/controller/job/job_controller.go: failureReason = "DeadlineExceeded" pkg/kubelet/active_deadline.go: reason = "DeadlineExceeded" vendor/golang.org/x/tools/internal/imports/zstdlib.go: "DeadlineExceeded", vendor/google.golang.org/grpc/codes/code_string.go: return "DeadlineExceeded"

wking · 2020-10-09T21:43:38Z

Aha, grep -E. I've escaped the parens in my regexp with 972bd91 -> 9d209c7.

openshift-ci-robot · 2020-10-09T22:31:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [stevekuznetsov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-10-09T23:14:01Z

Hooray! Thank you to all the folks who pitched in here and help get this landed 🎉

wking · 2020-10-09T23:33:24Z

I've opened an initial consumer: openshift/release#12647

…sage I'd fixed this for multi-step while rerolling 9d209c7 (pkg/steps: Logging around activeDeadlineSeconds DeadlineExceeded, 2020-10-08, openshift#1257), but missed fixing the template log. Catch that up now.

openshift-ci-robot requested review from droslean and hongkailiu September 24, 2020 01:35

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 09086e8 to 7f6de2a Compare September 24, 2020 01:38

stevekuznetsov reviewed Sep 24, 2020

View reviewed changes

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 7f6de2a to a67ec4b Compare September 24, 2020 04:11

bbguimaraes reviewed Sep 24, 2020

View reviewed changes

Comment thread test/e2e/multi-stage/registry/timeout/timeout-ref.yaml Outdated

wking force-pushed the configurable-step-activeDeadlineSeconds branch from a67ec4b to 616d675 Compare September 28, 2020 18:20

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 616d675 to 35e4b8e Compare September 28, 2020 20:19

vrutkovs reviewed Oct 2, 2020

View reviewed changes

Comment thread test/e2e/multi-stage.sh Outdated

stevekuznetsov reviewed Oct 2, 2020

View reviewed changes

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 35e4b8e to e3eb04d Compare October 5, 2020 18:14

wking force-pushed the configurable-step-activeDeadlineSeconds branch from e3eb04d to 747bb5a Compare October 5, 2020 18:23

vrutkovs reviewed Oct 6, 2020

View reviewed changes

Comment thread test/e2e/multi-stage.sh Outdated

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 9176e2c to 75ee666 Compare October 6, 2020 20:09

wking force-pushed the configurable-step-activeDeadlineSeconds branch 2 times, most recently from fa8b493 to 6265230 Compare October 8, 2020 19:23

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 6265230 to 493dbff Compare October 8, 2020 19:45

wking commented Oct 8, 2020

View reviewed changes

Comment thread pkg/steps/multi_stage.go

wking force-pushed the configurable-step-activeDeadlineSeconds branch 2 times, most recently from dd15573 to d91d7c2 Compare October 8, 2020 23:21

petr-muller reviewed Oct 9, 2020

View reviewed changes

Comment thread pkg/steps/multi_stage.go

Comment thread pkg/steps/multi_stage.go Outdated

vrutkovs approved these changes Oct 9, 2020

View reviewed changes

Comment thread pkg/steps/multi_stage.go

Comment thread pkg/steps/multi_stage.go Outdated

openshift-ci-robot assigned vrutkovs Oct 9, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2020

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2020

wking force-pushed the configurable-step-activeDeadlineSeconds branch from d91d7c2 to 972bd91 Compare October 9, 2020 16:41

wking force-pushed the configurable-step-activeDeadlineSeconds branch from 972bd91 to 9d209c7 Compare October 9, 2020 21:43

stevekuznetsov approved these changes Oct 9, 2020

View reviewed changes

openshift-ci-robot assigned stevekuznetsov Oct 9, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 9, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020

openshift-merge-robot merged commit 5784e03 into openshift:master Oct 9, 2020

wking deleted the configurable-step-activeDeadlineSeconds branch October 9, 2020 23:12

wking mentioned this pull request Oct 29, 2020

pkg/steps/template: De-ref pointer in with activeDeadlineSeconds= message #1350

Merged

Conversation

wking commented Sep 24, 2020

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking commented Sep 24, 2020

Uh oh!

wking commented Sep 24, 2020

Uh oh!

bbguimaraes commented Sep 24, 2020

Uh oh!

Uh oh!

wking commented Sep 28, 2020

Uh oh!

vrutkovs commented Oct 2, 2020

Uh oh!

Uh oh!

stevekuznetsov commented Oct 2, 2020

Uh oh!

stevekuznetsov commented Oct 2, 2020

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vrutkovs commented Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wking commented Oct 6, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vrutkovs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vrutkovs commented Oct 9, 2020

Uh oh!

vrutkovs commented Oct 9, 2020

Uh oh!

wking commented Oct 9, 2020

Uh oh!

wking commented Oct 9, 2020

Uh oh!

openshift-ci-robot commented Oct 9, 2020

Uh oh!

wking commented Oct 9, 2020

Uh oh!

wking commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vrutkovs commented Oct 2, 2020 •

edited

Loading