pkg/steps: Configurable activeDeadlineSeconds and terminationGracePeriodSeconds#1257
Conversation
09086e8 to
7f6de2a
Compare
stevekuznetsov
left a comment
There was a problem hiding this comment.
Add a test to test/e2e/multi-stage
| // Commands are the shell commands to run in | ||
| // the repository root to execute tests. | ||
| Commands string `json:"commands,omitempty"` | ||
|
|
There was a problem hiding this comment.
You want these in TestStep and LiteralTestStep - not here.
There was a problem hiding this comment.
As of a67ec4ba6, I'm touching two streams:
a. api.TestStepConfiguration -> steps.TestStep -> steps.PodStepConfiguration -> podStep.generatePodForStep.
b. api.LiteralTestStep -> multiStageTestStep.generatePods.
(b) makes sense to me for multi-step ref YAML. I'm not really sure what (a) is about, or how (a) is related to (b).
There was a problem hiding this comment.
TestStepConfiguration corresponds to the entries in the tests section of the configuration. It will affect all types of tests (container, template, multi-stage).
There was a problem hiding this comment.
Is that ok? Or do we want to exclude those for now?
There was a problem hiding this comment.
I think we want to have this on TestStep level, not TestStepConfiguration for consistency with LiteralTestStep (so that YAML indent levels are consistent between the two). We'll want people to mostly use test steps anyway.
There was a problem hiding this comment.
TestStep contains a LiteralTestStep, so if we add these there, you'd have the possibility of disagreement between TestStep.ActiveDeadlineSeconds and TestStep.LiteralTestStep.ActiveDeadlineSeconds, right? My current positioning has the new properties as siblings of commands in two of the three properties that have Commands (PipelineImageCacheStepConfiguration is the third type with Commands, and I dunno if we want the timeout properties in there or not). Do you still want me to shift some of these properties to a different struct?
There was a problem hiding this comment.
No, there can be no disagreement. These fields should not be on TestStepConfiguration.
There was a problem hiding this comment.
Dropped from TestStepConfiguration with 35e4b8e -> e3eb04d. So now the only pkg/api/types.go is LiteralTestStep. Did you want them on TestStep too, per this earlier comment, or does my argument that TestStep contains LiteralTestStep mean that touching just LiteralTestStep is sufficient?
7f6de2a to
a67ec4b
Compare
I'm not sure if that's my new |
|
The integration failure is a known but rare race. |
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)
a67ec4b to
616d675
Compare
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)
616d675 to
35e4b8e
Compare
|
Hmm, 35e4b8e has an $ date --utc --iso=m
2020-09-28T21:10+0000
$ oc -n ci-op-jsv79xhg logs --timestamps e2e-e2e -c test | tail -n1
2020-09-28T20:35:59.194973943Z Running test/e2e/multi-stage.sh:30: executing 'ci-operator --artifact-dir /tmp/openshift/test-e2e --resolver-address http://127.0.0.1:8080 --target timeout --unresolved-config /go/src/github.com/openshift/ci-tools/test/e2e/multi-stage/config.yaml' expecting exit code 1 and text 'fixme-error-message'...So it's been hung for over 30m on a test that has a 10s timeout on a 1m sleep. Not sure how to figure out what it's hung on... |
|
/test e2e |
|
Something deleted the namespace, but the failure mode should not be so slow, right? /test e2e |
|
@wking the logic that's watching the Pod does not see it complete or something - so the hang we're seeing is that the |
stevekuznetsov
left a comment
There was a problem hiding this comment.
We detect failure with podJobIsFailed - might want to check the resulting Pod YAML for an aborted Pod with this mechanism to see why we're not catching that.
| // Commands are the shell commands to run in | ||
| // the repository root to execute tests. | ||
| Commands string `json:"commands,omitempty"` | ||
|
|
There was a problem hiding this comment.
No, there can be no disagreement. These fields should not be on TestStepConfiguration.
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. It seems like we'd want these to be configurable alongside all the places where 'commands' was configurable. But: * I dunno what PipelineImageCacheStepConfiguration is about, so I did not add these properties there. * Steve says he doesn't want them on TestStepConfiguration [4], which handles the 'tests' section of the configuration, including container, template, multi-step tests [5]. So I'm only adding them to the single-step LiteralTestStep. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment) [4]: openshift#1257 (comment) [5]: openshift#1257 (comment)
35e4b8e to
e3eb04d
Compare
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. It seems like we'd want these to be configurable alongside all the places where 'commands' was configurable. But: * I dunno what PipelineImageCacheStepConfiguration is about, so I did not add these properties there. * Steve says he doesn't want them on TestStepConfiguration [4], which handles the 'tests' section of the configuration, including container, template, multi-step tests [5]. So I'm only adding them to the single-step LiteralTestStep. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment) [4]: openshift#1257 (comment) [5]: openshift#1257 (comment)
e3eb04d to
747bb5a
Compare
|
/retest So I can hopefully figure out why |
…iodSeconds Sometimes test steps like updates take a long time [1]. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from PodSpec [2] to step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown. I personally prefer matching the casing of the wrapped Kubernetes properties, but Bruno and Petr prefer matching the snake_casing of the existing step-config properties [3], so that's what we have in this commit. [1]: openshift/cluster-network-operator#785 (comment) [2]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core [3]: openshift#1257 (comment)
9176e2c to
75ee666
Compare
fa8b493 to
6265230
Compare
And, when we see that finalizer in the pod and are no longer interested in watching it, remove the finalizer so the deletion can go through. This should avoid cases where the parallel Delete (e.g. the one in podStep.run()) removes the pod, the change somehow slips through the watch, and we lose track of the pod [1]: 2020/10/06 23:04:25 error: could not wait for pod 'timeout-timeout': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) [1]: openshift#1257 (comment)
6265230 to
493dbff
Compare
dd15573 to
d91d7c2
Compare
vrutkovs
left a comment
There was a problem hiding this comment.
golangci-lint wants to check for pod != nil exactly once, before first attempt to use it - see dominikh/go-tools#641 (comment).
See 8b7ea33
|
/lgtm cancel oops, wrong button |
d91d7c2 to
972bd91
Compare
|
Linter is fixed \o/, time to update e2e |
|
e2e: Not sure what's wrong there. When I grep it locally, I get a match: $ grep '"timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod [^ ]* failed after 1[0-5]s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline' <<EOF
> error: some steps failed:
> * could not run steps: step timeout failed: "timeout" test steps failed: "timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod ci-op-dvxikfl7/timeout-timeout failed after 11s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline
> ...
> EOF
* could not run steps: step timeout failed: "timeout" test steps failed: "timeout" pod "timeout-timeout" exceeded the configured timeout activeDeadlineSeconds=10: the pod ci-op-dvxikfl7/timeout-timeout failed after 11s (failed containers: ): DeadlineExceeded Pod was active on the node longer than the specified deadline |
Get something that makes it a bit more clear that this is a "pod ran long" issue, vs. the pod controller's: Pod was active on the node longer than the specified deadline which sounds like it might be a node-side issue. Upstream Kubernetes has no public constants wrapping DeadlineExceeded: kubernetes$ $ git --no-pager log --oneline -1 f30d6a463dd (HEAD -> master, origin/master, origin/HEAD) Merge pull request #93779 from yodarshafrir1/fix_restart_job_failure_with_restart_policy_never kubernetes$ git grep '"DeadlineExceeded"' | grep -v test | sed 's/\t/ /g' pkg/controller/job/job_controller.go: failureReason = "DeadlineExceeded" pkg/kubelet/active_deadline.go: reason = "DeadlineExceeded" vendor/golang.org/x/tools/internal/imports/zstdlib.go: "DeadlineExceeded", vendor/google.golang.org/grpc/codes/code_string.go: return "DeadlineExceeded"
972bd91 to
9d209c7
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hooray! Thank you to all the folks who pitched in here and help get this landed 🎉 |
|
I've opened an initial consumer: openshift/release#12647 |
…sage I'd fixed this for multi-step while rerolling 9d209c7 (pkg/steps: Logging around activeDeadlineSeconds DeadlineExceeded, 2020-10-08, openshift#1257), but missed fixing the template log. Catch that up now.
Sometimes test steps like updates take a long time. When that happens, something wrapping the workflow may be timed out instead, and the whole thing gets torn down without any post-test gather/teardown steps running. With this commit, we expose two timeout-related properties from
PodSpecto step maintainers to use to control this behavior. Step maintainers can now set per-step timeouts that are comfortably less than the wrapper timeout, ensuring that slow/hung steps get killed, and the remaining steps have some time to gather details that will explain the delay and perform other teardown.