Fail activation if we exceed a timeout#4614
Fail activation if we exceed a timeout#4614knative-prow-robot merged 6 commits intoknative:masterfrom
Conversation
|
/assign @vagababov WDYT? |
knative-prow-robot
left a comment
There was a problem hiding this comment.
@jonjohnsonjr: 0 warnings.
Details
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
857b555 to
1a296ed
Compare
This avoids racing the Revision reconciler, which needs to inspect pod statuses before we scale down the Deployment.
|
/retest |
|
/retest |
|
The following is the coverage report on pkg/.
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jonjohnsonjr, mattmoor The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
This partially addresses #3456
The KPA reconciler will consider activation to have TimedOut if it is activating (Active=Unknown) for ProgressDeadlineSeconds (this is currently a hardcoded constant, but we might want to make it configurable in the future if we revive #810) and scale the deployment to zero, since we don't expect activation to ever succeed.
The Revision reconciler gets triggered after ProgressDeadlineSeconds because the Deployment that it watches will have it status update with Progressing=False,Reason=ProgressDeadlineExceeded. The Revision reconciler currently does pod failure diagnosis, which breaks if the pods get scaled down to zero. To avoid breaking that, the KPA reconciler will wait for ProgressDeadlineSeconds + 10 seconds before scaling to zero, to give the Revision reconciler an opportunity to look at the pod status. In the future we can make the KPA responsible for pod failure diagnosis to remove this race (and the 10 second buffer).
This PR is a continuation of #4094, but I've split apart the not-scaling-to-zero part from the lifecycle changes to make the PR smaller.