Right now, when the cluster is full, Knative will try to deploy the revision for ProgressDeadline seconds (120s by default in Knative). If there are not enough resources and CAS does not do its job fast enough, the PD will expire and then Autoscaler will wait some 10s more and finally give up and scale to 0 basically marking the revision as failed.
@jonjohnsonjr added this clutch long ago to deal with the revisions that fail to ever progress and we end up with zombie revisions that can never succeed.
But in the case above it is possible that the deployment will succeed.
Now, simple suggestion might be to just crank the PD setting to 11. But this has unintended consequences of all deployments now waiting that much to fail even if they are to fail anyway.
Another suggestion is to check in Autoscaler after PD+10s passed, whether the reason for pod unavailability is resource insufficiency, and if so — wait additional Xs rather than mark the revision failed right away. Otherwise, behave as we do now.
Deployments already watch for quota and will deploy the resources if they become available (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)
This seems like a mostly simple change that will help us to avoid making random knative users tweak PD for this reason.
Or I am missing something :-)
/cc @julz @markusthoemmes @mattmoor
Right now, when the cluster is full, Knative will try to deploy the revision for ProgressDeadline seconds (120s by default in Knative). If there are not enough resources and CAS does not do its job fast enough, the PD will expire and then Autoscaler will wait some 10s more and finally give up and scale to 0 basically marking the revision as failed.
@jonjohnsonjr added this clutch long ago to deal with the revisions that fail to ever progress and we end up with zombie revisions that can never succeed.
But in the case above it is possible that the deployment will succeed.
Now, simple suggestion might be to just crank the PD setting to 11. But this has unintended consequences of all deployments now waiting that much to fail even if they are to fail anyway.
Another suggestion is to check in Autoscaler after PD+10s passed, whether the reason for pod unavailability is resource insufficiency, and if so — wait additional Xs rather than mark the revision failed right away. Otherwise, behave as we do now.
Deployments already watch for quota and will deploy the resources if they become available (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)
This seems like a mostly simple change that will help us to avoid making random knative users tweak PD for this reason.
Or I am missing something :-)
/cc @julz @markusthoemmes @mattmoor