Autoscaling in Knative and in the Cluster

Right now, when the cluster is full, Knative will try to deploy the revision for ProgressDeadline seconds (120s by default in Knative). If there are not enough resources and CAS does not do its job fast enough, the PD will expire and then Autoscaler will wait some 10s more and finally give up and scale to 0 basically marking the revision as failed.

@jonjohnsonjr added this clutch long ago to deal with the revisions that fail to ever progress and we end up with zombie revisions that can never succeed.

But in the case above it is possible that the deployment will succeed.

Now, simple suggestion might be to just crank the PD setting to _11_. But this has unintended consequences of all deployments now waiting that much to fail even if they _are to fail anyway_.

Another suggestion is to check in Autoscaler after PD+10s passed, whether the reason for pod unavailability is _resource insufficiency_, and if so — wait additional Xs rather than mark the revision failed right away. Otherwise, behave as we do now.

Deployments already watch for quota and will deploy the resources if they become available (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment)

This seems like a mostly simple change that will help us to avoid making random knative users tweak PD for this reason.

Or I am missing something :-)

/cc @julz @markusthoemmes @mattmoor 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling in Knative and in the Cluster #9531

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Autoscaling in Knative and in the Cluster #9531

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions