Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation

In Knative 1.12, there is a change in how the reachability of a revision is calculated: #14309. This one has a negative side impact on the following scenario:

1. You have Kubernetes with Knative Serving
2. One simple service with currentScale=1 exists, e. g. `kn service create test --image ghcr.io/src2img/http-synthetics:latest --scale-min 1 --scale-max 1`
3. You have an admission webhook for the creation of a Pod in place with a failurePolicy=Fail. The webhook is not functional (in our case it was a temporary network glitch from Kubernetes master to the worker nodes that run the service). To reproduce this, just apply a webhook like this:

```bash
cat <<EOF | kubectl create -f -
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: dummy-webhook
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: non-existing
      namespace: non-existing
      path: /defaulting
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: webhook.non-existing.dev
  objectSelector: {}
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 5
EOF
```

4. Trigger an update on the Knative configuration that causes all deployments to be updated, for example `kubectl -n knative-serving patch configmap config-deployment -p '{"data":{"queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue:v1.12.1"}}'`

The following is now happening:

1. Knative Serving updates the Deployment's queue-proxy image.
2. The Deployment controller creates a new ReplicaSet for this with desiredScale 1.
3. The ReplicaSet controller fails to create the Pod for this ReplicaSet. This will lead to the Deployment to have this status:

   ```yaml
   status:
     conditions:
     - lastTransitionTime: "2023-11-21T13:54:56Z"
       lastUpdateTime: "2023-11-22T10:49:39Z"
       message: ReplicaSet "http-synthetics-00001-deployment-bcbddf84" has successfully progressed.
       reason: NewReplicaSetAvailable
       status: "True"
       type: Progressing
     - lastTransitionTime: "2023-11-22T10:49:59Z"
       lastUpdateTime: "2023-11-22T10:49:59Z"
       message: Deployment does not have minimum availability.
       reason: MinimumReplicasUnavailable
       status: "False"
       type: Available
     - lastTransitionTime: "2023-11-22T10:49:59Z"
       lastUpdateTime: "2023-11-22T10:49:59Z"
       message: 'Internal error occurred: failed calling webhook "webhook.non-existing.dev": failed to call webhook: ...'
       reason: FailedCreate
       status: "True"
       type: ReplicaFailure
     observedGeneration: 18
     unavailableReplicas: 1
   ```

   Without Knative in the picture, the Deployment would still be up as the Pod of the old ReplicaSet still exists. The ReplicaSet controller goes into an exponential backoff in retrying the Pod creation. Eventually when the webhook communication works again, it will succeed.

   Knative Serving 1.11 works like this. It keeps the Revision active and the KService therefore fully reachable.

4. Knative Serving 1.12 breaks here. It determines that the Revision is not reachable anymore and propagates the Deployment reason (`FailedCreate`) to the Revision. As it is not reachable anymore, the Deployment is scaled down to 0. This breaks the availability of the KService.

## In what area(s)?

Remove the '> ' to select
> /area autoscale

## What version of Knative?

> 1.12

## Expected Behavior

A temporary problem in creating a Pod should not cause the KService to be down.

## Actual Behavior

The KService is down since Serving 1.12. One can repair the Revision by deleting its Deployment. The new one will come up assuming Pod creation works.

Without knowing the exact design details, so just opinion ... In general, I think that an active Revision may go into a failed status. But it should not do it that quickly. For example if I have a Revision for which I deleted the image, then the new Pod will not come up and it may eventually mark the revision as Failed, but for temporary things that resolve within a few minutes, it should just give the Deployment time to become healthy (especially if it was not really ever broken).

## Steps to Reproduce the Problem

Included in above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

Description

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions