Skip to content

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation #14660

@SaschaSchwarze0

Description

@SaschaSchwarze0

In Knative 1.12, there is a change in how the reachability of a revision is calculated: #14309. This one has a negative side impact on the following scenario:

  1. You have Kubernetes with Knative Serving
  2. One simple service with currentScale=1 exists, e. g. kn service create test --image ghcr.io/src2img/http-synthetics:latest --scale-min 1 --scale-max 1
  3. You have an admission webhook for the creation of a Pod in place with a failurePolicy=Fail. The webhook is not functional (in our case it was a temporary network glitch from Kubernetes master to the worker nodes that run the service). To reproduce this, just apply a webhook like this:
cat <<EOF | kubectl create -f -
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: dummy-webhook
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: non-existing
      namespace: non-existing
      path: /defaulting
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: webhook.non-existing.dev
  objectSelector: {}
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 5
EOF
  1. Trigger an update on the Knative configuration that causes all deployments to be updated, for example kubectl -n knative-serving patch configmap config-deployment -p '{"data":{"queue-sidecar-image":"gcr.io/knative-releases/knative.dev/serving/cmd/queue:v1.12.1"}}'

The following is now happening:

  1. Knative Serving updates the Deployment's queue-proxy image.

  2. The Deployment controller creates a new ReplicaSet for this with desiredScale 1.

  3. The ReplicaSet controller fails to create the Pod for this ReplicaSet. This will lead to the Deployment to have this status:

    status:
      conditions:
      - lastTransitionTime: "2023-11-21T13:54:56Z"
        lastUpdateTime: "2023-11-22T10:49:39Z"
        message: ReplicaSet "http-synthetics-00001-deployment-bcbddf84" has successfully progressed.
        reason: NewReplicaSetAvailable
        status: "True"
        type: Progressing
      - lastTransitionTime: "2023-11-22T10:49:59Z"
        lastUpdateTime: "2023-11-22T10:49:59Z"
        message: Deployment does not have minimum availability.
        reason: MinimumReplicasUnavailable
        status: "False"
        type: Available
      - lastTransitionTime: "2023-11-22T10:49:59Z"
        lastUpdateTime: "2023-11-22T10:49:59Z"
        message: 'Internal error occurred: failed calling webhook "webhook.non-existing.dev": failed to call webhook: ...'
        reason: FailedCreate
        status: "True"
        type: ReplicaFailure
      observedGeneration: 18
      unavailableReplicas: 1

    Without Knative in the picture, the Deployment would still be up as the Pod of the old ReplicaSet still exists. The ReplicaSet controller goes into an exponential backoff in retrying the Pod creation. Eventually when the webhook communication works again, it will succeed.

    Knative Serving 1.11 works like this. It keeps the Revision active and the KService therefore fully reachable.

  4. Knative Serving 1.12 breaks here. It determines that the Revision is not reachable anymore and propagates the Deployment reason (FailedCreate) to the Revision. As it is not reachable anymore, the Deployment is scaled down to 0. This breaks the availability of the KService.

In what area(s)?

Remove the '> ' to select

/area autoscale

What version of Knative?

1.12

Expected Behavior

A temporary problem in creating a Pod should not cause the KService to be down.

Actual Behavior

The KService is down since Serving 1.12. One can repair the Revision by deleting its Deployment. The new one will come up assuming Pod creation works.

Without knowing the exact design details, so just opinion ... In general, I think that an active Revision may go into a failed status. But it should not do it that quickly. For example if I have a Revision for which I deleted the image, then the new Pod will not come up and it may eventually mark the revision as Failed, but for temporary things that resolve within a few minutes, it should just give the Deployment time to become healthy (especially if it was not really ever broken).

Steps to Reproduce the Problem

Included in above.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions