Skip to content

Revisions stay in ContainerHealthy=False status forever #15487

@SaschaSchwarze0

Description

@SaschaSchwarze0

In what area(s)?

/area API

We think that #14744 introduced a regression due to the change it made in revision_lifecycle.go. Let's go through it:

At any point in time, there might be a failure in a container in a pod that is part of the KService. Errors like OOM are recoverable = Kubernetes will restart the pod and assuming the OOM will not immediatly come back, the Pod will be in Running state and everything is working.

When the OOM happens, then the logic in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L107 will call MarkContainerHealthyFalse and sets the ContainerHealthy condition of the revision to False. So far so good.

The only code location that resets this, is in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L202. But this code never runs anymore since #14744. Basically if ContainerHealthy is False, then resUnavailable is set to true in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L173. If resUnavailable is true, then it never enters the code in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L198-L203.

What version of Knative?

Latest release.

Expected Behavior

The ContainerHealthy condition is set back to True when all containers are healthy.

Actual Behavior

The ContainerHealthy condition is never set from False to True.

Steps to Reproduce the Problem

  1. Create a KSvc using the image ghcr.io/src2img/http-synthetics:latest (code is from https://github.com/src2img/http-synthetics), and specify a memory limit on the container, for example 100M. Use minScale and maxScale 1.
  2. Wait for the Ksvc to become ready.
  3. Call its endpoint with curl -X PUT http://endpoint/claim-memory?amount=500000000

The call will fail because the container will go OOM. The revision switches to ContainerHealthy=False. Kubernetes restarts the container and it will be running again. But the revision status will never change back to healthy.

We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready. But not sure how Knative specifies the behavior there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/APIAPI objects and controllerskind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions