In what area(s)?
/area API
We think that #14744 introduced a regression due to the change it made in revision_lifecycle.go. Let's go through it:
At any point in time, there might be a failure in a container in a pod that is part of the KService. Errors like OOM are recoverable = Kubernetes will restart the pod and assuming the OOM will not immediatly come back, the Pod will be in Running state and everything is working.
When the OOM happens, then the logic in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L107 will call MarkContainerHealthyFalse and sets the ContainerHealthy condition of the revision to False. So far so good.
The only code location that resets this, is in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L202. But this code never runs anymore since #14744. Basically if ContainerHealthy is False, then resUnavailable is set to true in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L173. If resUnavailable is true, then it never enters the code in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L198-L203.
What version of Knative?
Latest release.
Expected Behavior
The ContainerHealthy condition is set back to True when all containers are healthy.
Actual Behavior
The ContainerHealthy condition is never set from False to True.
Steps to Reproduce the Problem
- Create a KSvc using the image
ghcr.io/src2img/http-synthetics:latest (code is from https://github.com/src2img/http-synthetics), and specify a memory limit on the container, for example 100M. Use minScale and maxScale 1.
- Wait for the Ksvc to become ready.
- Call its endpoint with
curl -X PUT http://endpoint/claim-memory?amount=500000000
The call will fail because the container will go OOM. The revision switches to ContainerHealthy=False. Kubernetes restarts the container and it will be running again. But the revision status will never change back to healthy.
We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready. But not sure how Knative specifies the behavior there.
In what area(s)?
/area API
We think that #14744 introduced a regression due to the change it made in revision_lifecycle.go. Let's go through it:
At any point in time, there might be a failure in a container in a pod that is part of the KService. Errors like OOM are recoverable = Kubernetes will restart the pod and assuming the OOM will not immediatly come back, the Pod will be in Running state and everything is working.
When the OOM happens, then the logic in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L107 will call
MarkContainerHealthyFalseand sets theContainerHealthycondition of the revision to False. So far so good.The only code location that resets this, is in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L202. But this code never runs anymore since #14744. Basically if
ContainerHealthyis False, thenresUnavailableis set to true in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L173. IfresUnavailableis true, then it never enters the code in https://github.com/knative/serving/blob/v0.42.2/pkg/apis/serving/v1/revision_lifecycle.go#L198-L203.What version of Knative?
Latest release.
Expected Behavior
The ContainerHealthy condition is set back to True when all containers are healthy.
Actual Behavior
The ContainerHealthy condition is never set from False to True.
Steps to Reproduce the Problem
ghcr.io/src2img/http-synthetics:latest(code is from https://github.com/src2img/http-synthetics), and specify a memory limit on the container, for example100M. Use minScale and maxScale 1.curl -X PUT http://endpoint/claim-memory?amount=500000000The call will fail because the container will go OOM. The revision switches to ContainerHealthy=False. Kubernetes restarts the container and it will be running again. But the revision status will never change back to healthy.
We are actually uncertain if it is even conceptually correct that the revision changes the ContainerHealthy condition after the revision was fully ready. But not sure how Knative specifies the behavior there.