You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a tracking issue for detecting and surfacing problems with a user's pods. There are a variety of failure modes, and so far we've been dealing with them in a very ad-hoc manner. Let's enumerate them here and start a discussion towards a more deliberate solution so we don't have to continue playing whack-a-mole.
Since we only look at a single pod, we can only surface issues that always affect every pod in a deployment, e.g. the image cannot be pulled, the container crashes on start, or the cluster has no resources. We should fix this, likely by looking at every pod's status.
It's unclear to me if there's a way to generically detect all of these issues.
Categorization
Ideally we could distill these issues down to a small set of buckets so we can deal with the issues in a generic way. I don't have a good answer here, but a non-exhaustive list of things we've encountered thus far:
A: For 1, 2, 4, and 5, the revision may never be able to serve traffic, but also may be caused by a temporary issue.
B: For 1 and 3, the revision may be serving traffic, but we are unable to continue scaling.
C: For 6, the revision can serve traffic, but will experience intermittent failures. This could be caused by a memory leak, a query of death, a bug in the code, or insufficient resource limits.
I invite suggestions for names/conditions for these categories. I suspect we'd want to surface these different kinds of failures in different ways...
Reporting
For category A, we definitely want to surface a fatal condition in the Revision status, which should get propagated up to the Revision status, because the user needs to take some action in order to fix their Revision.
For category B, I suspect we want to do something similar, but not be a fatal condition -- just informational. The user should take action to unblock the autoscaler, perhaps by notifying the cluster operator. In the case where we can't scale up to min_scale, this should probably be fatal.
For category C, the problem will be intermittent, and kubernetes is designed to handle these failures. The best we could do here is to somehow help the user diagnose these issues by surfacing what happened -- possibly by injecting some information into their logs?
This is a tracking issue for detecting and surfacing problems with a user's pods. There are a variety of failure modes, and so far we've been dealing with them in a very ad-hoc manner. Let's enumerate them here and start a discussion towards a more deliberate solution so we don't have to continue playing whack-a-mole.
Detection
We currently try to detect pod failures in the revision reconciler when reconciling a deployment. This logic will probably move to the autoscaler, but remains largely the same.
We look at a single pod to determine if:
Since we only look at a single pod, we can only surface issues that always affect every pod in a deployment, e.g. the image cannot be pulled, the container crashes on start, or the cluster has no resources. We should fix this, likely by looking at every pod's status.
It's unclear to me if there's a way to generically detect all of these issues.
Categorization
Ideally we could distill these issues down to a small set of buckets so we can deal with the issues in a generic way. I don't have a good answer here, but a non-exhaustive list of things we've encountered thus far:
A: For 1, 2, 4, and 5, the revision may never be able to serve traffic, but also may be caused by a temporary issue.
B: For 1 and 3, the revision may be serving traffic, but we are unable to continue scaling.
C: For 6, the revision can serve traffic, but will experience intermittent failures. This could be caused by a memory leak, a query of death, a bug in the code, or insufficient resource limits.
I invite suggestions for names/conditions for these categories. I suspect we'd want to surface these different kinds of failures in different ways...
Reporting
For category A, we definitely want to surface a fatal condition in the Revision status, which should get propagated up to the Revision status, because the user needs to take some action in order to fix their Revision.
For category B, I suspect we want to do something similar, but not be a fatal condition -- just informational. The user should take action to unblock the autoscaler, perhaps by notifying the cluster operator. In the case where we can't scale up to min_scale, this should probably be fatal.
For category C, the problem will be intermittent, and kubernetes is designed to handle these failures. The best we could do here is to somehow help the user diagnose these issues by surfacing what happened -- possibly by injecting some information into their logs?