Send logs for pods in bad status#4572
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yanweiguo The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
knative-prow-robot
left a comment
There was a problem hiding this comment.
@yanweiguo: 1 warning.
Details
In response to this:
Proposed Changes
- Check every pod when reconciling revision instead of one single pod.
- If not all pods of a revision are unavailable, only send warning logs.
- Also propagate reason of container in terminated status to revision when all pods of a revision are unavailable.
- Add a key to some controller logs indicating that they are helpful to be surfaced to users. It can be used as logs query filter.
Release Note
The reason for crashing pods are now propagated to the revision combined with existing exit code to ease debuggability.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
| } | ||
|
|
||
| func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, message string) { | ||
| func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, reason, message string) { |
There was a problem hiding this comment.
Golint comments: exported method RevisionStatus.MarkContainerExiting should have comment or be unexported. More info.
|
The following is the coverage report on pkg/.
|
|
@yanweiguo: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| // Should change revision status if all pods are crashing | ||
| shouldMarkRev := deployment.Status.AvailableReplicas == 0 | ||
| for _, pod := range pods.Items { | ||
| // Update the revision status if pod cannot be scheduled(possibly resource constraints) |
There was a problem hiding this comment.
| // Update the revision status if pod cannot be scheduled(possibly resource constraints) | |
| // Update the revision status if pod cannot be scheduled (possibly resource constraints) |
| status.Name, pod.Name, rev.Name, t.Reason, t.Message) | ||
| if shouldMarkRev { | ||
| logger.Infof("%s marking exiting with: %d/%s: %s", rev.Name, t.ExitCode, t.Reason, t.Message) | ||
| rev.Status.MarkContainerExiting(t.ExitCode, t.Reason, t.Message) |
There was a problem hiding this comment.
So if I have 3 pods one is ok, 2 are crashing but for different reasons, you'll set the status twice to different things?
markusthoemmes
left a comment
There was a problem hiding this comment.
As it stands, I think this makes this code quite expensive. The pod listing is not backed by an informer or any cache at all so this will now do API calls at any scale. I think this is fine at scale == 0 cases, I'm not sure we want to trigger this logic at arbitrary scale though.
|
|
||
| // If a container keeps crashing (no active pods in the deployment although we want some) | ||
| if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 { | ||
| if *deployment.Spec.Replicas > deployment.Status.AvailableReplicas { |
There was a problem hiding this comment.
This will always be true if we're scaling up, right?
| // Update the revision status if pod cannot be scheduled(possibly resource constraints) | ||
| // If pod cannot be scheduled then we expect the container status to be empty. | ||
| for _, cond := range pod.Status.Conditions { | ||
| if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse { |
There was a problem hiding this comment.
#4136 tries to grab this status from the deployment rather than from individual pods. Is that possible here as well?
| // the user container. | ||
| if status.Name == rev.Spec.GetContainer().Name { | ||
| if t := status.LastTerminationState.Terminated; t != nil { | ||
| userFacingLogger.Warnf("Container %s in pod %s of revision %s is in terminated status: %s/%s", |
There was a problem hiding this comment.
Should we produce events instead of logs? They'd be actually userfacing. On that note: Are these events already produced for the individual pods?
| } | ||
|
|
||
| if shouldMarkRev { | ||
| // Arbitrarily check the very first pod, as they all should be crashing |
There was a problem hiding this comment.
This comment seems wrong now.
I'm going to hold this. Yes I agree this makes this code expensive just for logging. We may have better solution to cover more cases discussed in #4557 /hold |
|
@yanweiguo any updates here? Are you still working on this? |
|
Not actively working on this. /close |
|
@yanweiguo: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #4534
Part of #4557
Proposed Changes
Release Note