Skip to content

Send logs for pods in bad status#4572

Closed
yanweiguo wants to merge 6 commits intoknative:masterfrom
yanweiguo:log_reason
Closed

Send logs for pods in bad status#4572
yanweiguo wants to merge 6 commits intoknative:masterfrom
yanweiguo:log_reason

Conversation

@yanweiguo
Copy link
Copy Markdown
Contributor

Fixes #4534
Part of #4557

Proposed Changes

  • Check every pod when reconciling revision instead of one single pod.
  • If not all pods of a revision are unavailable, only send warning logs.
  • Also propagate reason of container in terminated status to revision when all pods of a revision are unavailable.
  • Add a key to some controller logs indicating that they are helpful to be surfaced to users. It can be used as logs query filter.

Release Note

The reason for crashing pods are now propagated to the revision combined with existing exit code to ease debuggability.

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jun 29, 2019
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yanweiguo
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 29, 2019
@yanweiguo yanweiguo requested a review from markusthoemmes June 29, 2019 01:05
Copy link
Copy Markdown
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanweiguo: 1 warning.

Details

In response to this:

Fixes #4534
Part of #4557

Proposed Changes

  • Check every pod when reconciling revision instead of one single pod.
  • If not all pods of a revision are unavailable, only send warning logs.
  • Also propagate reason of container in terminated status to revision when all pods of a revision are unavailable.
  • Add a key to some controller logs indicating that they are helpful to be surfaced to users. It can be used as logs query filter.

Release Note

The reason for crashing pods are now propagated to the revision combined with existing exit code to ease debuggability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}

func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, message string) {
func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, reason, message string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method RevisionStatus.MarkContainerExiting should have comment or be unexported. More info.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jun 29, 2019
@yanweiguo yanweiguo requested a review from jonjohnsonjr June 29, 2019 01:05
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 78.2% 84.2% 6.0
pkg/reconciler/revision/reconcile_resources.go 91.9% 92.7% 0.8

@knative-prow-robot
Copy link
Copy Markdown
Contributor

@yanweiguo: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-integration-tests e19cc0c link /test pull-knative-serving-integration-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

// Should change revision status if all pods are crashing
shouldMarkRev := deployment.Status.AvailableReplicas == 0
for _, pod := range pods.Items {
// Update the revision status if pod cannot be scheduled(possibly resource constraints)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Update the revision status if pod cannot be scheduled(possibly resource constraints)
// Update the revision status if pod cannot be scheduled (possibly resource constraints)

status.Name, pod.Name, rev.Name, t.Reason, t.Message)
if shouldMarkRev {
logger.Infof("%s marking exiting with: %d/%s: %s", rev.Name, t.ExitCode, t.Reason, t.Message)
rev.Status.MarkContainerExiting(t.ExitCode, t.Reason, t.Message)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I have 3 pods one is ok, 2 are crashing but for different reasons, you'll set the status twice to different things?

Copy link
Copy Markdown
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, I think this makes this code quite expensive. The pod listing is not backed by an informer or any cache at all so this will now do API calls at any scale. I think this is fine at scale == 0 cases, I'm not sure we want to trigger this logic at arbitrary scale though.


// If a container keeps crashing (no active pods in the deployment although we want some)
if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
if *deployment.Spec.Replicas > deployment.Status.AvailableReplicas {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be true if we're scaling up, right?

// Update the revision status if pod cannot be scheduled(possibly resource constraints)
// If pod cannot be scheduled then we expect the container status to be empty.
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4136 tries to grab this status from the deployment rather than from individual pods. Is that possible here as well?

// the user container.
if status.Name == rev.Spec.GetContainer().Name {
if t := status.LastTerminationState.Terminated; t != nil {
userFacingLogger.Warnf("Container %s in pod %s of revision %s is in terminated status: %s/%s",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we produce events instead of logs? They'd be actually userfacing. On that note: Are these events already produced for the individual pods?

}

if shouldMarkRev {
// Arbitrarily check the very first pod, as they all should be crashing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems wrong now.

@yanweiguo
Copy link
Copy Markdown
Contributor Author

As it stands, I think this makes this code quite expensive. The pod listing is not backed by an informer or any cache at all so this will now do API calls at any scale. I think this is fine at scale == 0 cases, I'm not sure we want to trigger this logic at arbitrary scale though.

I'm going to hold this. Yes I agree this makes this code expensive just for logging. We may have better solution to cover more cases discussed in #4557

/hold

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2019
@jonjohnsonjr jonjohnsonjr mentioned this pull request Jul 10, 2019
@markusthoemmes
Copy link
Copy Markdown
Contributor

@yanweiguo any updates here? Are you still working on this?

@yanweiguo
Copy link
Copy Markdown
Contributor Author

Not actively working on this.

/close

@knative-prow-robot
Copy link
Copy Markdown
Contributor

@yanweiguo: Closed this PR.

Details

In response to this:

Not actively working on this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yanweiguo yanweiguo deleted the log_reason branch July 6, 2020 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add logs for container in bad status

6 participants