Send logs for pods in bad status by yanweiguo · Pull Request #4572 · knative/serving

yanweiguo · 2019-06-29T01:05:22Z

Fixes #4534
Part of #4557

Proposed Changes

Check every pod when reconciling revision instead of one single pod.
If not all pods of a revision are unavailable, only send warning logs.
Also propagate reason of container in terminated status to revision when all pods of a revision are unavailable.
Add a key to some controller logs indicating that they are helpful to be surfaced to users. It can be used as logs query filter.

Release Note

The reason for crashing pods are now propagated to the revision combined with existing exit code to ease debuggability.

knative-prow-robot · 2019-06-29T01:05:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yanweiguo
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow-robot

@yanweiguo: 1 warning.

Details

In response to this:

Fixes #4534
Part of #4557

Proposed Changes

Check every pod when reconciling revision instead of one single pod.

If not all pods of a revision are unavailable, only send warning logs.

Also propagate reason of container in terminated status to revision when all pods of a revision are unavailable.

Add a key to some controller logs indicating that they are helpful to be surfaced to users. It can be used as logs query filter.

Release Note
The reason for crashing pods are now propagated to the revision combined with existing exit code to ease debuggability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2019-06-29T01:05:35Z

 }

-func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, message string) {
+func (rs *RevisionStatus) MarkContainerExiting(exitCode int32, reason, message string) {


Golint comments: exported method RevisionStatus.MarkContainerExiting should have comment or be unexported. More info.

knative-metrics-robot · 2019-06-29T01:09:08Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go	78.2%	84.2%	6.0
pkg/reconciler/revision/reconcile_resources.go	91.9%	92.7%	0.8

knative-prow-robot · 2019-06-29T01:23:39Z

@yanweiguo: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-knative-serving-integration-tests	`e19cc0c`	link	`/test pull-knative-serving-integration-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vagababov · 2019-06-29T19:22:51Z

+			// Should change revision status if all pods are crashing
+			shouldMarkRev := deployment.Status.AvailableReplicas == 0
+			for _, pod := range pods.Items {
+				// Update the revision status if pod cannot be scheduled(possibly resource constraints)


Suggested change

// Update the revision status if pod cannot be scheduled(possibly resource constraints)

// Update the revision status if pod cannot be scheduled (possibly resource constraints)

vagababov · 2019-06-29T19:25:17Z

+								status.Name, pod.Name, rev.Name, t.Reason, t.Message)
+							if shouldMarkRev {
+								logger.Infof("%s marking exiting with: %d/%s: %s", rev.Name, t.ExitCode, t.Reason, t.Message)
+								rev.Status.MarkContainerExiting(t.ExitCode, t.Reason, t.Message)


So if I have 3 pods one is ok, 2 are crashing but for different reasons, you'll set the status twice to different things?

markusthoemmes

As it stands, I think this makes this code quite expensive. The pod listing is not backed by an informer or any cache at all so this will now do API calls at any scale. I think this is fine at scale == 0 cases, I'm not sure we want to trigger this logic at arbitrary scale though.

markusthoemmes · 2019-07-01T08:10:51Z


 	// If a container keeps crashing (no active pods in the deployment although we want some)
-	if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {
+	if *deployment.Spec.Replicas > deployment.Status.AvailableReplicas {


This will always be true if we're scaling up, right?

markusthoemmes · 2019-07-01T08:14:11Z

+				// Update the revision status if pod cannot be scheduled(possibly resource constraints)
+				// If pod cannot be scheduled then we expect the container status to be empty.
+				for _, cond := range pod.Status.Conditions {
+					if cond.Type == corev1.PodScheduled && cond.Status == corev1.ConditionFalse {


#4136 tries to grab this status from the deployment rather than from individual pods. Is that possible here as well?

markusthoemmes · 2019-07-01T08:18:17Z

+					// the user container.
+					if status.Name == rev.Spec.GetContainer().Name {
+						if t := status.LastTerminationState.Terminated; t != nil {
+							userFacingLogger.Warnf("Container %s in pod %s of revision %s is in terminated status: %s/%s",


Should we produce events instead of logs? They'd be actually userfacing. On that note: Are these events already produced for the individual pods?

markusthoemmes · 2019-07-01T08:19:06Z

+				}
+
+				if shouldMarkRev {
+					// Arbitrarily check the very first pod, as they all should be crashing


This comment seems wrong now.

yanweiguo · 2019-07-02T00:36:10Z

As it stands, I think this makes this code quite expensive. The pod listing is not backed by an informer or any cache at all so this will now do API calls at any scale. I think this is fine at scale == 0 cases, I'm not sure we want to trigger this logic at arbitrary scale though.

I'm going to hold this. Yes I agree this makes this code expensive just for logging. We may have better solution to cover more cases discussed in #4557

/hold

markusthoemmes · 2019-07-15T13:24:14Z

@yanweiguo any updates here? Are you still working on this?

yanweiguo · 2019-07-15T23:07:49Z

Not actively working on this.

/close

knative-prow-robot · 2019-07-15T23:07:51Z

@yanweiguo: Closed this PR.

Details

In response to this:

Not actively working on this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yanweiguo added 6 commits June 28, 2019 14:55

add reason to log

4f42234

add logs

5c87b59

fix typo

1de4780

add key

be1c558

rename

2bcd509

rename

e19cc0c

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jun 29, 2019

knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 29, 2019

yanweiguo requested a review from markusthoemmes June 29, 2019 01:05

knative-prow-robot reviewed Jun 29, 2019

View reviewed changes

knative-prow-robot added area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jun 29, 2019

yanweiguo requested a review from jonjohnsonjr June 29, 2019 01:05

knative-prow-robot requested review from vagababov and yt3liu June 29, 2019 01:05

vagababov reviewed Jun 29, 2019

View reviewed changes

markusthoemmes reviewed Jul 1, 2019

View reviewed changes

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2019

jonjohnsonjr mentioned this pull request Jul 10, 2019

Pod failures #4557

Open

knative-prow-robot closed this Jul 15, 2019

yanweiguo deleted the log_reason branch July 6, 2020 19:33

	// Update the revision status if pod cannot be scheduled(possibly resource constraints)
	// Update the revision status if pod cannot be scheduled (possibly resource constraints)

Conversation

yanweiguo commented Jun 29, 2019

Proposed Changes

Uh oh!

knative-prow-robot commented Jun 29, 2019

Uh oh!

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

Uh oh!

knative-prow-robot Jun 29, 2019

Choose a reason for hiding this comment

Uh oh!

knative-metrics-robot commented Jun 29, 2019

Uh oh!

knative-prow-robot commented Jun 29, 2019

Uh oh!

vagababov Jun 29, 2019

Choose a reason for hiding this comment

Uh oh!

vagababov Jun 29, 2019

Choose a reason for hiding this comment

Uh oh!

markusthoemmes left a comment

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

yanweiguo commented Jul 2, 2019

Uh oh!

markusthoemmes commented Jul 15, 2019

Uh oh!

yanweiguo commented Jul 15, 2019

Uh oh!

knative-prow-robot commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants