Skip to content

Bubble up pod schedule errors to revision status#4191

Merged
knative-prow-robot merged 1 commit intoknative:masterfrom
shashwathi:pod_error
Jun 3, 2019
Merged

Bubble up pod schedule errors to revision status#4191
knative-prow-robot merged 1 commit intoknative:masterfrom
shashwathi:pod_error

Conversation

@shashwathi
Copy link
Copy Markdown
Contributor

Pod schedule error information is updated in revision status so user can debug what's happening from the knative revision, configuration and svc object.

Eg:
Sample configuration status (with pod spec that has unschedulable resource )

    conditions:
    - lastTransitionTime: 2019-05-30T18:42:45Z
      message: 'Revision "configuration-example-jmk22" failed with message: 0/3 nodes
        are available: 3 Insufficient cpu..'
      reason: RevisionFailed
      status: "False"
      type: Ready

Revision status

   - lastTransitionTime: 2019-05-30T18:42:45Z
      message: '0/3 nodes are available: 3 Insufficient cpu.'
      reason: Unschedulable
      status: "False"
      type: ResourcesAvailable

Fixes #4153

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 30, 2019
@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 30, 2019
Copy link
Copy Markdown
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashwathi: 1 warning.

Details

In response to this:

Pod schedule error information is updated in revision status so user can debug what's happening from the knative revision, configuration and svc object.

Eg:
Sample configuration status (with pod spec that has unschedulable resource )

   conditions:
   - lastTransitionTime: 2019-05-30T18:42:45Z
     message: 'Revision "configuration-example-jmk22" failed with message: 0/3 nodes
       are available: 3 Insufficient cpu..'
     reason: RevisionFailed
     status: "False"
     type: Ready

Revision status

  - lastTransitionTime: 2019-05-30T18:42:45Z
     message: '0/3 nodes are available: 3 Insufficient cpu.'
     reason: Unschedulable
     status: "False"
     type: ResourcesAvailable

Fixes #4153

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

revCondSet.Manage(rs).MarkTrue(RevisionConditionResourcesAvailable)
}

func (rs *RevisionStatus) MarkResourcesUnavailable(reason, message string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method RevisionStatus.MarkResourcesUnavailable should have comment or be unexported. More info.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this as well.

@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label May 30, 2019
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 84.4% 0.2
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@jonjohnsonjr
Copy link
Copy Markdown
Contributor

Nice, I'll need MarkResourcesUnavailable for #4192

@dprotaso
Copy link
Copy Markdown
Member

/lgtm

@shashwathi after rebasing/resolving conflict you'll need to assign this to Matt because of the API package changes

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 31, 2019
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label May 31, 2019
@shashwathi
Copy link
Copy Markdown
Contributor Author

/assign @mattmoor

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 90.0% -1.4

Copy link
Copy Markdown
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 3, 2019
@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Jun 3, 2019

/hold

Is an e2e test for this possible? If not, feel free to remove the hold.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2019
@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Jun 3, 2019

Perhaps you could ask for an unreasonably large amount of CPU? Like 50k cores :)

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 3, 2019
@shashwathi
Copy link
Copy Markdown
Contributor Author

@mattmoor : I have updated the PR to include e2e test with high CPU like you recommended :) Thanks for the suggestion.

@shashwathi
Copy link
Copy Markdown
Contributor Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2019
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

@shashwathi
Copy link
Copy Markdown
Contributor Author

/test pull-knative-serving-unit-tests

@@ -0,0 +1,98 @@
package e2e
Copy link
Copy Markdown
Contributor

@jonjohnsonjr jonjohnsonjr Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs:

// +build e2e

/*
Copyright 2019 The Knative Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much

break
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care about overriding this status with the statuses below if they all happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean updating revision status if resources are present on nodes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NM, I think it's fine. But what I meant is code below would override the status value. But we don't have idea of priorities, so I guess it doesn't matter here.

Comment thread test/e2e/pod_schedule_error_test.go Outdated

names.Config = serviceresourcenames.Configuration(svc)

errorReason := "RevisionFailed"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those both are constants.

Comment thread test/e2e/pod_schedule_error_test.go Outdated
t.Fatalf("Failed to get revision from configuration %s: %v", names.Config, err)
}

revisionReason := "Unschedulable"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as are these 2.

Comment thread test/e2e/pod_schedule_error_test.go Outdated
}

// Get revision name from configuration.
func getRevisionFromConfiguration(clients *test.Clients, configName string) (string, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just revisionFromConfiguration.

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Copy Markdown
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superficial issues only

Comment thread test/e2e/pod_schedule_error_test.go Outdated
revisionName, revisionReason, errorMsg, cond.Reason, cond.Message)
}
return false, nil
}, "RevisionFailed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use errorReason here?

Comment thread test/e2e/pod_schedule_error_test.go Outdated
if config.Status.LatestCreatedRevisionName != "" {
return config.Status.LatestCreatedRevisionName, nil
}
return "", fmt.Errorf("No valid revision name found in configuration %s", configName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors start with lowercase in Go.

Suggested change
return "", fmt.Errorf("No valid revision name found in configuration %s", configName)
return "", fmt.Errorf("no valid revision name found in configuration %s", configName)

Comment thread test/e2e/pod_schedule_error_test.go Outdated
if cond.Reason == revisionReason && strings.Contains(cond.Message, errorMsg) {
return true, nil
}
return true, fmt.Errorf("The revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true, fmt.Errorf("The revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",
return true, fmt.Errorf("the revision %s was not marked with expected error condition (Reason=%q, Message=%q), but with (Reason=%q, Message=%q)",

@shashwathi
Copy link
Copy Markdown
Contributor Author

shashwathi commented Jun 3, 2019

@vagababov Addressed your comments. Ready for another review

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Copy Markdown
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do the linter error and found one more error string :)
But I am sure those are the last ones :)

revCondSet.Manage(rs).MarkTrue(RevisionConditionResourcesAvailable)
}

func (rs *RevisionStatus) MarkResourcesUnavailable(reason, message string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this as well.

Comment thread test/e2e/pod_schedule_error_test.go Outdated
return true, nil
}
t.Logf("Reason: %s ; Message: %s ; Status: %s", cond.Reason, cond.Message, cond.Status)
return true, fmt.Errorf("The service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return true, fmt.Errorf("The service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",
return true, fmt.Errorf("the service %s was not marked with expected error condition (Reason=\"%s\", Message=\"%s\", Status=\"%s\"), but with (Reason=\"%s\", Message=\"%s\", Status=\"%s\")",

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_lifecycle.go 84.1% 77.2% -6.9
pkg/reconciler/revision/reconcile_resources.go 91.4% 91.7% 0.3

Copy link
Copy Markdown
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Thanks!

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2019
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, shashwathi, vagababov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface pod scheduling errors in service status

8 participants