Skip to content

Revise PodAutoscaler's life cycle#4094

Closed
hohaichi wants to merge 18 commits intoknative:masterfrom
hohaichi:pnr
Closed

Revise PodAutoscaler's life cycle#4094
hohaichi wants to merge 18 commits intoknative:masterfrom
hohaichi:pnr

Conversation

@hohaichi
Copy link
Copy Markdown
Contributor

@hohaichi hohaichi commented May 14, 2019

Currently PodAutoscaler's Ready condition is the same as Active condition. It is not only mismatched with Revision's Ready condition, where the Active condition is only an informative condition that does not define the Ready condition, but also not capturing the PodAutoscaler's status very well. For example, when everything is good but there's no traffic, the PodAutoscaler should be Ready, but Inactive.

Fixes #3456

Proposed Changes

  • Make Active an informational condition for PodAutoscaler.
  • Define Bootstrap as a living condition for PodAutoscaler's Ready.
  • Add ReadyReplicas to PodScalable's Status so that PodScalable can be used to detect when the PA may be stuck from scaling up from zero.
  • PodAutoscaler checks the following conditions when PA got stuck not able to scale up from zero: container exiting, pod unscheduled, and image pull problems.
  • Revision reads PodAutoscaler's Ready condition and updates its ContainerHealthy condition.
  • PodAutoscaler scales down (to 0 if allowed) when it fails to become Ready.

Release Note

Revise PodAutoscaler's life cycle to capture its Ready status more accurately.

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 14, 2019
@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 14, 2019
Copy link
Copy Markdown
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hohaichi: 0 warnings.

Details

In response to this:

…eck container health

Fixes #3456

Proposed Changes

  • Introduce a Requirements condition in PodAutoscaler. This condition is to capture all preconditions for a PodAutoscaler to become Ready, besides the Active condition.
  • Use this Requirements condition to fix the crashlooping problem by having heathy containers as one requirement.

Release Note


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hohaichi
Copy link
Copy Markdown
Contributor Author

/hold
Prototyping to gather feedback

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/API API objects and controllers area/autoscale labels May 14, 2019
@hohaichi
Copy link
Copy Markdown
Contributor Author

/assign @mattmoor
Matt, what do you think about this approach? I think it is extensible to fix #3077 too.

Comment thread pkg/reconciler/autoscaling/kpa/kpa.go Outdated
}

func (c *Reconciler) verifyContainer(pa *pav1alpha1.PodAutoscaler) (*apis.Condition, error) {
rev, err := c.revLister.Revisions(pa.Namespace).Get(pa.Name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the layering violation that's here. Revision creates the PodAutoscaler abstraction, so this is a back-edge in the layering graph, which I've been trying to eliminate. :(

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 22, 2019
@hohaichi hohaichi changed the title Introduce a Requirements condition for PodAutoscaler and use it to ch… @hohaichi Revise PodAutoscaler's life cycle May 22, 2019
@hohaichi hohaichi changed the title @hohaichi Revise PodAutoscaler's life cycle Revise PodAutoscaler's life cycle May 22, 2019
@hohaichi
Copy link
Copy Markdown
Contributor Author

@mattmoor I've prototyped a fix based on our discussion. Could you please take a look? If the change is good, I'll add tests.

@hohaichi
Copy link
Copy Markdown
Contributor Author

hohaichi commented Jun 6, 2019

/assign @jonjohnsonjr
Jon, @mattmoor mentioned that you are working on separation of concerns in knative serving and doing something similar to this PR. Could you please have a look?

Copy link
Copy Markdown
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some superficial items, I need to read this in more detail

Comment thread pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go Outdated
Comment thread pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go
Comment thread pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go
Comment thread pkg/apis/autoscaling/v1alpha1/pa_types.go Outdated
Comment thread pkg/apis/autoscaling/v1alpha1/pa_types.go Outdated
Comment thread pkg/reconciler/autoscaling/kpa/kpa.go Outdated
Comment thread pkg/reconciler/autoscaling/kpa/kpa.go Outdated
Comment thread pkg/reconciler/autoscaling/kpa/kpa.go
Comment thread pkg/reconciler/revision/reconcile_resources.go
@knative-prow-robot knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Jun 14, 2019
@hohaichi
Copy link
Copy Markdown
Contributor Author

@mattmoor @vagababov @jonjohnsonjr
Since all feedback are on details, it looks like the high level approach is OK with you. I've addressed the feedback and am working on tests now.

Comment thread pkg/apis/autoscaling/v1alpha1/pa_types.go Outdated
Comment thread pkg/apis/autoscaling/v1alpha1/pa_types.go Outdated
Comment thread pkg/reconciler/autoscaling/kpa/kpa.go Outdated
@hohaichi
Copy link
Copy Markdown
Contributor Author

/hold
(work in progress)

@knative-prow-robot knative-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 24, 2019
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hohaichi
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hohaichi
Copy link
Copy Markdown
Contributor Author

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2019
// MarkResourceFailedCreation changes the "Active" condition to false to reflect that a
// critical resource of the given kind and name was unable to be created.
func (pas *PodAutoscalerStatus) MarkResourceFailedCreation(kind, name string) {
// TODO: This looks more like a bootstrap condition?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used here when we fail to create a HorizontalPodAutoscaler.

Looking at other resources, what happens if we fail to create an owned resource?

Just return an error:

Outliers:

It seems like we should consider this entire dependency graph when we think about the lifecycle of any single resource (and how failures to reconcile get propagated)...

In this case, I think I'd just return an error and leave Ready=Unknown. This might be an issue, though, since the HPA will still be Active, but it definitely shouldn't be... should we just remove that line? I haven't entirely though through that...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned the previous comment into a graph because I wanted to visualize which resources actually update their conditions based on failure to create a child resource (in red):
graph

It seems like we should be more consistent here... cc @mattmoor

// MarkResourceNotOwned changes the "Active" condition to false to reflect that the
// resource of the given kind and name has already been created, and we do not own it.
func (pas *PodAutoscalerStatus) MarkResourceNotOwned(kind, name string) {
// TODO: This looks more like a bootstrap condition?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it should set Ready=False, if we're being consistent with other reconcilers.

@jonjohnsonjr jonjohnsonjr mentioned this pull request Jun 27, 2019
@hohaichi
Copy link
Copy Markdown
Contributor Author

Updated per our discussion yesterday--making resource-not-owned and resource-failed-creation readiness conditions.

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go 98.3% 95.7% -2.7
pkg/apis/serving/v1alpha1/revision_lifecycle.go 78.2% 83.0% 4.8
pkg/reconciler/autoscaling/hpa/hpa.go 86.4% 87.0% 0.6
pkg/reconciler/autoscaling/kpa/kpa.go 92.2% 79.9% -12.4
pkg/reconciler/autoscaling/kpa/scaler.go 88.9% 89.4% 0.5
pkg/reconciler/revision/reconcile_resources.go 91.9% 91.7% -0.3

@knative-prow-robot
Copy link
Copy Markdown
Contributor

@hohaichi: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-go-coverage f3ac59e link /test pull-knative-serving-go-coverage

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Jul 8, 2019

I think @jonjohnsonjr is going to reopen this as another PR. Going to close this to clear from gubernator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash looping revisions never gets scaled to zero

7 participants