Skip to content

[WIP] Stabilize 1->0 case#1251

Closed
akyyy wants to merge 13 commits intoknative:masterfrom
akyyy:tearingdown
Closed

[WIP] Stabilize 1->0 case#1251
akyyy wants to merge 13 commits intoknative:masterfrom
akyyy:tearingdown

Conversation

@akyyy
Copy link
Copy Markdown
Contributor

@akyyy akyyy commented Jun 18, 2018

Fixes #
#1250

Proposed Changes

  • Add deactivating condition for revision so there is no race between revision controller and route controller. Tearing down k8s resources need to happen after we route traffic to activator service.
  • Keep the revision service and deployment while we tearing down k8s resources for 1->0. So we don't let the terminating pod block creating new pod. This is useful to serve traffic while tearing down.
  • Don't include route label in revisions. So when we update k8s deployment, there is only one replicaSet.

@google-prow-robot google-prow-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 18, 2018
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 96.6% -1.7
pkg/controller/route/route.go 78.5% 78.1% -0.4
pkg/controller/route/route_test.go 78.5% 78.1% -0.4
pkg/controller/revision/revision.go 80.0% 76.6% -3.4
pkg/controller/revision/revision_test.go 80.0% 76.6% -3.4

*TestCoverage feature is being tested, do not rely on any info here yet

@akyyy akyyy removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2018
@akyyy akyyy changed the title [WIP] stabilize 1->0 case Stabilize 1->0 case Jun 18, 2018
@akyyy
Copy link
Copy Markdown
Contributor Author

akyyy commented Jun 18, 2018

/retest

Copy link
Copy Markdown
Contributor

@josephburnett josephburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a few nits.

/lgtm

Comment thread cmd/activator/main.go
timeout := 500 * time.Millisecond

i := 0
for ; i < maxRetry; i++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isdal had a good point. We should probe for iptable programming by hitting the health check endpoint. That way POST requests are actually retried. But that's fine to do as a separate pull request.

Comment thread pkg/controller/revision/revision.go Outdated
logger.Info("Deleted service")
} else {
// Serving state is RevisionServingStateReserve. Delete the revision service and update
// the dpeloyment replicas to be 0.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spelling: d[e]ployment

Comment thread pkg/controller/revision/revision.go Outdated
}
}

logger.Info("Scale the deployment to 0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: scal[ing] the deployment to 0

Comment thread pkg/controller/revision/revision.go Outdated
// Serving state is RevisionServingStateReserve. Delete the revision service and update
// the dpeloyment replicas to be 0.
if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil {
if cond.Reason != "Inactive" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can combine lines 497 and 498 into a single if statement.

Comment thread pkg/controller/revision/revision.go Outdated
// the dpeloyment replicas to be 0.
if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil {
if cond.Reason != "Inactive" {
if cond.Reason != "Deactivating" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the amount of indentation here, how about writing this as a short-circuiting check.

if cond.Reason == "Deactivating" {
    return nil
}
...

Comment thread pkg/controller/revision/revision.go Outdated
return err
}

logger.Infof("Deactvaing Deployment %q", deploymentName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spelling: deact[i]va[t]ing

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2018
@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2018
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 96.6% -1.7
pkg/controller/route/route.go 78.7% 78.1% -0.6
pkg/controller/route/route_test.go 78.7% 78.1% -0.6
pkg/controller/revision/revision.go 79.7% 76.6% -3.1
pkg/controller/revision/revision_test.go 79.7% 76.6% -3.1

*TestCoverage feature is being tested, do not rely on any info here yet

Comment thread pkg/controller/revision/revision.go Outdated
}
logger.Info("Deleted service")
} else {
// Serving state is RevisionServingStateReserve. Delete the revision service and update
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a deletion below, is this comment correct?

} else {
// Serving state is RevisionServingStateReserve. Delete the revision service and update
// the deployment replicas to be 0.
if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are now constants for reasons in revision_types.go ?

Copy link
Copy Markdown
Contributor

@josephburnett josephburnett Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. There is also a "Deactivating" reason string above that could be a constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constants are still strings. I don't see them defined in revision_types.go? The issue is still open. #880
We may address that later.

*deployment.Spec.Replicas = int32(0)
_, err = dc.Update(deployment)
if err != nil {
logger.Errorf("Error deactivating deployment %q: %s", deploymentName, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err ?

(or if that isn't desirable we should still return to avoid seeing the log.Infof success here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks!

Comment thread cmd/activator/main.go Outdated
@@ -48,28 +47,28 @@ type activationHandler struct {
type retryRoundTripper struct{}

func (rrt retryRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new transport for every transaction might be problematic. Should we not cache this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 96.6% -1.7
pkg/controller/route/route.go 78.3% 78.1% -0.2
pkg/controller/route/route_test.go 78.3% 78.1% -0.2
pkg/controller/revision/revision.go 79.7% 75.5% -4.2
pkg/controller/revision/revision_test.go 79.7% 75.5% -4.2

*TestCoverage feature is being tested, do not rely on any info here yet

Copy link
Copy Markdown
Contributor

@tcnghia tcnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2018
@josephburnett
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@josephburnett
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2018
@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akyyy, josephburnett, tcnghia
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: vaikas-google

Assign the PR to them by writing /assign @vaikas-google in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

})
}

func (rs *RevisionStatus) MarkDeactivating() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs unit test coverage.

@google-prow-robot
Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

@google-prow-robot google-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 19, 2018

for k, v := range revision.ObjectMeta.Labels {
labels[k] = v
if k != serving.RouteLabelKey {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the comment below up here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment thread pkg/controller/route/route_test.go Outdated
Name: fmt.Sprintf("%s-service", cfgrev.Name),
Namespace: testNamespace,
Route: []v1alpha2.DestinationWeight{
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapse as we had on the LHS everywhere please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// Serving state is RevisionServingStateReserve. Keep the revision service and update
// the deployment replicas to be 0.
if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" {
if cond.Reason == "Deactivating" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be predicating logic on cond.Reason, to me this is an indicator that our model for the lifecycle of inactive revisions is incomplete. This is clearly an extension of the logic we currently have in here, but we need to prioritize fixing this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want to do this (#645 (comment)) but before we pivot to the new model, I want what we have at head to not throw 500's. This fixes the last of the 500's that happens just during deactivation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's an issue out of the scope of this pr.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Does that mean that cleaning this up is the top priority after this goes in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take it after this pr and oncall.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. After we fix this 503 issue, the top priority is migrating us to the new model you described and (hopefully) getting rid of serving state.

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 98.3% 0.0
pkg/apis/serving/v1alpha1/revision_types_test.go 98.3% 98.3% 0.0
pkg/controller/route/route.go 78.5% 78.1% -0.4
pkg/controller/route/route_test.go 78.5% 78.1% -0.4
pkg/controller/revision/revision.go 79.7% 76.3% -3.4
pkg/controller/revision/revision_test.go 79.7% 76.3% -3.4

*TestCoverage feature is being tested, do not rely on any info here yet

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 98.3% 0.0
pkg/apis/serving/v1alpha1/revision_types_test.go 98.3% 98.3% 0.0
pkg/controller/route/route.go 78.5% 77.9% -0.6
pkg/controller/route/route_test.go 78.5% 77.9% -0.6
pkg/controller/revision/revision.go 79.7% 76.3% -3.4
pkg/controller/revision/revision_test.go 79.7% 76.3% -3.4

*TestCoverage feature is being tested, do not rely on any info here yet

Copy link
Copy Markdown
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of suggestions, in addition to what Matt is requesting.

logger.Errorf("Failed to get deployment %q", deploymentName)
return err
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, you could check the deployment here and see if Replicas is already 0 and then shortcircuit without updating?

_, err = kubeClient.AppsV1().Deployments(testNamespace).Get(expectedDeploymentName, metav1.GetOptions{})

if err != nil {
t.Fatalf("Expected k8s deployment to be there but it was gone: %s/%s", testNamespace, expectedDeploymentName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this could fail for another reason as well and the deployment could be there? Perhaps check isNotFound and then fail with this message?

Comment thread pkg/controller/route/route.go Outdated
// A revision is considered inactive (yet) if it's in
// "Inactive" condition or "Activating" condition.
// "Inactive" condition, "Activating" or "Deactivating" condition.
logger.Infof("cond: %v", cond)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like it's not only the cond, but status that makes this condition happen. I am little confused about this 'double-checking'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be in part of issue #645
#645 (comment)
I'll fix it in that issue.

// end up with multiple replica sets.
if k != serving.RouteLabelKey {
labels[k] = v
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thinking about this some more) I'm not thrilled about the Revision controller needing to know things about the Route controller (these are essentially the only references to route in the directory).

This also feels like a band-aid for the one case we know about now, but doesn't solve the more general problem that label mutations could trigger, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just a temporary solution. I opened issue #1293 and will add a link of this piece of code to the issue.

err = c.deleteService(ctx, rev)
if err != nil {
logger.Error("Failed to delete k8s service", zap.Error(err))
if rev.Spec.ServingState == v1alpha1.RevisionServingStateRetired {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josephburnett I thought we just added a comment indicating this state was unused? Should we just delete the code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's cleaner to delete Retired state. I added this comment to issue #645

// Then create the actual route rules.
logger.Info("Creating Istio route rules")
revisionRoutes, err := c.createOrUpdateRouteRules(ctx, route, configMap, revMap)
revisionRoutes, inactiveRev, err := c.createOrUpdateRouteRules(ctx, route, configMap, revMap)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took some digging for me to determine what inactiveRev was, whereas before the method and return value were fairly self-describing. What I don't get is why there is only one inactive revision (name?) returned from this method. The traffic block could have N revisions, all reserve.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, activator only forward traffic to one inactive revision with the largest traffic weight. More details are here. #882

return nil, err
}
rev.Status.MarkInactive()
if _, err = revisionClient.Update(rev); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but we're updating Revision.Status from the Route controller. That makes me even more uncomfortable that predicating on cond.Reason.

@josephburnett I understand wanting to root out 5XXs within our current model, but cleaning this up needs to be a top priority. We've known about this problem for 2+ months now, I'd really like to see a proper fix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I find this aspect of the PR troubling (chatted a bit with @vaikas-google too), and I think we should reconsider our choice of band-aid to fix this.

I'd initially proposed this to remediate this, and it got a bit lost in broader thinking about how we make autoscaling more extensible.

If we are thinking about a nearer term fix, I think we should go with this proposal instead, eliding the aspects around Conditions and Retired for now.

Comment thread pkg/controller/route/route_test.go Outdated
},
Route: []v1alpha2.DestinationWeight{getActivatorDestinationWeight(100)},
Route: []v1alpha2.DestinationWeight{
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can collapse this as well.

// Serving state is RevisionServingStateReserve. Keep the revision service and update
// the deployment replicas to be 0.
if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" {
if cond.Reason == "Deactivating" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Does that mean that cleaning this up is the top priority after this goes in?

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 98.3% 0.0
pkg/controller/route/route.go 78.5% 77.7% -0.8
pkg/controller/revision/revision.go 79.7% 75.7% -4.0

*TestCoverage feature is being tested, do not rely on any info here yet

@akyyy
Copy link
Copy Markdown
Contributor Author

akyyy commented Jun 20, 2018

I assigned #645 to me and will work on it right after finishing my oncall and one half done item.

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.3% 98.3% 0.0
pkg/controller/route/route.go 78.5% 77.9% -0.6
pkg/controller/revision/revision.go 79.7% 75.7% -4.0

*TestCoverage feature is being tested, do not rely on any info here yet

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/apis/serving/v1alpha1/revision_types.go 98.5% 98.5% 0.0
pkg/controller/route/route.go 78.7% 77.7% -1.0
pkg/controller/revision/revision.go 81.8% 77.0% -4.8

*TestCoverage feature is being tested, do not rely on any info here yet

@akyyy akyyy changed the title Stabilize 1->0 case [WIP] Stabilize 1->0 case Jun 21, 2018
@google-prow-robot google-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2018
@akyyy
Copy link
Copy Markdown
Contributor Author

akyyy commented Jun 21, 2018

Based on discussion, I'll add a serving state ActiveReserve (this name pending change) which will be used by autoscaler to initiate scale to zero. After the route controller set routerules correspondingly to activator, the route controller will set revision serving state to Reserve. Then the revision controller will tear down k8s resources. This was proposed here.
#645 (comment)

I'll split this pr into two prs.
The first pr is to keep k8s service and deployment.
The second pr is to resolve the race condition by adding a new serving state.

@steuhs
Copy link
Copy Markdown
Contributor

steuhs commented Jun 23, 2018

/test pull-knative-serving-go-coverage

@google-prow-robot
Copy link
Copy Markdown

@akyyy: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-go-coverage 014f36e link /test pull-knative-serving-go-coverage

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mattmoor mattmoor closed this Jul 12, 2018
creydr pushed a commit to creydr/knative-serving that referenced this pull request Mar 25, 2025
Signed-off-by: red-hat-konflux-kflux-prd-rh02 <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com>
Co-authored-by: red-hat-konflux-kflux-prd-rh02[bot] <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.