Keep the deployment when scale to zero. by akyyy · Pull Request #1320 · knative/serving

akyyy · 2018-06-21T21:45:52Z

Fixes #1250

Proposed Changes

Instead of deleting the revision service and the deployment while scaling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted.
To be able to update the deployment, we need to remove the route label from the revision. More details are here The revision labels should be a static set #1293

akyyy · 2018-06-21T21:49:12Z

This pr contains a subset of changes (keeping the deployment and service) in my other pr #1251.

…nScaleToZero

google-prow-robot · 2018-06-21T21:52:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akyyy
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mattmoor

Assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

pkg/controller/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mdemirhan · 2018-06-21T21:49:38Z

+
+// teardownK8SResources deletes autoscaler resources, but keeps the revision service and deployment.
+// It is used when the revision serving state is Reserve.
+func (c *Controller) teardownK8SResources(ctx context.Context, rev *v1alpha1.Revision) error {


A clearer name would be better here. scaleRevisionResourcesToZero or something. Otherwise it is unclear what this does vs delete does unless you read the func comments.

mdemirhan · 2018-06-21T21:51:53Z

+
+func (c *Controller) deleteAutoscalerResources(ctx context.Context, rev *v1alpha1.Revision) error {
+	logger := logging.FromContext(ctx)
+	if err := c.deleteAutoscalerDeployment(ctx, rev); err != nil {


Why not set the replica size to 0 for this one as well? (Keep in mind that if we do that, updating controller code will no longer update any existing revision's autoscaler).

For revision deployment, the reason to keep it is to unblock new pod creation during existing pod terminating. For autoscaler deployment, I don't see a reason to keep it?

Keep in mind that if we do that, updating controller code will no longer update any existing revision's autoscaler

@mdemirhan Curious to know why you think changes couldn't be reconciled to update existing autoscaler deployments?

+1 to setting the autoscaler deployment to 0 as well. I hope that will allow the autoscaler to come back up more quickly and therefore respond more quickly to large traffic spikes when scaled to zero. It would also help avoid a similar 30-second deadzone in scaling from 1-N (e.g. the deployment is being deleted but passes the controller reconciliation loop by existing).

@dprotaso, I think that he means we won't update the single-tenant autoscaler binary if we just scale to zero instead of creating a deployment anew from the latest autoscaler container passed into the controller through commandline args. That's true. But we are in the process of migrating to an always-on multiple-revision autoscaler. So that issue won't be around for long.

@mdemirhan, @josephburnett You're right. I should keep autoscaler deployment as well. I'll work on that. Thanks!

@dprotaso I was looking at the current implementation for reconcileAutoscalerDeployment - if there is an existing deployment, it will skip reconcilliation. So, if we don't delete the deployment but just update the replica count of the existing deployment, the autoscaler will never be updated. Currently though, if something scales to zero, autoscaler will get latest bits once activation happens next time.

That being said, I like deterministic behavior and I think current behavior is worse because in case there is an issue with an updated autoscaler, you will see the negative effects of it sometime in the future when N->0->1 happens.

We need a better way to upgrade Knative - ko apply -f config/ is too unpredictable for upgrades. I will file an Issue on this issue.

@mdemirhan IMO if we update our sidecars (because of a controller update), we should update deployments during the normal reconciliation. The same if Operators push other config changes (in fact, the sidecars should probably move to ConfigMaps with updates picked up via configmap.Watcher).

My attempt at updating deployments during reconciliation was thwarted by its defaulter, but I'd like to reach a point where we reconcile updates around our controllers for all resources we manage.

I'd also like any Deployment reconciliation we do to follow the pattern of checkAndUpdateDeployment that's commented as much as possible (though it's implementation may remain heavily commented).

mdemirhan · 2018-06-21T21:56:11Z

+		// TODO(mattmoor): Compare the deployments and update if it has changed
+		// out from under us.
+		logger.Infof("Found existing deployment %q, updating", deploymentName)
+		_, err = dc.Update(deployment)


Old code used to skip updating the deployment if it already exist. However; this one will keep updating the deployment objects every 30 seconds. Seems like a regression.

Nice find! Fixed.

knative-metrics-robot · 2018-06-21T22:48:49Z

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/controller/route/route.go	78.5%	79.2%	0.7
pkg/controller/revision/revision.go	78.4%	76.5%	-1.8

*TestCoverage feature is being tested, do not rely on any info here yet

mdemirhan · 2018-06-21T23:04:10Z

-			return err
+		// TODO(mattmoor): Compare the deployments and update if it has changed
+		// out from under us. So far the deployment could only be updated for replicas field.
+		if *existingDeployment.Spec.Replicas == *desiredDeployment.Spec.Replicas {


Something happened to my comments. I think I accidentally deleted them. Here it goes again:

This code will reset the replica count that is set by autoscaler every 30 seconds and that seems wrong. If a deployment exist, we should just return.

Yeah, this seems strange to me too. I think the controller should be overwriting everything except the replicas count. We leave that to the Activator and the Autoscaler. And the update is punted for now by the TODO(mattmoor) above.

I am not sure, but I think the reason for not updating deployments every 30 seconds today is because updating deployments even without changes might cause restart of the pods (I don't think that is the case unless the pod spec changes, but I am not 100% sure). @mattmoor is that the reason that deployments are never reconciled today?

dprotaso · 2018-06-25T15:59:32Z

+	// And the deployment is no longer ready, so update that
+	rev.Status.MarkInactive()
+	logger.Infof("Updating status with the following conditions %+v", rev.Status.Conditions)
+	if _, err := c.updateStatus(rev); err != nil {


If you rebase you won't need the updateStatus call anymore due to the PR #1321

This whole function will be gone after #1334

dprotaso · 2018-06-25T16:00:34Z

+	// And the deployment is no longer ready, so update that
+	rev.Status.MarkInactive()
+	logger.Infof("Updating status with the following conditions %+v", rev.Status.Conditions)
+	if _, err := c.updateStatus(rev); err != nil {


Likewise see: #1321

dprotaso · 2018-06-25T16:02:27Z

+
+func (c *Controller) deleteAutoscalerResources(ctx context.Context, rev *v1alpha1.Revision) error {
+	logger := logging.FromContext(ctx)
+	if err := c.deleteAutoscalerDeployment(ctx, rev); err != nil {


Keep in mind that if we do that, updating controller code will no longer update any existing revision's autoscaler

@mdemirhan Curious to know why you think changes couldn't be reconciled to update existing autoscaler deployments?

dprotaso · 2018-06-25T16:03:42Z

+	if err := c.resolver.Resolve(desiredDeployment); err != nil {
+		logger.Error("Error resolving deployment", zap.Error(err))
+		rev.Status.MarkContainerMissing(err.Error())
+		if _, updateErr := c.updateStatus(rev); updateErr != nil {


Likewise see: #1321

dprotaso · 2018-06-25T16:12:16Z

 var (
 	elaPodReplicaCount   = int32(1)
-	elaPodMaxUnavailable = intstr.IntOrString{Type: intstr.Int, IntVal: 1}
+	elaPodMaxUnavailable = intstr.IntOrString{Type: intstr.Int, IntVal: 0}


Just curious to know what your motivation was to change this to 0?

I had the same question.

dprotaso · 2018-06-25T16:12:51Z

+		return err
+	}
+	if *deployment.Spec.Replicas == 0 {
+		logger.Infof("Deployment %s is scaled to 0 already.", deploymentName)


dprotaso · 2018-06-25T16:13:55Z

 		return err
 	}
-
+	logger.Infof("Successfully scaled deployment %s to 0.", deploymentName)


dprotaso · 2018-06-25T16:16:38Z

 			Labels: map[string]string{
-				"testLabel1":          "foo",
-				"testLabel2":          "bar",
-				serving.RouteLabelKey: "test-route",


Is the constant serving.RouteLabelKey used anymore?

google-prow-robot · 2018-06-26T19:44:51Z

@akyyy: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-knative-serving-go-coverage	`313f770`	link	`/test pull-knative-serving-go-coverage`
pull-knative-serving-unit-tests	`313f770`	link	`/test pull-knative-serving-unit-tests`
pull-knative-serving-integration-tests	`313f770`	link	`/test pull-knative-serving-integration-tests`
pull-knative-serving-build-tests	`313f770`	link	`/test pull-knative-serving-build-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mattmoor · 2018-06-28T04:45:27Z

I believe this is subsumed by @mdemirhan's new PR.

akyyy added 3 commits June 21, 2018 14:08

Keep revision service and deployment when scale to 0

0b6263d

minor change

946db6c

minor fix

1594fe4

google-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 21, 2018

akyyy assigned tcnghia, josephburnett and mdemirhan Jun 21, 2018

akyyy requested a review from mattmoor June 21, 2018 21:47

akyyy assigned mattmoor Jun 21, 2018

Merge remote-tracking branch 'upstream/master' into keepDeploymentWhe…

2df0e31

…nScaleToZero

mdemirhan suggested changes Jun 21, 2018

View reviewed changes

akyyy added 3 commits June 21, 2018 15:36

address cr comments

c01b33f

minor change

1a663e2

rename functions

4cc56f0

mdemirhan suggested changes Jun 21, 2018

View reviewed changes

dprotaso reviewed Jun 25, 2018

View reviewed changes

keep autoscaler deployment

313f770

mdemirhan mentioned this pull request Jun 27, 2018

Keep the deployment when scale to zero #1383

Merged

mattmoor closed this Jun 28, 2018

Conversation

akyyy commented Jun 21, 2018

Proposed Changes

Uh oh!

akyyy commented Jun 21, 2018

Uh oh!

google-prow-robot commented Jun 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knative-metrics-robot commented Jun 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-prow-robot commented Jun 26, 2018

Uh oh!

mattmoor commented Jun 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants