Keep the deployment when scale to zero by mdemirhan · Pull Request #1383 · knative/serving

mdemirhan · 2018-06-27T23:09:42Z

This is a continuation of Yao's PR (#1320) - it addresses the feedback posted in that PR. Sorry for forking this to yet another PR, but Yao got pulled into another issue and I took over finalizing this one.

Instead of deleting the revision service and the deployment while scaling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted.
To be able to update the deployment, we need to remove the route label from the revision. See issue The revision labels should be a static set #1293 for more details.
Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.

Fixes #1250 and #1293

…caling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted. * To be able to update the deployment, we need to remove the route label from the revision. See issue #1293 for more details. * Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.

mattmoor

Generally I think this looks good to me. (I'll hold /lgtm /approve for comments)

I would characterize essentially of my comments as largely stylistic nits to try and keep the broader controller codebase in a consistent style.

I'm going to post this PR in #api so see if folks have any thoughts on dropping the Route label from Revision, but I don't know that it's giving us anything but headaches :)

thanks for picking this up.
-M

mattmoor · 2018-06-28T01:32:48Z

-	rollingUpdateConfig := appsv1.RollingUpdateDeployment{
-		MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
-		MaxSurge:       &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
-	}


Are you just removing defaults?

I see from the PR description you are, thanks for the cleanup.

mattmoor · 2018-06-28T03:32:11Z

+	case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve:
+		// When Active or Reserved, deployment should exist and have a particular specification.
+		if err != nil {
+			if !apierrs.IsNotFound(err) {


I found the previous structure more readable, since it avoids this double negative "not not found" (see the current episode of testing on the toilet :D )

Perhaps this is what you were talking about earlier, if so let's chat tomorrow.

The issue I had with the previous one was the fact that err that gets overridden is then checked in the else block. While that is not an issue the way it was in the previous code (because if apierrs.IsNotFound(err) block returns), any change in the if block not returning could easily confuse the developer on what is happening. For me personally, trying to follow the err and see what it is at what point was confusing enough that I spent a couple of mins to understand the logic.

I can keep the original structure but change the name of the first err to something else in order to prevent the confusion. let me know.

I think my bias would be towards renaming the long-lived err to something more descriptive.

mattmoor · 2018-06-28T03:53:39Z

+		// Defaulter should have set this field.
+		logger.Errorf("Deployment has nil Replicas. This is unexpected. Reconciling the deployment: %v", deployment)
+		deployment.Spec.Replicas = new(int32)
+		changed = WasChanged


I'd drop this line. If for some reason Kubernetes changes something (e.g. nil == 0), we could end up in a reconciliation spin loop for scaled to zero Revisions.

nil defaults to 1 today, not 0 unfortunately. I can take that into account, but either way, if k8s change the defaults, our code has to change to understand that. LMKWYT

I'll tweak my example below to incorporate this.

This is moot because i am using the code block suggested below. So please ignore.

mattmoor · 2018-06-28T04:00:01Z

+			var replicaCount int32 = 1
+			if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve {
+				replicaCount = 0
+			}


For Deployments, we do this calculation in checkAndUpdateDeployment, so I think I'd prefer the symmetry of sinking this logic into createDeployment here and createAutoscalerDeployment below.

mattmoor · 2018-06-28T04:18:28Z

+
+	logger.Infof("Reconciling deployment %v to update the replica count to %v", deployment.Name, *deployment.Spec.Replicas)
+	d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment)
+	return d, changed, err


There are parts of the commented body I like, I thought it simplest to just show what I'd write than leave a bunch of nits.

{ logger := logging.FromContext(ctx) // TODO(mattmoor): Generalize this to reconcile discrepancies vs. what // MakeServingDeployment() would produce. desiredDeployment := deployment.DeepCopy() if desiredDeployment.Spec.Replicas == nil { // Replicas defaults to one. var one int32 = 1 desiredDeployment.Spec.Replicas = &one } if rev.Spec.ServingState == v1alpha1.RevisionServingStateActive && *desiredDeployment.Spec.Replicas == 0 { *desiredDeployment.Spec.Replicas = 1 } else if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve && *desiredDeployment.Spec.Replicas != 0 { *desiredDeployment.Spec.Replicas = 0 } if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) { return deployment, Unchanged, nil } logger.Infof("Reconciling deployment diff (-desired, +observed): %v", cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{}))) deployment.Spec = desiredDeployment.Spec d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment) return d, WasChanged, err }

I prefer the more generalized determination of Changed/Unchanged based on equality.Semantic.DeepEqual, and I prefer the logging based on cmp.Diff. Assuming we can get the construction of desiredDeployment fixed (to avoid fighting with the defaulter), it means these surrounding elements won't change.

I will change the code to that, but this is an a lot heavier implementation than the current one. We are making a deep copy of an object and running a full blown equality checker every 30 seconds for every revision. My worry is the overhead of this with 100s of revisions.

Yeah, I'd like to see us put this under load and see where we fall short. Until then, I think my bias is towards readability and debuggability.

mattmoor · 2018-06-28T04:18:51Z

+	case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve:
+		// When Active or Reserved, Autoscaler deployment should exist and have a particular specification.
+		if err != nil {
+			if !apierrs.IsNotFound(err) {


same comments re: double-negative and keeping the existing structure.

mattmoor · 2018-06-28T04:19:37Z

+			var replicaCount int32 = 1
+			if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve {
+				replicaCount = 0
+			}


see comment above about sinking this into the createAutoscalerDeployment method for symmetry with checkAndUpdateDeployment.

mattmoor · 2018-06-28T04:23:28Z

-	}
-	if err := c.setLabelForGivenRevisions(ctx, route, revMap); err != nil {
-		return nil, err
-	}


Can you make sure that the docs in docs/specs/... don't have examples with Route labels?

Done. Couldn't find any references to knative.dev/route in labels in the spec.

# Conflicts: # pkg/controller/revision/pod.go # pkg/controller/revision/revision.go # pkg/controller/revision/revision_test.go # pkg/controller/route/route_test.go

mattmoor

one comment, but otherwise LGTM. thanks for changes.

mattmoor · 2018-06-28T17:01:31Z

+	// MakeServingDeployment() would produce.
+	desiredDeployment := deployment.DeepCopy()
+	if desiredDeployment.Spec.Replicas == nil {
+		desiredDeployment.Spec.Replicas = new(int32)


Based on what you said, I think this should be:

one := 1 desiredDeployment.Spec.Replicas = &one

Good catch. Will fix.

akyyy · 2018-06-28T16:49:49Z

-
 // computeRevisionRoutes computes RevisionRoute for a route object. If there is one or more inactive revisions and enableScaleToZero
 // is true, a route rule with the activator service as the destination will be added. It returns the revision routes, the inactive
 // revision name to which the activator should forward requests to, and error if there is any.


In this function, if you search
cond.Reason == "Activating" && cond.Status == corev1.ConditionUnknown,
we're routing traffic to activator when it's Inactive or Activating. I think it's safe to replace "Activating" by "Deploying". Since you're changing revision condition in reconcileDeployment func in revision.go.

Changed this to "Updating" which captures both activating and deactivating cases. This is a hacky fix that will be addressed correctly as part of removing our reliance on revision conditions.

akyyy · 2018-06-28T16:53:18Z

+			}
+			if changed == WasChanged {
+				logger.Infof("Updated deployment %q", deploymentName)
+				rev.Status.MarkDeploying("Updating")


We need to be careful about the revision condition reasons. They're used to route traffic today in route.go.

akyyy · 2018-06-28T17:04:56Z

+		*desiredDeployment.Spec.Replicas = 0
+	}
+
+	if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) {


Since we only update replicas field for deployments today, it's cheaper to just compare that value? Same comment for the log.

I'd asked for this, since I'm hoping that isn't true for long :)

akyyy · 2018-06-28T17:11:29Z

Thanks for taking over this while I'm oncall!

mdemirhan · 2018-06-28T17:52:50Z

/hold
Holding this until my manual end to end tests are complete.

knative-metrics-robot · 2018-06-28T17:54:59Z

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/controller/route/route.go	78.7%	80.4%	1.7
pkg/controller/revision/pod.go	95.0%	94.9%	-0.1
pkg/controller/revision/revision.go	71.8%	75.0%	3.2

*TestCoverage feature is being tested, do not rely on any info here yet

mdemirhan · 2018-06-28T18:27:17Z

/cancel hold

mdemirhan · 2018-06-28T18:27:58Z

That didn't work :) Cancelling the hold as my manual testing seems to be working fine.

akyyy

/lgtm

mattmoor

/lgtm
/approve

google-prow-robot · 2018-06-28T20:54:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, mdemirhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattmoor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dprotaso · 2018-06-28T19:54:43Z

 }

-func TestSetLabelNotChangeConfigurationAndRevisionLabelIfLabelExists(t *testing.T) {
+func TestSetLabelNotChangeConfigurationLabelIfLabelExists(t *testing.T) {


Can we call this test TestRouteEventDoesNotUpdateConfiguration

You should probably then delete revision update reactor here: https://github.com/knative/serving/pull/1383/files#diff-e256c737a20333c1e37bb5452c9df333R1154

dprotaso · 2018-06-28T20:50:33Z

+	}
+	logger.Infof("Reconciling deployment diff (-desired, +observed): %v",
+		cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{})))
+	deployment.Spec = desiredDeployment.Spec


This assignment seems unnecessary when we can just use desiredDeployment in the Update call

unless I'm missing something

dprotaso · 2018-06-28T20:57:19Z

+	// Activate the revision. Replicas should increase to 1
+	rev.Spec.ServingState = v1alpha1.RevisionServingStateActive
+	updateRevision(t, kubeClient, kubeInformer, elaClient, elaInformer, controller, rev)
+	d1, d2 = getDeployments()


This test could have better readability - maybe not in this PR

ie. I'd actually prefer explicit assertions

assertRevisionDeploymentHasReplicaCount(t, 1)
assertAutoscalingDeploymentHasReplicaCount(t, 1)

mdemirhan requested review from dprotaso, josephburnett, mattmoor and tcnghia June 27, 2018 23:09

google-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 27, 2018

mattmoor approved these changes Jun 28, 2018

View reviewed changes

mattmoor self-assigned this Jun 28, 2018

mdemirhan requested a review from akyyy June 28, 2018 15:51

mdemirhan added 2 commits June 28, 2018 09:32

Address PR feedback.

e5196bf

Merge branch 'master-upstream' into deployfix

d8cfe9c

# Conflicts: # pkg/controller/revision/pod.go # pkg/controller/revision/revision.go # pkg/controller/revision/revision_test.go # pkg/controller/route/route_test.go

mattmoor approved these changes Jun 28, 2018

View reviewed changes

akyyy reviewed Jun 28, 2018

View reviewed changes

Addressing PR comments.

0436eba

google-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2018

mdemirhan removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2018

akyyy reviewed Jun 28, 2018

View reviewed changes

google-prow-robot assigned akyyy Jun 28, 2018

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2018

mattmoor approved these changes Jun 28, 2018

View reviewed changes

google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2018

dprotaso reviewed Jun 28, 2018

View reviewed changes

google-prow-robot merged commit da3028b into knative:master Jun 28, 2018

mdemirhan deleted the deployfix branch June 28, 2018 21:44

mdemirhan mentioned this pull request Jun 28, 2018

Migrate to Istio v1alpha3 Gateway / VirtualService #1228

Merged

tcnghia mentioned this pull request Oct 19, 2018

Add mdemirhan to /OWNERS #2264

Merged

Conversation

mdemirhan commented Jun 27, 2018

Uh oh!

mattmoor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattmoor Jun 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattmoor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akyyy commented Jun 28, 2018

Uh oh!

mdemirhan commented Jun 28, 2018

Uh oh!

knative-metrics-robot commented Jun 28, 2018

Uh oh!

mdemirhan commented Jun 28, 2018

Uh oh!

mdemirhan commented Jun 28, 2018

Uh oh!

akyyy left a comment

Choose a reason for hiding this comment

Uh oh!

mattmoor left a comment

Choose a reason for hiding this comment

mattmoor Jun 28, 2018 •

edited

Loading