Keep the deployment when scale to zero#1383
Keep the deployment when scale to zero#1383google-prow-robot merged 4 commits intoknative:masterfrom mdemirhan:deployfix
Conversation
…caling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted. * To be able to update the deployment, we need to remove the route label from the revision. See issue #1293 for more details. * Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.
mattmoor
left a comment
There was a problem hiding this comment.
Generally I think this looks good to me. (I'll hold /lgtm /approve for comments)
I would characterize essentially of my comments as largely stylistic nits to try and keep the broader controller codebase in a consistent style.
I'm going to post this PR in #api so see if folks have any thoughts on dropping the Route label from Revision, but I don't know that it's giving us anything but headaches :)
thanks for picking this up.
-M
| rollingUpdateConfig := appsv1.RollingUpdateDeployment{ | ||
| MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 1}, | ||
| MaxSurge: &intstr.IntOrString{Type: intstr.Int, IntVal: 1}, | ||
| } |
There was a problem hiding this comment.
Are you just removing defaults?
There was a problem hiding this comment.
I see from the PR description you are, thanks for the cleanup.
| case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve: | ||
| // When Active or Reserved, deployment should exist and have a particular specification. | ||
| if err != nil { | ||
| if !apierrs.IsNotFound(err) { |
There was a problem hiding this comment.
I found the previous structure more readable, since it avoids this double negative "not not found" (see the current episode of testing on the toilet :D )
There was a problem hiding this comment.
Perhaps this is what you were talking about earlier, if so let's chat tomorrow.
There was a problem hiding this comment.
The issue I had with the previous one was the fact that err that gets overridden is then checked in the else block. While that is not an issue the way it was in the previous code (because if apierrs.IsNotFound(err) block returns), any change in the if block not returning could easily confuse the developer on what is happening. For me personally, trying to follow the err and see what it is at what point was confusing enough that I spent a couple of mins to understand the logic.
I can keep the original structure but change the name of the first err to something else in order to prevent the confusion. let me know.
There was a problem hiding this comment.
I think my bias would be towards renaming the long-lived err to something more descriptive.
| // Defaulter should have set this field. | ||
| logger.Errorf("Deployment has nil Replicas. This is unexpected. Reconciling the deployment: %v", deployment) | ||
| deployment.Spec.Replicas = new(int32) | ||
| changed = WasChanged |
There was a problem hiding this comment.
I'd drop this line. If for some reason Kubernetes changes something (e.g. nil == 0), we could end up in a reconciliation spin loop for scaled to zero Revisions.
There was a problem hiding this comment.
nil defaults to 1 today, not 0 unfortunately. I can take that into account, but either way, if k8s change the defaults, our code has to change to understand that. LMKWYT
There was a problem hiding this comment.
I'll tweak my example below to incorporate this.
There was a problem hiding this comment.
This is moot because i am using the code block suggested below. So please ignore.
| var replicaCount int32 = 1 | ||
| if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve { | ||
| replicaCount = 0 | ||
| } |
There was a problem hiding this comment.
For Deployments, we do this calculation in checkAndUpdateDeployment, so I think I'd prefer the symmetry of sinking this logic into createDeployment here and createAutoscalerDeployment below.
|
|
||
| logger.Infof("Reconciling deployment %v to update the replica count to %v", deployment.Name, *deployment.Spec.Replicas) | ||
| d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment) | ||
| return d, changed, err |
There was a problem hiding this comment.
There are parts of the commented body I like, I thought it simplest to just show what I'd write than leave a bunch of nits.
{
logger := logging.FromContext(ctx)
// TODO(mattmoor): Generalize this to reconcile discrepancies vs. what
// MakeServingDeployment() would produce.
desiredDeployment := deployment.DeepCopy()
if desiredDeployment.Spec.Replicas == nil {
// Replicas defaults to one.
var one int32 = 1
desiredDeployment.Spec.Replicas = &one
}
if rev.Spec.ServingState == v1alpha1.RevisionServingStateActive && *desiredDeployment.Spec.Replicas == 0 {
*desiredDeployment.Spec.Replicas = 1
} else if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve && *desiredDeployment.Spec.Replicas != 0 {
*desiredDeployment.Spec.Replicas = 0
}
if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) {
return deployment, Unchanged, nil
}
logger.Infof("Reconciling deployment diff (-desired, +observed): %v",
cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{})))
deployment.Spec = desiredDeployment.Spec
d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment)
return d, WasChanged, err
}I prefer the more generalized determination of Changed/Unchanged based on equality.Semantic.DeepEqual, and I prefer the logging based on cmp.Diff. Assuming we can get the construction of desiredDeployment fixed (to avoid fighting with the defaulter), it means these surrounding elements won't change.
There was a problem hiding this comment.
I will change the code to that, but this is an a lot heavier implementation than the current one. We are making a deep copy of an object and running a full blown equality checker every 30 seconds for every revision. My worry is the overhead of this with 100s of revisions.
There was a problem hiding this comment.
Yeah, I'd like to see us put this under load and see where we fall short. Until then, I think my bias is towards readability and debuggability.
| case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve: | ||
| // When Active or Reserved, Autoscaler deployment should exist and have a particular specification. | ||
| if err != nil { | ||
| if !apierrs.IsNotFound(err) { |
There was a problem hiding this comment.
same comments re: double-negative and keeping the existing structure.
| var replicaCount int32 = 1 | ||
| if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve { | ||
| replicaCount = 0 | ||
| } |
There was a problem hiding this comment.
see comment above about sinking this into the createAutoscalerDeployment method for symmetry with checkAndUpdateDeployment.
| } | ||
| if err := c.setLabelForGivenRevisions(ctx, route, revMap); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
Can you make sure that the docs in docs/specs/... don't have examples with Route labels?
There was a problem hiding this comment.
Done. Couldn't find any references to knative.dev/route in labels in the spec.
# Conflicts: # pkg/controller/revision/pod.go # pkg/controller/revision/revision.go # pkg/controller/revision/revision_test.go # pkg/controller/route/route_test.go
mattmoor
left a comment
There was a problem hiding this comment.
one comment, but otherwise LGTM. thanks for changes.
| // MakeServingDeployment() would produce. | ||
| desiredDeployment := deployment.DeepCopy() | ||
| if desiredDeployment.Spec.Replicas == nil { | ||
| desiredDeployment.Spec.Replicas = new(int32) |
There was a problem hiding this comment.
Based on what you said, I think this should be:
one := 1
desiredDeployment.Spec.Replicas = &oneThere was a problem hiding this comment.
Good catch. Will fix.
|
|
||
| // computeRevisionRoutes computes RevisionRoute for a route object. If there is one or more inactive revisions and enableScaleToZero | ||
| // is true, a route rule with the activator service as the destination will be added. It returns the revision routes, the inactive | ||
| // revision name to which the activator should forward requests to, and error if there is any. |
There was a problem hiding this comment.
In this function, if you search
cond.Reason == "Activating" && cond.Status == corev1.ConditionUnknown,
we're routing traffic to activator when it's Inactive or Activating. I think it's safe to replace "Activating" by "Deploying". Since you're changing revision condition in reconcileDeployment func in revision.go.
There was a problem hiding this comment.
Changed this to "Updating" which captures both activating and deactivating cases. This is a hacky fix that will be addressed correctly as part of removing our reliance on revision conditions.
| } | ||
| if changed == WasChanged { | ||
| logger.Infof("Updated deployment %q", deploymentName) | ||
| rev.Status.MarkDeploying("Updating") |
There was a problem hiding this comment.
We need to be careful about the revision condition reasons. They're used to route traffic today in route.go.
| *desiredDeployment.Spec.Replicas = 0 | ||
| } | ||
|
|
||
| if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) { |
There was a problem hiding this comment.
Since we only update replicas field for deployments today, it's cheaper to just compare that value? Same comment for the log.
There was a problem hiding this comment.
I'd asked for this, since I'm hoping that isn't true for long :)
|
Thanks for taking over this while I'm oncall! |
|
/hold |
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
/cancel hold |
|
That didn't work :) Cancelling the hold as my manual testing seems to be working fine. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mattmoor, mdemirhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| } | ||
|
|
||
| func TestSetLabelNotChangeConfigurationAndRevisionLabelIfLabelExists(t *testing.T) { | ||
| func TestSetLabelNotChangeConfigurationLabelIfLabelExists(t *testing.T) { |
There was a problem hiding this comment.
Can we call this test TestRouteEventDoesNotUpdateConfiguration
There was a problem hiding this comment.
You should probably then delete revision update reactor here: https://github.com/knative/serving/pull/1383/files#diff-e256c737a20333c1e37bb5452c9df333R1154
| } | ||
| logger.Infof("Reconciling deployment diff (-desired, +observed): %v", | ||
| cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{}))) | ||
| deployment.Spec = desiredDeployment.Spec |
There was a problem hiding this comment.
This assignment seems unnecessary when we can just use desiredDeployment in the Update call
unless I'm missing something
| // Activate the revision. Replicas should increase to 1 | ||
| rev.Spec.ServingState = v1alpha1.RevisionServingStateActive | ||
| updateRevision(t, kubeClient, kubeInformer, elaClient, elaInformer, controller, rev) | ||
| d1, d2 = getDeployments() |
There was a problem hiding this comment.
This test could have better readability - maybe not in this PR
ie. I'd actually prefer explicit assertions
assertRevisionDeploymentHasReplicaCount(t, 1)
assertAutoscalingDeploymentHasReplicaCount(t, 1)
This is a continuation of Yao's PR (#1320) - it addresses the feedback posted in that PR. Sorry for forking this to yet another PR, but Yao got pulled into another issue and I took over finalizing this one.
Instead of deleting the revision service and the deployment while scaling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted.
To be able to update the deployment, we need to remove the route label from the revision. See issue The revision labels should be a static set #1293 for more details.
Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.
Fixes #1250 and #1293