Skip to content

Keep the deployment when scale to zero#1383

Merged
google-prow-robot merged 4 commits intoknative:masterfrom
mdemirhan:deployfix
Jun 28, 2018
Merged

Keep the deployment when scale to zero#1383
google-prow-robot merged 4 commits intoknative:masterfrom
mdemirhan:deployfix

Conversation

@mdemirhan
Copy link
Copy Markdown
Contributor

This is a continuation of Yao's PR (#1320) - it addresses the feedback posted in that PR. Sorry for forking this to yet another PR, but Yao got pulled into another issue and I took over finalizing this one.

  • Instead of deleting the revision service and the deployment while scaling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted.

  • To be able to update the deployment, we need to remove the route label from the revision. See issue The revision labels should be a static set #1293 for more details.

  • Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.

Fixes #1250 and #1293

…caling to 0, we need to keep the deployment and just set its replicas to 0. This way, the deployment can have 1 pod in terminating state while another pod is spinning up. So we don't have to wait for the pod to be deleted.

* To be able to update the deployment, we need to remove the route label from the revision. See issue #1293 for more details.
* Remove some of the defaults for deployments (such as update strategy, max unavailability & surge) and use k8s defaults.
@google-prow-robot google-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 27, 2018
Copy link
Copy Markdown
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think this looks good to me. (I'll hold /lgtm /approve for comments)

I would characterize essentially of my comments as largely stylistic nits to try and keep the broader controller codebase in a consistent style.

I'm going to post this PR in #api so see if folks have any thoughts on dropping the Route label from Revision, but I don't know that it's giving us anything but headaches :)

thanks for picking this up.
-M

rollingUpdateConfig := appsv1.RollingUpdateDeployment{
MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
MaxSurge: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you just removing defaults?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see from the PR description you are, thanks for the cleanup.

Comment thread pkg/controller/revision/revision.go Outdated
case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve:
// When Active or Reserved, deployment should exist and have a particular specification.
if err != nil {
if !apierrs.IsNotFound(err) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the previous structure more readable, since it avoids this double negative "not not found" (see the current episode of testing on the toilet :D )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is what you were talking about earlier, if so let's chat tomorrow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I had with the previous one was the fact that err that gets overridden is then checked in the else block. While that is not an issue the way it was in the previous code (because if apierrs.IsNotFound(err) block returns), any change in the if block not returning could easily confuse the developer on what is happening. For me personally, trying to follow the err and see what it is at what point was confusing enough that I spent a couple of mins to understand the logic.

I can keep the original structure but change the name of the first err to something else in order to prevent the confusion. let me know.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my bias would be towards renaming the long-lived err to something more descriptive.

Comment thread pkg/controller/revision/revision.go Outdated
// Defaulter should have set this field.
logger.Errorf("Deployment has nil Replicas. This is unexpected. Reconciling the deployment: %v", deployment)
deployment.Spec.Replicas = new(int32)
changed = WasChanged
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop this line. If for some reason Kubernetes changes something (e.g. nil == 0), we could end up in a reconciliation spin loop for scaled to zero Revisions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil defaults to 1 today, not 0 unfortunately. I can take that into account, but either way, if k8s change the defaults, our code has to change to understand that. LMKWYT

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll tweak my example below to incorporate this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moot because i am using the code block suggested below. So please ignore.

Comment thread pkg/controller/revision/revision.go Outdated
var replicaCount int32 = 1
if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve {
replicaCount = 0
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Deployments, we do this calculation in checkAndUpdateDeployment, so I think I'd prefer the symmetry of sinking this logic into createDeployment here and createAutoscalerDeployment below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment thread pkg/controller/revision/revision.go Outdated

logger.Infof("Reconciling deployment %v to update the replica count to %v", deployment.Name, *deployment.Spec.Replicas)
d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment)
return d, changed, err
Copy link
Copy Markdown
Member

@mattmoor mattmoor Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parts of the commented body I like, I thought it simplest to just show what I'd write than leave a bunch of nits.

{
	logger := logging.FromContext(ctx)

	// TODO(mattmoor): Generalize this to reconcile discrepancies vs. what
	// MakeServingDeployment() would produce.
	desiredDeployment := deployment.DeepCopy()
	if desiredDeployment.Spec.Replicas == nil {
		// Replicas defaults to one.
		var one int32 = 1
		desiredDeployment.Spec.Replicas = &one
	}
	if rev.Spec.ServingState == v1alpha1.RevisionServingStateActive && *desiredDeployment.Spec.Replicas == 0 {
		*desiredDeployment.Spec.Replicas = 1
	} else if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve && *desiredDeployment.Spec.Replicas != 0 {
		*desiredDeployment.Spec.Replicas = 0
	}

	if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) {
		return deployment, Unchanged, nil
	}
	logger.Infof("Reconciling deployment diff (-desired, +observed): %v",
		cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{})))
	deployment.Spec = desiredDeployment.Spec
	d, err := c.KubeClientSet.AppsV1().Deployments(deployment.Namespace).Update(deployment)
	return d, WasChanged, err
}

I prefer the more generalized determination of Changed/Unchanged based on equality.Semantic.DeepEqual, and I prefer the logging based on cmp.Diff. Assuming we can get the construction of desiredDeployment fixed (to avoid fighting with the defaulter), it means these surrounding elements won't change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the code to that, but this is an a lot heavier implementation than the current one. We are making a deep copy of an object and running a full blown equality checker every 30 seconds for every revision. My worry is the overhead of this with 100s of revisions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd like to see us put this under load and see where we fall short. Until then, I think my bias is towards readability and debuggability.

Comment thread pkg/controller/revision/revision.go Outdated
case v1alpha1.RevisionServingStateActive, v1alpha1.RevisionServingStateReserve:
// When Active or Reserved, Autoscaler deployment should exist and have a particular specification.
if err != nil {
if !apierrs.IsNotFound(err) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments re: double-negative and keeping the existing structure.

Comment thread pkg/controller/revision/revision.go Outdated
var replicaCount int32 = 1
if rev.Spec.ServingState == v1alpha1.RevisionServingStateReserve {
replicaCount = 0
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above about sinking this into the createAutoscalerDeployment method for symmetry with checkAndUpdateDeployment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

}
if err := c.setLabelForGivenRevisions(ctx, route, revMap); err != nil {
return nil, err
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure that the docs in docs/specs/... don't have examples with Route labels?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Couldn't find any references to knative.dev/route in labels in the spec.

@mattmoor mattmoor self-assigned this Jun 28, 2018
@mdemirhan mdemirhan requested a review from akyyy June 28, 2018 15:51
mdemirhan added 2 commits June 28, 2018 09:32
# Conflicts:
#	pkg/controller/revision/pod.go
#	pkg/controller/revision/revision.go
#	pkg/controller/revision/revision_test.go
#	pkg/controller/route/route_test.go
Copy link
Copy Markdown
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment, but otherwise LGTM. thanks for changes.

Comment thread pkg/controller/revision/revision.go Outdated
// MakeServingDeployment() would produce.
desiredDeployment := deployment.DeepCopy()
if desiredDeployment.Spec.Replicas == nil {
desiredDeployment.Spec.Replicas = new(int32)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what you said, I think this should be:

one := 1
desiredDeployment.Spec.Replicas = &one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will fix.


// computeRevisionRoutes computes RevisionRoute for a route object. If there is one or more inactive revisions and enableScaleToZero
// is true, a route rule with the activator service as the destination will be added. It returns the revision routes, the inactive
// revision name to which the activator should forward requests to, and error if there is any.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function, if you search
cond.Reason == "Activating" && cond.Status == corev1.ConditionUnknown,
we're routing traffic to activator when it's Inactive or Activating. I think it's safe to replace "Activating" by "Deploying". Since you're changing revision condition in reconcileDeployment func in revision.go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to "Updating" which captures both activating and deactivating cases. This is a hacky fix that will be addressed correctly as part of removing our reliance on revision conditions.

}
if changed == WasChanged {
logger.Infof("Updated deployment %q", deploymentName)
rev.Status.MarkDeploying("Updating")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful about the revision condition reasons. They're used to route traffic today in route.go.

*desiredDeployment.Spec.Replicas = 0
}

if equality.Semantic.DeepEqual(desiredDeployment.Spec, deployment.Spec) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only update replicas field for deployments today, it's cheaper to just compare that value? Same comment for the log.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd asked for this, since I'm hoping that isn't true for long :)

@akyyy
Copy link
Copy Markdown
Contributor

akyyy commented Jun 28, 2018

Thanks for taking over this while I'm oncall!

@mdemirhan
Copy link
Copy Markdown
Contributor Author

/hold
Holding this until my manual end to end tests are complete.

@google-prow-robot google-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2018
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/controller/route/route.go 78.7% 80.4% 1.7
pkg/controller/revision/pod.go 95.0% 94.9% -0.1
pkg/controller/revision/revision.go 71.8% 75.0% 3.2

*TestCoverage feature is being tested, do not rely on any info here yet

@mdemirhan
Copy link
Copy Markdown
Contributor Author

/cancel hold

@mdemirhan mdemirhan removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2018
@mdemirhan
Copy link
Copy Markdown
Contributor Author

That didn't work :) Cancelling the hold as my manual testing seems to be working fine.

Copy link
Copy Markdown
Contributor

@akyyy akyyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2018
Copy link
Copy Markdown
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, mdemirhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2018
}

func TestSetLabelNotChangeConfigurationAndRevisionLabelIfLabelExists(t *testing.T) {
func TestSetLabelNotChangeConfigurationLabelIfLabelExists(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this test TestRouteEventDoesNotUpdateConfiguration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
logger.Infof("Reconciling deployment diff (-desired, +observed): %v",
cmp.Diff(desiredDeployment.Spec, deployment.Spec, cmpopts.IgnoreUnexported(resource.Quantity{})))
deployment.Spec = desiredDeployment.Spec
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment seems unnecessary when we can just use desiredDeployment in the Update call

unless I'm missing something

// Activate the revision. Replicas should increase to 1
rev.Spec.ServingState = v1alpha1.RevisionServingStateActive
updateRevision(t, kubeClient, kubeInformer, elaClient, elaInformer, controller, rev)
d1, d2 = getDeployments()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could have better readability - maybe not in this PR

ie. I'd actually prefer explicit assertions

assertRevisionDeploymentHasReplicaCount(t, 1)
assertAutoscalingDeploymentHasReplicaCount(t, 1)

@google-prow-robot google-prow-robot merged commit da3028b into knative:master Jun 28, 2018
@mdemirhan mdemirhan deleted the deployfix branch June 28, 2018 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants