[WIP] Stabilize 1->0 case#1251
[WIP] Stabilize 1->0 case#1251akyyy wants to merge 13 commits intoknative:masterfrom akyyy:tearingdown
Conversation
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
/retest |
josephburnett
left a comment
There was a problem hiding this comment.
Just noticed a few nits.
/lgtm
| timeout := 500 * time.Millisecond | ||
|
|
||
| i := 0 | ||
| for ; i < maxRetry; i++ { |
There was a problem hiding this comment.
@isdal had a good point. We should probe for iptable programming by hitting the health check endpoint. That way POST requests are actually retried. But that's fine to do as a separate pull request.
| logger.Info("Deleted service") | ||
| } else { | ||
| // Serving state is RevisionServingStateReserve. Delete the revision service and update | ||
| // the dpeloyment replicas to be 0. |
There was a problem hiding this comment.
Nit: spelling: d[e]ployment
| } | ||
| } | ||
|
|
||
| logger.Info("Scale the deployment to 0") |
There was a problem hiding this comment.
Nit: scal[ing] the deployment to 0
| // Serving state is RevisionServingStateReserve. Delete the revision service and update | ||
| // the dpeloyment replicas to be 0. | ||
| if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil { | ||
| if cond.Reason != "Inactive" { |
There was a problem hiding this comment.
I think you can combine lines 497 and 498 into a single if statement.
| // the dpeloyment replicas to be 0. | ||
| if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil { | ||
| if cond.Reason != "Inactive" { | ||
| if cond.Reason != "Deactivating" { |
There was a problem hiding this comment.
To reduce the amount of indentation here, how about writing this as a short-circuiting check.
if cond.Reason == "Deactivating" {
return nil
}
...
| return err | ||
| } | ||
|
|
||
| logger.Infof("Deactvaing Deployment %q", deploymentName) |
There was a problem hiding this comment.
Nit: spelling: deact[i]va[t]ing
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
| } | ||
| logger.Info("Deleted service") | ||
| } else { | ||
| // Serving state is RevisionServingStateReserve. Delete the revision service and update |
There was a problem hiding this comment.
I don't see a deletion below, is this comment correct?
| } else { | ||
| // Serving state is RevisionServingStateReserve. Delete the revision service and update | ||
| // the deployment replicas to be 0. | ||
| if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" { |
There was a problem hiding this comment.
I believe there are now constants for reasons in revision_types.go ?
There was a problem hiding this comment.
+1. There is also a "Deactivating" reason string above that could be a constant.
There was a problem hiding this comment.
The constants are still strings. I don't see them defined in revision_types.go? The issue is still open. #880
We may address that later.
| *deployment.Spec.Replicas = int32(0) | ||
| _, err = dc.Update(deployment) | ||
| if err != nil { | ||
| logger.Errorf("Error deactivating deployment %q: %s", deploymentName, err) |
There was a problem hiding this comment.
return err ?
(or if that isn't desirable we should still return to avoid seeing the log.Infof success here?
| @@ -48,28 +47,28 @@ type activationHandler struct { | |||
| type retryRoundTripper struct{} | |||
|
|
|||
| func (rrt retryRoundTripper) RoundTrip(r *http.Request) (*http.Response, error) { | |||
There was a problem hiding this comment.
Creating a new transport for every transaction might be problematic. Should we not cache this?
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
/lgtm |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: akyyy, josephburnett, tcnghia Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| }) | ||
| } | ||
|
|
||
| func (rs *RevisionStatus) MarkDeactivating() { |
There was a problem hiding this comment.
This needs unit test coverage.
|
New changes are detected. LGTM label has been removed. |
|
|
||
| for k, v := range revision.ObjectMeta.Labels { | ||
| labels[k] = v | ||
| if k != serving.RouteLabelKey { |
There was a problem hiding this comment.
Why isn't the comment below up here?
| Name: fmt.Sprintf("%s-service", cfgrev.Name), | ||
| Namespace: testNamespace, | ||
| Route: []v1alpha2.DestinationWeight{ | ||
| { |
There was a problem hiding this comment.
collapse as we had on the LHS everywhere please.
| // Serving state is RevisionServingStateReserve. Keep the revision service and update | ||
| // the deployment replicas to be 0. | ||
| if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" { | ||
| if cond.Reason == "Deactivating" { |
There was a problem hiding this comment.
We should not be predicating logic on cond.Reason, to me this is an indicator that our model for the lifecycle of inactive revisions is incomplete. This is clearly an extension of the logic we currently have in here, but we need to prioritize fixing this
There was a problem hiding this comment.
Yes, we want to do this (#645 (comment)) but before we pivot to the new model, I want what we have at head to not throw 500's. This fixes the last of the 500's that happens just during deactivation.
There was a problem hiding this comment.
Yea, that's an issue out of the scope of this pr.
There was a problem hiding this comment.
Ack. Does that mean that cleaning this up is the top priority after this goes in?
There was a problem hiding this comment.
I can take it after this pr and oncall.
There was a problem hiding this comment.
+1. After we fix this 503 issue, the top priority is migrating us to the new model you described and (hopefully) getting rid of serving state.
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
vaikas
left a comment
There was a problem hiding this comment.
Just a couple of suggestions, in addition to what Matt is requesting.
| logger.Errorf("Failed to get deployment %q", deploymentName) | ||
| return err | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit, you could check the deployment here and see if Replicas is already 0 and then shortcircuit without updating?
| _, err = kubeClient.AppsV1().Deployments(testNamespace).Get(expectedDeploymentName, metav1.GetOptions{}) | ||
|
|
||
| if err != nil { | ||
| t.Fatalf("Expected k8s deployment to be there but it was gone: %s/%s", testNamespace, expectedDeploymentName) |
There was a problem hiding this comment.
Seems like this could fail for another reason as well and the deployment could be there? Perhaps check isNotFound and then fail with this message?
| // A revision is considered inactive (yet) if it's in | ||
| // "Inactive" condition or "Activating" condition. | ||
| // "Inactive" condition, "Activating" or "Deactivating" condition. | ||
| logger.Infof("cond: %v", cond) |
There was a problem hiding this comment.
seems like it's not only the cond, but status that makes this condition happen. I am little confused about this 'double-checking'.
There was a problem hiding this comment.
This could be in part of issue #645
#645 (comment)
I'll fix it in that issue.
| // end up with multiple replica sets. | ||
| if k != serving.RouteLabelKey { | ||
| labels[k] = v | ||
| } |
There was a problem hiding this comment.
(thinking about this some more) I'm not thrilled about the Revision controller needing to know things about the Route controller (these are essentially the only references to route in the directory).
This also feels like a band-aid for the one case we know about now, but doesn't solve the more general problem that label mutations could trigger, right?
There was a problem hiding this comment.
Yes, this is just a temporary solution. I opened issue #1293 and will add a link of this piece of code to the issue.
| err = c.deleteService(ctx, rev) | ||
| if err != nil { | ||
| logger.Error("Failed to delete k8s service", zap.Error(err)) | ||
| if rev.Spec.ServingState == v1alpha1.RevisionServingStateRetired { |
There was a problem hiding this comment.
@josephburnett I thought we just added a comment indicating this state was unused? Should we just delete the code?
There was a problem hiding this comment.
I agree it's cleaner to delete Retired state. I added this comment to issue #645
| // Then create the actual route rules. | ||
| logger.Info("Creating Istio route rules") | ||
| revisionRoutes, err := c.createOrUpdateRouteRules(ctx, route, configMap, revMap) | ||
| revisionRoutes, inactiveRev, err := c.createOrUpdateRouteRules(ctx, route, configMap, revMap) |
There was a problem hiding this comment.
It took some digging for me to determine what inactiveRev was, whereas before the method and return value were fairly self-describing. What I don't get is why there is only one inactive revision (name?) returned from this method. The traffic block could have N revisions, all reserve.
There was a problem hiding this comment.
For now, activator only forward traffic to one inactive revision with the largest traffic weight. More details are here. #882
| return nil, err | ||
| } | ||
| rev.Status.MarkInactive() | ||
| if _, err = revisionClient.Update(rev); err != nil { |
There was a problem hiding this comment.
I didn't notice this before, but we're updating Revision.Status from the Route controller. That makes me even more uncomfortable that predicating on cond.Reason.
@josephburnett I understand wanting to root out 5XXs within our current model, but cleaning this up needs to be a top priority. We've known about this problem for 2+ months now, I'd really like to see a proper fix.
There was a problem hiding this comment.
So I find this aspect of the PR troubling (chatted a bit with @vaikas-google too), and I think we should reconsider our choice of band-aid to fix this.
I'd initially proposed this to remediate this, and it got a bit lost in broader thinking about how we make autoscaling more extensible.
If we are thinking about a nearer term fix, I think we should go with this proposal instead, eliding the aspects around Conditions and Retired for now.
| }, | ||
| Route: []v1alpha2.DestinationWeight{getActivatorDestinationWeight(100)}, | ||
| Route: []v1alpha2.DestinationWeight{ | ||
| { |
There was a problem hiding this comment.
you can collapse this as well.
| // Serving state is RevisionServingStateReserve. Keep the revision service and update | ||
| // the deployment replicas to be 0. | ||
| if cond := rev.Status.GetCondition(v1alpha1.RevisionConditionReady); cond != nil && cond.Reason != "Inactive" { | ||
| if cond.Reason == "Deactivating" { |
There was a problem hiding this comment.
Ack. Does that mean that cleaning this up is the top priority after this goes in?
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
I assigned #645 to me and will work on it right after finishing my oncall and one half done item. |
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
The following is the coverage report on pkg/. Say
*TestCoverage feature is being tested, do not rely on any info here yet |
|
Based on discussion, I'll add a serving state ActiveReserve (this name pending change) which will be used by autoscaler to initiate scale to zero. After the route controller set routerules correspondingly to activator, the route controller will set revision serving state to Reserve. Then the revision controller will tear down k8s resources. This was proposed here. I'll split this pr into two prs. |
|
/test pull-knative-serving-go-coverage |
|
@akyyy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Signed-off-by: red-hat-konflux-kflux-prd-rh02 <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com> Co-authored-by: red-hat-konflux-kflux-prd-rh02[bot] <190377777+red-hat-konflux-kflux-prd-rh02[bot]@users.noreply.github.com>
Fixes #
#1250
Proposed Changes