Fix crash-looping pods take a long time to terminate/clean#14607
Fix crash-looping pods take a long time to terminate/clean#14607delphamk wants to merge 2 commits intoknative:mainfrom
Conversation
|
Hi @andrew-delph. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrew-delph The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test istio-latest-no-mesh |
This is definitely not related to the changes. /test istio-latest-no-mesh |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14607 +/- ##
==========================================
+ Coverage 84.20% 86.06% +1.85%
==========================================
Files 213 197 -16
Lines 16633 14936 -1697
==========================================
- Hits 14006 12854 -1152
+ Misses 2280 1774 -506
+ Partials 347 308 -39 ☔ View full report in Codecov by Sentry. |
|
/retest |
|
/retest |
| routingState := rev.GetRoutingState() | ||
| if routingState == v1.RoutingStateActive { | ||
| return autoscalingv1alpha1.ReachabilityReachable |
There was a problem hiding this comment.
If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.
There was a problem hiding this comment.
I think another thing to note is that with the pod informer if you remove the changes in the file I think it will mean CrashLooping Pods will be marked unhealthy faster so then we toggle reachability to false. Then I think the autoscaler changes are unnecessary
There was a problem hiding this comment.
If we want to make the RoutingState => Reachability essentially passthrough I think we should pull that out in the separate PR to see what implications it has - since it's not clear to me if this breaks anything.
Should I still do this?
|
/ok-to-test |
4547514 to
f4f6b55
Compare
dprotaso
left a comment
There was a problem hiding this comment.
Take a look at my comment here (#14656 (comment)) I think it's worth revisiting some of the assumptions that lead to the creation of this PR.
Secondly, I notice this PR introduces a regression. If a service has all the traffic pinned to a specific revision when we rollout a new revision it's immediately scaled to zero and considered 'failed'. It should scale to 1 (or initial scale) and then scale down after since it's unreachable.
| cond := pa.Status.GetCondition("Active") | ||
| pa.Status.MarkInactive(cond.Reason, cond.Message) |
There was a problem hiding this comment.
Probably simpler to have an if block that prevents setting a PA as inactive if it already is inactive.
Curious -do you know offhand what reason/messages get overwritten? I'm wondering if it makes sense to pull this into a separate PR if it helps with surfacing error messages.
There was a problem hiding this comment.
Thanks for getting back to me!
I think it was being overwritten as changes made in the PR. I'll have to test that again though. For starters I will create the pr as you suggest.
There was a problem hiding this comment.
I have created this pr for the computeActiveCondition. I think that it covers all the cases the same. #14940
c60c5cd to
0d10f98
Compare
Making changes to the pa active status has becomes difficult. This change breaks down some of the logic so that it is more readable. No changes to test cases were made as it doesn't actually change the logical cases.
47347c4 to
7033fe1
Compare
|
@andrew-delph: The following tests failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| @@ -0,0 +1,4 @@ | |||
| # Deadstart test image | |||
There was a problem hiding this comment.
I added an image here to help with this
There was a problem hiding this comment.
Going to convert the test to use this image.
| // This test case creates a service which can never reach a ready state. | ||
| // The service is then udpated with a healthy image and is verified that | ||
| // the healthy revision is ready and the unhealhy revision is scaled to zero. | ||
| func TestDeadStartToHealthy(t *testing.T) { |
There was a problem hiding this comment.
I added a similar test here - #14909
But I was expecting it to fail without any fixes - but it actually did pass. So then I realized that we already do scale down crashing revisions when they are unreachable.
So we might not need any other changes than what's already in main - unless you've uncovered a scenario where it isn't
There was a problem hiding this comment.
The case here is still happening for me. #14656 (comment)
The rev-2 will scale down but the first stuck in restarting.
1. Unreachable pa becomes inactive PA computeActiveCondition() will MarkInactive in the case of "Queued" when the pa is Unreachable. 2. Adding DeadStart e2e tests TestDeadStartToHealthy: creates a service which is never able to transition to a ready state by existing immediately. The failed revision will scale to zero once the service is updated with a healthy revision. TestDeadStartFromHealthy: updates a healthy service with an image that can never reach a ready state. The healthy revision remains Ready and the DeadStart revision doesn't not scale down until ProgressDeadline is reached.
|
I'm currently getting an issue when the progress deadline is reached but looking it. |
Pods that instantly crash do no scale to 0 until progress deadline is called for the deployment.
Proposed Changes
If pa is not ready and unreachable, It is marked inactive instead of queued.
This will scale the deployment to 0 even if metrics cannot be retrieved.