Initial PR for avoiding http retries by activator#1665
Initial PR for avoiding http retries by activator#1665sukhil-suresh wants to merge 2 commits intoknative:masterfrom
Conversation
Avoid http retries by activator (for forwarding request to the revision pod) when user has defined an HTTPGet readiness probe * TCPSocket and Exec action based user-defined readinessProbe are not supported * Follow-up PR will address on how to better handle the scenario when user has NOT defined a readinessProbek Co-authored-by: Shash <shashwathireddy@gmail.com> Signed-off-by: Shash <shashwathireddy@gmail.com>
* Add autoscaling e2e test for the case when user has defined HTTPGet readinessProbe * Increase timeout for execution of e2e test to 20m instead of using the default 10m
|
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pivotal-sukhil-suresh If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
The following is the coverage report on pkg/.
|
|
@pivotal-sukhil-suresh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@adrcunha @jessiezcc @srinivashegde86 @steuhs CC @shashwathi Assuming the I had confirmed the e2e tests passed locally before making the PR. |
|
You can see prow test logs by following these instructions https://github.com/knative/docs/blob/master/community/REVIEWING.md#viewing-test-logs |
shashwathi
left a comment
There was a problem hiding this comment.
Small nit picks. Rest looks good
| } | ||
|
|
||
| func getTestHttpServer(t *testing.T) *httptest.Server { | ||
| handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
You can consider using Server Mux.
mux := http.NewServeMux()
mux.Handle("/health", func(w http.ResponseWriter, req *http.Request) {
w.WriteHeader(http.StatusOK)
})
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
w.WriteHeader(http.StatusNotFound)
})
There was a problem hiding this comment.
Yes, could use the mux. But, not fully convinced the change is required for this trivial use case of a test server (with 2 endpoints). Switching to mux does not necessarily make it any more readable either. And finally, loosely based the tests on pkg/controller/revision/resolve_test.go, which seems to be the only other serving repo test file using test HTTP servers. Let me know what you think...
| local options="" | ||
| (( EMIT_METRICS )) && options="-emitmetrics" | ||
| report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options} | ||
| report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options} |
There was a problem hiding this comment.
I would recommend using some environment variable for test timeout and default the value of variable to 20m. Probably document the variable and its usage in test docs.
There was a problem hiding this comment.
yes, makes sense. will make the change.
There was a problem hiding this comment.
Not an issue anymore. This commit from yesterday increased the timeout to 20 mins - fbfa23d
|
|
||
| logger.Infof("Creating a new Route and Configuration") | ||
|
|
||
| names := test.ResourceNames{ |
There was a problem hiding this comment.
Looks like there is test helper function to create route and config CreateRouteAndConfig.
There was a problem hiding this comment.
The TestAutoscaleUpDownUp_WithReadinessProbe intentionally avoids the CreateRouteAndConfig call, since the generated ConfigurationSpec does not have a readinessProbe. Could have altered the CreateRouteAndConfig and passed additional params which would have meant passing it down the chain of calls. Instead opted to go with the approach used by test/e2e/build_test.go
dprotaso
left a comment
There was a problem hiding this comment.
I did a quick pass (didn't cover everything). In general you're performing the readiness checks for all HTTP methods which is unnecessary. GETs, PUTs, HEADs should be idempotent etc.
So restrict the change to just HTTP POSTs
| // Function creates HTTP readiness probe for revision | ||
| func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe { | ||
| probe := revision.Spec.Container.ReadinessProbe.DeepCopy() | ||
| probe.HTTPGet.Scheme = "http" |
There was a problem hiding this comment.
You're making assumptions on the scheme - just confirm it's always http
You can use the defaulter function to set the default value for this and the path
https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/core/v1/defaults.go#L294
There was a problem hiding this comment.
Yes, will make the change. I did consider using the defaulter functions from the kubernetes library. Can't remember why I backed out of it :-/
| const ( | ||
| maxRetry = 60 | ||
| defaultPeriodSeconds = int32(1 * time.Second) | ||
| defaultTimeoutSeconds = int32(1 * time.Second) |
There was a problem hiding this comment.
These defaults exist here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults.go#L185
| probe := createHttpGetProbe(revision, *endpoint) | ||
|
|
||
| // Number of seconds after the readiness probes are initiated | ||
| time.Sleep(time.Second * int32ToDuration(probe.InitialDelaySeconds)) |
There was a problem hiding this comment.
We essentially know the endpoint is ready, but not necessarily from all nodes. I'm thinking we can skip the initial delay and hope that it'll optimistically work.
There was a problem hiding this comment.
We did have some hesitation in using it. But the thinking was that a user directive to apply an initial delay should be respected. The default initial delay is 0 seconds.
As an example, assume for an app, the actual initial delay is longer than 60 seconds; then the endpoint verification would fail if the default retry interval of 1 second and max retry limit of 60 is used.
| local options="" | ||
| (( EMIT_METRICS )) && options="-emitmetrics" | ||
| report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options} | ||
| report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options} |
There was a problem hiding this comment.
Probably don't want to keep the 20 minute timeout
There was a problem hiding this comment.
The 20 min timeout was added because the TestAutoscaleUpDownUp_WithReadinessProbe runs the same scenarios as the previous TestAutoscaleUpDownUp but with readinessProbe defined in the ConfigurationSpec. The TestAutoscaleUpDownUp takes a long time and doubling it upped it to about 17 mins and so was running into go test timeout failure. The default timeout for go test is 10 mins
Maybe PR #1670 when merged may help reduce the time. But for now, the default timeout of 10 mins is no good. Happy to consider alternatives. Suggestions?
There was a problem hiding this comment.
Not an issue anymore. this commit from yesterday updated the timeout to 20 mins - fbfa23d
| } | ||
|
|
||
| var transport http.RoundTripper | ||
| if endpoint.IsVerified() { |
There was a problem hiding this comment.
It's probably still worth using the retryRoundTripper for HTTP GETs
There was a problem hiding this comment.
Not clear as to why the retry approach is better when the readinessProbe is defined by the user. The readinessProbe is a clear indicator of when the app is ready to receive requests.
There was a problem hiding this comment.
For GETs a request could still fail for whatever reason even if readiness probe succeeds
There was a problem hiding this comment.
Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.
Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).
| Port int32 | ||
| FQDN string | ||
| Port int32 | ||
| Verified VerificationStatus |
There was a problem hiding this comment.
Verified reads like a state. I'd just call this Status
There was a problem hiding this comment.
Fair. Will change
| defaultTimeoutSeconds = int32(1 * time.Second) | ||
| ) | ||
|
|
||
| func verifyRevisionRoutability(revision *v1alpha1.Revision, endpoint *Endpoint, logger *zap.SugaredLogger) { |
There was a problem hiding this comment.
It's not clear that you're using a reference to an Endpoint in order to mutate it. It might be better to not mutate the value and return a VerificationStatus. Then set the endpoint's type to be a non-reference. ie. endpoint Endpoint so you're not seeing &endpoint everywhere.
There was a problem hiding this comment.
well, could change the function name to verifyEndpointStatus, which would make it more obvious :) But, yeah do agree with your suggestion. Will alter.
| } | ||
|
|
||
| // Function creates HTTP readiness probe for revision | ||
| func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe { |
There was a problem hiding this comment.
You're transforming a (revision, endpoint) to a probe. Then in the subsequent steps you're converting the probe to an http get call/request.
You can go straight to a (revision, endpoint) -> http.Request/call
There was a problem hiding this comment.
Fair. This approach is a remnant of the initial effort to tackle both HttpGet and TCPSocket based readiness probes. Support for TCPSocket probe is currently blocked by issue #1241. More details in this issue comment.
Will update.
| transport = h2cutil.NewTransport() | ||
| } | ||
| } else { | ||
| transport = retryRoundTripper{ |
There was a problem hiding this comment.
The PR is meant to address not retrying HTTP POSTs it's still possible here if the endpoint isn't verified.
There was a problem hiding this comment.
The PR comment refers to this fact. This issue is being resolved with a multi PR approach
This initial PR avoids HTTP retries by activator when the user has defined an HTTPGet readinessProbe. The follow up PR will default to queue-proxy health check when user has not defined a readinessProbe
There was a problem hiding this comment.
To clarify my comment if the endpoint VerifiedStatus is Failed it will go into the else block and use the retryTripper for HTTP POSTs
|
Confirmed that one of the OWNERS have to upload an updated image for The PR added a |
|
Please have your PR LGTMed by shashwathi and dprotaso. After that I'll review it and upload the new test images so Prow integration tests will pass. Meanwhile, running them locally is the way to go. |
| } | ||
|
|
||
| var transport http.RoundTripper | ||
| if endpoint.IsVerified() { |
There was a problem hiding this comment.
Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.
Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).
| retryCount := 1 | ||
| retryInterval := time.Second * int32ToDuration(probe.PeriodSeconds) | ||
|
|
||
| for retryCount = 1; retryCount < maxRetry; retryCount++ { |
There was a problem hiding this comment.
I would prefer to go with an exponential backoff. E.g. #1814. Let's see if you can reuse @markusthoemmes' retry mechanism. Or if his mechanism can be modified for you to use it.
|
Just merged #1689 which adds activator unit tests. @markusthoemmes is implementing an exponential backoff in #1814 which you should be able to use for your GET readiness probing. |
|
Closing per @dprotaso |
Fixes #1448
Proposed Changes
Adopting multi PR approach for the issue (as per discussion)
Release Note
NONE