Initial PR for avoiding http retries by activator by sukhil-suresh · Pull Request #1665 · knative/serving

sukhil-suresh · 2018-07-24T19:30:39Z

Proposed Changes

Adopting multi PR approach for the issue (as per discussion)

This initial PR avoids HTTP retries by activator when the user has defined an HTTPGet readinessProbe
Follow-up PR will address the case when the user has not defined a readinessProbe
NOTE: The e2e test image (autoscale) has been updated - @adrcunha @jessiezcc @srinivashegde86 @steuhs

Release Note
NONE

Avoid http retries by activator (for forwarding request to the revision pod) when user has defined an HTTPGet readiness probe * TCPSocket and Exec action based user-defined readinessProbe are not supported * Follow-up PR will address on how to better handle the scenario when user has NOT defined a readinessProbek Co-authored-by: Shash <shashwathireddy@gmail.com> Signed-off-by: Shash <shashwathireddy@gmail.com>

* Add autoscaling e2e test for the case when user has defined HTTPGet readinessProbe * Increase timeout for execution of e2e test to 20m instead of using the default 10m

googlebot · 2018-07-24T19:30:42Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

google-prow-robot · 2018-07-24T19:30:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pivotal-sukhil-suresh
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approvers: adrcunha, josephburnett

If they are not already assigned, you can assign the PR to them by writing /assign @adrcunha @josephburnett in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-metrics-robot · 2018-07-24T19:33:01Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/activator/prober.go	Do not exist	95.2%
pkg/activator/revision.go	80.0%	80.4%	0.4
pkg/activator/activator.go	Do not exist	100.0%

google-prow-robot · 2018-07-24T20:07:31Z

@pivotal-sukhil-suresh: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-knative-serving-integration-tests	`01fe640`	link	`/test pull-knative-serving-integration-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sukhil-suresh · 2018-07-24T22:56:26Z

@adrcunha @jessiezcc @srinivashegde86 @steuhs CC @shashwathi

Assuming the pull-knative-serving-integration-tests failing job refers to the e2e tests? If so, it is probably failing because of an update to the autoscale test-image (by one of the PR commits). Can you confirm? The new test images may need to be uploaded to the e2e tests Docker repo. I am basing this on the test docs

I had confirmed the e2e tests passed locally before making the PR.

dprotaso · 2018-07-25T06:42:25Z

You can see prow test logs by following these instructions

https://github.com/knative/docs/blob/master/community/REVIEWING.md#viewing-test-logs

shashwathi

Small nit picks. Rest looks good

shashwathi · 2018-07-25T04:09:11Z

+}
+
+func getTestHttpServer(t *testing.T) *httptest.Server {
+	handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {


You can consider using Server Mux.

mux := http.NewServeMux() mux.Handle("/health", func(w http.ResponseWriter, req *http.Request) { w.WriteHeader(http.StatusOK) }) mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { w.WriteHeader(http.StatusNotFound) })

Yes, could use the mux. But, not fully convinced the change is required for this trivial use case of a test server (with 2 endpoints). Switching to mux does not necessarily make it any more readable either. And finally, loosely based the tests on pkg/controller/revision/resolve_test.go, which seems to be the only other serving repo test file using test HTTP servers. Let me know what you think...

shashwathi · 2018-07-25T14:05:22Z

  local options=""
  (( EMIT_METRICS )) && options="-emitmetrics"
-  report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
+  report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}


I would recommend using some environment variable for test timeout and default the value of variable to 20m. Probably document the variable and its usage in test docs.

yes, makes sense. will make the change.

Not an issue anymore. This commit from yesterday increased the timeout to 20 mins - fbfa23d

shashwathi · 2018-07-25T14:10:01Z

+
+	logger.Infof("Creating a new Route and Configuration")
+
+	names := test.ResourceNames{


Looks like there is test helper function to create route and config CreateRouteAndConfig.

The TestAutoscaleUpDownUp_WithReadinessProbe intentionally avoids the CreateRouteAndConfig call, since the generated ConfigurationSpec does not have a readinessProbe. Could have altered the CreateRouteAndConfig and passed additional params which would have meant passing it down the chain of calls. Instead opted to go with the approach used by test/e2e/build_test.go

dprotaso

I did a quick pass (didn't cover everything). In general you're performing the readiness checks for all HTTP methods which is unnecessary. GETs, PUTs, HEADs should be idempotent etc.

So restrict the change to just HTTP POSTs

dprotaso · 2018-07-25T06:49:05Z

+// Function creates HTTP readiness probe for revision
+func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe {
+	probe := revision.Spec.Container.ReadinessProbe.DeepCopy()
+	probe.HTTPGet.Scheme = "http"


You're making assumptions on the scheme - just confirm it's always http

You can use the defaulter function to set the default value for this and the path
https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/core/v1/defaults.go#L294

Yes, will make the change. I did consider using the defaulter functions from the kubernetes library. Can't remember why I backed out of it :-/

dprotaso · 2018-07-25T06:56:43Z

+const (
+	maxRetry              = 60
+	defaultPeriodSeconds  = int32(1 * time.Second)
+	defaultTimeoutSeconds = int32(1 * time.Second)


These defaults exist here:

https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/v1/defaults.go#L185

dprotaso · 2018-07-25T07:07:06Z

+	probe := createHttpGetProbe(revision, *endpoint)
+
+	// Number of seconds after the readiness probes are initiated
+	time.Sleep(time.Second * int32ToDuration(probe.InitialDelaySeconds))


We essentially know the endpoint is ready, but not necessarily from all nodes. I'm thinking we can skip the initial delay and hope that it'll optimistically work.

We did have some hesitation in using it. But the thinking was that a user directive to apply an initial delay should be respected. The default initial delay is 0 seconds.

As an example, assume for an app, the actual initial delay is longer than 60 seconds; then the endpoint verification would fail if the default retry interval of 1 second and max retry limit of 60 is used.

dprotaso · 2018-07-25T07:16:36Z

  local options=""
  (( EMIT_METRICS )) && options="-emitmetrics"
-  report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
+  report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}


Probably don't want to keep the 20 minute timeout

The 20 min timeout was added because the TestAutoscaleUpDownUp_WithReadinessProbe runs the same scenarios as the previous TestAutoscaleUpDownUp but with readinessProbe defined in the ConfigurationSpec. The TestAutoscaleUpDownUp takes a long time and doubling it upped it to about 17 mins and so was running into go test timeout failure. The default timeout for go test is 10 mins

Maybe PR #1670 when merged may help reduce the time. But for now, the default timeout of 10 mins is no good. Happy to consider alternatives. Suggestions?

Not an issue anymore. this commit from yesterday updated the timeout to 20 mins - fbfa23d

dprotaso · 2018-07-25T14:50:08Z

 	}
+
+	var transport http.RoundTripper
+	if endpoint.IsVerified() {


It's probably still worth using the retryRoundTripper for HTTP GETs

Not clear as to why the retry approach is better when the readinessProbe is defined by the user. The readinessProbe is a clear indicator of when the app is ready to receive requests.

For GETs a request could still fail for whatever reason even if readiness probe succeeds

Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.

Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).

dprotaso · 2018-07-25T14:52:47Z

-	Port int32
+	FQDN     string
+	Port     int32
+	Verified VerificationStatus


Verified reads like a state. I'd just call this Status

Fair. Will change

dprotaso · 2018-07-25T15:19:21Z

+	defaultTimeoutSeconds = int32(1 * time.Second)
+)
+
+func verifyRevisionRoutability(revision *v1alpha1.Revision, endpoint *Endpoint, logger *zap.SugaredLogger) {


It's not clear that you're using a reference to an Endpoint in order to mutate it. It might be better to not mutate the value and return a VerificationStatus. Then set the endpoint's type to be a non-reference. ie. endpoint Endpoint so you're not seeing &endpoint everywhere.

well, could change the function name to verifyEndpointStatus, which would make it more obvious :) But, yeah do agree with your suggestion. Will alter.

dprotaso · 2018-07-25T15:29:59Z

+}
+
+// Function creates HTTP readiness probe for revision
+func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe {


You're transforming a (revision, endpoint) to a probe. Then in the subsequent steps you're converting the probe to an http get call/request.

You can go straight to a (revision, endpoint) -> http.Request/call

Fair. This approach is a remnant of the initial effort to tackle both HttpGet and TCPSocket based readiness probes. Support for TCPSocket probe is currently blocked by issue #1241. More details in this issue comment.

Will update.

dprotaso · 2018-07-25T15:36:14Z

+			transport = h2cutil.NewTransport()
+		}
+	} else {
+		transport = retryRoundTripper{


The PR is meant to address not retrying HTTP POSTs it's still possible here if the endpoint isn't verified.

The PR comment refers to this fact. This issue is being resolved with a multi PR approach

This initial PR avoids HTTP retries by activator when the user has defined an HTTPGet readinessProbe. The follow up PR will default to queue-proxy health check when user has not defined a readinessProbe

To clarify my comment if the endpoint VerifiedStatus is Failed it will go into the else block and use the retryTripper for HTTP POSTs

sukhil-suresh · 2018-07-25T16:47:45Z

Confirmed that one of the OWNERS have to upload an updated image for test/e2e/test_images/autoscale/ to the e2e docker repo.

The PR added a /health endpoint for the autoscale test app and the log snippet below shows that user-container /health request is timing out. Ref: 01fe640#diff-b846825c0707e40e6bf22ac1fa286d29

...
I0724 19:59:24.818] NAMESPACE         LAST SEEN   FIRST SEEN   COUNT     NAME                                                                             KIND                      SUBOBJECT                                                TYPE      REASON                         SOURCE                                                           MESSAGE
...
I0724 19:59:25.167] noodleburg        10m         11m          4         prodtohikpas-00001-deployment-847ff6868-mltqk.154465cc2b7374d4                   Pod                       spec.containers{user-container}                          Warning   Unhealthy                      kubelet, gke-kserving-e2e-cls143-default-pool-51493bb4-7ww6      Readiness probe failed: Get http://10.8.1.29:8012/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
...

adrcunha · 2018-07-25T19:55:23Z

Please have your PR LGTMed by shashwathi and dprotaso. After that I'll review it and upload the new test images so Prow integration tests will pass. Meanwhile, running them locally is the way to go.

josephburnett · 2018-08-10T16:31:35Z

 	}
+
+	var transport http.RoundTripper
+	if endpoint.IsVerified() {


Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.

Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).

josephburnett · 2018-08-10T16:35:54Z

+	retryCount := 1
+	retryInterval := time.Second * int32ToDuration(probe.PeriodSeconds)
+
+	for retryCount = 1; retryCount < maxRetry; retryCount++ {


I would prefer to go with an exponential backoff. E.g. #1814. Let's see if you can reuse @markusthoemmes' retry mechanism. Or if his mechanism can be modified for you to use it.

josephburnett · 2018-08-10T17:14:05Z

Just merged #1689 which adds activator unit tests. @markusthoemmes is implementing an exponential backoff in #1814 which you should be able to use for your GET readiness probing.

mattmoor · 2018-09-21T16:53:56Z

Closing per @dprotaso

sukhil-suresh and others added 2 commits July 24, 2018 15:09

Add e2e test for issue knative#1448

01fe640

* Add autoscaling e2e test for the case when user has defined HTTPGet readinessProbe * Increase timeout for execution of e2e test to 20m instead of using the default 10m

google-prow-robot requested review from jessiezcc and josephburnett July 24, 2018 19:30

google-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 24, 2018

sukhil-suresh changed the title ~~1448~~ Initial PR for avoiding http retries by activator Jul 24, 2018

shashwathi reviewed Jul 25, 2018

View reviewed changes

dprotaso reviewed Jul 25, 2018

View reviewed changes

mattmoor assigned mdemirhan Jul 26, 2018

josephburnett reviewed Aug 10, 2018

View reviewed changes

josephburnett mentioned this pull request Aug 10, 2018

Activator unit tests #1689

Merged

mattmoor closed this Sep 21, 2018


		logger.Infof("Creating a new Route and Configuration")

		names := test.ResourceNames{

Conversation

sukhil-suresh commented Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Uh oh!

googlebot commented Jul 24, 2018

Uh oh!

google-prow-robot commented Jul 24, 2018

Uh oh!

knative-metrics-robot commented Jul 24, 2018

Uh oh!

google-prow-robot commented Jul 24, 2018

Uh oh!

sukhil-suresh commented Jul 24, 2018

Uh oh!

dprotaso commented Jul 25, 2018

Uh oh!

shashwathi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sukhil-suresh Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shashwathi Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sukhil-suresh Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dprotaso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sukhil-suresh Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dprotaso Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dprotaso Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

sukhil-suresh commented Jul 24, 2018 •

edited

Loading

sukhil-suresh Jul 25, 2018 •

edited

Loading

shashwathi Jul 25, 2018 •

edited

Loading

sukhil-suresh Jul 25, 2018 •

edited

Loading

sukhil-suresh Jul 25, 2018 •

edited

Loading

dprotaso Jul 26, 2018 •

edited

Loading

dprotaso Jul 25, 2018 •

edited

Loading

sukhil-suresh Jul 25, 2018 •

edited

Loading