Skip to content

Initial PR for avoiding http retries by activator#1665

Closed
sukhil-suresh wants to merge 2 commits intoknative:masterfrom
sukhil-suresh:1448
Closed

Initial PR for avoiding http retries by activator#1665
sukhil-suresh wants to merge 2 commits intoknative:masterfrom
sukhil-suresh:1448

Conversation

@sukhil-suresh
Copy link
Copy Markdown
Contributor

@sukhil-suresh sukhil-suresh commented Jul 24, 2018

Fixes #1448

Proposed Changes

Adopting multi PR approach for the issue (as per discussion)

  • This initial PR avoids HTTP retries by activator when the user has defined an HTTPGet readinessProbe
  • Follow-up PR will address the case when the user has not defined a readinessProbe
  • NOTE: The e2e test image (autoscale) has been updated - @adrcunha @jessiezcc @srinivashegde86 @steuhs

Release Note
NONE

sukhil-suresh and others added 2 commits July 24, 2018 15:09
Avoid http retries by activator (for forwarding request to the revision pod)
when user has defined an HTTPGet readiness probe

* TCPSocket and Exec action based user-defined readinessProbe are not supported

* Follow-up PR will address on how to better handle the scenario when user
 has NOT defined a readinessProbek

Co-authored-by: Shash <shashwathireddy@gmail.com>
Signed-off-by: Shash <shashwathireddy@gmail.com>
* Add autoscaling e2e test for the case when user has defined
HTTPGet readinessProbe

* Increase timeout for execution of e2e test to 20m instead of
using the default 10m
@googlebot
Copy link
Copy Markdown

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@google-prow-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pivotal-sukhil-suresh
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approvers: adrcunha, josephburnett

If they are not already assigned, you can assign the PR to them by writing /assign @adrcunha @josephburnett in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 24, 2018
@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/activator/prober.go Do not exist 95.2%
pkg/activator/revision.go 80.0% 80.4% 0.4
pkg/activator/activator.go Do not exist 100.0%

@sukhil-suresh sukhil-suresh changed the title 1448 Initial PR for avoiding http retries by activator Jul 24, 2018
@google-prow-robot
Copy link
Copy Markdown

@pivotal-sukhil-suresh: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-knative-serving-integration-tests 01fe640 link /test pull-knative-serving-integration-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sukhil-suresh
Copy link
Copy Markdown
Contributor Author

@adrcunha @jessiezcc @srinivashegde86 @steuhs CC @shashwathi

Assuming the pull-knative-serving-integration-tests failing job refers to the e2e tests? If so, it is probably failing because of an update to the autoscale test-image (by one of the PR commits). Can you confirm? The new test images may need to be uploaded to the e2e tests Docker repo. I am basing this on the test docs

I had confirmed the e2e tests passed locally before making the PR.

@dprotaso
Copy link
Copy Markdown
Member

You can see prow test logs by following these instructions

https://github.com/knative/docs/blob/master/community/REVIEWING.md#viewing-test-logs

Copy link
Copy Markdown
Contributor

@shashwathi shashwathi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit picks. Rest looks good

}

func getTestHttpServer(t *testing.T) *httptest.Server {
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider using Server Mux.

mux := http.NewServeMux()
mux.Handle("/health",  func(w http.ResponseWriter, req *http.Request) {
   w.WriteHeader(http.StatusOK)
})
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
   w.WriteHeader(http.StatusNotFound)
})

Copy link
Copy Markdown
Contributor Author

@sukhil-suresh sukhil-suresh Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could use the mux. But, not fully convinced the change is required for this trivial use case of a test server (with 2 endpoints). Switching to mux does not necessarily make it any more readable either. And finally, loosely based the tests on pkg/controller/revision/resolve_test.go, which seems to be the only other serving repo test file using test HTTP servers. Let me know what you think...

Comment thread test/e2e-tests.sh
local options=""
(( EMIT_METRICS )) && options="-emitmetrics"
report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
Copy link
Copy Markdown
Contributor

@shashwathi shashwathi Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using some environment variable for test timeout and default the value of variable to 20m. Probably document the variable and its usage in test docs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense. will make the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue anymore. This commit from yesterday increased the timeout to 20 mins - fbfa23d


logger.Infof("Creating a new Route and Configuration")

names := test.ResourceNames{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is test helper function to create route and config CreateRouteAndConfig.

Copy link
Copy Markdown
Contributor Author

@sukhil-suresh sukhil-suresh Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestAutoscaleUpDownUp_WithReadinessProbe intentionally avoids the CreateRouteAndConfig call, since the generated ConfigurationSpec does not have a readinessProbe. Could have altered the CreateRouteAndConfig and passed additional params which would have meant passing it down the chain of calls. Instead opted to go with the approach used by test/e2e/build_test.go

Copy link
Copy Markdown
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick pass (didn't cover everything). In general you're performing the readiness checks for all HTTP methods which is unnecessary. GETs, PUTs, HEADs should be idempotent etc.

So restrict the change to just HTTP POSTs

Comment thread pkg/activator/prober.go
// Function creates HTTP readiness probe for revision
func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe {
probe := revision.Spec.Container.ReadinessProbe.DeepCopy()
probe.HTTPGet.Scheme = "http"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're making assumptions on the scheme - just confirm it's always http

You can use the defaulter function to set the default value for this and the path
https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/apis/core/v1/defaults.go#L294

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make the change. I did consider using the defaulter functions from the kubernetes library. Can't remember why I backed out of it :-/

Comment thread pkg/activator/prober.go
const (
maxRetry = 60
defaultPeriodSeconds = int32(1 * time.Second)
defaultTimeoutSeconds = int32(1 * time.Second)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pkg/activator/prober.go
probe := createHttpGetProbe(revision, *endpoint)

// Number of seconds after the readiness probes are initiated
time.Sleep(time.Second * int32ToDuration(probe.InitialDelaySeconds))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We essentially know the endpoint is ready, but not necessarily from all nodes. I'm thinking we can skip the initial delay and hope that it'll optimistically work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did have some hesitation in using it. But the thinking was that a user directive to apply an initial delay should be respected. The default initial delay is 0 seconds.

As an example, assume for an app, the actual initial delay is longer than 60 seconds; then the endpoint verification would fail if the default retry interval of 1 second and max retry limit of 60 is used.

Comment thread test/e2e-tests.sh
local options=""
(( EMIT_METRICS )) && options="-emitmetrics"
report_go_test -v -tags=e2e -count=1 ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
report_go_test -v -tags=e2e -count=1 -timeout=20m ./test/$1 -dockerrepo gcr.io/knative-tests/test-images/$1 ${options}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want to keep the 20 minute timeout

Copy link
Copy Markdown
Contributor Author

@sukhil-suresh sukhil-suresh Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 20 min timeout was added because the TestAutoscaleUpDownUp_WithReadinessProbe runs the same scenarios as the previous TestAutoscaleUpDownUp but with readinessProbe defined in the ConfigurationSpec. The TestAutoscaleUpDownUp takes a long time and doubling it upped it to about 17 mins and so was running into go test timeout failure. The default timeout for go test is 10 mins

Maybe PR #1670 when merged may help reduce the time. But for now, the default timeout of 10 mins is no good. Happy to consider alternatives. Suggestions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue anymore. this commit from yesterday updated the timeout to 20 mins - fbfa23d

Comment thread cmd/activator/main.go
}

var transport http.RoundTripper
if endpoint.IsVerified() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably still worth using the retryRoundTripper for HTTP GETs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear as to why the retry approach is better when the readinessProbe is defined by the user. The readinessProbe is a clear indicator of when the app is ready to receive requests.

Copy link
Copy Markdown
Member

@dprotaso dprotaso Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GETs a request could still fail for whatever reason even if readiness probe succeeds

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.

Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).

Port int32
FQDN string
Port int32
Verified VerificationStatus
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified reads like a state. I'd just call this Status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Will change

Comment thread pkg/activator/prober.go
defaultTimeoutSeconds = int32(1 * time.Second)
)

func verifyRevisionRoutability(revision *v1alpha1.Revision, endpoint *Endpoint, logger *zap.SugaredLogger) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear that you're using a reference to an Endpoint in order to mutate it. It might be better to not mutate the value and return a VerificationStatus. Then set the endpoint's type to be a non-reference. ie. endpoint Endpoint so you're not seeing &endpoint everywhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, could change the function name to verifyEndpointStatus, which would make it more obvious :) But, yeah do agree with your suggestion. Will alter.

Comment thread pkg/activator/prober.go
}

// Function creates HTTP readiness probe for revision
func createHttpGetProbe(revision *v1alpha1.Revision, endpoint Endpoint) *v1.Probe {
Copy link
Copy Markdown
Member

@dprotaso dprotaso Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're transforming a (revision, endpoint) to a probe. Then in the subsequent steps you're converting the probe to an http get call/request.

You can go straight to a (revision, endpoint) -> http.Request/call

Copy link
Copy Markdown
Contributor Author

@sukhil-suresh sukhil-suresh Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. This approach is a remnant of the initial effort to tackle both HttpGet and TCPSocket based readiness probes. Support for TCPSocket probe is currently blocked by issue #1241. More details in this issue comment.

Will update.

Comment thread cmd/activator/main.go
transport = h2cutil.NewTransport()
}
} else {
transport = retryRoundTripper{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is meant to address not retrying HTTP POSTs it's still possible here if the endpoint isn't verified.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR comment refers to this fact. This issue is being resolved with a multi PR approach

This initial PR avoids HTTP retries by activator when the user has defined an HTTPGet readinessProbe. The follow up PR will default to queue-proxy health check when user has not defined a readinessProbe

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my comment if the endpoint VerifiedStatus is Failed it will go into the else block and use the retryTripper for HTTP POSTs

@sukhil-suresh
Copy link
Copy Markdown
Contributor Author

Confirmed that one of the OWNERS have to upload an updated image for test/e2e/test_images/autoscale/ to the e2e docker repo.

The PR added a /health endpoint for the autoscale test app and the log snippet below shows that user-container /health request is timing out. Ref: 01fe640#diff-b846825c0707e40e6bf22ac1fa286d29

...
I0724 19:59:24.818] NAMESPACE         LAST SEEN   FIRST SEEN   COUNT     NAME                                                                             KIND                      SUBOBJECT                                                TYPE      REASON                         SOURCE                                                           MESSAGE
...
I0724 19:59:25.167] noodleburg        10m         11m          4         prodtohikpas-00001-deployment-847ff6868-mltqk.154465cc2b7374d4                   Pod                       spec.containers{user-container}                          Warning   Unhealthy                      kubelet, gke-kserving-e2e-cls143-default-pool-51493bb4-7ww6      Readiness probe failed: Get http://10.8.1.29:8012/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
...

@adrcunha
Copy link
Copy Markdown
Contributor

Please have your PR LGTMed by shashwathi and dprotaso. After that I'll review it and upload the new test images so Prow integration tests will pass. Meanwhile, running them locally is the way to go.

Comment thread cmd/activator/main.go
}

var transport http.RoundTripper
if endpoint.IsVerified() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the user's application says it's ready, it will be put into service. That's true whether it's the first pod or the 100th. The only reason we're retrying at this level (in the activator) is because we know the network programming is eventually consistent, which matters more for the first pod.

Once we've verified we can reach the service with readiness probing, we should just forward the request. If it fails because of something in the user's application, we should just rely on a higher level retry (or not).

Comment thread pkg/activator/prober.go
retryCount := 1
retryInterval := time.Second * int32ToDuration(probe.PeriodSeconds)

for retryCount = 1; retryCount < maxRetry; retryCount++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to go with an exponential backoff. E.g. #1814. Let's see if you can reuse @markusthoemmes' retry mechanism. Or if his mechanism can be modified for you to use it.

@josephburnett
Copy link
Copy Markdown
Contributor

Just merged #1689 which adds activator unit tests. @markusthoemmes is implementing an exponential backoff in #1814 which you should be able to use for your GET readiness probing.

@mattmoor
Copy link
Copy Markdown
Member

Closing per @dprotaso

@mattmoor mattmoor closed this Sep 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants