Backoff retries in the activator. by markusthoemmes · Pull Request #1814 · knative/serving

markusthoemmes · 2018-08-08T13:16:35Z

Proposed Changes

Added an exponential backoff to the activator's retry logic. In the process, I lowered the timeout to start with (we might need to adjust that a bit to hit a sweet spot) and the total time to retry is now bounded by the elapsed time spent in retrying + requesting.

To determine a good retry interval, the following table can help. Production data on how many retries were needed in reality will help to adjust though.

Regarding tests: Didn't find any for this specific file. I'd love to add some but will need some guidance on how to do so if necessary.

Release Note

Added an exponential backoff to the activator's retry logic

markusthoemmes · 2018-08-08T14:35:38Z

/test pull-knative-serving-integration-tests

markusthoemmes · 2018-08-08T15:14:42Z

/test pull-knative-serving-integration-tests

markusthoemmes · 2018-08-08T15:16:44Z

/assign @josephburnett

josephburnett · 2018-08-10T16:12:10Z

 }

+func (rrt *retryRoundTripper) CalculateDelay(retries int, minRetryInterval time.Duration) time.Duration {
+	return time.Duration(int(minRetryInterval/time.Millisecond)*retries*retries) * time.Millisecond


I believe this is quadratic, not exponential. What we want is an aggressive retry during normal activation times, but a quickly growing retry interval thereafter. Which is easier to achieve with exponential because of that hockey stick shape.

In my experience a small base like 1.3 is a good start. With the retry index as the exponent. Then multiply by the min retry.

E.g. return time.Duration(int(minRetryInterval/time.Millisecond)*(1.3^retries)) * time.Millisecond

It would look something like this. (The actual numbers should be tuned, but the point is to keep the curve low and fast until we leave normal operating conditions.)

Doh. Of course it's quadratic... very much my bad. Thanks for pointing that out, I'll fix accordingly.

Went for Base=1.3, MinRetry=100ms for now, giving me a progression as shown in the table:

knative-metrics-robot · 2018-08-10T17:54:39Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/activator/util/retryer.go	100.0%	84.6%	-15.4

josephburnett

/lgtm
/approve

knative-prow-robot · 2018-08-10T19:06:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cmd/activator/OWNERS~~ [josephburnett]
~~pkg/activator/OWNERS~~ [josephburnett]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

markusthoemmes · 2018-08-10T19:09:37Z

/retest

josephburnett · 2018-08-10T19:20:47Z

/retest

markusthoemmes · 2018-08-10T20:27:37Z

/retest

srinivashegde86 · 2018-08-10T21:49:22Z

/restest

srinivashegde86 · 2018-08-10T21:49:29Z

/retest

markusthoemmes · 2018-08-10T22:24:05Z

/retest

knative-prow-robot requested a review from josephburnett August 8, 2018 13:16

knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 8, 2018

knative-prow-robot requested a review from mdemirhan August 8, 2018 13:16

knative-prow-robot assigned josephburnett Aug 8, 2018

josephburnett reviewed Aug 10, 2018

View reviewed changes

This was referenced Aug 10, 2018

Initial PR for avoiding http retries by activator #1665

Closed

Activator unit tests #1689

Merged

markusthoemmes force-pushed the activator-backoff branch from 2c4c6b0 to 65be77e Compare August 10, 2018 16:54

Implement exponential retryer, refactor tests.

52d5083

markusthoemmes force-pushed the activator-backoff branch from 65be77e to 52d5083 Compare August 10, 2018 17:51

knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 10, 2018

Reduce abstraction, document retryer.

d22fff5

josephburnett approved these changes Aug 10, 2018

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2018

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2018

knative-prow-robot merged commit 1345d3a into knative:master Aug 10, 2018

trisberg mentioned this pull request Aug 16, 2018

0->1 activation errors #1872

Closed

Conversation

markusthoemmes commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Uh oh!

markusthoemmes commented Aug 8, 2018

Uh oh!

markusthoemmes commented Aug 8, 2018

Uh oh!

markusthoemmes commented Aug 8, 2018

Uh oh!

josephburnett Aug 10, 2018

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Aug 10, 2018

Choose a reason for hiding this comment

Uh oh!

markusthoemmes Aug 10, 2018

Choose a reason for hiding this comment

Uh oh!

knative-metrics-robot commented Aug 10, 2018

Uh oh!

josephburnett left a comment

Choose a reason for hiding this comment

Uh oh!

knative-prow-robot commented Aug 10, 2018

Uh oh!

markusthoemmes commented Aug 10, 2018

Uh oh!

josephburnett commented Aug 10, 2018

Uh oh!

markusthoemmes commented Aug 10, 2018

Uh oh!

srinivashegde86 commented Aug 10, 2018

Uh oh!

srinivashegde86 commented Aug 10, 2018

Uh oh!

markusthoemmes commented Aug 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

markusthoemmes commented Aug 8, 2018 •

edited

Loading