Backoff algorithm is still not very good

Here's a list of gaps observed in a `querier` trace, between DynamoDB.BatchGetItemPages that errored:

First one: 100ms, 190ms, 180ms, 200ms, 660ms, 2050ms
Next one: 60ms, 140ms, 250ms, 170ms, 440ms
Next one: 10ms, 160ms, 290ms, 600ms, 140ms, 100ms, 3880ms
Next one: 40ms, 60ms, 120ms, 370ms, 1240ms

All of this is consistent with the implementation, which picks a random number within a range that doubles, but it doesn't feel right.  I think we should have a base number that doubles, and add a few milliseconds for jitter.

Next issue is that calls that return unprocessed errors also increase the backoff.  Since we commonly get a number of `ProvisionedThroughputExceeded` errors followed by a number of calls with unprocessed keys, we crank the delay up to tens of seconds.

I think we should reset the backoff as long as there is no error, and eliminate `WaitWithoutCounting()`.

Finally I would drop the max backoff time from 50 seconds - it's too long compared to the default write context timeout of 1 minute, also compared to the typical time an end-user will wait for a query.  Say 20 seconds?

So the base times would be:
100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s

Typical retries in a 1 minute timeout, if they all error ~9
Total time to get to MaxRetries=20 ~6 minutes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff algorithm is still not very good #792

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backoff algorithm is still not very good #792

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions