-
Notifications
You must be signed in to change notification settings - Fork 850
Description
Here's a list of gaps observed in a querier trace, between DynamoDB.BatchGetItemPages that errored:
First one: 100ms, 190ms, 180ms, 200ms, 660ms, 2050ms
Next one: 60ms, 140ms, 250ms, 170ms, 440ms
Next one: 10ms, 160ms, 290ms, 600ms, 140ms, 100ms, 3880ms
Next one: 40ms, 60ms, 120ms, 370ms, 1240ms
All of this is consistent with the implementation, which picks a random number within a range that doubles, but it doesn't feel right. I think we should have a base number that doubles, and add a few milliseconds for jitter.
Next issue is that calls that return unprocessed errors also increase the backoff. Since we commonly get a number of ProvisionedThroughputExceeded errors followed by a number of calls with unprocessed keys, we crank the delay up to tens of seconds.
I think we should reset the backoff as long as there is no error, and eliminate WaitWithoutCounting().
Finally I would drop the max backoff time from 50 seconds - it's too long compared to the default write context timeout of 1 minute, also compared to the typical time an end-user will wait for a query. Say 20 seconds?
So the base times would be:
100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s
Typical retries in a 1 minute timeout, if they all error ~9
Total time to get to MaxRetries=20 ~6 minutes.