Skip to content

Backoff algorithm is still not very good #792

@bboreham

Description

@bboreham

Here's a list of gaps observed in a querier trace, between DynamoDB.BatchGetItemPages that errored:

First one: 100ms, 190ms, 180ms, 200ms, 660ms, 2050ms
Next one: 60ms, 140ms, 250ms, 170ms, 440ms
Next one: 10ms, 160ms, 290ms, 600ms, 140ms, 100ms, 3880ms
Next one: 40ms, 60ms, 120ms, 370ms, 1240ms

All of this is consistent with the implementation, which picks a random number within a range that doubles, but it doesn't feel right. I think we should have a base number that doubles, and add a few milliseconds for jitter.

Next issue is that calls that return unprocessed errors also increase the backoff. Since we commonly get a number of ProvisionedThroughputExceeded errors followed by a number of calls with unprocessed keys, we crank the delay up to tens of seconds.

I think we should reset the backoff as long as there is no error, and eliminate WaitWithoutCounting().

Finally I would drop the max backoff time from 50 seconds - it's too long compared to the default write context timeout of 1 minute, also compared to the typical time an end-user will wait for a query. Say 20 seconds?

So the base times would be:
100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s, 20s

Typical retries in a 1 minute timeout, if they all error ~9
Total time to get to MaxRetries=20 ~6 minutes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions