-
Notifications
You must be signed in to change notification settings - Fork 134
Description
What problem are you facing?
There are various down-stream APIs that implement rate limiting in undesirable ways OR don't implement it at all. The current crossplane-runtime behavior leads to rapid-fire retries of failing requests. In the vast majority of these cases, these failures are not at all transitory, but are the result of invalid configuration or request.
Downstream API owners, understandably, get worried when failure counts go up as a result of this and naturally ask us to figure out how to limit the number of calls. As a provider developer, I don't really see a clean way to influence this behavior through crossplane-runtime's interface. For instance, in many cases, I would opt to disable any request re-queuing on failure, and simply wait for the next poll interval to actually do anything. For some other cases, I may want to specify a minimum re-queue duration so that I can have a more frequent poll interval without amplifying failure calls. If there's a re-queued request for a given resource in the queue, we skip it at its next poll interval (not even sure how this is possible right now).
How could Crossplane help solve your problem?
I'm not sure the best way to solve this within crossplane-runtime, but I do know that information can be passed from a provider into runtime via managed.External* structs. The structs allow propagation of connection secrets and some other data that end up flowing through the reconcile loop.
As a "simple" first step, would it be at all feasible to introduce a RequeueOnError boolean into the managed.External* structs that could inform the reconcile loop as to what to do when it encounters non-nil errors in the response? For instance, in this provider-terraform issue I filed #193, I would likely opt to NOT retry on error and simply let the natural poll/sync interval handle retries at a later time. Sure, claim owners may get slower feedback, but in most cases, that feedback will be pretty static, regardless of the number of times we retry.
I'd also like to know, as a provider developer, if there are existing patterns to address this concern. We've implemented work-arounds through timestamp tracking and other methods to cater to down-stream APIs, but this is not ideal and adds complexity to provider code.
I'm also super interested in previous discussions on this, which had to have come up, and if there's been any movement or ideas on how to address these kinds of challenges.