Is your feature request related to a problem?
Yes. The least_request load balancing strategy can cause a complete TPS drop when a single upstream endpoint hangs. This occurs due to two primary factors:
- Long request timeouts (like 30 seconds or more) make this much worse.
- As shown in LeastRequestLoadBalancer$ReadyPicker.nextChildToUse(), the N_CHOICES selection method randomly picks two endpoints. It may select the same unhealthy endpoint twice(instead of two distinct endpoints).
When this occurs, all traffic is routed to the hanged up endpoint, causing a full service degradation, which is unacceptable.
Describe the solution you'd like
- Suport for FULL_SCAN mode of xDS LEAST_REQUEST Load Balancer Policy, which would check all endpoints before picking one.
- Adjust the N_CHOICES algorithm to prevent it from picking the same endpoint twice, like record which endpoints were already chosen, or something else.
Is your feature request related to a problem?
Yes. The least_request load balancing strategy can cause a complete TPS drop when a single upstream endpoint hangs. This occurs due to two primary factors:
When this occurs, all traffic is routed to the hanged up endpoint, causing a full service degradation, which is unacceptable.
Describe the solution you'd like