Retry the grpc connection when there's an error#503
Retry the grpc connection when there's an error#503mattklein123 merged 5 commits intoenvoyproxy:mainfrom
Conversation
|
@mattklein123 @renuka-fernando Requesting your review on this PR. Thank you. |
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
| p.retryGrpcConn() | ||
| return |
There was a problem hiding this comment.
Shouldn't we retry gRPC conn only for connection errors?
There was a problem hiding this comment.
When we run Xds Server behind Envoy, during pod shutdowns/server enforced max connection age, the client gets RESET frame like "rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR" which is not treated as connection failure (It is RESET). Envoy -> XDS control plane does not just retry connection failures but retries with a backup on any error for the same reason https://github.com/envoyproxy/envoy/blob/49425f55aa9212a64b3390909160c41dc22ff349/source/extensions/config_subscription/grpc/grpc_stream.h#L50
This PR just mimicks the Envoy behaviour
|
Thank you for the approval @renuka-fernando |
|
@mattklein123 can you please take a look and merge this if this looks good to you? |
mattklein123
left a comment
There was a problem hiding this comment.
Please also add documentation in the README.
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
|
Thank you for your review @mattklein123. Updated per comments. Please take a second look when you're free. Thanks! |
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
This PR aims to fix the hot-looping problem described in Issue#502
The issue here was when the xDS-server closed the stream, the xDS-client tried to NACK the previous response and it went berserk in a hot-loop trying to fetch the configuration updates from the closed stream. This happens because the sotw.isConnErr doesn't return true in this case when the server signals the client with the following error message
rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR.Here's the sotw.isConnErr() for reference.
So the idea of this PR is to retry the connection whenever there is an error trying to fetch the config, instead of just expecting and handling just a few error codes.
Also, added exponential backoff for retrying the connection attempts.