Skip to content

Retry the grpc connection when there's an error#503

Merged
mattklein123 merged 5 commits intoenvoyproxy:mainfrom
akondapuram:main
Feb 23, 2024
Merged

Retry the grpc connection when there's an error#503
mattklein123 merged 5 commits intoenvoyproxy:mainfrom
akondapuram:main

Conversation

@akondapuram
Copy link
Copy Markdown
Contributor

@akondapuram akondapuram commented Feb 9, 2024

This PR aims to fix the hot-looping problem described in Issue#502

The issue here was when the xDS-server closed the stream, the xDS-client tried to NACK the previous response and it went berserk in a hot-loop trying to fetch the configuration updates from the closed stream. This happens because the sotw.isConnErr doesn't return true in this case when the server signals the client with the following error message
rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR.

Here's the sotw.isConnErr() for reference.

So the idea of this PR is to retry the connection whenever there is an error trying to fetch the config, instead of just expecting and handling just a few error codes.

Also, added exponential backoff for retrying the connection attempts.

@akondapuram
Copy link
Copy Markdown
Contributor Author

@mattklein123 @renuka-fernando Requesting your review on this PR. Thank you.
cc: @ramaraochavali

Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
Comment thread src/provider/xds_grpc_sotw_provider.go Outdated
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
@akondapuram akondapuram marked this pull request as ready for review February 16, 2024 02:49
Comment on lines +118 to +119
p.retryGrpcConn()
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we retry gRPC conn only for connection errors?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we run Xds Server behind Envoy, during pod shutdowns/server enforced max connection age, the client gets RESET frame like "rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR" which is not treated as connection failure (It is RESET). Envoy -> XDS control plane does not just retry connection failures but retries with a backup on any error for the same reason https://github.com/envoyproxy/envoy/blob/49425f55aa9212a64b3390909160c41dc22ff349/source/extensions/config_subscription/grpc/grpc_stream.h#L50

This PR just mimicks the Envoy behaviour

@akondapuram
Copy link
Copy Markdown
Contributor Author

Thank you for the approval @renuka-fernando
@mattklein123 Could you please take a look when you get some time? Thank you!

@ramaraochavali
Copy link
Copy Markdown

@mattklein123 can you please take a look and merge this if this looks good to you?

@mattklein123 mattklein123 self-assigned this Feb 22, 2024
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add documentation in the README.

Comment thread src/provider/xds_grpc_sotw_provider.go Outdated
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
@akondapuram
Copy link
Copy Markdown
Contributor Author

Thank you for your review @mattklein123. Updated per comments. Please take a second look when you're free. Thanks!

Comment thread README.md Outdated
Signed-off-by: alekhya.kondapuram <alekhya.kondapuram@salesforce.com>
@mattklein123 mattklein123 merged commit 19f2079 into envoyproxy:main Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants