upstream: API proposal for extensions to outlier detection#31205
upstream: API proposal for extensions to outlier detection#31205wbpcode merged 9 commits intoenvoyproxy:mainfrom
Conversation
…errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
wbpcode
left a comment
There was a problem hiding this comment.
Thanks for you contribution. From my quick check to the source code of outlier detection, seems it (source code) is well designed for different protocols.
All proxies (tcp proxy, redis proxy, mysql proxy, etc.) could call the putResult to tell outlier detection the results of requests. (code for HTTP and Result for non-HTTP).
If we want do more things in this areas, then it's reasonable to create some cross-protocol errors that used by the outlier detection. And then, define configurable mapping from protocol-dependent statues to these general errors.
| // Monitor name. | ||
| string name = 1; |
There was a problem hiding this comment.
Several monitors can be active within one cluster. The name would be used to indicate/log which monitor was triggered and marked a host as unhealthy.
| // Error bucket for HTTP codes | ||
| // [#not-implemented-hide:] | ||
| message HttpErrorsBucket { | ||
| string name = 1; | ||
| type.v3.Int32Range range = 2; | ||
| } | ||
|
|
||
| // Error bucket for locally originated errors | ||
| // [#not-implemented-hide:] | ||
| message LocalOriginEvents { | ||
| } | ||
|
|
||
| // Error bucket for database errors. | ||
| // Sub-parameters may be added later, like malformed response, error on write, etc. | ||
| // [#not-implemented-hide:] | ||
| message DatabaseTransactions { | ||
| } | ||
|
|
||
| // Union of possible error buckets. | ||
| // [#not-implemented-hide:] | ||
| message ErrorBucket { | ||
| oneof bucket { | ||
| HttpErrorsBucket http_errors = 1; | ||
| LocalOriginEvents local_origin_events = 2; | ||
| DatabaseTransactions database_transactions = 3; | ||
| } | ||
| } |
There was a problem hiding this comment.
I think these should be part of API of different proxies rather than the outlier detection self.
There was a problem hiding this comment.
That is good idea. I will try to reshuffle it, but am not sure how difficult it would be.
|
cc @envoyproxy/api-shepherds I think this is a core API change and others may also want to take a look. cc @htuch @markdroth |
|
The idea behind extending outlier detector was boiling in my head for many months. Because it is core API, backwards compatibility is essential. But I am convinced that if this piece of API was designed today it would look close to what is proposed here. The other thing is that internally all OD algorithms are executed regardless whether they are used or not. For example, even if only consecutive errors are configured, logic for success_rate and failure_frequency is executed, but it just does not kick in. The proposed API allows for backwards compatibility and co-existence of "old" config and new one. @htuch @mattklein123 WDYT about this proposal? I remember that your comments helped me a lot when I was working on separating locally originated errors from 5xx. |
|
One thought I had (triggered by your use of "extend") is that maybe OD should be a 1st class API extension point. It would definitely empower folks to do a lot of customization without needing to modify core APIs going forward. In your proposed change, I think that rather than treat the whole OD as an extension, the "monitors" would become pluggable extensions. You could then have actions related to monitors as an additional extension type. Overload manager is a great example of this API design pattern. |
|
@htuch Thanks for sharing your thoughts. Are you talking about that part of overload manager: https://github.com/envoyproxy/envoy/blob/main/api/envoy/config/overload/v3/overload.proto#L35-L44? |
|
Yep, that basic idea. CC @KBaichoo who might have thoughts on how this pattern worked out and whether it would make sense for outlier detection. |
|
Agreed, this seems like having a first class extension here would work well. I think @htuch's suggestion here of using a "monitor extension" would allow us to abstract away the particular error into just "the error occurred" or "didn't occur" and then the core could could just focus on how many of these occurred. This is similar to how the overload manager monitor's produce a float in the range of [0,1.0] to tell whether overload should kick in. If the set of algorithm's used on outlier detection are the same / similar then we can get away with "monitor extension" otherwise we might need something larger. |
markdroth
left a comment
There was a problem hiding this comment.
I like @htuch's idea of making OD monitors an extension point.
It is worth noting that gRPC also supports outlier detection (see gRFC A50), so whatever new API gets introduced here is one that we'll eventually need to support as well. @ejona86 @dfawley @murgatroid99
| // Union of possible error buckets. | ||
| // [#not-implemented-hide:] | ||
| message ErrorBucket { | ||
| oneof bucket { |
There was a problem hiding this comment.
We now prefer not to use oneof in the xDS API, as per https://github.com/envoyproxy/envoy/blob/main/api/STYLE.md.
I think we can do such abstraction. Thanks for great idea!
Currently, everything is internally mapped to HTTP codes and then fed to OD. I think that in the future, for grpc, we can define meaning of "error" using grpc-status codes, but in the first iteration there will be no difference how grpc is treated by OD. Let me work on my prototype and use "typed" config to be plugged in into protobuf's |
|
/wait |
Moved proto for errors and consecutive errors monitor to envoy/extensions. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
Extension to outlier detection monitor is defined as I still use |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
markdroth
left a comment
There was a problem hiding this comment.
I still use
oneofto capture several different types of errors (http, locally originated, database error) under one type. I know that this is discouraged (thanks @markdroth). What is an alternative?
Yeah, we definitely don't want to use oneof, since it makes it incredibly difficult to add new features in a backward-compatible way.
In this particular case, why do we need the ErrorBucket wrapper to begin with? Couldn't we just plug the individual error bucket types into the monitors field directly?
|
|
||
| // Set of passive monitors. | ||
| // [#not-implemented-hide:] | ||
| message Monitor { |
There was a problem hiding this comment.
I think we should use the existing TypedExtensionConfig message instead of defining our own message here.
There was a problem hiding this comment.
I was not aware of TypedExtensionConfig. Will refactor it. Thanks!
The idea is that a monitor (for example consecutive errors) is configured with a list of error buckets of the same or different types. The best example is mix of http codes and locally originated errors (resets, timeouts). Those are two different proto messages. I imagined that a monitor simply has a list of buckets: So a monitor with buckets as above will check for consecutive errors which fall into any bucket. The following errors will trigger the monitor: HTTP 503->tcp reset->HTTP 302. In a nutshell it would be great to have a single field where different proto messages could be used. Since, |
|
Oh, okay, I think I see what the confusion is here. I saw that the If Getting back to your question, though, message ErrorBuckets {
repeated HttpErrorsBucket http_errors = 1;
repeated LocalOriginEvents local_origin_events = 2;
repeated DatabaseTransactions database_transactions = 3;
}Then you can just have a non-repeated |
Error types are going to be reused. I will move them to common/v3.
That should work! Thanks a lot @markdroth |
| Monitors monitors = 24; | ||
| repeated Monitor monitors = 24; |
There was a problem hiding this comment.
Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?
And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?
If all monitors work at same time, it would be more complex to handle the output of different monitor. I prefer only use TypedExtensionConfig monitor = 24; here because single monitor is enough for absolute majority of users. If some one wants to composite multiple monitors, he can implement it by an composite extension and decide how to handle different outputs of different monitors by him self.
There was a problem hiding this comment.
It's worth noting that the existing mechanism for configuring outlier detection does allow configuring both failure percentage and success rate at the same time. If the goal here is to move that existing configuration into an extension, then it seems like we should make it possible to configure multiple ejection algorithms.
There was a problem hiding this comment.
Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?
A monitor will consume results of interaction with upstream entity and run logic specific to a monitor. For example, count consecutive errors or standard deviation or frequency of errors.
And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?
The result of a transaction will go to all monitors.
If all monitors work at same time, it would be more complex to handle the output of different monitor.
You are right that most users will use a single monitor. But, as @markdroth pointed out, there are already many monitors, so repeated Monitor is a natural extension. Also, I plan to add a monitor which measures response time and eject slow servers. The combination of one monitor which counts errors and another monitor which measures response time starts to make sense.
I think that the idea of multiple monitors was always there. See this comment: https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/upstream/outlier_detection_impl.h#L369-L371. But for some reason, instead of adding new, separate monitors with different logic, all algorithms were added to a single monitor. With this new API, I hope to change it and maybe, if there is an agreement, deprecate old ways of configuring outlier detector.
There was a problem hiding this comment.
sgtm if we intend to use the monitors to replace whole previous implementation.
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
I actually have another question about the possible implementation. The initial target of this patch is to support non-HTTP protocols outlier detections, what the extension interface would be like? What type inputs the extension require and what type outputs the extension will provide? |
You are almost correct. The initial target is to allow a user to define which HTTP codes should be treated as errors. Right now it is hardcoded to 5xx. But there is demand to treat 4xx codes as error as well. The other goal is to allow a developer to add new types of errors without a need to map them to HTTP codes. Right now, everything needs to be somehow mapped to HTTP codes. I coded a prototype with 3 types of errors: user-defined HTTP codes, locally originated errors and database errors. I imagine that adding a grpc error type and feed grpc status there should also be super easy. At the bottom, at C++ level, the interface is basic pure abstract |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
@wbpcode nag for stalled PR. (Can't tell if it should be /waited or if it's currently waiting for your reply.) |
wbpcode
left a comment
There was a problem hiding this comment.
LGTM overall. Only two minor comments.
(And so sorry for the delay, I recently lost lots of pings :(
| message HttpCodes { | ||
| type.v3.Int32Range range = 1; | ||
| } | ||
|
|
||
| // Error bucket for locally originated errors. | ||
| // [#not-implemented-hide:] | ||
| message LocalOriginEvents { | ||
| } | ||
|
|
||
| // Error bucket for database errors. | ||
| // Sub-parameters may be added later, like malformed response, error on write, etc. | ||
| // [#not-implemented-hide:] | ||
| message DatabaseTransactions { | ||
| } |
There was a problem hiding this comment.
I will make these more simple by HttpErrors, LocalErrors, DatabaseErrors, etc.
There was a problem hiding this comment.
Agree - names are corrected now.
| // The % chance that a host is actually ejected. Defaults to 100. | ||
| google.protobuf.UInt32Value enforcing = 3 [(validate.rules).uint32 = {lte: 100}]; | ||
|
|
||
| // Error buckets. |
There was a problem hiding this comment.
Could you add a comment here about which would be treat as an error? I guess any event that listed in the buckets would be treated as error?
There was a problem hiding this comment.
This is just a placeholder for now to visualize how more or less API will look like and will not be included in docs. I plan to update/improve documentation when I deliver implementation of the first extension - consecutive errors monitor.
|
/wait |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
wbpcode
left a comment
There was a problem hiding this comment.
LGTM. Thanks. Look forward to the impl. :)
|
@wbpcode Thanks for approving. I have a prototype working, so it should move forward smoothly. |
Commit Message:
API proposal for extensions to outlier detection
Additional Description:
The goal of outlier detection extensions is to allow:
Design, config snippets, link to working demo (works for non-5xx errors and redis errors) is in https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link
Risk Level: Low
Testing: N/A
Docs Changes: Not at the moment
Release Notes: Not at the moment
Platform Specific Features: No