upstream: implementation of outlier detection extensions#34154
upstream: implementation of outlier detection extensions#34154cpakulski wants to merge 102 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
… monitor. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps! |
|
will the old outlier detection extension be eventually deprecated? (given that the new one lands) |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/wait @cpakulski please merge main to resolve the conflict @wbpcode is this PR ready for your review again? |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
See https://github.com/envoyproxy/envoy/actions/runs/15151640605/job/42598717694 and #34154 (comment). I will prefer to get a balanced result between the complex router or complex outlier extension. The currenty Result is used for multiple different protocol and there is no reason to abort it. And It's only have very limited statuses (4 different value actually, no much complex than boolean), it should be fine to prapagate this status to all outlier extensions. Then it's unnecessary to add complex proto API into router to select different outlier extension for different events. |
I think it's hard to accpt this API. That means the HTTP router need to aware different monitors of different clusters. And the HCM router configuration will interact with the CDS in some way. IMO, the current 4 statuses And if complex mapping (protocol specific status codes to L7 failure or L7 success) is necessary in the future, we could configure a opaque mapping extension in the outlier detection, and the router could call the mapping extension (if is provided) to generate the L7 status. Then, by this way,
|
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
This is a valid point. So, this (current) API moves the complexity towards the router and makes outlier extensions very simple. Also, it makes possible to define more than one http matcher and send matching results to different extension monitors (by using monitor names). I am not really sure if this is a viable scenario though. So, in order to simplify things on the router side the only option is to reduce configuration flexibility (only one http matcher) and make some sort of a filter (which events L4 or L7 should be accepted by an extension). |
@wbpcode I need some time to design/prototype the API. Your point about not specifying clusters at the router is very valid one and I 100% agree. The difficulty with designing the API is where to make the conversion from protocol specific result to protocol independent result. And this needs to be done per cluster, because some clusters may need to treat say http codes range 500-599 as error and other cluster may treat http codes range 400-599 as error or one cluster may want to receive L4 (locally originated events) and other may not want to receive it.
So this is sort of per-cluster configuration, but still keeps outlier detection monitors very simple/protocol independent (understand only Success/Error). The cluster is already specified and this config will send to the outlier extensions attached to Sorry for long message, but this API is very difficult to design, especially that it requires backwards compatibility with existing old "canned" outlier monitors. |
You are welcome! And thanks so much for your contribution and patiance. I do think thing could be simpler. If new API is necessary, as I said, we can add a new field at outlier detection to map the external L7 statuses to - name:exmaple
type: STRICT_DNS
connect_timeout: 0.25s
lb_policy: ROUND_ROBIN
outlier_detection:
external_statuses_policy:
"@type": type.googleapis.com/envoy.extensions.outlier_detection.external_statuses_policy.v3.Http
http_code_ranges:
- start: 500
end: 599
result: ExtOriginRequestFailed
load_assignment:
cluster_name: sds
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: exmaple.com
port_value: 80
protocol: TCPOr we can use the exist extension point - name:exmaple
type: STRICT_DNS
connect_timeout: 0.25s
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
outlier_detection:
status_ranges:
- start: 500
end: 599
result: ExtOriginRequestFailed
load_assignment:
cluster_name: sds
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: exmaple.com
port_value: 80
protocol: TCP
- name: redis
type: STRICT_DNS
connect_timeout: 0.25s
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.filters.network.redis_proxy:
"@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProtocolOptions
outlier_detection:
status_ranges: ...Both are fine to me although I prefer the seond way because needn't to define new extension point and could aggregate all protocol-dependent configuration at same position. |
|
But anyway, we can leave it at future PR. At this PR, I think the outlier detection should handle the fixed 4 statuses (L4 + L7) first. (I still insist on that to reuse the We can add a field at |
Commit Message: upstream: unify all outlier report to use putResult Additional Description: To use `putResult` for all outlier detection event reporting. This is related to the #34154. Ideally, the putResult will be used as the unique event reporting entrypoint of outlier detection and the `Result` enum will be used as the input of the custom outlier monitors. This also fixed a bug where the connection success event will be reported twice. Risk Level: low. Testing: n/a Docs Changes: n/a. Release Notes: n/a. Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com> Signed-off-by: code <wbphub@gmail.com> Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
wbpcode
left a comment
There was a problem hiding this comment.
Thanks for the update. And some comments are added to the API.
| // structure. | ||
| // [#extension-category: envoy.outlier_detection_monitors] | ||
| repeated core.v3.TypedExtensionConfig monitors = 24; | ||
| core.v3.TypedExtensionConfig monitors = 24; |
There was a problem hiding this comment.
Then the name should be monitor. This change LGTM, but I want to say the or/and wrapper would complex to implementation.
There was a problem hiding this comment.
I looked at the implementation of matcher and can reuse some of the code from there. matcher allows creation of various AND + OR hierarchies. I do not think that such scenarios will be used very often, but I have seen requirements for multiple monitors to fail in order to mark a node as unhealthy. Will change to monitor. Thanks.
| message LocallyOriginatedEvents { | ||
| // If enabled, locally originated events will be reported to outlier detection extensions. | ||
| // Defaults to true. | ||
| bool enable = 1; | ||
| } |
There was a problem hiding this comment.
This is unnecessary to be part of HTTP protocol. You only need an EventFilter at the MonitorCommonConfig:
enum Event {
LocalOriginFailure = 0;
LocalOriginSuccess = 1;
ExtOriginFailure = 2;
ExtOriginSuccess = 3;
}
message MonitorCommonConfig {
// The % chance that a host will be actually ejected when an monitor's
// algorithm decides that the host is unhealthy.
// This setting can be used to disable
// ejection or to ramp it up slowly. Defaults to 100.
google.protobuf.UInt32Value enforcing = 1 [(validate.rules).uint32 = {lte: 100}];
// The list of events to accept.
repeated Event take_events = 1;
}
There was a problem hiding this comment.
So we are back to the subject where to send events. I originally used names of monitors. Assuming that in the future there will be hierarchy (AND/OR) of monitors, not all monitors may be interested in all events. For example some may only be interested in locally originated events, others in http events. So, I used names next to the http matcher.
There are few problems with the enum Event.
- Why would a user want to receive only
ExtOriginFailureand not receiveExtOriginSuccess? That would completely break a monitor logic because a monitor would receive only failures and any monitor will fail every time. - it exposes internal naming used in the code. It is simpler to express: here is a matcher and send the result to the following monitors than using Local/ExtOrigin names (IMHO).
- this problem is a bit difficult to understand. It took me a long time to solve when I worked on splitting locally originated and externally originated events. The problem is that if both types of events (local and ext origin) are sent to the same monitor, some locally originated events must be silenced. Let's assume that the monitor type is consecutive errors (3 failures in the row mark a node as unhealthy). A request to the upstream server which returns error 500 will result in two events: LocallyOriginatedSuccess (because Envoy successfully connected to upstream) and ExtFailure (because 500 was returned). In that case the monitor will never fail, because 3 exchanges will result in sending to the monitor: SUCCESS, FAILURE, SUCCESS, FAILURE, SUCCESS, FAILURE. So, in order for this thing to work, LocallyOriginatedSuccess must be suppressed when http events are sent to the same monitor. That is why I added this code:
envoy/source/common/upstream/upstream_impl.cc
Lines 1590 to 1596 in 3c4d116
So, in essence I think it is better to keep such logic in router and make monitors to accept only success/failure (true/false)
There was a problem hiding this comment.
Response to first point: it may never happen but we needn't care. We have provided all possible groups and the users/developers decide how to make the custom monitor works as expected.
Response to second point: no matter which name is used, the users/developers need to aware the four different events clearly because it is core of whole outlier detection.
Response to third point: this is why we need this take_events list. Every monitor could select the events it concern and suppress some events. Then the events after filtered will be converted to bool and be handled by the monitor extension.
If we do this in router, how to handle the case that differ monitor need consume different events? As I said before, it's unacceptable to let router to aware different monitors.
So, a good enough solution is the router will report all events to outlier detection module. The outlier detection module will filter the events based on the monitor's concern (the take_events of every monitor) and convert the filtered events to bool. Finally, the monitor takes the bool as input.
Per you current API, how to support the case where one monitor want to consume only local origin event and another one want to consume the external origin event in an or monitor list? And I also think HttpProtocalOptions should only contains protocol specific things. At here, it is configuration about how to map the external response to ExtOriginSuccess/ExtOriginFailure.
There was a problem hiding this comment.
Response to first point: it may never happen but we needn't care. We have provided all possible groups and the users/developers decide how to make the custom monitor works as expected.
Actually, I do not agree with this. Users should think in terms of HTTP status codes not in terms of internal variables passed to outlier. The problem with this solution is that if a user sends http codes and locally originated events to the same monitor, LocalOriginSuccess must be explicitly disabled. Otherwise the monitor will never fail for consecutive 5xx codes. User does not need to understand those internals. The router should handle that logic.
Response to second point: no matter which name is used, the users/developers need to aware the four different events clearly because it is core of whole outlier detection.
The same as above. Why? All a user should care about is how to define what should be treated as error.
If we do this in router, how to handle the case that differ monitor need consume different events? As I said before, it's unacceptable to let router to aware different monitors.
Agreed, but it is not at the router level anymore, but in cluster's protocol extensions.
Per you current API, how to support the case where one monitor want to consume only local origin event and another one want to consume the external origin event in an or monitor list?
That was the purpose of the monitor names. I dropped it to start with smaller PR footprint, but names could be added later.
Anyways, let's pause for a while. I have another, much simpler idea and will open a new PR soon.
There was a problem hiding this comment.
Actually, I do not agree with this. Users should think in terms of HTTP status codes not in terms of internal variables passed to outlier. The problem with this solution is that if a user sends http codes and locally originated events to the same monitor, LocalOriginSuccess must be explicitly disabled. Otherwise the monitor will never fail for consecutive 5xx codes. User does not need to understand those internals. The router should handle that logic.
Then how the router handle this correctly for various cases like local events only monitor or ext events only monitor? You still need some configuration options and the users still need to understand the backend logic of the options. If you don't want users to know all these events, then you can also creat similar enum to provide LocalOrigin, ExtOrigin, Default, three classes of events. The users only need to know which class of events is need to concern. This may more friendly to users.
You may could provide a good enough default values, but the complexity never disappeared.
The same as above. Why? All a user should care about is how to define what should be treated as error.
Nope. Only the users wants to use more flexible features need to understand it. Like current users also need to understand the split mode of outlier detection. If the users want to understand the split mode clearly, it basically needs to read the code to know which is local events and what is the actual behavior. Actually defining this events in the API is more friendly in some way.
Anyways, let's pause for a while. I have another, much simpler idea and will open a new PR soon.
Sounds great. But keep it small and let we get agreement by one step and on step.
There was a problem hiding this comment.
Then how the router handle this correctly for various cases like local events only monitor or ext events only monitor?
LocalOriginEvents and ExtOriginEvents are not totally independent. If they go to separate monitors, then they are independent and router sends all LocalOriginEvents to one monitor and all ExtOriginEvents to the second monitor. If however, they go to the same monitor, router has to suppress sending ExtOriginSuccess. So, in order to implement this logic in router, router has to somehow know where the events go. I implemented this logic in HttpProtocolOptions.
There was a problem hiding this comment.
LocalOriginEvents and ExtOriginEvents are not totally independent. If they go to separate monitors, then they are independent and router sends all LocalOriginEvents to one monitor and all ExtOriginEvents to the second monitor. If however, they go to the same monitor, router has to suppress sending ExtOriginSuccess. So, in order to implement this logic in router, router has to somehow know where the events go. I implemented this logic in HttpProtocolOptions.
That's why I suggest use an enum to control which event should be accepted. No matter it's event list or event type. The comlexity won't disappear and I don't see why HttpProtocolOptions would be better solution for that. I personally think only the protocol-specific things should be placed in HttpProtocolOptions (for example, mapping response to ext event).
But you refered you have a better idea, we can check it first.
| } | ||
|
|
||
| message OutlierDetection { | ||
| HttpEvents http_events = 1; |
There was a problem hiding this comment.
So, all the responses that mach this will be treat as failure or success? I think this should be document clearly.
And you may needn't the HttpEvents wrapper. You can use the matcher header directly:
config.common.matcher.v3.MatchPredicate failure_matcher = 1;
// or
config.common.matcher.v3.MatchPredicate success_matcher = 1;
There was a problem hiding this comment.
I agree that failure_matcher is a good idea. Thanks. Will change the name.
I used a larger structure message OutlierDetection { on purpose, because I think in the future we may add some extra parameters next to the matcher, like names of monitors where to send the result of matching. It nicely groups the matcher and destination of matching result. If the matcher is separated, it would be harder to add list of destinations. WDYT?
There was a problem hiding this comment.
OutlierDetection message sgtm. But needn't HttpEvents wrapper.
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/wait Waiting for addressing comments. |
|
@cpakulski hey there do you plan to finish up on this PR, its been so much work done on it so its going to be a shame to go on waste. |
|
Will try to pick this up if @cpakulski has no time for this. |
|
I am still planning to finish it. I am on annual vacations now and will go back to this issue in 2-3 weeks. |
Commit Message:
upstream: implementation of outlier detection extensions
Additional Description:
The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link
In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:
So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not.
The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).
This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to
The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.
The implementation is built on top of already approved API for extensions: #31205
Risk Level: Low (previous configuration still works. Extensions do not have to be configured)
Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions
Docs Changes: Yes. Added.
Release Notes: Yes.
Platform Specific Features: No
Fixes #18789