upstream: API proposal for extensions to outlier detection by cpakulski · Pull Request #31205 · envoyproxy/envoy

cpakulski · 2023-12-06T18:49:10Z

Commit Message:
API proposal for extensions to outlier detection

Additional Description:
The goal of outlier detection extensions is to allow:

defining HTTP code ranges which should be treated as errors. See Outlier Detection for non-error status codes #18789
using passive health checking for non-http protocols (like redis, postgres). See redis outlier detection when redis is down but Envoy is up #24215

Design, config snippets, link to working demo (works for non-5xx errors and redis errors) is in https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

Risk Level: Low
Testing: N/A
Docs Changes: Not at the moment
Release Notes: Not at the moment
Platform Specific Features: No

…errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only · 2023-12-06T18:49:19Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31205 was opened by cpakulski.

see: more, trace.

wbpcode

Thanks for you contribution. From my quick check to the source code of outlier detection, seems it (source code) is well designed for different protocols.

All proxies (tcp proxy, redis proxy, mysql proxy, etc.) could call the putResult to tell outlier detection the results of requests. (code for HTTP and Result for non-HTTP).

If we want do more things in this areas, then it's reasonable to create some cross-protocol errors that used by the outlier detection. And then, define configurable mapping from protocol-dependent statues to these general errors.

wbpcode · 2023-12-07T03:32:46Z

+    // Monitor name.
+    string name = 1;


What is the usage of this name?

Several monitors can be active within one cluster. The name would be used to indicate/log which monitor was triggered and marked a host as unhealthy.

wbpcode · 2023-12-07T04:15:45Z

+// Error bucket for HTTP codes
+// [#not-implemented-hide:]
+message HttpErrorsBucket {
+  string name = 1;
+  type.v3.Int32Range range = 2;
+}
+
+// Error bucket for locally originated errors
+// [#not-implemented-hide:]
+message LocalOriginEvents {
+}
+
+// Error bucket for database errors.
+// Sub-parameters may be added later, like malformed response, error on write, etc.
+// [#not-implemented-hide:]
+message DatabaseTransactions {
+}
+
+// Union of possible error buckets.
+// [#not-implemented-hide:]
+message ErrorBucket {
+  oneof bucket {
+    HttpErrorsBucket http_errors = 1;
+    LocalOriginEvents local_origin_events = 2;
+    DatabaseTransactions database_transactions = 3;
+  }
+}


I think these should be part of API of different proxies rather than the outlier detection self.

That is good idea. I will try to reshuffle it, but am not sure how difficult it would be.

wbpcode · 2023-12-07T04:16:40Z

cc @envoyproxy/api-shepherds I think this is a core API change and others may also want to take a look.

cc @htuch @markdroth

cpakulski · 2023-12-08T00:45:53Z

The idea behind extending outlier detector was boiling in my head for many months. Because it is core API, backwards compatibility is essential. But I am convinced that if this piece of API was designed today it would look close to what is proposed here.
I believe that historically the OD mechanism was designed for 5xx errors (so it was very HTTP specific). Later on locally originated errors (timeouts, resets, etc) were added and for simplicity were treated as 5xx errors. Back in 2019 I added configuration item to distinguish HTTP codes from locally originated errors: #4822. In order to do that I had to duplicate most of config items, like number of consecutive errors, success rate params, and introduce separate counters for HTTP errors and locally originated errors.
The problem is that in the current API the algorithm is attached to type of error. For example consecutive_5xx defines algorithm which counts consecutive errors and errors must be HTTP 5xx. consecutive_local_origin is similar but for locally originated errors.
In this API proposal I would like to separate algorithm from type of error. Administrator attaches type of errors which algorithm should count.
In order to address issue #18789, there are two choices with the current API. Add another set of config items like consecutive_4xx, enforce_consecutive_4xx, success_rate_4xx, enforce_success_rate_4xx, ... About 8-10 items. The other choice would be boolean flag "treat_4xx_as_5xx". In the proposed API, all what is needed is another HTTP Error bucket with range 400-499.
It is even more cumbersome for database errors. To address #24215 another 8-10 items would have to be introduced like consecutive_db_error, etc. Or we would have to treat DB errors like HTTP 5xx which is very cumbersome.
Another example is grpc. How to use OD and feed it with grpc-status? In the proposed API, we would add GRPCError bucket type and specify which codes should be treated as errors.

The other thing is that internally all OD algorithms are executed regardless whether they are used or not. For example, even if only consecutive errors are configured, logic for success_rate and failure_frequency is executed, but it just does not kick in.

The proposed API allows for backwards compatibility and co-existence of "old" config and new one.

@htuch @mattklein123 WDYT about this proposal? I remember that your comments helped me a lot when I was working on separating locally originated errors from 5xx.

htuch · 2023-12-08T06:30:34Z

One thought I had (triggered by your use of "extend") is that maybe OD should be a 1st class API extension point. It would definitely empower folks to do a lot of customization without needing to modify core APIs going forward. In your proposed change, I think that rather than treat the whole OD as an extension, the "monitors" would become pluggable extensions. You could then have actions related to monitors as an additional extension type.

Overload manager is a great example of this API design pattern.

cpakulski · 2023-12-12T22:53:03Z

@htuch Thanks for sharing your thoughts. Are you talking about that part of overload manager: https://github.com/envoyproxy/envoy/blob/main/api/envoy/config/overload/v3/overload.proto#L35-L44?

htuch · 2023-12-13T06:34:58Z

Yep, that basic idea. CC @KBaichoo who might have thoughts on how this pattern worked out and whether it would make sense for outlier detection.

KBaichoo · 2023-12-18T21:17:20Z

Agreed, this seems like having a first class extension here would work well.

I think @htuch's suggestion here of using a "monitor extension" would allow us to abstract away the particular error into just "the error occurred" or "didn't occur" and then the core could could just focus on how many of these occurred. This is similar to how the overload manager monitor's produce a float in the range of [0,1.0] to tell whether overload should kick in. If the set of algorithm's used on outlier detection are the same / similar then we can get away with "monitor extension" otherwise we might need something larger.

markdroth

I like @htuch's idea of making OD monitors an extension point.

It is worth noting that gRPC also supports outlier detection (see gRFC A50), so whatever new API gets introduced here is one that we'll eventually need to support as well. @ejona86 @dfawley @murgatroid99

markdroth · 2023-12-18T21:47:49Z

+// Union of possible error buckets.
+// [#not-implemented-hide:]
+message ErrorBucket {
+  oneof bucket {


We now prefer not to use oneof in the xDS API, as per https://github.com/envoyproxy/envoy/blob/main/api/STYLE.md.

cpakulski · 2023-12-19T20:58:42Z

If the set of algorithm's used on outlier detection are the same / similar then we can get away with "monitor extension" otherwise we might need something larger.

I think we can do such abstraction. Thanks for great idea!

It is worth noting that gRPC also supports outlier detection (see gRFC A50), so whatever new API gets introduced here is one that we'll eventually need to support as well.

Currently, everything is internally mapped to HTTP codes and then fed to OD. I think that in the future, for grpc, we can define meaning of "error" using grpc-status codes, but in the first iteration there will be no difference how grpc is treated by OD.

Let me work on my prototype and use "typed" config to be plugged in into protobuf's Any.

adisuissa · 2024-01-02T14:41:14Z

/wait

Moved proto for errors and consecutive errors monitor to envoy/extensions. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-01-09T16:35:15Z

Extension to outlier detection monitor is defined as Any. Thanks to @htuch and @KBaichoo for idea. I also have a working prototype which uses this new proto.

I still use oneof to capture several different types of errors (http, locally originated, database error) under one type. I know that this is discouraged (thanks @markdroth). What is an alternative?

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

markdroth

I still use oneof to capture several different types of errors (http, locally originated, database error) under one type. I know that this is discouraged (thanks @markdroth). What is an alternative?

Yeah, we definitely don't want to use oneof, since it makes it incredibly difficult to add new features in a backward-compatible way.

In this particular case, why do we need the ErrorBucket wrapper to begin with? Couldn't we just plug the individual error bucket types into the monitors field directly?

markdroth · 2024-01-09T19:58:17Z


+  // Set of passive monitors.
+  // [#not-implemented-hide:]
+  message Monitor {


I think we should use the existing TypedExtensionConfig message instead of defining our own message here.

I was not aware of TypedExtensionConfig. Will refactor it. Thanks!

cpakulski · 2024-01-09T20:25:26Z

In this particular case, why do we need the ErrorBucket wrapper to begin with? Couldn't we just plug the individual error bucket types into the monitors field directly?

The idea is that a monitor (for example consecutive errors) is configured with a list of error buckets of the same or different types. The best example is mix of http codes and locally originated errors (resets, timeouts). Those are two different proto messages. I imagined that a monitor simply has a list of buckets:
repeated ErrorBuckets buckets
and user may plug-in different types of buckets (same type or different types): http codes, locally originated. As below:

buckets:
  - http_errors:
       range: 500-599
  - http_errors:
       range: 300-325
  - local_origin_events: {}

So a monitor with buckets as above will check for consecutive errors which fall into any bucket. The following errors will trigger the monitor: HTTP 503->tcp reset->HTTP 302.

In a nutshell it would be great to have a single field where different proto messages could be used. Since, oneof is not to be used, is using Any/TypedExtensions the only othe choice?

markdroth · 2024-01-09T20:54:28Z

Oh, okay, I think I see what the confusion is here. I saw that the ErrorBucket message was defined in a file in a separate subdirectory (error_types), so I assumed that was intended to be a separate extension -- i.e., I thought your intent was that the monitors field could contain either ConsecutiveErrors or ErrorBucket. But it looks like what you're actually intending here is just that ErrorBucket is one of the fields inside the ConsecutiveErrors message.

If ErrorBucket is not going to be used in anything except ConsecutiveErrors, then you could just define it in the same file. Alternatively, if the intent here is just that the ErrorBucket is a reusable message that in the future could be used for extensions other than ConsecutiveErrors, then I suggest moving those types to a directory that does not imply that they are a separate extension. It would probably be fine to put them in api/envoy/extensions/outlier_detection_monitors/v3 or api/envoy/extensions/outlier_detection_monitors/common/v3.

Getting back to your question, though, Any and TypedExtensionConfig don't actually solve the same problem as oneof, and I don't think they're helpful here. How about just moving the repeated down one level, so it looks like this:

message ErrorBuckets {
  repeated HttpErrorsBucket http_errors = 1;
  repeated LocalOriginEvents local_origin_events = 2;
  repeated DatabaseTransactions database_transactions = 3;
}

Then you can just have a non-repeated ErrorBuckets message in ConsecutiveErrors. This would still allow you to group any number of conditions of any of these types into a single bucket.

cpakulski · 2024-01-09T22:52:57Z

If ErrorBucket is not going to be used in anything except ConsecutiveErrors, then you could just define it in the same file. Alternatively, if the intent here is just that the ErrorBucket is a reusable message that in the future could be used for extensions other than ConsecutiveErrors, then I suggest moving those types to a directory that does not imply that they are a separate extension. It would probably be fine to put them in api/envoy/extensions/outlier_detection_monitors/v3 or api/envoy/extensions/outlier_detection_monitors/common/v3.

Error types are going to be reused. I will move them to common/v3.

How about just moving the repeated down one level, so it looks like this:
message ErrorBuckets {
repeated HttpErrorsBucket http_errors = 1;
repeated LocalOriginEvents local_origin_events = 2;
repeated DatabaseTransactions database_transactions = 3;
}

That should work! Thanks a lot @markdroth

wbpcode · 2024-01-10T01:57:41Z

-  Monitors monitors = 24;
+  repeated Monitor monitors = 24;


Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?

And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?

If all monitors work at same time, it would be more complex to handle the output of different monitor. I prefer only use TypedExtensionConfig monitor = 24; here because single monitor is enough for absolute majority of users. If some one wants to composite multiple monitors, he can implement it by an composite extension and decide how to handle different outputs of different monitors by him self.

It's worth noting that the existing mechanism for configuring outlier detection does allow configuring both failure percentage and success rate at the same time. If the goal here is to move that existing configuration into an extension, then it seems like we should make it possible to configure multiple ejection algorithms.

Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?

A monitor will consume results of interaction with upstream entity and run logic specific to a monitor. For example, count consecutive errors or standard deviation or frequency of errors.

And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?

The result of a transaction will go to all monitors.

If all monitors work at same time, it would be more complex to handle the output of different monitor.

You are right that most users will use a single monitor. But, as @markdroth pointed out, there are already many monitors, so repeated Monitor is a natural extension. Also, I plan to add a monitor which measures response time and eject slow servers. The combination of one monitor which counts errors and another monitor which measures response time starts to make sense.

I think that the idea of multiple monitors was always there. See this comment: https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/upstream/outlier_detection_impl.h#L369-L371. But for some reason, instead of adding new, separate monitors with different logic, all algorithms were added to a single monitor. With this new API, I hope to change it and maybe, if there is an agreement, deprecate old ways of configuring outlier detector.

sgtm if we intend to use the monitors to replace whole previous implementation.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode · 2024-01-11T01:56:05Z

I actually have another question about the possible implementation. The initial target of this patch is to support non-HTTP protocols outlier detections, what the extension interface would be like? What type inputs the extension require and what type outputs the extension will provide?

cpakulski · 2024-01-11T16:14:37Z

I actually have another question about the possible implementation. The initial target of this patch is to support non-HTTP protocols outlier detections, what the extension interface would be like? What type inputs the extension require and what type outputs the extension will provide?

You are almost correct. The initial target is to allow a user to define which HTTP codes should be treated as errors. Right now it is hardcoded to 5xx. But there is demand to treat 4xx codes as error as well. The other goal is to allow a developer to add new types of errors without a need to map them to HTTP codes. Right now, everything needs to be somehow mapped to HTTP codes. I coded a prototype with 3 types of errors: user-defined HTTP codes, locally originated errors and database errors. I imagine that adding a grpc error type and feed grpc status there should also be super easy.

At the bottom, at C++ level, the interface is basic pure abstract Error class and monitors just consume those errors and match if that is what a user defined. If there is match, counters are increased. No assumption about a specific type of errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

ravenblackx · 2024-01-17T16:25:07Z

@wbpcode nag for stalled PR. (Can't tell if it should be /waited or if it's currently waiting for your reply.)

wbpcode

LGTM overall. Only two minor comments.

(And so sorry for the delay, I recently lost lots of pings :(

wbpcode · 2024-02-06T01:48:17Z

+message HttpCodes {
+  type.v3.Int32Range range = 1;
+}
+
+// Error bucket for locally originated errors.
+// [#not-implemented-hide:]
+message LocalOriginEvents {
+}
+
+// Error bucket for database errors.
+// Sub-parameters may be added later, like malformed response, error on write, etc.
+// [#not-implemented-hide:]
+message DatabaseTransactions {
+}


I will make these more simple by HttpErrors, LocalErrors, DatabaseErrors, etc.

Agree - names are corrected now.

wbpcode · 2024-02-06T01:50:04Z

+  // The % chance that a host is actually ejected. Defaults to 100.
+  google.protobuf.UInt32Value enforcing = 3 [(validate.rules).uint32 = {lte: 100}];
+
+  // Error buckets.


Could you add a comment here about which would be treat as an error? I guess any event that listed in the buckets would be treated as error?

This is just a placeholder for now to visualize how more or less API will look like and will not be included in docs. I plan to update/improve documentation when I deliver implementation of the first extension - consecutive errors monitor.

wbpcode · 2024-02-06T01:51:24Z

/wait

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-02-13T13:50:44Z

/retest

wbpcode

LGTM. Thanks. Look forward to the impl. :)

cpakulski · 2024-02-21T15:15:20Z

@wbpcode Thanks for approving. I have a prototype working, so it should move forward smoothly.

cpakulski added 2 commits December 6, 2023 18:29

API for defining HTTP errors, locally originated errors and database …

1282f87

…errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Adjusted next free field.

4612105

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only Bot added the api label Dec 6, 2023

repokitteh-read-only Bot assigned wbpcode Dec 6, 2023

wbpcode reviewed Dec 7, 2023

View reviewed changes

markdroth reviewed Dec 18, 2023

View reviewed changes

repokitteh-read-only Bot added the waiting label Jan 2, 2024

Use Any for monitor extensions.

416943e

Moved proto for errors and consecutive errors monitor to envoy/extensions. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only Bot removed the waiting label Jan 9, 2024

cpakulski added 3 commits January 9, 2024 18:29

Adjusted main api's BUILD file.

1f4e730

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Renamed common to error_types.

cf3df96

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed docs.

d72202e

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

markdroth reviewed Jan 9, 2024

View reviewed changes

wbpcode reviewed Jan 10, 2024

View reviewed changes

Used TypedExtensionConfig instead of user-define message.

7b7979e

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Redesign ErrorBucket to avoid using oneof.

50e61b0

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode reviewed Feb 6, 2024

View reviewed changes

repokitteh-read-only Bot added the waiting label Feb 6, 2024

Renamed error buckets.

2dce958

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only Bot removed the waiting label Feb 12, 2024

cpakulski requested a review from wbpcode February 14, 2024 18:41

wbpcode approved these changes Feb 16, 2024

View reviewed changes

repokitteh-read-only Bot removed the api label Feb 16, 2024

wbpcode merged commit 6e71eb8 into envoyproxy:main Feb 16, 2024

cpakulski mentioned this pull request May 14, 2024

upstream: implementation of outlier detection extensions #34154

Open

Conversation

cpakulski commented Dec 6, 2023

Uh oh!

repokitteh-read-only Bot commented Dec 6, 2023

Uh oh!

wbpcode left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode commented Dec 7, 2023

Uh oh!

cpakulski commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

htuch commented Dec 8, 2023

Uh oh!

cpakulski commented Dec 12, 2023

Uh oh!

htuch commented Dec 13, 2023

Uh oh!

KBaichoo commented Dec 18, 2023

Uh oh!

markdroth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpakulski commented Dec 19, 2023

Uh oh!

adisuissa commented Jan 2, 2024

Uh oh!

cpakulski commented Jan 9, 2024

Uh oh!

markdroth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpakulski commented Jan 9, 2024

Uh oh!

markdroth commented Jan 9, 2024

Uh oh!

cpakulski commented Jan 9, 2024

Uh oh!

wbpcode Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode commented Jan 11, 2024

Uh oh!

cpakulski commented Jan 11, 2024

Uh oh!

ravenblackx commented Jan 17, 2024

Uh oh!

wbpcode left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpakulski commented Dec 8, 2023 •

edited

Loading

wbpcode Jan 10, 2024 •

edited

Loading

wbpcode Jan 11, 2024 •

edited

Loading