Skip to content

upstream: API proposal for extensions to outlier detection#31205

Merged
wbpcode merged 9 commits intoenvoyproxy:mainfrom
cpakulski:od_ext_api
Feb 16, 2024
Merged

upstream: API proposal for extensions to outlier detection#31205
wbpcode merged 9 commits intoenvoyproxy:mainfrom
cpakulski:od_ext_api

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

Commit Message:
API proposal for extensions to outlier detection

Additional Description:
The goal of outlier detection extensions is to allow:

Design, config snippets, link to working demo (works for non-5xx errors and redis errors) is in https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

Risk Level: Low
Testing: N/A
Docs Changes: Not at the moment
Release Notes: Not at the moment
Platform Specific Features: No

…errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #31205 was opened by cpakulski.

see: more, trace.

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you contribution. From my quick check to the source code of outlier detection, seems it (source code) is well designed for different protocols.

All proxies (tcp proxy, redis proxy, mysql proxy, etc.) could call the putResult to tell outlier detection the results of requests. (code for HTTP and Result for non-HTTP).

If we want do more things in this areas, then it's reasonable to create some cross-protocol errors that used by the outlier detection. And then, define configurable mapping from protocol-dependent statues to these general errors.

Comment on lines +205 to +206
// Monitor name.
string name = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of this name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several monitors can be active within one cluster. The name would be used to indicate/log which monitor was triggered and marked a host as unhealthy.

Comment on lines +21 to +47
// Error bucket for HTTP codes
// [#not-implemented-hide:]
message HttpErrorsBucket {
string name = 1;
type.v3.Int32Range range = 2;
}

// Error bucket for locally originated errors
// [#not-implemented-hide:]
message LocalOriginEvents {
}

// Error bucket for database errors.
// Sub-parameters may be added later, like malformed response, error on write, etc.
// [#not-implemented-hide:]
message DatabaseTransactions {
}

// Union of possible error buckets.
// [#not-implemented-hide:]
message ErrorBucket {
oneof bucket {
HttpErrorsBucket http_errors = 1;
LocalOriginEvents local_origin_events = 2;
DatabaseTransactions database_transactions = 3;
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be part of API of different proxies rather than the outlier detection self.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good idea. I will try to reshuffle it, but am not sure how difficult it would be.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Dec 7, 2023

cc @envoyproxy/api-shepherds I think this is a core API change and others may also want to take a look.

cc @htuch @markdroth

@cpakulski
Copy link
Copy Markdown
Contributor Author

cpakulski commented Dec 8, 2023

The idea behind extending outlier detector was boiling in my head for many months. Because it is core API, backwards compatibility is essential. But I am convinced that if this piece of API was designed today it would look close to what is proposed here.
I believe that historically the OD mechanism was designed for 5xx errors (so it was very HTTP specific). Later on locally originated errors (timeouts, resets, etc) were added and for simplicity were treated as 5xx errors. Back in 2019 I added configuration item to distinguish HTTP codes from locally originated errors: #4822. In order to do that I had to duplicate most of config items, like number of consecutive errors, success rate params, and introduce separate counters for HTTP errors and locally originated errors.
The problem is that in the current API the algorithm is attached to type of error. For example consecutive_5xx defines algorithm which counts consecutive errors and errors must be HTTP 5xx. consecutive_local_origin is similar but for locally originated errors.
In this API proposal I would like to separate algorithm from type of error. Administrator attaches type of errors which algorithm should count.
In order to address issue #18789, there are two choices with the current API. Add another set of config items like consecutive_4xx, enforce_consecutive_4xx, success_rate_4xx, enforce_success_rate_4xx, ... About 8-10 items. The other choice would be boolean flag "treat_4xx_as_5xx". In the proposed API, all what is needed is another HTTP Error bucket with range 400-499.
It is even more cumbersome for database errors. To address #24215 another 8-10 items would have to be introduced like consecutive_db_error, etc. Or we would have to treat DB errors like HTTP 5xx which is very cumbersome.
Another example is grpc. How to use OD and feed it with grpc-status? In the proposed API, we would add GRPCError bucket type and specify which codes should be treated as errors.

The other thing is that internally all OD algorithms are executed regardless whether they are used or not. For example, even if only consecutive errors are configured, logic for success_rate and failure_frequency is executed, but it just does not kick in.

The proposed API allows for backwards compatibility and co-existence of "old" config and new one.

@htuch @mattklein123 WDYT about this proposal? I remember that your comments helped me a lot when I was working on separating locally originated errors from 5xx.

@htuch
Copy link
Copy Markdown
Member

htuch commented Dec 8, 2023

One thought I had (triggered by your use of "extend") is that maybe OD should be a 1st class API extension point. It would definitely empower folks to do a lot of customization without needing to modify core APIs going forward. In your proposed change, I think that rather than treat the whole OD as an extension, the "monitors" would become pluggable extensions. You could then have actions related to monitors as an additional extension type.

Overload manager is a great example of this API design pattern.

@cpakulski
Copy link
Copy Markdown
Contributor Author

@htuch Thanks for sharing your thoughts. Are you talking about that part of overload manager: https://github.com/envoyproxy/envoy/blob/main/api/envoy/config/overload/v3/overload.proto#L35-L44?

@htuch
Copy link
Copy Markdown
Member

htuch commented Dec 13, 2023

Yep, that basic idea. CC @KBaichoo who might have thoughts on how this pattern worked out and whether it would make sense for outlier detection.

@KBaichoo
Copy link
Copy Markdown
Contributor

Agreed, this seems like having a first class extension here would work well.

I think @htuch's suggestion here of using a "monitor extension" would allow us to abstract away the particular error into just "the error occurred" or "didn't occur" and then the core could could just focus on how many of these occurred. This is similar to how the overload manager monitor's produce a float in the range of [0,1.0] to tell whether overload should kick in. If the set of algorithm's used on outlier detection are the same / similar then we can get away with "monitor extension" otherwise we might need something larger.

Copy link
Copy Markdown
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @htuch's idea of making OD monitors an extension point.

It is worth noting that gRPC also supports outlier detection (see gRFC A50), so whatever new API gets introduced here is one that we'll eventually need to support as well. @ejona86 @dfawley @murgatroid99

// Union of possible error buckets.
// [#not-implemented-hide:]
message ErrorBucket {
oneof bucket {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now prefer not to use oneof in the xDS API, as per https://github.com/envoyproxy/envoy/blob/main/api/STYLE.md.

@cpakulski
Copy link
Copy Markdown
Contributor Author

If the set of algorithm's used on outlier detection are the same / similar then we can get away with "monitor extension" otherwise we might need something larger.

I think we can do such abstraction. Thanks for great idea!

It is worth noting that gRPC also supports outlier detection (see gRFC A50), so whatever new API gets introduced here is one that we'll eventually need to support as well.

Currently, everything is internally mapped to HTTP codes and then fed to OD. I think that in the future, for grpc, we can define meaning of "error" using grpc-status codes, but in the first iteration there will be no difference how grpc is treated by OD.

Let me work on my prototype and use "typed" config to be plugged in into protobuf's Any.

@adisuissa
Copy link
Copy Markdown
Contributor

/wait

Moved proto for errors and consecutive errors monitor to envoy/extensions.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

Extension to outlier detection monitor is defined as Any. Thanks to @htuch and @KBaichoo for idea. I also have a working prototype which uses this new proto.

I still use oneof to capture several different types of errors (http, locally originated, database error) under one type. I know that this is discouraged (thanks @markdroth). What is an alternative?

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still use oneof to capture several different types of errors (http, locally originated, database error) under one type. I know that this is discouraged (thanks @markdroth). What is an alternative?

Yeah, we definitely don't want to use oneof, since it makes it incredibly difficult to add new features in a backward-compatible way.

In this particular case, why do we need the ErrorBucket wrapper to begin with? Couldn't we just plug the individual error bucket types into the monitors field directly?


// Set of passive monitors.
// [#not-implemented-hide:]
message Monitor {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the existing TypedExtensionConfig message instead of defining our own message here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of TypedExtensionConfig. Will refactor it. Thanks!

@cpakulski
Copy link
Copy Markdown
Contributor Author

In this particular case, why do we need the ErrorBucket wrapper to begin with? Couldn't we just plug the individual error bucket types into the monitors field directly?

The idea is that a monitor (for example consecutive errors) is configured with a list of error buckets of the same or different types. The best example is mix of http codes and locally originated errors (resets, timeouts). Those are two different proto messages. I imagined that a monitor simply has a list of buckets:
repeated ErrorBuckets buckets
and user may plug-in different types of buckets (same type or different types): http codes, locally originated. As below:

buckets:
  - http_errors:
       range: 500-599
  - http_errors:
       range: 300-325
  - local_origin_events: {}

So a monitor with buckets as above will check for consecutive errors which fall into any bucket. The following errors will trigger the monitor: HTTP 503->tcp reset->HTTP 302.

In a nutshell it would be great to have a single field where different proto messages could be used. Since, oneof is not to be used, is using Any/TypedExtensions the only othe choice?

@markdroth
Copy link
Copy Markdown
Contributor

Oh, okay, I think I see what the confusion is here. I saw that the ErrorBucket message was defined in a file in a separate subdirectory (error_types), so I assumed that was intended to be a separate extension -- i.e., I thought your intent was that the monitors field could contain either ConsecutiveErrors or ErrorBucket. But it looks like what you're actually intending here is just that ErrorBucket is one of the fields inside the ConsecutiveErrors message.

If ErrorBucket is not going to be used in anything except ConsecutiveErrors, then you could just define it in the same file. Alternatively, if the intent here is just that the ErrorBucket is a reusable message that in the future could be used for extensions other than ConsecutiveErrors, then I suggest moving those types to a directory that does not imply that they are a separate extension. It would probably be fine to put them in api/envoy/extensions/outlier_detection_monitors/v3 or api/envoy/extensions/outlier_detection_monitors/common/v3.

Getting back to your question, though, Any and TypedExtensionConfig don't actually solve the same problem as oneof, and I don't think they're helpful here. How about just moving the repeated down one level, so it looks like this:

message ErrorBuckets {
  repeated HttpErrorsBucket http_errors = 1;
  repeated LocalOriginEvents local_origin_events = 2;
  repeated DatabaseTransactions database_transactions = 3;
}

Then you can just have a non-repeated ErrorBuckets message in ConsecutiveErrors. This would still allow you to group any number of conditions of any of these types into a single bucket.

@cpakulski
Copy link
Copy Markdown
Contributor Author

If ErrorBucket is not going to be used in anything except ConsecutiveErrors, then you could just define it in the same file. Alternatively, if the intent here is just that the ErrorBucket is a reusable message that in the future could be used for extensions other than ConsecutiveErrors, then I suggest moving those types to a directory that does not imply that they are a separate extension. It would probably be fine to put them in api/envoy/extensions/outlier_detection_monitors/v3 or api/envoy/extensions/outlier_detection_monitors/common/v3.

Error types are going to be reused. I will move them to common/v3.

How about just moving the repeated down one level, so it looks like this:
message ErrorBuckets {
repeated HttpErrorsBucket http_errors = 1;
repeated LocalOriginEvents local_origin_events = 2;
repeated DatabaseTransactions database_transactions = 3;
}

That should work! Thanks a lot @markdroth

Comment on lines +242 to +182
Monitors monitors = 24;
repeated Monitor monitors = 24;
Copy link
Copy Markdown
Member

@wbpcode wbpcode Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?

And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?

If all monitors work at same time, it would be more complex to handle the output of different monitor. I prefer only use TypedExtensionConfig monitor = 24; here because single monitor is enough for absolute majority of users. If some one wants to composite multiple monitors, he can implement it by an composite extension and decide how to handle different outputs of different monitors by him self.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the existing mechanism for configuring outlier detection does allow configuring both failure percentage and success rate at the same time. If the goal here is to move that existing configuration into an extension, then it seems like we should make it possible to configure multiple ejection algorithms.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, what role will the monitor take? Will it handle all the logic of outlier detection or just determine if an error is throwed?

A monitor will consume results of interaction with upstream entity and run logic specific to a monitor. For example, count consecutive errors or standard deviation or frequency of errors.

And is the repeated necessary? Will all monitors work at same time or just first valid monitor will be used?

The result of a transaction will go to all monitors.

If all monitors work at same time, it would be more complex to handle the output of different monitor.

You are right that most users will use a single monitor. But, as @markdroth pointed out, there are already many monitors, so repeated Monitor is a natural extension. Also, I plan to add a monitor which measures response time and eject slow servers. The combination of one monitor which counts errors and another monitor which measures response time starts to make sense.

I think that the idea of multiple monitors was always there. See this comment: https://github.com/envoyproxy/envoy/blob/release/v1.28/source/common/upstream/outlier_detection_impl.h#L369-L371. But for some reason, instead of adding new, separate monitors with different logic, all algorithms were added to a single monitor. With this new API, I hope to change it and maybe, if there is an agreement, deprecate old ways of configuring outlier detector.

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm if we intend to use the monitors to replace whole previous implementation.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Jan 11, 2024

I actually have another question about the possible implementation. The initial target of this patch is to support non-HTTP protocols outlier detections, what the extension interface would be like? What type inputs the extension require and what type outputs the extension will provide?

@cpakulski
Copy link
Copy Markdown
Contributor Author

I actually have another question about the possible implementation. The initial target of this patch is to support non-HTTP protocols outlier detections, what the extension interface would be like? What type inputs the extension require and what type outputs the extension will provide?

You are almost correct. The initial target is to allow a user to define which HTTP codes should be treated as errors. Right now it is hardcoded to 5xx. But there is demand to treat 4xx codes as error as well. The other goal is to allow a developer to add new types of errors without a need to map them to HTTP codes. Right now, everything needs to be somehow mapped to HTTP codes. I coded a prototype with 3 types of errors: user-defined HTTP codes, locally originated errors and database errors. I imagine that adding a grpc error type and feed grpc status there should also be super easy.

At the bottom, at C++ level, the interface is basic pure abstract Error class and monitors just consume those errors and match if that is what a user defined. If there is match, counters are increased. No assumption about a specific type of errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@ravenblackx
Copy link
Copy Markdown
Contributor

@wbpcode nag for stalled PR. (Can't tell if it should be /waited or if it's currently waiting for your reply.)

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Only two minor comments.

(And so sorry for the delay, I recently lost lots of pings :(

Comment on lines +18 to +31
message HttpCodes {
type.v3.Int32Range range = 1;
}

// Error bucket for locally originated errors.
// [#not-implemented-hide:]
message LocalOriginEvents {
}

// Error bucket for database errors.
// Sub-parameters may be added later, like malformed response, error on write, etc.
// [#not-implemented-hide:]
message DatabaseTransactions {
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make these more simple by HttpErrors, LocalErrors, DatabaseErrors, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - names are corrected now.

// The % chance that a host is actually ejected. Defaults to 100.
google.protobuf.UInt32Value enforcing = 3 [(validate.rules).uint32 = {lte: 100}];

// Error buckets.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here about which would be treat as an error? I guess any event that listed in the buckets would be treated as error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a placeholder for now to visualize how more or less API will look like and will not be included in docs. I plan to update/improve documentation when I deliver implementation of the first extension - consecutive errors monitor.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Feb 6, 2024

/wait

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski cpakulski requested a review from wbpcode February 14, 2024 18:41
Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. Look forward to the impl. :)

@wbpcode wbpcode merged commit 6e71eb8 into envoyproxy:main Feb 16, 2024
@cpakulski
Copy link
Copy Markdown
Contributor Author

@wbpcode Thanks for approving. I have a prototype working, so it should move forward smoothly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants