Skip to content

upstream: implementation of outlier detection extensions#34154

Open
cpakulski wants to merge 102 commits intoenvoyproxy:mainfrom
cpakulski:od_ext_impl
Open

upstream: implementation of outlier detection extensions#34154
cpakulski wants to merge 102 commits intoenvoyproxy:mainfrom
cpakulski:od_ext_impl

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

@cpakulski cpakulski commented May 14, 2024

Commit Message:
upstream: implementation of outlier detection extensions

Additional Description:
The idea and need for the extensions are described in RFC document: https://docs.google.com/document/d/1ZCZSoirVB39eOLdD0VPlsEUING8c23Sq5bzozrv6f4k/edit?usp=drive_link

In a nutshell, the design decouples types of result reported to outlier detector from an algorithm which marks a host as unhealthy. For example, an outlier detector may be configured to count 3 consecutive errors and type of those errors can be defined by a user. For example:

  • count http codes in range 500-503
  • count http codes in range 401-403
  • count locally originated errors (resets, timeouts)

So, the algorithm does not really care about the exact type of reported result. It is only interested whether reported result should be considered an error or not.
The idea of using user-defined errors can be expanded to database errors. See issue #24215 (I have working prototype for errors reported by Redis).

This design puts extensions on top of already existing outlier detector. It means that the solution is 100% backwards compatible. Previous configs are accepted. But a user may configure outlier detection extension to

  • use "old" outlier detection to react to 5xx errors and add another range of HTTP errors (say 4xx)
  • disable "old" outlier detection and configure everything using extensions

The implementation of extensions is built on top of already existing outlier detection structures. This minimizes code changes and re-uses event logger, timers, etc.

The implementation is built on top of already approved API for extensions: #31205

Risk Level: Low (previous configuration still works. Extensions do not have to be configured)
Testing: Added unit tests for new code and tests checking co-existence of "old" outlier and "new" extensions
Docs Changes: Yes. Added.
Release Notes: Yes.
Platform Specific Features: No
Fixes #18789

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
… monitor.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @wbpcode
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34154 was opened by cpakulski.

see: more, trace.

cpakulski added 9 commits May 14, 2024 21:05
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski cpakulski changed the title WIP upstream: implementation of outlier detection extensions upstream: implementation of outlier detection extensions May 17, 2024
@cpakulski cpakulski marked this pull request as ready for review May 17, 2024 19:13
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

@wbpcode I see that you have been assigned for API approval, but changes in proto files are not really API changes, but rather changes to event log which is defined in terms of protobufs. Hope it helps!

@nezdolik
Copy link
Copy Markdown
Member

will the old outlier detection extension be eventually deprecated? (given that the new one lands)

cpakulski added 3 commits May 12, 2025 20:41
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@tyxia
Copy link
Copy Markdown
Member

tyxia commented May 20, 2025

/wait

@cpakulski please merge main to resolve the conflict

@wbpcode is this PR ready for your review again?

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented May 21, 2025

See https://github.com/envoyproxy/envoy/actions/runs/15151640605/job/42598717694 and #34154 (comment).

I will prefer to get a balanced result between the complex router or complex outlier extension. The currenty Result is used for multiple different protocol and there is no reason to abort it. And It's only have very limited statuses (4 different value actually, no much complex than boolean), it should be fine to prapagate this status to all outlier extensions. Then it's unnecessary to add complex proto API into router to select different outlier extension for different events.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented May 22, 2025

Only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?
HTTP modules so far used putHttpResponseCode directly, but now they do it via ExtOriginRequestSuccess/ExtOriginRequestFailure. I think you are right and it should be fine. Please update the comment https://github.com/envoyproxy/envoy/blob/main/envoy/upstream/outlier_detection.h#L25-L27
Note, the router's configuration is works for whole HCM, that means it may shared by multiple upstreams/clusters, then, how we can handle the case where one cluster want to handle L4/L7 errors separately and another one doesn't? So, I will strong inclined let the monitors to handle both L4/L7 errors.
Actually, timeout/reset, etc are all handled as same status (L4 failure) for now.
Correct. This case is handled even with bools. Let's continue this discussion in the other PR, so we do not loose that valuable feedback. But in general, under router there is a matcher and a list of destinations where you want to send the result of matching:
match:
http_code:
start: 500
end: 599
send_to:
- monitor in cluster1
- monitor in cluster3
So, if you do not want to send something to say cluster2, you do not configure it under matcher.

I think it's hard to accpt this API. That means the HTTP router need to aware different monitors of different clusters. And the HCM router configuration will interact with the CDS in some way.

IMO, the current 4 statuses Result could be used as the only input of the monitor. Then that should could cover most current requirements.

And if complex mapping (protocol specific status codes to L7 failure or L7 success) is necessary in the future, we could configure a opaque mapping extension in the outlier detection, and the router could call the mapping extension (if is provided) to generate the L7 status.

Then, by this way,

  1. Low complexity in most cases and features.
  2. Extensible for all protocols. The routers of different protocols could cast the mapping extension to actual type and run the mapping.
  3. All outlier detection configuration are configured only on the cluster and the LDS and CDS won't interact with each other.

cpakulski added 2 commits May 22, 2025 23:02
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

I think it's hard to accpt this API. That means the HTTP router need to aware different monitors of different clusters. And the HCM router configuration will interact with the CDS in some way.

This is a valid point. So, this (current) API moves the complexity towards the router and makes outlier extensions very simple. Also, it makes possible to define more than one http matcher and send matching results to different extension monitors (by using monitor names). I am not really sure if this is a viable scenario though.

So, in order to simplify things on the router side the only option is to reduce configuration flexibility (only one http matcher) and make some sort of a filter (which events L4 or L7 should be accepted by an extension).

@cpakulski
Copy link
Copy Markdown
Contributor Author

I think it's hard to accpt this API. That means the HTTP router need to aware different monitors of different clusters. And the HCM router configuration will interact with the CDS in some way.

@wbpcode I need some time to design/prototype the API. Your point about not specifying clusters at the router is very valid one and I 100% agree. The difficulty with designing the API is where to make the conversion from protocol specific result to protocol independent result. And this needs to be done per cluster, because some clusters may need to treat say http codes range 500-599 as error and other cluster may treat http codes range 400-599 as error or one cluster may want to receive L4 (locally originated events) and other may not want to receive it.

  • in my original API the conversion happened at outlier detection monitors. So, router always sends HTTP specific results and an outlier monitor extension specifies which http range it wants to treat as error. The outlier extension may also specify that it wants to receive L4 results (locally originated events). That config was per-cluster at the cluster level and is similar to already existing "canned" outlier detection monitors. The only problem was that it requires at the API (proto) level different types for different result types. I defined 3 types: http codes, locally originated event and database event. I think the problem was that each new filter type would require a new type of an object which the filter would report to outlier detection extensions. I am thinking now that maybe we can add a generic error called for example FilterResult or ApplicationResult or simply GenericResult. That would be sort of boolean with Success/Error values and could be used by all modules which do not require subfiltering (by subfiltering I mean checking for example if a reported code belongs to a specified HTTP code range). Maybe database modules do not need that subfiltering. If a redis filter reports error when receives response from the database it can use GenericResult(Error) and does not need to report what failed.
  • in the latest API I made outlier detection monitors very simple: they understand only boolean values: error or success. But it requires that router makes a conversion from protocol specific value to Success/Error. As we agreed above this must be done per cluster and configuring it at router makes API very ugly. The only other option would be to move outlier config to HCM, where routes are defined:
           virtual_hosts:
             - name: local_service
               domains: ["*"]
               routes:
               - match:
                   prefix: "/"
                 route:
                   cluster: some_service
                   outlier_detection:
                      - http:
                        match:
                           http_code:
                              start: 500
                              end: 599
                     - locally_originated: true

So this is sort of per-cluster configuration, but still keeps outlier detection monitors very simple/protocol independent (understand only Success/Error). The cluster is already specified and this config will send to the outlier extensions attached to some_service cluster all L4 locally originated events and report codes 5xx to the outlier monitors extensions as errors.

Sorry for long message, but this API is very difficult to design, especially that it requires backwards compatibility with existing old "canned" outlier monitors.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented May 26, 2025

Sorry for long message, but this API is very difficult to design, especially that it requires backwards compatibility with existing old "canned" outlier monitors.

You are welcome! And thanks so much for your contribution and patiance.

I do think thing could be simpler. If new API is necessary, as I said, we can add a new field at outlier detection to map the external L7 statuses to Result/boolean. By this way, we need to define a new extension point for external statues mapping. But it's optional because in most cases the default mapping that hard coded in the router filter is enough.
For example:

  - name:exmaple
    type: STRICT_DNS
    connect_timeout: 0.25s
    lb_policy: ROUND_ROBIN
    outlier_detection:
      external_statuses_policy:
        "@type": type.googleapis.com/envoy.extensions.outlier_detection.external_statuses_policy.v3.Http
        http_code_ranges:
        - start: 500
           end: 599
           result: ExtOriginRequestFailed
    load_assignment:
      cluster_name: sds
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: exmaple.com
                port_value: 80
                protocol: TCP

Or we can use the exist extension point typed_extension_protocol_options and configure the protocol dependent outlier detection configuration at that typed_extension_protocol_options. For example:

  - name:exmaple
    type: STRICT_DNS
    connect_timeout: 0.25s
    lb_policy: ROUND_ROBIN
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http2_protocol_options: {}
        outlier_detection:
          status_ranges:
          - start: 500
             end: 599
             result: ExtOriginRequestFailed
    load_assignment:
      cluster_name: sds
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: exmaple.com
                port_value: 80
                protocol: TCP


  - name: redis
    type: STRICT_DNS
    connect_timeout: 0.25s
    lb_policy: ROUND_ROBIN
    typed_extension_protocol_options:
      envoy.filters.network.redis_proxy:
        "@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProtocolOptions
        outlier_detection:
          status_ranges: ...

Both are fine to me although I prefer the seond way because needn't to define new extension point and could aggregate all protocol-dependent configuration at same position.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented May 26, 2025

But anyway, we can leave it at future PR. At this PR, I think the outlier detection should handle the fixed 4 statuses (L4 + L7) first. (I still insist on that to reuse the Result. As I said before, it actually only contains 4 statuses. The timeout will handled as same as LocalOriginFailed, we can simplify it in the future to remove timeout, etc.) These 4 status could handle most of cases for most protocols.

We can add a field at MonitorCommonConfig to indicate if the monitor only handle local errors or external errors to support the similar behavior of split_external_local_origin_errors. In the specific implementation, we could let the common code of the outlier detection to filter the statues based on the MonitorCommonConfig and the monitor self could finally take only true or false.

wbpcode added a commit that referenced this pull request May 29, 2025
Commit Message: upstream: unify all outlier report to use putResult
Additional Description:

To use `putResult` for all outlier detection event reporting. This is
related to the #34154. Ideally,
the putResult will be used as the unique event reporting entrypoint of
outlier detection and the `Result` enum will be used as the input of the
custom outlier monitors.

This also fixed a bug where the connection success event will be
reported twice.

Risk Level: low.
Testing: n/a
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
Signed-off-by: code <wbphub@gmail.com>
Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. And some comments are added to the API.

// structure.
// [#extension-category: envoy.outlier_detection_monitors]
repeated core.v3.TypedExtensionConfig monitors = 24;
core.v3.TypedExtensionConfig monitors = 24;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the name should be monitor. This change LGTM, but I want to say the or/and wrapper would complex to implementation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the implementation of matcher and can reuse some of the code from there. matcher allows creation of various AND + OR hierarchies. I do not think that such scenarios will be used very often, but I have seen requirements for multiple monitors to fail in order to mark a node as unhealthy. Will change to monitor. Thanks.

Comment on lines +184 to +188
message LocallyOriginatedEvents {
// If enabled, locally originated events will be reported to outlier detection extensions.
// Defaults to true.
bool enable = 1;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary to be part of HTTP protocol. You only need an EventFilter at the MonitorCommonConfig:

enum Event {
  LocalOriginFailure = 0;
  LocalOriginSuccess = 1;
  ExtOriginFailure = 2;
  ExtOriginSuccess = 3;
}

message MonitorCommonConfig {
  // The % chance that a host will be actually ejected when an monitor's
  // algorithm decides that the host is unhealthy.
  // This setting can be used to disable
  // ejection or to ramp it up slowly. Defaults to 100.
  google.protobuf.UInt32Value enforcing = 1 [(validate.rules).uint32 = {lte: 100}];

  // The list of events to accept.
  repeated Event take_events = 1;
}


Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are back to the subject where to send events. I originally used names of monitors. Assuming that in the future there will be hierarchy (AND/OR) of monitors, not all monitors may be interested in all events. For example some may only be interested in locally originated events, others in http events. So, I used names next to the http matcher.

There are few problems with the enum Event.

  • Why would a user want to receive only ExtOriginFailure and not receive ExtOriginSuccess? That would completely break a monitor logic because a monitor would receive only failures and any monitor will fail every time.
  • it exposes internal naming used in the code. It is simpler to express: here is a matcher and send the result to the following monitors than using Local/ExtOrigin names (IMHO).
  • this problem is a bit difficult to understand. It took me a long time to solve when I worked on splitting locally originated and externally originated events. The problem is that if both types of events (local and ext origin) are sent to the same monitor, some locally originated events must be silenced. Let's assume that the monitor type is consecutive errors (3 failures in the row mark a node as unhealthy). A request to the upstream server which returns error 500 will result in two events: LocallyOriginatedSuccess (because Envoy successfully connected to upstream) and ExtFailure (because 500 was returned). In that case the monitor will never fail, because 3 exchanges will result in sending to the monitor: SUCCESS, FAILURE, SUCCESS, FAILURE, SUCCESS, FAILURE. So, in order for this thing to work, LocallyOriginatedSuccess must be suppressed when http events are sent to the same monitor. That is why I added this code:
    if (result == Upstream::Outlier::Result::LocalOriginConnectSuccess) {
    if (!http_protocol_options_->outlier_detection_http_events_matcher_.empty()) {
    return absl::nullopt;
    } else {
    return absl::optional<bool>(true);
    }
    }

So, in essence I think it is better to keep such logic in router and make monitors to accept only success/failure (true/false)

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to first point: it may never happen but we needn't care. We have provided all possible groups and the users/developers decide how to make the custom monitor works as expected.

Response to second point: no matter which name is used, the users/developers need to aware the four different events clearly because it is core of whole outlier detection.

Response to third point: this is why we need this take_events list. Every monitor could select the events it concern and suppress some events. Then the events after filtered will be converted to bool and be handled by the monitor extension.


If we do this in router, how to handle the case that differ monitor need consume different events? As I said before, it's unacceptable to let router to aware different monitors.

So, a good enough solution is the router will report all events to outlier detection module. The outlier detection module will filter the events based on the monitor's concern (the take_events of every monitor) and convert the filtered events to bool. Finally, the monitor takes the bool as input.


Per you current API, how to support the case where one monitor want to consume only local origin event and another one want to consume the external origin event in an or monitor list? And I also think HttpProtocalOptions should only contains protocol specific things. At here, it is configuration about how to map the external response to ExtOriginSuccess/ExtOriginFailure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to first point: it may never happen but we needn't care. We have provided all possible groups and the users/developers decide how to make the custom monitor works as expected.

Actually, I do not agree with this. Users should think in terms of HTTP status codes not in terms of internal variables passed to outlier. The problem with this solution is that if a user sends http codes and locally originated events to the same monitor, LocalOriginSuccess must be explicitly disabled. Otherwise the monitor will never fail for consecutive 5xx codes. User does not need to understand those internals. The router should handle that logic.

Response to second point: no matter which name is used, the users/developers need to aware the four different events clearly because it is core of whole outlier detection.

The same as above. Why? All a user should care about is how to define what should be treated as error.

If we do this in router, how to handle the case that differ monitor need consume different events? As I said before, it's unacceptable to let router to aware different monitors.

Agreed, but it is not at the router level anymore, but in cluster's protocol extensions.

Per you current API, how to support the case where one monitor want to consume only local origin event and another one want to consume the external origin event in an or monitor list?

That was the purpose of the monitor names. I dropped it to start with smaller PR footprint, but names could be added later.

Anyways, let's pause for a while. I have another, much simpler idea and will open a new PR soon.

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do not agree with this. Users should think in terms of HTTP status codes not in terms of internal variables passed to outlier. The problem with this solution is that if a user sends http codes and locally originated events to the same monitor, LocalOriginSuccess must be explicitly disabled. Otherwise the monitor will never fail for consecutive 5xx codes. User does not need to understand those internals. The router should handle that logic.

Then how the router handle this correctly for various cases like local events only monitor or ext events only monitor? You still need some configuration options and the users still need to understand the backend logic of the options. If you don't want users to know all these events, then you can also creat similar enum to provide LocalOrigin, ExtOrigin, Default, three classes of events. The users only need to know which class of events is need to concern. This may more friendly to users.

You may could provide a good enough default values, but the complexity never disappeared.

The same as above. Why? All a user should care about is how to define what should be treated as error.

Nope. Only the users wants to use more flexible features need to understand it. Like current users also need to understand the split mode of outlier detection. If the users want to understand the split mode clearly, it basically needs to read the code to know which is local events and what is the actual behavior. Actually defining this events in the API is more friendly in some way.

Anyways, let's pause for a while. I have another, much simpler idea and will open a new PR soon.

Sounds great. But keep it small and let we get agreement by one step and on step.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how the router handle this correctly for various cases like local events only monitor or ext events only monitor?

LocalOriginEvents and ExtOriginEvents are not totally independent. If they go to separate monitors, then they are independent and router sends all LocalOriginEvents to one monitor and all ExtOriginEvents to the second monitor. If however, they go to the same monitor, router has to suppress sending ExtOriginSuccess. So, in order to implement this logic in router, router has to somehow know where the events go. I implemented this logic in HttpProtocolOptions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalOriginEvents and ExtOriginEvents are not totally independent. If they go to separate monitors, then they are independent and router sends all LocalOriginEvents to one monitor and all ExtOriginEvents to the second monitor. If however, they go to the same monitor, router has to suppress sending ExtOriginSuccess. So, in order to implement this logic in router, router has to somehow know where the events go. I implemented this logic in HttpProtocolOptions.

That's why I suggest use an enum to control which event should be accepted. No matter it's event list or event type. The comlexity won't disappear and I don't see why HttpProtocolOptions would be better solution for that. I personally think only the protocol-specific things should be placed in HttpProtocolOptions (for example, mapping response to ext event).

But you refered you have a better idea, we can check it first.

}

message OutlierDetection {
HttpEvents http_events = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, all the responses that mach this will be treat as failure or success? I think this should be document clearly.

And you may needn't the HttpEvents wrapper. You can use the matcher header directly:

config.common.matcher.v3.MatchPredicate failure_matcher = 1;

// or

config.common.matcher.v3.MatchPredicate success_matcher = 1;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that failure_matcher is a good idea. Thanks. Will change the name.

I used a larger structure message OutlierDetection { on purpose, because I think in the future we may add some extra parameters next to the matcher, like names of monitors where to send the result of matching. It nicely groups the matcher and destination of matching result. If the matcher is separated, it would be harder to add list of destinations. WDYT?

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutlierDetection message sgtm. But needn't HttpEvents wrapper.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@botengyao
Copy link
Copy Markdown
Member

/wait

Waiting for addressing comments.

@KirilNN
Copy link
Copy Markdown

KirilNN commented Jul 30, 2025

@cpakulski hey there do you plan to finish up on this PR, its been so much work done on it so its going to be a shame to go on waste.

@wbpcode
Copy link
Copy Markdown
Member

wbpcode commented Jul 30, 2025

Will try to pick this up if @cpakulski has no time for this.

@cpakulski
Copy link
Copy Markdown
Contributor Author

I am still planning to finish it. I am on annual vacations now and will go back to this issue in 2-3 weeks.

@cpakulski
Copy link
Copy Markdown
Contributor Author

Working on it again. Probably will finish #39947 first and then we can decide how to trim this PR as some sections are not necessary after #39947.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api no stalebot Disables stalebot from closing an issue waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Outlier Detection for non-error status codes

9 participants