upstream: outlier detection for non 5xx codes by cpakulski · Pull Request #39947 · envoyproxy/envoy

cpakulski · 2025-06-18T20:53:48Z

Commit Message: outlier detection for non 5xx codes
Additional Description:
Added to a cluster HTTP a specific option to define an http matcher. When a response matches the defined matcher, the response is treated as error, regardless of the status code. The error is reported to outlier detection as 5xx code. If a response does not match the matcher it is treated as success and forwarded to outlier detection as code 200.
Risk Level: Low (no change unless matcher is defined)
Testing: Added unit and integration tests
Docs Changes: yes
Release Notes: yes
Platform Specific Features: n/a
Fixes #18789

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only · 2025-06-18T20:53:54Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #39947 was opened by cpakulski.

see: more, trace.

repokitteh-read-only · 2025-06-18T20:53:59Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #39947 was opened by cpakulski.

see: more, trace.

wbpcode

Thanks for the update. I didn't get why we need a map_to. Why not return a bool by single failure_matcher. Which scenario is the single failure_matcher couldn't cover?

wbpcode · 2025-06-19T02:00:20Z

+
+  message HttpEvents {
+    // Matcher for response headers.
+    config.common.matcher.v3.MatchPredicate match = 1 [(validate.rules).message = {required: true}];
+
+    // Code which should be reported to the outlier detection if response headers matched the matcher.
+    envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}];
+  }
+
+  // List of matchers for response headers along with codes to be reported to outlier detection.
+  // The first matcher which returns matching success is applied.
+  repeated HttpEvents outlier_detection = 8;


Why we need a list of HttpEvents? I think single match should enough to match an error?

message OutlierDetection { config.common.matcher.v3.MatchPredicate failure_matcher = 1 [(validate.rules).message = {required: true}]; } OutlierDetection outlier_detection = 8;

Applied your suggestion. Thanks! It is now a single matcher. If a code matches the matcher, it will be treated as error.

wbpcode · 2025-06-19T02:02:45Z

+  virtual absl::optional<uint64_t>
+  processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;


Suggested change

virtual absl::optional<uint64_t>

processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;

virtual absl::optional<bool>

checkFailureResponse(Http::ResponseHeaderMap& reponse) const PURE;

return an absl::optional<bool> make more sense. It's unecessary to return http code.

Converted to optional<bool>. I am not sure about the name of the method though. processHttpForOutlierDetection is very descriptive about what it does and what is its purpose.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode · 2025-06-20T02:22:21Z

-    // Code which should be reported to the outlier detection if response headers matched the matcher.
-    envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}];
+    // Determines if matching indicate error or non-error. Defaults to false (error).
+    bool success_on_match = 2;


If we agree the outlier detection matcher here should generate a boolean result, then:

the success_on_match seem make no sense. You only need to define that if match then signify error. This would simlify the API.

We also needn't repeated HttpEvents. Because the MatchPredicate self support or_match, and_match, not_match. Single config.common.matcher.v3.MatchPredicate is good enough for this task.

This PR assumes that "old" outlier is used and this API is overlay on top of the default behaviour which sends all HTTP status codes to outlier detection.

success_on_match is needed to address use case described in Enhance Outlier Detection with Selective 5xx Error Exclusion #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.

similar in regards to repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.

success_on_match is needed to address use case described in #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.

I understand the requirement but why why the bool flag is necessary? But may be you can use a not_match cover the case?

similar in regards to repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.

The MatchPredicate should could cover that case.

failure_matcher: or_match: rules: - http_response_headers_match: headers: - name: ":status" string_match: safe_regex: regex: "4.." - and_match: rules: - http_response_headers_match: headers: - name: ":status" string_match: safe_regex: regex: "5.." - not_match: http_response_headers_match: headers: - name: ":status" string_match: safe_regex: regex: "502|503|504"

I prefer the xds.type.matcher.v3.Matcher rather then config.common.matcher.v3.MatchPredicate because xds.type.matcher.v3.Matcher support CEL and is easier to support various logical operators. (But MatchPredicate should also fine)

But may be you can use a not_match cover the case?

Let me try. If it is possible it would be great!

it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.

It would be very easy to explain the new behavior to our users if there is only single matcher. And the single matcher is flexible enough. If there are multiple matchers, we always need to consider the case where a code matched multiple matchers and what the behavior should be.
So, what's benifit of the new comlexity of multiple matcher?

it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.

I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.

So, in essence there are two approaches here:

always use legacy behaviour and use failure_matcher and success_matcher to add or remove status codes from default set.

use only failure_matcher to completely overwrite legacy behavior. If failure_matcher is not specified, legacy behaviour takes place (only 5xx are treated as errors). If a failure_matcher is defined, only codes matching the matcher will be treated as error.

I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.

I think this is one of our differenet perspectives. IMO, the multiple matcher bring much more complexity because the users to aware the multiple matcher's result and the original codes to figure out the final success/failure result.

I will prefer the second approach in your list. Simplely derterming the result based on legacy codes or new matcher would be more intuitive in most cases.

I will implement a single matcher and see how it goes with the rest of the logic. Thanks.

I can convert the matcher to xds.type.matcher.v3.Matcher if required. I have no opinion which one is better or faster. Matching happens on each response, so performance is important.

wbpcode · 2025-06-20T02:25:26Z

-    absl::optional<uint64_t> new_code = cluster_->processHttpForOutlierDetection(*headers);
-    if (new_code.has_value()) {
-      put_result_code = new_code.value();
+    absl::optional<bool> matched = cluster_->processHttpForOutlierDetection(*headers);
+    if (matched.has_value()) {
+      // Outlier detector distinguishes only two values:
+      // Anything >= 500 is error.
+      // Anything < 500 is success.
+      put_result_code = matched.value() ? 500 : 200;


This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.

This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.

It will not break consecutive_gateway_failure, because if the original code was 5xx and it matches the matcher, it will be forwarded in its original form. Non-5xx codes will be forwarded to outlier as 500, but 502 for example will be forwarded as 502.

wbpcode · 2025-06-20T02:27:00Z

+  virtual absl::optional<bool>
  processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;


Rather then adding such new method at ClusterInfo. I think it would be better to expose the httpProtocolOptions() here.

That is probably a good idea. ClusterInfo should be just a config repository without methods with logic.

I tried to do it (expose httpProtocolOptions), but it requires to include a certain header files (from source/extensions) which is probably not a good idea to do in envoy/upstream/upstream.h. Other option would be to use forward declaration, like

class httpProtocolOptions;

It is true that so far ClusterInfo was purely repository of config options and this method (processHttpForOutlierDetection) adds some logic to it, but on the other side it hides implementation and provides a nice interface without exposing details.

WDYT?

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode · 2025-06-25T02:50:21Z

+  // If specified, only responses matching the matcher will be treated by outlier detection as errors.
+  // If not specified, only 5xx codes are treated by outlier detection as errors.
+  config.common.matcher.v3.MatchPredicate outlier_detection_error_matcher = 8;


I actually prefer you previous design about add a new OutlierDetection message. We may could add new field at there in the future if we aware more feature requirements. (If I didn't present this clearly before, sorry! 😞 )

message OutlierDetection { config.common.matcher.v3.MatchPredicate failure_matcher = 1; }

Used a previous design, so we can extend the proto related to outlier without adding new fields to parent message.

wbpcode

Only one minor comment to the API. Once we get agreement to the API, that should pretty quick to complete the review the code implemention. I think maybe we can land this PR before this weekend or next Tuesday. :)

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2025-09-04T02:01:39Z

/retest

cpakulski · 2025-09-11T14:25:26Z

@wbpcode I know it has been a long time since you reviewed it last time. I apologize for delay. I addressed most of your comments. I think we reached an agreement on API :-)). If the rest of the code looks good I will add integration and regression tests, docs and it should be ready for final review.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2025-09-25T01:52:06Z

/retest

cpakulski · 2025-09-25T17:01:36Z

@wbpcode Thanks for approving the API. I added integration tests and updated docs. I believe that it is ready for another review (moving it out of draft).

KBaichoo · 2025-09-30T16:37:16Z

seems like this needs a maintainer reviewer (wbpcode did the api shepards)

/assign @paul-r-gall

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2025-10-01T18:47:12Z

/retest

cpakulski · 2025-10-01T22:27:47Z

release notes are still needed. I will add them once the PR is ready for merge.

paul-r-gall · 2025-10-02T14:26:50Z

Thanks, I'll approve once you add release notes!

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2025-10-02T17:57:17Z

Thanks @paul-r-gall . CI fails now and I am investigating if my changes cause those errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2025-10-02T20:40:46Z

/retest

cpakulski · 2025-10-03T00:47:19Z

@paul-r-gall . CI passes now. I had to do minor adjustment to a namespace in one of the tests.

cpakulski · 2025-10-03T12:38:54Z

Thanks a lot @paul-r-gall!

Added ability to report a different code to outlier detection.

9ec0297

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only Bot added the api label Jun 18, 2025

repokitteh-read-only Bot assigned markdroth Jun 18, 2025

wbpcode reviewed Jun 19, 2025

View reviewed changes

Changed api not to use http codes.

719e9b9

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode reviewed Jun 20, 2025

View reviewed changes

Single error matcher.

c8bf5c8

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

wbpcode reviewed Jun 25, 2025

View reviewed changes

wbpcode assigned wbpcode and unassigned markdroth Jun 25, 2025

Adjusted api. Added unit tests.

b7c1bf6

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski added the no stalebot Disables stalebot from closing an issue label Jul 1, 2025

mathetake removed the no stalebot Disables stalebot from closing an issue label Aug 18, 2025

Merge remote-tracking branch 'upstream/main' into od_code_map

f2b6973

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

cpakulski mentioned this pull request Aug 31, 2025

upstream: implementation of outlier detection extensions #34154

Open

cpakulski added 5 commits September 3, 2025 16:19

Added unit test to router.

05c6aec

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Corrected check for empty matcher.

ee98d59

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Corrected proto.

20a0543

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into od_code_map

87c7f22

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Resolve namespace conflict.

f9f7a85

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski requested a review from wbpcode September 11, 2025 14:25

cpakulski added 2 commits September 22, 2025 17:29

Added integration tests.

0ac18ee

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Test comment.

850f8c6

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Updated docs.

fadeda2

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski marked this pull request as ready for review September 25, 2025 17:02

cpakulski requested a review from mattklein123 as a code owner September 25, 2025 17:02

repokitteh-read-only Bot assigned paul-r-gall Sep 30, 2025

paul-r-gall reviewed Sep 30, 2025

View reviewed changes

Comment thread source/common/router/router.cc Outdated

Comment thread source/extensions/common/matcher/matcher.h

Comment thread source/common/router/router.cc Outdated

cpakulski added 2 commits October 1, 2025 16:11

Renamed local variable.

e6058f2

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into od_code_map

7ed17f6

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski requested a review from paul-r-gall October 1, 2025 22:20

cpakulski added 2 commits October 2, 2025 16:08

Added release note.

2630a8e

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into od_code_map

bc0f26c

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

paul-r-gall previously approved these changes Oct 2, 2025

View reviewed changes

Resolve namespace conflict.

035aa40

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski dismissed paul-r-gall’s stale review via 035aa40 October 2, 2025 18:29

cpakulski requested review from agrawroh, botengyao and yanavlasov as code owners October 2, 2025 18:29

paul-r-gall approved these changes Oct 3, 2025

View reviewed changes

paul-r-gall merged commit 2a5978a into envoyproxy:main Oct 3, 2025
26 checks passed

cpakulski mentioned this pull request Oct 21, 2025

test: enable parameterized test for HttpProtocolIntegrationTest #41635

Closed

cpakulski mentioned this pull request Oct 29, 2025

Enhance Outlier Detection with Selective 5xx Error Exclusion #38311

Closed

		virtual absl::optional<uint64_t>
		processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;

		virtual absl::optional<bool>
		processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;

Conversation

cpakulski commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only Bot commented Jun 18, 2025

Uh oh!

repokitteh-read-only Bot commented Jun 18, 2025

Uh oh!

wbpcode left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpakulski Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wbpcode left a comment

Choose a reason for hiding this comment

Uh oh!

cpakulski commented Sep 4, 2025

Uh oh!

cpakulski commented Sep 11, 2025

Uh oh!

cpakulski commented Sep 25, 2025

Uh oh!

cpakulski commented Sep 25, 2025

Uh oh!

KBaichoo commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cpakulski commented Oct 1, 2025

Uh oh!

cpakulski commented Oct 1, 2025

cpakulski commented Jun 18, 2025 •

edited

Loading

wbpcode Jun 20, 2025 •

edited

Loading

wbpcode Jun 20, 2025 •

edited

Loading

cpakulski Jun 23, 2025 •

edited

Loading