Skip to content

upstream: outlier detection for non 5xx codes#39947

Merged
paul-r-gall merged 18 commits intoenvoyproxy:mainfrom
cpakulski:od_code_map
Oct 3, 2025
Merged

upstream: outlier detection for non 5xx codes#39947
paul-r-gall merged 18 commits intoenvoyproxy:mainfrom
cpakulski:od_code_map

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

@cpakulski cpakulski commented Jun 18, 2025

Commit Message: outlier detection for non 5xx codes
Additional Description:
Added to a cluster HTTP a specific option to define an http matcher. When a response matches the defined matcher, the response is treated as error, regardless of the status code. The error is reported to outlier detection as 5xx code. If a response does not match the matcher it is treated as success and forwarded to outlier detection as code 200.
Risk Level: Low (no change unless matcher is defined)
Testing: Added unit and integration tests
Docs Changes: yes
Release Notes: yes
Platform Specific Features: n/a
Fixes #18789

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #39947 was opened by cpakulski.

see: more, trace.

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #39947 was opened by cpakulski.

see: more, trace.

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I didn't get why we need a map_to. Why not return a bool by single failure_matcher. Which scenario is the single failure_matcher couldn't cover?

Comment on lines +179 to +190

message HttpEvents {
// Matcher for response headers.
config.common.matcher.v3.MatchPredicate match = 1 [(validate.rules).message = {required: true}];

// Code which should be reported to the outlier detection if response headers matched the matcher.
envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}];
}

// List of matchers for response headers along with codes to be reported to outlier detection.
// The first matcher which returns matching success is applied.
repeated HttpEvents outlier_detection = 8;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need a list of HttpEvents? I think single match should enough to match an error?

message OutlierDetection {
  config.common.matcher.v3.MatchPredicate failure_matcher = 1 [(validate.rules).message = {required: true}];
}

OutlierDetection outlier_detection = 8;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied your suggestion. Thanks! It is now a single matcher. If a code matches the matcher, it will be treated as error.

Comment thread envoy/upstream/upstream.h Outdated
Comment on lines +1016 to +1017
virtual absl::optional<uint64_t>
processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
virtual absl::optional<uint64_t>
processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;
virtual absl::optional<bool>
checkFailureResponse(Http::ResponseHeaderMap& reponse) const PURE;

return an absl::optional<bool> make more sense. It's unecessary to return http code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to optional<bool>. I am not sure about the name of the method though. processHttpForOutlierDetection is very descriptive about what it does and what is its purpose.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Comment on lines +184 to +185
// Code which should be reported to the outlier detection if response headers matched the matcher.
envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}];
// Determines if matching indicate error or non-error. Defaults to false (error).
bool success_on_match = 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we agree the outlier detection matcher here should generate a boolean result, then:

  1. the success_on_match seem make no sense. You only need to define that if match then signify error. This would simlify the API.
  2. We also needn't repeated HttpEvents. Because the MatchPredicate self support or_match, and_match, not_match. Single config.common.matcher.v3.MatchPredicate is good enough for this task.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR assumes that "old" outlier is used and this API is overlay on top of the default behaviour which sends all HTTP status codes to outlier detection.

  • success_on_match is needed to address use case described in Enhance Outlier Detection with Selective 5xx Error Exclusion #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.
  • similar in regards to repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success_on_match is needed to address use case described in #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.

I understand the requirement but why why the bool flag is necessary? But may be you can use a not_match cover the case?

similar in regards to repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.

The MatchPredicate should could cover that case.

failure_matcher:
  or_match:
    rules:
    - http_response_headers_match:
        headers:
        - name: ":status"
          string_match:
            safe_regex:
              regex: "4.."
    - and_match:
        rules:
        - http_response_headers_match:
            headers:
            - name: ":status"
              string_match:
                safe_regex:
                  regex: "5.."
        - not_match:
            http_response_headers_match:
              headers:
              - name: ":status"
                string_match:
                  safe_regex:
                    regex: "502|503|504"

Copy link
Copy Markdown
Member

@wbpcode wbpcode Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the xds.type.matcher.v3.Matcher rather then config.common.matcher.v3.MatchPredicate because xds.type.matcher.v3.Matcher support CEL and is easier to support various logical operators. (But MatchPredicate should also fine)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But may be you can use a not_match cover the case?

Let me try. If it is possible it would be great!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.

It would be very easy to explain the new behavior to our users if there is only single matcher. And the single matcher is flexible enough. If there are multiple matchers, we always need to consider the case where a code matched multiple matchers and what the behavior should be.
So, what's benifit of the new comlexity of multiple matcher?

Copy link
Copy Markdown
Contributor Author

@cpakulski cpakulski Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.

I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.

So, in essence there are two approaches here:

  • always use legacy behaviour and use failure_matcher and success_matcher to add or remove status codes from default set.
  • use only failure_matcher to completely overwrite legacy behavior. If failure_matcher is not specified, legacy behaviour takes place (only 5xx are treated as errors). If a failure_matcher is defined, only codes matching the matcher will be treated as error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.

I think this is one of our differenet perspectives. IMO, the multiple matcher bring much more complexity because the users to aware the multiple matcher's result and the original codes to figure out the final success/failure result.

I will prefer the second approach in your list. Simplely derterming the result based on legacy codes or new matcher would be more intuitive in most cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement a single matcher and see how it goes with the rest of the logic. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can convert the matcher to xds.type.matcher.v3.Matcher if required. I have no opinion which one is better or faster. Matching happens on each response, so performance is important.

Comment thread source/common/router/router.cc Outdated
Comment on lines +1639 to +1644
absl::optional<uint64_t> new_code = cluster_->processHttpForOutlierDetection(*headers);
if (new_code.has_value()) {
put_result_code = new_code.value();
absl::optional<bool> matched = cluster_->processHttpForOutlierDetection(*headers);
if (matched.has_value()) {
// Outlier detector distinguishes only two values:
// Anything >= 500 is error.
// Anything < 500 is success.
put_result_code = matched.value() ? 500 : 200;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.

It will not break consecutive_gateway_failure, because if the original code was 5xx and it matches the matcher, it will be forwarded in its original form. Non-5xx codes will be forwarded to outlier as 500, but 502 for example will be forwarded as 502.

Comment thread envoy/upstream/upstream.h Outdated
Comment on lines 1016 to 1017
virtual absl::optional<bool>
processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather then adding such new method at ClusterInfo. I think it would be better to expose the httpProtocolOptions() here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably a good idea. ClusterInfo should be just a config repository without methods with logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do it (expose httpProtocolOptions), but it requires to include a certain header files (from source/extensions) which is probably not a good idea to do in envoy/upstream/upstream.h. Other option would be to use forward declaration, like

class httpProtocolOptions;

It is true that so far ClusterInfo was purely repository of config options and this method (processHttpForOutlierDetection) adds some logic to it, but on the other side it hides implementation and provides a nice interface without exposing details.

WDYT?

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Comment on lines +179 to +181
// If specified, only responses matching the matcher will be treated by outlier detection as errors.
// If not specified, only 5xx codes are treated by outlier detection as errors.
config.common.matcher.v3.MatchPredicate outlier_detection_error_matcher = 8;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer you previous design about add a new OutlierDetection message. We may could add new field at there in the future if we aware more feature requirements. (If I didn't present this clearly before, sorry! 😞 )

message OutlierDetection {
  config.common.matcher.v3.MatchPredicate failure_matcher = 1;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a previous design, so we can extend the proto related to outlier without adding new fields to parent message.

Copy link
Copy Markdown
Member

@wbpcode wbpcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one minor comment to the API. Once we get agreement to the API, that should pretty quick to complete the review the code implemention. I think maybe we can land this PR before this weekend or next Tuesday. :)

@wbpcode wbpcode assigned wbpcode and unassigned markdroth Jun 25, 2025
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski cpakulski added the no stalebot Disables stalebot from closing an issue label Jul 1, 2025
@mathetake mathetake removed the no stalebot Disables stalebot from closing an issue label Aug 18, 2025
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski
Copy link
Copy Markdown
Contributor Author

@wbpcode I know it has been a long time since you reviewed it last time. I apologize for delay. I addressed most of your comments. I think we reached an agreement on API :-)). If the rest of the code looks good I will add integration and regression tests, docs and it should be ready for final review.

@cpakulski cpakulski requested a review from wbpcode September 11, 2025 14:25
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski
Copy link
Copy Markdown
Contributor Author

@wbpcode Thanks for approving the API. I added integration tests and updated docs. I believe that it is ready for another review (moving it out of draft).

@cpakulski cpakulski marked this pull request as ready for review September 25, 2025 17:02
@KBaichoo
Copy link
Copy Markdown
Contributor

seems like this needs a maintainer reviewer (wbpcode did the api shepards)

/assign @paul-r-gall

Comment thread source/common/router/router.cc Outdated
Comment thread source/extensions/common/matcher/matcher.h
Comment thread source/common/router/router.cc Outdated
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski cpakulski requested a review from paul-r-gall October 1, 2025 22:20
@cpakulski
Copy link
Copy Markdown
Contributor Author

release notes are still needed. I will add them once the PR is ready for merge.

@paul-r-gall
Copy link
Copy Markdown
Contributor

Thanks, I'll approve once you add release notes!

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
paul-r-gall
paul-r-gall previously approved these changes Oct 2, 2025
@cpakulski
Copy link
Copy Markdown
Contributor Author

Thanks @paul-r-gall . CI fails now and I am investigating if my changes cause those errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@cpakulski
Copy link
Copy Markdown
Contributor Author

@paul-r-gall . CI passes now. I had to do minor adjustment to a namespace in one of the tests.

@paul-r-gall paul-r-gall merged commit 2a5978a into envoyproxy:main Oct 3, 2025
26 checks passed
@cpakulski
Copy link
Copy Markdown
Contributor Author

Thanks a lot @paul-r-gall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Outlier Detection for non-error status codes

6 participants