upstream: outlier detection for non 5xx codes#39947
upstream: outlier detection for non 5xx codes#39947paul-r-gall merged 18 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
wbpcode
left a comment
There was a problem hiding this comment.
Thanks for the update. I didn't get why we need a map_to. Why not return a bool by single failure_matcher. Which scenario is the single failure_matcher couldn't cover?
|
|
||
| message HttpEvents { | ||
| // Matcher for response headers. | ||
| config.common.matcher.v3.MatchPredicate match = 1 [(validate.rules).message = {required: true}]; | ||
|
|
||
| // Code which should be reported to the outlier detection if response headers matched the matcher. | ||
| envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}]; | ||
| } | ||
|
|
||
| // List of matchers for response headers along with codes to be reported to outlier detection. | ||
| // The first matcher which returns matching success is applied. | ||
| repeated HttpEvents outlier_detection = 8; |
There was a problem hiding this comment.
Why we need a list of HttpEvents? I think single match should enough to match an error?
message OutlierDetection {
config.common.matcher.v3.MatchPredicate failure_matcher = 1 [(validate.rules).message = {required: true}];
}
OutlierDetection outlier_detection = 8;
There was a problem hiding this comment.
Applied your suggestion. Thanks! It is now a single matcher. If a code matches the matcher, it will be treated as error.
| virtual absl::optional<uint64_t> | ||
| processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE; |
There was a problem hiding this comment.
| virtual absl::optional<uint64_t> | |
| processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE; | |
| virtual absl::optional<bool> | |
| checkFailureResponse(Http::ResponseHeaderMap& reponse) const PURE; |
return an absl::optional<bool> make more sense. It's unecessary to return http code.
There was a problem hiding this comment.
Converted to optional<bool>. I am not sure about the name of the method though. processHttpForOutlierDetection is very descriptive about what it does and what is its purpose.
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
| // Code which should be reported to the outlier detection if response headers matched the matcher. | ||
| envoy.type.v3.StatusCode map_to = 2 [(validate.rules).enum = {defined_only: true}]; | ||
| // Determines if matching indicate error or non-error. Defaults to false (error). | ||
| bool success_on_match = 2; |
There was a problem hiding this comment.
If we agree the outlier detection matcher here should generate a boolean result, then:
- the
success_on_matchseem make no sense. You only need to define that if match then signify error. This would simlify the API. - We also needn't repeated
HttpEvents. Because theMatchPredicateself support or_match, and_match, not_match. Singleconfig.common.matcher.v3.MatchPredicateis good enough for this task.
There was a problem hiding this comment.
This PR assumes that "old" outlier is used and this API is overlay on top of the default behaviour which sends all HTTP status codes to outlier detection.
success_on_matchis needed to address use case described in Enhance Outlier Detection with Selective 5xx Error Exclusion #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.- similar in regards to
repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.
There was a problem hiding this comment.
success_on_match is needed to address use case described in #38311. The default is false (matcher means error), so users can define only matcher, which is more intuitive.
I understand the requirement but why why the bool flag is necessary? But may be you can use a not_match cover the case?
similar in regards to repeated HttpEvents. You are right that a single matcher would be enough to define what should be treated as error, but if two overlays are required, one to define 4xx as errors, and second to skip 502, 503 and 504 as errors, two matchers are needs.
The MatchPredicate should could cover that case.
failure_matcher:
or_match:
rules:
- http_response_headers_match:
headers:
- name: ":status"
string_match:
safe_regex:
regex: "4.."
- and_match:
rules:
- http_response_headers_match:
headers:
- name: ":status"
string_match:
safe_regex:
regex: "5.."
- not_match:
http_response_headers_match:
headers:
- name: ":status"
string_match:
safe_regex:
regex: "502|503|504"
There was a problem hiding this comment.
I prefer the xds.type.matcher.v3.Matcher rather then config.common.matcher.v3.MatchPredicate because xds.type.matcher.v3.Matcher support CEL and is easier to support various logical operators. (But MatchPredicate should also fine)
There was a problem hiding this comment.
But may be you can use a not_match cover the case?
Let me try. If it is possible it would be great!
There was a problem hiding this comment.
it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.
It would be very easy to explain the new behavior to our users if there is only single matcher. And the single matcher is flexible enough. If there are multiple matchers, we always need to consider the case where a code matched multiple matchers and what the behavior should be.
So, what's benifit of the new comlexity of multiple matcher?
There was a problem hiding this comment.
it's not very hard to accept a failure_matcher and a success_matcher. But could you kindly explain more about why? In my mind, if single failure_matcher is configured, that means the failure_matcher will take over the determination of the failure/success. Or if the failure_matcher is not configured, the legacy behavior will be used.
I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.
So, in essence there are two approaches here:
- always use legacy behaviour and use
failure_matcherandsuccess_matcherto add or remove status codes from default set. - use only
failure_matcherto completely overwrite legacy behavior. Iffailure_matcheris not specified, legacy behaviour takes place (only 5xx are treated as errors). If afailure_matcheris defined, only codes matching the matcher will be treated as error.
There was a problem hiding this comment.
I think it is possible to use only failure_matcher but it will be a bit more difficult to understand from users' point of view. I envisioned that failure_matcher will be used only to add additional errors to already existing ones (5xx), not completely redefine what codes are treated as errors. (And success_matcher would be used to remove status codes from default error codes). So, if a user wants to treat 4xx as errors, the user needs to define failure_matcher for 4xx only. If the failure_matcher decides what is error and what is not error, then the user must define matcher for 4xx and 5xx, otherwise 5xx will be treated as success.
I think this is one of our differenet perspectives. IMO, the multiple matcher bring much more complexity because the users to aware the multiple matcher's result and the original codes to figure out the final success/failure result.
I will prefer the second approach in your list. Simplely derterming the result based on legacy codes or new matcher would be more intuitive in most cases.
There was a problem hiding this comment.
I will implement a single matcher and see how it goes with the rest of the logic. Thanks.
There was a problem hiding this comment.
I can convert the matcher to xds.type.matcher.v3.Matcher if required. I have no opinion which one is better or faster. Matching happens on each response, so performance is important.
| absl::optional<uint64_t> new_code = cluster_->processHttpForOutlierDetection(*headers); | ||
| if (new_code.has_value()) { | ||
| put_result_code = new_code.value(); | ||
| absl::optional<bool> matched = cluster_->processHttpForOutlierDetection(*headers); | ||
| if (matched.has_value()) { | ||
| // Outlier detector distinguishes only two values: | ||
| // Anything >= 500 is error. | ||
| // Anything < 500 is success. | ||
| put_result_code = matched.value() ? 500 : 200; |
There was a problem hiding this comment.
This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.
There was a problem hiding this comment.
This change will finally break the consecutive_gateway_failure which only consume 502, 503, 504. We may only let the new monitor to consumer the matched result.
It will not break consecutive_gateway_failure, because if the original code was 5xx and it matches the matcher, it will be forwarded in its original form. Non-5xx codes will be forwarded to outlier as 500, but 502 for example will be forwarded as 502.
| virtual absl::optional<bool> | ||
| processHttpForOutlierDetection(Http::ResponseHeaderMap& reponse) const PURE; |
There was a problem hiding this comment.
Rather then adding such new method at ClusterInfo. I think it would be better to expose the httpProtocolOptions() here.
There was a problem hiding this comment.
That is probably a good idea. ClusterInfo should be just a config repository without methods with logic.
There was a problem hiding this comment.
I tried to do it (expose httpProtocolOptions), but it requires to include a certain header files (from source/extensions) which is probably not a good idea to do in envoy/upstream/upstream.h. Other option would be to use forward declaration, like
class httpProtocolOptions;
It is true that so far ClusterInfo was purely repository of config options and this method (processHttpForOutlierDetection) adds some logic to it, but on the other side it hides implementation and provides a nice interface without exposing details.
WDYT?
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
| // If specified, only responses matching the matcher will be treated by outlier detection as errors. | ||
| // If not specified, only 5xx codes are treated by outlier detection as errors. | ||
| config.common.matcher.v3.MatchPredicate outlier_detection_error_matcher = 8; |
There was a problem hiding this comment.
I actually prefer you previous design about add a new OutlierDetection message. We may could add new field at there in the future if we aware more feature requirements. (If I didn't present this clearly before, sorry! 😞 )
message OutlierDetection {
config.common.matcher.v3.MatchPredicate failure_matcher = 1;
}
There was a problem hiding this comment.
Used a previous design, so we can extend the proto related to outlier without adding new fields to parent message.
wbpcode
left a comment
There was a problem hiding this comment.
Only one minor comment to the API. Once we get agreement to the API, that should pretty quick to complete the review the code implemention. I think maybe we can land this PR before this weekend or next Tuesday. :)
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
@wbpcode I know it has been a long time since you reviewed it last time. I apologize for delay. I addressed most of your comments. I think we reached an agreement on API :-)). If the rest of the code looks good I will add integration and regression tests, docs and it should be ready for final review. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
@wbpcode Thanks for approving the API. I added integration tests and updated docs. I believe that it is ready for another review (moving it out of draft). |
|
seems like this needs a maintainer reviewer (wbpcode did the api shepards) /assign @paul-r-gall |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
release notes are still needed. I will add them once the PR is ready for merge. |
|
Thanks, I'll approve once you add release notes! |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
Thanks @paul-r-gall . CI fails now and I am investigating if my changes cause those errors. |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
|
/retest |
|
@paul-r-gall . CI passes now. I had to do minor adjustment to a namespace in one of the tests. |
|
Thanks a lot @paul-r-gall! |
Commit Message: outlier detection for non 5xx codes
Additional Description:
Added to a cluster HTTP a specific option to define an http matcher. When a response matches the defined matcher, the response is treated as error, regardless of the status code. The error is reported to outlier detection as 5xx code. If a response does not match the matcher it is treated as success and forwarded to outlier detection as code 200.
Risk Level: Low (no change unless matcher is defined)
Testing: Added unit and integration tests
Docs Changes: yes
Release Notes: yes
Platform Specific Features: n/a
Fixes #18789