upstream: unify all outlier report to use putResult#39440
upstream: unify all outlier report to use putResult#39440wbpcode merged 8 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
b884fb7 to
9bfabe7
Compare
| putHttpResponseCode(enumToInt(Http::Code::OK)); | ||
| localOriginNoFailure(); | ||
| putHttpResponseCode(code.has_value() ? code.value() : enumToInt(Http::Code::OK)); |
There was a problem hiding this comment.
Note, the onPoolReady will call the putResult(Result::LocalOriginConnectSuccess), so, the localOriginNoFailure() should be removed to avoid repeated connect event.
There was a problem hiding this comment.
Can you describe how does the repeated connect event happen? Does it mean that unit tests do not cover that case?
There was a problem hiding this comment.
In the previous implementation, the double/thrift proxy will call the putResult(Result::LocalOriginConnectSuccess) when pool is ready (onPoolReady). And when the request is successed, the putResult(Result::ExtOriginRequestSuccess) will be called.
If the split_external_local_origin_errors is set to true, then putResult() will actually call the putResultWithLocalExternalSplit. In that case, the localOriginNoFailure() will actually be called twice for same request. (one from putResult(Result::LocalOriginConnectSuccess) and another one from putResult(Result::ExtOriginRequestSuccess))
There was a problem hiding this comment.
If we want to use the extension to replace the old behavior completely, then to cover that requirement, the consecutive errors extension also need to provide three different modes: LocalOrigin, ExternalOrigin, Composite/Default?
There was a problem hiding this comment.
Thanks for explaining why do you experience repeated event.
In the new extensions, a user may define 2 extensions and send LocalOrigin to one extension and ExternalOrigin/HttpCode to the second extension. In this model those two types of events are counter separately.
But a user may also define only one extension and send both LocalOrigin and ExternalOrigin/HttpCode to the same extension. In this model a user says that it does not really care about types of errors: if a server returns 5xx or connection to a server fails, it should be counted as error.
So in other words an extension does not need to understand Result enum (LocalOrigin, ExternalOrigin, etc) only whether the module reports success or error (boolean).
There was a problem hiding this comment.
But the Result/event won't disappear? The router need to map the Result/event to specific extension or multiple extensions + bool? I think this mapping self is more complex?
I actually think the Result emum is very suitable choice which actually only contains four statuses: l4 success, l4 failure, l7 success, l7 failure and could cover most protocols outlier detections?
I think that Result may stay because it is used by L4 to inform the router (or Redis, etc) that there was a problem with the connection. It also uses several values like timeout, upstream reset, etc. But the outlier monitor extensions do not really need to know the details. Extensions only care if it was a success or failure. So there must be a mapping somewhere from Result to boolean.
Anyways, we can continue this discussion on my other PR.
There was a problem hiding this comment.
Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.
There was a problem hiding this comment.
I think that Result may stay because it is used by L4 to inform the router (or Redis, etc) that there was a problem with the connection. It also uses several values like timeout, upstream reset, etc. But the outlier monitor extensions do not really need to know the details. Extensions only care if it was a success or failure. So there must be a mapping somewhere from Result to boolean.
Anyways, we can continue this discussion on my other PR.
I think this is a problem about where the complexity should be. If the extension monitor only handle single bool, that means if we want to support the current feature about handle L4/L7 errors separately, then we need to configure mapping rules to map specific events to specific monitor at the router configuration, and at the upstream configuration, we also need to configure multiple different monitors to handle different errors.
Note, the router's configuration is works for whole HCM, that means it may shared by multiple upstreams/clusters, then, how we can handle the case where one cluster want to handle L4/L7 errors separately and another one doesn't? So, I will strong inclined let the monitors to handle both L4/L7 errors.
Actually, timeout/reset, etc are all handled as same status (L4 failure) for now.
There was a problem hiding this comment.
Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.
Only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?
There was a problem hiding this comment.
Only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?
HTTP modules so far used putHttpResponseCode directly, but now they do it via ExtOriginRequestSuccess/ExtOriginRequestFailure. I think you are right and it should be fine. Please update the comment https://github.com/envoyproxy/envoy/blob/main/envoy/upstream/outlier_detection.h#L25-L27
Note, the router's configuration is works for whole HCM, that means it may shared by multiple upstreams/clusters, then, how we can handle the case where one cluster want to handle L4/L7 errors separately and another one doesn't? So, I will strong inclined let the monitors to handle both L4/L7 errors.
Actually, timeout/reset, etc are all handled as same status (L4 failure) for now.
Correct. This case is handled even with bools. Let's continue this discussion in the other PR, so we do not loose that valuable feedback. But in general, under router there is a matcher and a list of destinations where you want to send the result of matching:
match:
http_code:
start: 500
end: 599
send_to:
- monitor in cluster1
- monitor in cluster3
So, if you do not want to send something to say cluster2, you do not configure it under matcher.
|
/retest |
… branch 'main' of https://github.com/envoyproxy/envoy into dev-unify-outlier
|
/retest |
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
I've never looked closely at the outlier detector before. Can you explain to a newbie the behavior change being made here, why it is correct, and how that reflects in the changed test expectations?
/wait
cpakulski
left a comment
There was a problem hiding this comment.
Few comments regarding logic.
| case Result::ExtOriginRequestFailed: | ||
| // map it to http code and call http handler. | ||
| return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable)); | ||
| putHttpResponseCode(code.has_value() ? code.value() |
There was a problem hiding this comment.
Technically/logically it is correct to read http code from code, but there is underlying assumption in the outlier detection that all failures generated by non-http modules map those failures to HTTP code 500. If you specify code other than 500, the "canned" outlier detection modules like consecutive 5xx, failure frequency and success rate will stop to function properly. My preference is to leave putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable)); as it clearly states that such mapping takes place.
There was a problem hiding this comment.
This is necessary because HTTP use the code rather than the status. And only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?
We may migrate features to extension monitor in the future gradually and use the Result only as input, but for now, we still need this behavior
There was a problem hiding this comment.
OK. You can leave it as is, but if you call putResult(Result::ExtOriginRequestFailed, 400), you indicate failure but outlier detection monitor like consecutive_5xx will treat is as success. So, in order for outlier detection to work properly, you must always call putResult(Result::ExtOriginRequestFailed, code) with code 500. If someone in the future uses a different HTTP code, it may lead to lengthly debugging session why things do not work.
| putHttpResponseCode(enumToInt(Http::Code::OK)); | ||
| localOriginNoFailure(); | ||
| putHttpResponseCode(code.has_value() ? code.value() : enumToInt(Http::Code::OK)); |
There was a problem hiding this comment.
Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.
Co-authored-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: code <wbphub@gmail.com>
Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
|
/assign-from @envoyproxy/senior-maintainers re-assign this PR because @ggreenway will OOO for a while :) |
|
@envoyproxy/senior-maintainers assignee is @mattklein123 |
|
/retest |
|
Will wait for @cpakulski approval then I will look. /wait-any |
cpakulski
left a comment
There was a problem hiding this comment.
Looks good. Thanks. Please update the indicated code comment.
| return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable)); | ||
| putHttpResponseCode(code.value_or(enumToInt(Http::Code::ServiceUnavailable))); | ||
| break; | ||
| // EXT_ORIGIN_REQUEST_SUCCESS is used to report that transaction with non-http server was |
There was a problem hiding this comment.
This comment is not accurate anymore, because router, which is http-specific module, also uses ExtOriginRequestSuccess. You can just rephrase it to something like:
EXT_ORIGIN_REQUEST_SUCCESS is used to report that transaction with a server was completed successfully.
Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
|
/retest |
cpakulski
left a comment
There was a problem hiding this comment.
It looks good now. Thanks!
@mattklein123 passing it on to you. Thank you.
There was a problem hiding this comment.
Pull Request Overview
This PR replaces all direct calls to putHttpResponseCode with the unified putResult API across production and test code, and removes the deprecated interface method.
- Removed
putHttpResponseCodein favor ofputResultin mocks and null implementations - Updated outlier detection implementation and interface to deprecate the old API
- Refactored tests to use
putResultwith the appropriateResultenum and optional code
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/mocks/upstream/host.h | Removed deprecated MOCK_METHOD(putHttpResponseCode) |
| test/extensions/clusters/logical_dns/logical_dns_cluster_test.cc | Switched test calls from putHttpResponseCode to putResult |
| test/common/upstream/upstream_impl_test.cc | Updated outlier detector tests to use putResult |
| test/common/upstream/outlier_detection_impl_test.cc | Updated helper loadRq and assertions to use putResult |
| test/common/router/router_upstream_log_test.cc | Replaced putHttpResponseCode expectations |
| test/common/router/router_upstream_filter_test.cc | Replaced putHttpResponseCode expectation |
| test/common/router/router_test_base.cc | Updated call to outlier detector in base tests |
| test/common/router/router_test.cc | Swapped putHttpResponseCode for putResult in multiple tests |
| test/common/router/router_2_test.cc | Updated suppressed-header test to use putResult |
| source/common/upstream/upstream_impl.h | Removed null impl of putHttpResponseCode |
| source/common/upstream/outlier_detection_impl.h | Moved and declared putHttpResponseCode alongside putResult |
| source/common/upstream/outlier_detection_impl.cc | Adjusted putResultWithLocalExternalSplit to call putHttpResponseCode internally |
| source/common/router/router.cc | Changed filter logic to call putResult instead of putHttpResponseCode |
| envoy/upstream/outlier_detection.h | Removed pure-virtual putHttpResponseCode from the interface |
Per offline communication, ggreenway have passed this to ther maintainers.
Commit Message: upstream: unify all outlier report to use putResult
Additional Description:
To use
putResultfor all outlier detection event reporting. This is related to the #34154. Ideally, the putResult will be used as the unique event reporting entrypoint of outlier detection and theResultenum will be used as the input of the custom outlier monitors.This also fixed a bug where the connection success event will be reported twice.
Risk Level: low.
Testing: n/a
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]