Skip to content

upstream: unify all outlier report to use putResult#39440

Merged
wbpcode merged 8 commits intoenvoyproxy:mainfrom
wbpcode:dev-unify-outlier
May 29, 2025
Merged

upstream: unify all outlier report to use putResult#39440
wbpcode merged 8 commits intoenvoyproxy:mainfrom
wbpcode:dev-unify-outlier

Conversation

@wbpcode
Copy link
Copy Markdown
Member

@wbpcode wbpcode commented May 12, 2025

Commit Message: upstream: unify all outlier report to use putResult
Additional Description:

To use putResult for all outlier detection event reporting. This is related to the #34154. Ideally, the putResult will be used as the unique event reporting entrypoint of outlier detection and the Result enum will be used as the input of the custom outlier monitors.

This also fixed a bug where the connection success event will be reported twice.

Risk Level: low.
Testing: n/a
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
Comment on lines +168 to +169
putHttpResponseCode(enumToInt(Http::Code::OK));
localOriginNoFailure();
putHttpResponseCode(code.has_value() ? code.value() : enumToInt(Http::Code::OK));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the onPoolReady will call the putResult(Result::LocalOriginConnectSuccess), so, the localOriginNoFailure() should be removed to avoid repeated connect event.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe how does the repeated connect event happen? Does it mean that unit tests do not cover that case?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, the double/thrift proxy will call the putResult(Result::LocalOriginConnectSuccess) when pool is ready (onPoolReady). And when the request is successed, the putResult(Result::ExtOriginRequestSuccess) will be called.

If the split_external_local_origin_errors is set to true, then putResult() will actually call the putResultWithLocalExternalSplit. In that case, the localOriginNoFailure() will actually be called twice for same request. (one from putResult(Result::LocalOriginConnectSuccess) and another one from putResult(Result::ExtOriginRequestSuccess))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use the extension to replace the old behavior completely, then to cover that requirement, the consecutive errors extension also need to provide three different modes: LocalOrigin, ExternalOrigin, Composite/Default?

Copy link
Copy Markdown
Contributor

@cpakulski cpakulski May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining why do you experience repeated event.

In the new extensions, a user may define 2 extensions and send LocalOrigin to one extension and ExternalOrigin/HttpCode to the second extension. In this model those two types of events are counter separately.
But a user may also define only one extension and send both LocalOrigin and ExternalOrigin/HttpCode to the same extension. In this model a user says that it does not really care about types of errors: if a server returns 5xx or connection to a server fails, it should be counted as error.

So in other words an extension does not need to understand Result enum (LocalOrigin, ExternalOrigin, etc) only whether the module reports success or error (boolean).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the Result/event won't disappear? The router need to map the Result/event to specific extension or multiple extensions + bool? I think this mapping self is more complex?

I actually think the Result emum is very suitable choice which actually only contains four statuses: l4 success, l4 failure, l7 success, l7 failure and could cover most protocols outlier detections?

I think that Result may stay because it is used by L4 to inform the router (or Redis, etc) that there was a problem with the connection. It also uses several values like timeout, upstream reset, etc. But the outlier monitor extensions do not really need to know the details. Extensions only care if it was a success or failure. So there must be a mapping somewhere from Result to boolean.

Anyways, we can continue this discussion on my other PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.

Copy link
Copy Markdown
Member Author

@wbpcode wbpcode May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Result may stay because it is used by L4 to inform the router (or Redis, etc) that there was a problem with the connection. It also uses several values like timeout, upstream reset, etc. But the outlier monitor extensions do not really need to know the details. Extensions only care if it was a success or failure. So there must be a mapping somewhere from Result to boolean.
Anyways, we can continue this discussion on my other PR.

I think this is a problem about where the complexity should be. If the extension monitor only handle single bool, that means if we want to support the current feature about handle L4/L7 errors separately, then we need to configure mapping rules to map specific events to specific monitor at the router configuration, and at the upstream configuration, we also need to configure multiple different monitors to handle different errors.

Note, the router's configuration is works for whole HCM, that means it may shared by multiple upstreams/clusters, then, how we can handle the case where one cluster want to handle L4/L7 errors separately and another one doesn't? So, I will strong inclined let the monitors to handle both L4/L7 errors.
Actually, timeout/reset, etc are all handled as same status (L4 failure) for now.

Copy link
Copy Markdown
Member Author

@wbpcode wbpcode May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.

Only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?

Copy link
Copy Markdown
Contributor

@cpakulski cpakulski May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?

HTTP modules so far used putHttpResponseCode directly, but now they do it via ExtOriginRequestSuccess/ExtOriginRequestFailure. I think you are right and it should be fine. Please update the comment https://github.com/envoyproxy/envoy/blob/main/envoy/upstream/outlier_detection.h#L25-L27

Note, the router's configuration is works for whole HCM, that means it may shared by multiple upstreams/clusters, then, how we can handle the case where one cluster want to handle L4/L7 errors separately and another one doesn't? So, I will strong inclined let the monitors to handle both L4/L7 errors.
Actually, timeout/reset, etc are all handled as same status (L4 failure) for now.

Correct. This case is handled even with bools. Let's continue this discussion in the other PR, so we do not loose that valuable feedback. But in general, under router there is a matcher and a list of destinations where you want to send the result of matching:

 match: 
       http_code:
           start: 500
           end: 599
      send_to:
       - monitor in cluster1
       - monitor in cluster3

So, if you do not want to send something to say cluster2, you do not configure it under matcher.

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 16, 2025

/retest

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 16, 2025

/retest

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 21, 2025

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never looked closely at the outlier detector before. Can you explain to a newbie the behavior change being made here, why it is correct, and how that reflects in the changed test expectations?

/wait

Comment thread source/common/router/router.cc Outdated
Comment thread source/common/router/router.cc Outdated
Comment thread source/common/upstream/outlier_detection_impl.cc Outdated
Comment thread source/common/upstream/outlier_detection_impl.cc Outdated
Comment thread test/common/router/router_test.cc
Copy link
Copy Markdown
Contributor

@cpakulski cpakulski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments regarding logic.

Comment thread test/common/router/router_test.cc
case Result::ExtOriginRequestFailed:
// map it to http code and call http handler.
return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable));
putHttpResponseCode(code.has_value() ? code.value()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically/logically it is correct to read http code from code, but there is underlying assumption in the outlier detection that all failures generated by non-http modules map those failures to HTTP code 500. If you specify code other than 500, the "canned" outlier detection modules like consecutive 5xx, failure frequency and success rate will stop to function properly. My preference is to leave putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable)); as it clearly states that such mapping takes place.

Copy link
Copy Markdown
Member Author

@wbpcode wbpcode May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because HTTP use the code rather than the status. And only HTTP modules will set the HTTP code, so, I think others won't be effected by this PR?

We may migrate features to extension monitor in the future gradually and use the Result only as input, but for now, we still need this behavior

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. You can leave it as is, but if you call putResult(Result::ExtOriginRequestFailed, 400), you indicate failure but outlier detection monitor like consecutive_5xx will treat is as success. So, in order for outlier detection to work properly, you must always call putResult(Result::ExtOriginRequestFailed, code) with code 500. If someone in the future uses a different HTTP code, it may lead to lengthly debugging session why things do not work.

Comment on lines +168 to +169
putHttpResponseCode(enumToInt(Http::Code::OK));
localOriginNoFailure();
putHttpResponseCode(code.has_value() ? code.value() : enumToInt(Http::Code::OK));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment similar to one above for ExtOriginRequestFailed. There is an underlying assumption that all non-http modules map ExtOriginRequestSuccess to HTTP code 200. So, if you specify code other than 200, consecutive5xx, failure frequency, success rate monitors may stop working properly, albeit most code simply assumes than anything different than 5xx is success.

Co-authored-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: code <wbphub@gmail.com>
@ggreenway ggreenway removed their assignment May 21, 2025
Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 22, 2025

/assign-from @envoyproxy/senior-maintainers re-assign this PR because @ggreenway will OOO for a while :)

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @mattklein123
re-assign assignee is @None
this assignee is @None
PR assignee is @None
because assignee is @None
@ggreenway assignee is @None
will assignee is @None
OOO assignee is @None
for assignee is @None
a assignee is @None
while assignee is @None
:) assignee is @None

🐱

Caused by: a #39440 (comment) was created by @wbpcode.

see: more, trace.

@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 22, 2025

/retest

@mattklein123
Copy link
Copy Markdown
Member

Will wait for @cpakulski approval then I will look.

/wait-any

Copy link
Copy Markdown
Contributor

@cpakulski cpakulski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks. Please update the indicated code comment.

return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable));
putHttpResponseCode(code.value_or(enumToInt(Http::Code::ServiceUnavailable)));
break;
// EXT_ORIGIN_REQUEST_SUCCESS is used to report that transaction with non-http server was
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not accurate anymore, because router, which is http-specific module, also uses ExtOriginRequestSuccess. You can just rephrase it to something like:
EXT_ORIGIN_REQUEST_SUCCESS is used to report that transaction with a server was completed successfully.

Signed-off-by: wangbaiping(wbpcode) <wbphub@gmail.com>
@wbpcode
Copy link
Copy Markdown
Member Author

wbpcode commented May 27, 2025

/retest

Copy link
Copy Markdown
Contributor

@cpakulski cpakulski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good now. Thanks!

@mattklein123 passing it on to you. Thank you.

@wbpcode wbpcode requested a review from Copilot May 28, 2025 10:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces all direct calls to putHttpResponseCode with the unified putResult API across production and test code, and removes the deprecated interface method.

  • Removed putHttpResponseCode in favor of putResult in mocks and null implementations
  • Updated outlier detection implementation and interface to deprecate the old API
  • Refactored tests to use putResult with the appropriate Result enum and optional code

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/mocks/upstream/host.h Removed deprecated MOCK_METHOD(putHttpResponseCode)
test/extensions/clusters/logical_dns/logical_dns_cluster_test.cc Switched test calls from putHttpResponseCode to putResult
test/common/upstream/upstream_impl_test.cc Updated outlier detector tests to use putResult
test/common/upstream/outlier_detection_impl_test.cc Updated helper loadRq and assertions to use putResult
test/common/router/router_upstream_log_test.cc Replaced putHttpResponseCode expectations
test/common/router/router_upstream_filter_test.cc Replaced putHttpResponseCode expectation
test/common/router/router_test_base.cc Updated call to outlier detector in base tests
test/common/router/router_test.cc Swapped putHttpResponseCode for putResult in multiple tests
test/common/router/router_2_test.cc Updated suppressed-header test to use putResult
source/common/upstream/upstream_impl.h Removed null impl of putHttpResponseCode
source/common/upstream/outlier_detection_impl.h Moved and declared putHttpResponseCode alongside putResult
source/common/upstream/outlier_detection_impl.cc Adjusted putResultWithLocalExternalSplit to call putHttpResponseCode internally
source/common/router/router.cc Changed filter logic to call putResult instead of putHttpResponseCode
envoy/upstream/outlier_detection.h Removed pure-virtual putHttpResponseCode from the interface

Comment thread source/common/upstream/outlier_detection_impl.h
@wbpcode wbpcode dismissed ggreenway’s stale review May 29, 2025 12:55

Per offline communication, ggreenway have passed this to ther maintainers.

@wbpcode wbpcode merged commit 0672c68 into envoyproxy:main May 29, 2025
25 checks passed
@wbpcode wbpcode deleted the dev-unify-outlier branch May 29, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants