upstream: gradually decrease outlier detector's ejection time multiplier when host stays healthy#14235
Conversation
when a node stays healthy. Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
|
/retest |
|
Retrying Azure Pipelines: |
snowp
left a comment
There was a problem hiding this comment.
Thanks for working on this, a few comments to get you started. I imagine we want this to be guarded by a configuration flag as per the linked issue?
| for longer and longer periods if they continue to fail. | ||
| multiplied by the number of times the host has been ejected in a row. This causes hosts to get ejected | ||
| for longer and longer periods if they continue to fail. When the host becomes healthy, the ejection time | ||
| multiplier decreases with time. Eventually, if the host stays healthy for long time, |
There was a problem hiding this comment.
how long? could need some more detail on how the multiplier decreases
There was a problem hiding this comment.
Added details to documentation.
| ASSERT(!host_.lock()->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)); | ||
| host_.lock()->healthFlagSet(Host::HealthFlag::FAILED_OUTLIER_CHECK); | ||
| num_ejections_++; | ||
| eject_time_backoff_++; |
There was a problem hiding this comment.
Should this be capped at some max value? During prolonged outages this would likely get really high right?
There was a problem hiding this comment.
Correct. Thanks for pointing this out. I added max_ejection_time config to cap it.
|
|
||
| // Test verifies that ejection time increases each time the node is ejected, | ||
| // and decreases when node stays healthy. | ||
| #define EJECT_PERIOD_TICK(x) x * 10000 |
There was a problem hiding this comment.
Not sure what this adds, but in general prefer functions over macros for this kinda stuff
There was a problem hiding this comment.
It is just a helper. Instead of moving time using 10000, 20000, 30000, ..... 300000, this allows to use 1, 2,3, ... 30.
Converted it to constexpr function.
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks API LGTM with small comment.
/wait
| // See the :ref:`architecture overview <arch_overview_outlier_detection>` for | ||
| // more information on outlier detection. | ||
| // [#next-free-field: 21] | ||
| // [#next-free-field: 22] |
There was a problem hiding this comment.
This needs a release note. Technically this is a breaking behavior change but I agree it's for the best, but we should call it out carefully in the release note.
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
|
Needs main merge, thanks. /wait |
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
| * kill_request: enable a way to configure kill header name in KillRequest proto. | ||
| * memory: enable new tcmalloc with restartable sequences for aarch64 builds. | ||
| * mongo proxy metrics: swapped network connection remote and local closed counters previously set reversed (`cx_destroy_local_with_active_rq` and `cx_destroy_remote_with_active_rq`). | ||
| * outlier detection: added :ref:`max_ejection_time <envoy_v3_api_field_config.cluster.v3.OutlierDetection.max_ejection_time>` to limit ejection time growth when node stays unhealthy for extended period of time. |
There was a problem hiding this comment.
Can you make it more clear that the old behavior was unlimited growth and the new default is 5 minutes, etc.? Thank you.
/wait
There was a problem hiding this comment.
Expanded the release note.
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
In tests, used chrono::seconds to move simulated time forward. Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
|
/retest |
|
Retrying Azure Pipelines: |
snowp
left a comment
There was a problem hiding this comment.
Thanks this is close to ready, just one comment
| }; | ||
|
|
||
| // Names used in runtime configuration. | ||
| const std::string max_ejection_percent_runtime = "outlier_detection.max_ejection_percent"; |
There was a problem hiding this comment.
I would use constexpr absl::string_view here and make all the names use ProperCase to reflect that they're string constants
…iew. Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
|
/retest |
|
Retrying Azure Pipelines: |
|
max_ejection_time with default 30s may break compatibility |
|
@cpakulski Why is the default value enforced for "max_ejection_time" instead of retaining the old behaviour if "max_ejection_time" is not provided. I understand the intent of adding the max_ejection_time and it is a good change. However it breaks people (see istio/istio#30181) if base_ejection_time is set above 5m (the default value of max_ejection_time) How about
WDYT? |
|
@ramaraochavali What is exactly the problem? Are the nodes ejected prematurely? |
|
No. the problem is config is rejected by Envoy if there is an existing outlier detection setting with base_ejection_time > 5 m. Istio consumers can create such configuration and when they upgrade their config will get rejected. |
|
Let me check, but it should not be rejected. The max ejection time should be capped at base_ejection_time and never increase (if base_ejection_time > max_ejection_time) or max_ejection_time is not specified. |
|
@cpakulski More details, This is the error we are getting "outlier detector's max_ejection_time cannot be smaller than base_ejection_time" when it is rejected. This validation is failing I think we should skip that validation if max_ejection_time is not set. Since you get defaulting it here by the time it gets there it will always have value. Probably you should capture if it is set or not in a boolean and skip this |
|
I am working on the fix and would like to keep the logic fairly clean. First I will add unit test which fails and then correct it. |
|
The issue is fixed in #14962. |
|
@cpakulski Thank you so much. |
Commit Message:
gradually decrease outlier detector's ejection time multiplier when node stays healthy
Additional Description:
Each time a host is ejected by outlier detector, the ejection time is increased. That increased ejection time stays with the host and is never decreased. This PR adds logic to gradually decrease the ejection time when node stays healthy.
Risk Level: Low
Testing: Added unit test.
Docs Changes: Updated outlier detector section.
Release Notes: No
Platform Specific Features:
Fixes #6016