Skip to content

upstream: gradually decrease outlier detector's ejection time multiplier when host stays healthy#14235

Merged
snowp merged 8 commits intoenvoyproxy:masterfrom
cpakulski:issue/6016
Dec 16, 2020
Merged

upstream: gradually decrease outlier detector's ejection time multiplier when host stays healthy#14235
snowp merged 8 commits intoenvoyproxy:masterfrom
cpakulski:issue/6016

Conversation

@cpakulski
Copy link
Copy Markdown
Contributor

Commit Message:
gradually decrease outlier detector's ejection time multiplier when node stays healthy

Additional Description:
Each time a host is ejected by outlier detector, the ejection time is increased. That increased ejection time stays with the host and is never decreased. This PR adds logic to gradually decrease the ejection time when node stays healthy.

Risk Level: Low
Testing: Added unit test.
Docs Changes: Updated outlier detector section.
Release Notes: No
Platform Specific Features:
Fixes #6016

when a node stays healthy.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@cpakulski cpakulski marked this pull request as draft December 1, 2020 22:20
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14235 (comment) was created by @cpakulski.

see: more, trace.

@cpakulski cpakulski marked this pull request as ready for review December 2, 2020 04:07
@cpakulski cpakulski changed the title upstream: gradually decrease outlier detector's ejection time multiplier upstream: gradually decrease outlier detector's ejection time multiplier when host stays healthy Dec 2, 2020
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, a few comments to get you started. I imagine we want this to be guarded by a configuration flag as per the linked issue?

for longer and longer periods if they continue to fail.
multiplied by the number of times the host has been ejected in a row. This causes hosts to get ejected
for longer and longer periods if they continue to fail. When the host becomes healthy, the ejection time
multiplier decreases with time. Eventually, if the host stays healthy for long time,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long? could need some more detail on how the multiplier decreases

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added details to documentation.

ASSERT(!host_.lock()->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK));
host_.lock()->healthFlagSet(Host::HealthFlag::FAILED_OUTLIER_CHECK);
num_ejections_++;
eject_time_backoff_++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be capped at some max value? During prolonged outages this would likely get really high right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Thanks for pointing this out. I added max_ejection_time config to cap it.


// Test verifies that ejection time increases each time the node is ejected,
// and decreases when node stays healthy.
#define EJECT_PERIOD_TICK(x) x * 10000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this adds, but in general prefer functions over macros for this kinda stuff

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a helper. Instead of moving time using 10000, 20000, 30000, ..... 300000, this allows to use 1, 2,3, ... 30.
Converted it to constexpr function.

@snowp snowp added the waiting label Dec 3, 2020
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
API shepherd assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #14235 was synchronize by cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@cpakulski cpakulski requested a review from snowp December 9, 2020 02:27
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks API LGTM with small comment.

/wait

// See the :ref:`architecture overview <arch_overview_outlier_detection>` for
// more information on outlier detection.
// [#next-free-field: 21]
// [#next-free-field: 22]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a release note. Technically this is a breaking behavior change but I agree it's for the best, but we should call it out carefully in the release note.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@mattklein123
Copy link
Copy Markdown
Member

Needs main merge, thanks.

/wait

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
Comment thread docs/root/version_history/current.rst Outdated
* kill_request: enable a way to configure kill header name in KillRequest proto.
* memory: enable new tcmalloc with restartable sequences for aarch64 builds.
* mongo proxy metrics: swapped network connection remote and local closed counters previously set reversed (`cx_destroy_local_with_active_rq` and `cx_destroy_remote_with_active_rq`).
* outlier detection: added :ref:`max_ejection_time <envoy_v3_api_field_config.cluster.v3.OutlierDetection.max_ejection_time>` to limit ejection time growth when node stays unhealthy for extended period of time.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it more clear that the old behavior was unlimited growth and the new default is 5 minutes, etc.? Thank you.

/wait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the release note.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
mattklein123
mattklein123 previously approved these changes Dec 11, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@snowp snowp added the waiting label Dec 14, 2020
In tests, used chrono::seconds to move simulated time forward.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14235 (comment) was created by @cpakulski.

see: more, trace.

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is close to ready, just one comment

};

// Names used in runtime configuration.
const std::string max_ejection_percent_runtime = "outlier_detection.max_ejection_percent";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use constexpr absl::string_view here and make all the names use ProperCase to reflect that they're string constants

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@snowp snowp added the waiting label Dec 16, 2020
…iew.

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@cpakulski
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14235 (comment) was created by @cpakulski.

see: more, trace.

@cpakulski cpakulski requested a review from snowp December 16, 2020 21:31
Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@snowp snowp merged commit 8b1569a into envoyproxy:master Dec 16, 2020
@cpakulski cpakulski deleted the issue/6016 branch December 16, 2020 22:41
@hzxuzhonghu
Copy link
Copy Markdown
Contributor

max_ejection_time with default 30s may break compatibility

FYI istio/istio#30181

@ramaraochavali
Copy link
Copy Markdown
Contributor

@cpakulski Why is the default value enforced for "max_ejection_time" instead of retaining the old behaviour if "max_ejection_time" is not provided. I understand the intent of adding the max_ejection_time and it is a good change. However it breaks people (see istio/istio#30181) if base_ejection_time is set above 5m (the default value of max_ejection_time)

How about

  • Enforcing max_ejection_time only when it is explicitly provided or base_ejection_time < 5m (the default max_ejection_time if not provided) otherwise leave the current behaviour as is.

WDYT?
cc: @mattklein123 for your thoughts as well?

@cpakulski
Copy link
Copy Markdown
Contributor Author

@ramaraochavali What is exactly the problem? Are the nodes ejected prematurely?

@ramaraochavali
Copy link
Copy Markdown
Contributor

ramaraochavali commented Feb 4, 2021

No. the problem is config is rejected by Envoy if there is an existing outlier detection setting with base_ejection_time > 5 m. Istio consumers can create such configuration and when they upgrade their config will get rejected.

@cpakulski
Copy link
Copy Markdown
Contributor Author

Let me check, but it should not be rejected. The max ejection time should be capped at base_ejection_time and never increase (if base_ejection_time > max_ejection_time) or max_ejection_time is not specified.

@ramaraochavali
Copy link
Copy Markdown
Contributor

@cpakulski More details, This is the error we are getting "outlier detector's max_ejection_time cannot be smaller than base_ejection_time" when it is rejected.

This validation is failing

if (detector->config().maxEjectionTimeMs() < detector->config().baseEjectionTimeMs()) {

I think we should skip that validation if max_ejection_time is not set.
Since you get defaulting it here
PROTOBUF_GET_MS_OR_DEFAULT(config, max_ejection_time, DEFAULT_MAX_EJECTION_TIME_MS))) {}
by the time it gets there it will always have value. Probably you should capture if it is set or not in a boolean and skip this

@cpakulski
Copy link
Copy Markdown
Contributor Author

I am working on the fix and would like to keep the logic fairly clean. First I will add unit test which fails and then correct it.

@cpakulski
Copy link
Copy Markdown
Contributor Author

The issue is fixed in #14962.

@ramaraochavali
Copy link
Copy Markdown
Contributor

@cpakulski Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request]OutlierDetection: reset or cap number of times host has been ejected.

5 participants