Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 34 additions & 3 deletions docs/configuration/cluster_manager/cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,9 +138,40 @@ dns_refresh_rate_ms

outlier_detection
*(optional, object)* If specified, outlier detection will be enabled for this upstream cluster.
Currently the presence of the empty object enables it and there are no options. See the
:ref:`architecture overview <arch_overview_outlier_detection>` for more information on outlier
detection.
See the :ref:`architecture overview <arch_overview_outlier_detection>` for more information on outlier
detection. The following configuration values are supported:

.. _config_cluster_manager_cluster_outlier_detection_consecutive_5xx:

consecutive_5xx
The number of consecutive 5xx responses before a consecutive 5xx ejection occurs. Defaults to 5.

.. _config_cluster_manager_cluster_outlier_detection_interval_ms:

interval_ms
The time interval between ejection analysis sweeps. This can result in both new ejections as well
as hosts being returned to service. Defaults to 10000ms or 10s.

.. _config_cluster_manager_cluster_outlier_detection_base_ejection_time_ms:

base_ejection_time_ms
The base time that a host is ejected for. The real time is equal to the base time multiplied by
the number of times the host has been ejected. Defaults to 30000ms or 30s.

.. _config_cluster_manager_cluster_outlier_detection_max_ejection_percent:

max_ejection_percent
The maximum % of an upstream cluster that can be ejected due to outlier detection. Defaults to 10%.

.. _config_cluster_manager_cluster_outlier_detection_enforcing:

enforcing
The % chance that a host will be actually ejected when an outlier status is detected. This setting
can be used to disable ejection or to ramp it up slowly. Defaults to 100.

Each of the above configuration values can be overridden via
:ref:`runtime values <config_cluster_manager_cluster_runtime_outlier_detection>`.


.. toctree::
:hidden:
Expand Down
27 changes: 17 additions & 10 deletions docs/configuration/cluster_manager/cluster_runtime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,26 +29,33 @@ Outlier detection
-----------------

See the outlier detection :ref:`architecture overview <arch_overview_outlier_detection>` for more
information on outlier detection.
information on outlier detection. The runtime parameters supported by outlier detection are the
same as the :ref:`static configuration parameters <config_cluster_manager_cluster_outlier_detection>`, namely

outlier_detection.consecutive_5xx
The number of consecutive 5xx responses before a consecutive 5xx ejection occurs. Defaults to 5.
:ref:`consecutive_5XX
<config_cluster_manager_cluster_outlier_detection_consecutive_5xx>`
setting in outlier detection

outlier_detection.interval_ms
The time interval between ejection analysis sweeps. This can result in both new ejections as well
as hosts being returned to service. Defaults to 10000ms or 10s.
:ref:`interval_ms
<config_cluster_manager_cluster_outlier_detection_interval_ms>`
setting in outlier detection

outlier_detection.base_ejection_time_ms
The base time that a host is ejected for. The real time is equal to the base time multiplied by
the number of times the host has been ejected. Defaults to 30000ms or 30s.
:ref:`base_ejection_time_ms
<config_cluster_manager_cluster_outlier_detection_base_ejection_time_ms>`
setting in outlier detection

outlier_detection.max_ejection_percent
The maximum % of an upstream cluster that can be ejected due to outlier detection. Defaults to
10%.
:ref:`max_ejection_percent
<config_cluster_manager_cluster_outlier_detection_max_ejection_percent>`
setting in outlier detection

outlier_detection.enforcing
The % chance that a host will be actually ejected when an outlier status is detected. This setting
can be used to disable ejection or to ramp it up slowly. Defaults to 100.
:ref:`enforcing
<config_cluster_manager_cluster_outlier_detection_enforcing>`
setting in outlier detection

Core
----
Expand Down
8 changes: 4 additions & 4 deletions docs/intro/arch_overview/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ ejection algorithm works as follows:

#. A host is determined to be an outlier.
#. Envoy checks to make sure the number of ejected hosts is below the allowed threshold (specified
via the :ref:`outlier_detection.max_ejection_percent
<config_cluster_manager_cluster_runtime_outlier_detection>` runtime value).
via the :ref:`outlier_detection.max_ejection_percent
<config_cluster_manager_cluster_outlier_detection>` setting).
If the number of ejected hosts is above the threshold the host is not ejected.
#. The host is ejected for some number of milliseconds. Ejection means that the host is marked
unhealthy and will not be used during load balancing unless the load balancer is in a
:ref:`panic <arch_overview_load_balancing_panic_threshold>` scenario. The number of milliseconds
is equal to the :ref:`outlier_detection.base_ejection_time_ms
<config_cluster_manager_cluster_runtime_outlier_detection>` runtime value
<config_cluster_manager_cluster_outlier_detection>` value
multiplied by the number of times the host has been ejected. This causes hosts to get ejected
for longer and longer periods if they continue to fail.
#. An ejected host will automatically be brought back into service after the ejection time has
Expand All @@ -46,7 +46,7 @@ If an upstream host returns some number of consecutive 5xx, it will be ejected.
case a 5xx means an actual 5xx respond code, or an event that would cause the HTTP router to return
one on the upstream's behalf (reset, connection failure, etc.). The number of consecutive 5xx
required for ejection is controlled by the :ref:`outlier_detection.consecutive_5xx
<config_cluster_manager_cluster_runtime_outlier_detection>` runtime value.
<config_cluster_manager_cluster_outlier_detection>` value.

Ejection event logging
----------------------
Expand Down
32 changes: 31 additions & 1 deletion source/common/json/config_schemas.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1061,7 +1061,37 @@ const std::string Json::Schema::CLUSTER_SCHEMA(R"EOF(
"minimum" : 0,
"exclusiveMinimum" : true
},
"outlier_detection" : {"type" : "object"}
"outlier_detection" : {
"type" : "object",
"properties" : {
"consecutive_5xx" : {
"type" : "integer",
"minimum" : 0,
"exclusiveMinimum" : true
},
"interval_ms" : {
"type" : "integer",
"minimum" : 0,
"exclusiveMinimum" : true
},
"base_ejection_time_ms" : {
"type" : "integer",
"minimum" : 0,
"exclusiveMinimum" : true
},
"max_ejection_percent" : {
"type" : "integer",
"minimum" : 0,
"maximum" : 100
},
"enforcing" : {
"type" : "integer",
"minimum" : 0,
"maximum" : 100
}
},
"additionalProperties" : false
}
},
"required" : ["name", "type", "connect_timeout_ms", "lb_type"],
"additionalProperties" : false
Expand Down
49 changes: 29 additions & 20 deletions source/common/upstream/outlier_detection_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,9 @@ DetectorPtr DetectorImplFactory::createForCluster(Cluster& cluster,
Event::Dispatcher& dispatcher,
Runtime::Loader& runtime,
EventLoggerPtr event_logger) {
// Right now we don't support any configuration but in order to make the config backwards
// compatible we just look for an empty object.
if (cluster_config.hasObject("outlier_detection")) {
return DetectorImpl::create(cluster, dispatcher, runtime, ProdSystemTimeSource::instance_,
event_logger);
return DetectorImpl::create(cluster, *cluster_config.getObject("outlier_detection"), dispatcher,
runtime, ProdSystemTimeSource::instance_, event_logger);
} else {
return nullptr;
}
Expand All @@ -44,18 +42,28 @@ void DetectorHostSinkImpl::putHttpResponseCode(uint64_t response_code) {
}

if (++consecutive_5xx_ ==
detector->runtime().snapshot().getInteger("outlier_detection.consecutive_5xx", 5)) {
detector->runtime().snapshot().getInteger("outlier_detection.consecutive_5xx",
detector->config().consecutive5xx())) {
detector->onConsecutive5xx(host_.lock());
}
} else {
consecutive_5xx_ = 0;
}
}

DetectorImpl::DetectorImpl(const Cluster& cluster, Event::Dispatcher& dispatcher,
Runtime::Loader& runtime, SystemTimeSource& time_source,
EventLoggerPtr event_logger)
: dispatcher_(dispatcher), runtime_(runtime), time_source_(time_source),
DetectorConfig::DetectorConfig(const Json::Object& json_config)
: interval_ms_(static_cast<uint64_t>(json_config.getInteger("interval_ms", 10000))),
base_ejection_time_ms_(
static_cast<uint64_t>(json_config.getInteger("base_ejection_time_ms", 30000))),
consecutive_5xx_(static_cast<uint64_t>(json_config.getInteger("consecutive_5xx", 5))),
max_ejection_percent_(
static_cast<uint64_t>(json_config.getInteger("max_ejection_percent", 10))),
enforcing_(static_cast<uint64_t>(json_config.getInteger("enforcing", 100))) {}

DetectorImpl::DetectorImpl(const Cluster& cluster, const Json::Object& json_config,
Event::Dispatcher& dispatcher, Runtime::Loader& runtime,
SystemTimeSource& time_source, EventLoggerPtr event_logger)
: config_(json_config), dispatcher_(dispatcher), runtime_(runtime), time_source_(time_source),
stats_(generateStats(cluster.info()->statsScope())),
interval_timer_(dispatcher.createTimer([this]() -> void { onIntervalTimer(); })),
event_logger_(event_logger) {}
Expand All @@ -69,13 +77,12 @@ DetectorImpl::~DetectorImpl() {
}
}

std::shared_ptr<DetectorImpl> DetectorImpl::create(const Cluster& cluster,
Event::Dispatcher& dispatcher,
Runtime::Loader& runtime,
SystemTimeSource& time_source,
EventLoggerPtr event_logger) {
std::shared_ptr<DetectorImpl>
DetectorImpl::create(const Cluster& cluster, const Json::Object& json_config,
Event::Dispatcher& dispatcher, Runtime::Loader& runtime,
SystemTimeSource& time_source, EventLoggerPtr event_logger) {
std::shared_ptr<DetectorImpl> detector(
new DetectorImpl(cluster, dispatcher, runtime, time_source, event_logger));
new DetectorImpl(cluster, json_config, dispatcher, runtime, time_source, event_logger));
detector->initialize(cluster);
return detector;
}
Expand Down Expand Up @@ -114,16 +121,17 @@ void DetectorImpl::addHostSink(HostPtr host) {

void DetectorImpl::armIntervalTimer() {
interval_timer_->enableTimer(std::chrono::milliseconds(
runtime_.snapshot().getInteger("outlier_detection.interval_ms", 10000)));
runtime_.snapshot().getInteger("outlier_detection.interval_ms", config_.intervalMs())));
}

void DetectorImpl::checkHostForUneject(HostPtr host, DetectorHostSinkImpl* sink, SystemTime now) {
if (!host->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) {
return;
}

std::chrono::milliseconds base_eject_time = std::chrono::milliseconds(
runtime_.snapshot().getInteger("outlier_detection.base_ejection_time_ms", 30000));
std::chrono::milliseconds base_eject_time =
std::chrono::milliseconds(runtime_.snapshot().getInteger(
"outlier_detection.base_ejection_time_ms", config_.baseEjectionTimeMs()));
ASSERT(sink->numEjections() > 0)
if ((base_eject_time * sink->numEjections()) <= (now - sink->lastEjectionTime().value())) {
stats_.ejections_active_.dec();
Expand All @@ -139,11 +147,12 @@ void DetectorImpl::checkHostForUneject(HostPtr host, DetectorHostSinkImpl* sink,

void DetectorImpl::ejectHost(HostPtr host, EjectionType type) {
uint64_t max_ejection_percent = std::min<uint64_t>(
100, runtime_.snapshot().getInteger("outlier_detection.max_ejection_percent", 10));
100, runtime_.snapshot().getInteger("outlier_detection.max_ejection_percent",
config_.maxEjectionPercent()));
double ejected_percent = 100.0 * stats_.ejections_active_.value() / host_sinks_.size();
if (ejected_percent < max_ejection_percent) {
stats_.ejections_total_.inc();
if (runtime_.snapshot().featureEnabled("outlier_detection.enforcing", 100)) {
if (runtime_.snapshot().featureEnabled("outlier_detection.enforcing", config_.enforcing())) {
stats_.ejections_active_.inc();
host_sinks_[host]->eject(time_source_.currentSystemTime());
runCallbacks(host);
Expand Down
33 changes: 28 additions & 5 deletions source/common/upstream/outlier_detection_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -84,27 +84,49 @@ struct DetectionStats {
ALL_OUTLIER_DETECTION_STATS(GENERATE_COUNTER_STRUCT, GENERATE_GAUGE_STRUCT)
};

/**
* Configuration for the outlier detection.
*/
class DetectorConfig {
public:
DetectorConfig(const Json::Object& json_config);

uint64_t intervalMs() { return interval_ms_; }
uint64_t baseEjectionTimeMs() { return base_ejection_time_ms_; }
uint64_t consecutive5xx() { return consecutive_5xx_; }
uint64_t maxEjectionPercent() { return max_ejection_percent_; }
uint64_t enforcing() { return enforcing_; }

private:
const uint64_t interval_ms_;
const uint64_t base_ejection_time_ms_;
const uint64_t consecutive_5xx_;
const uint64_t max_ejection_percent_;
const uint64_t enforcing_;
};

/**
* An implementation of an outlier detector. In the future we may support multiple outlier detection
* implementations with different configuration. For now, as we iterate everything is contained
* within this implementation.
*/
class DetectorImpl : public Detector, public std::enable_shared_from_this<DetectorImpl> {
public:
static std::shared_ptr<DetectorImpl> create(const Cluster& cluster, Event::Dispatcher& dispatcher,
Runtime::Loader& runtime,
SystemTimeSource& time_source,
EventLoggerPtr event_logger);
static std::shared_ptr<DetectorImpl>
create(const Cluster& cluster, const Json::Object& json_config, Event::Dispatcher& dispatcher,
Runtime::Loader& runtime, SystemTimeSource& time_source, EventLoggerPtr event_logger);
~DetectorImpl();

void onConsecutive5xx(HostPtr host);
Runtime::Loader& runtime() { return runtime_; }
DetectorConfig& config() { return config_; }

// Upstream::Outlier::Detector
void addChangedStateCb(ChangeStateCb cb) override { callbacks_.push_back(cb); }

private:
DetectorImpl(const Cluster& cluster, Event::Dispatcher& dispatcher, Runtime::Loader& runtime,
DetectorImpl(const Cluster& cluster, const Json::Object& json_config,
Event::Dispatcher& dispatcher, Runtime::Loader& runtime,
SystemTimeSource& time_source, EventLoggerPtr event_logger);

void addHostSink(HostPtr host);
Expand All @@ -117,6 +139,7 @@ class DetectorImpl : public Detector, public std::enable_shared_from_this<Detect
void onIntervalTimer();
void runCallbacks(HostPtr host);

DetectorConfig config_;
Event::Dispatcher& dispatcher_;
Runtime::Loader& runtime_;
SystemTimeSource& time_source_;
Expand Down
Loading