Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions docs/configuration/cluster_manager/cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,11 +168,33 @@ outlier_detection
max_ejection_percent
The maximum % of an upstream cluster that can be ejected due to outlier detection. Defaults to 10%.

.. _config_cluster_manager_cluster_outlier_detection_enforcing:
.. _config_cluster_manager_cluster_outlier_detection_enforcing_consecutive_5xx:

enforcing
The % chance that a host will be actually ejected when an outlier status is detected. This setting
can be used to disable ejection or to ramp it up slowly. Defaults to 100.
enforcing_consecutive_5xx
The % chance that a host will be actually ejected when an outlier status is detected through
consecutive 5xx. This setting can be used to disable ejection or to ramp it up slowly. Defaults to 100.

.. _config_cluster_manager_cluster_outlier_detection_enforcing_success_rate:

enforcing_success_rate
The % chance that a host will be actually ejected when an outlier status is detected through
success rate statistics. This setting can be used to disable ejection or to ramp it up slowly.
Defaults to 100.

.. _config_cluster_manager_cluster_outlier_detection_success_rate_minimum_hosts:

success_rate_minimum_hosts
The number of hosts in a cluster that must have enough request volume to detect success rate outliers.
If the number of hosts is less than this setting, outlier detection via success rate statistics is not
performed for any host in the cluster. Defaults to 5.

.. _config_cluster_manager_cluster_outlier_detection_success_rate_request_volume:

success_rate_request_volume
The minimum number of total requests that must be collected in one interval
(as defined by :ref:`interval_ms <config_cluster_manager_cluster_outlier_detection_interval_ms>` above)
to include this host in success rate based outlier detection. If the volume is lower than this setting,
outlier detection via success rate statistics is not performed for that host. Defaults to 100.

Each of the above configuration values can be overridden via
:ref:`runtime values <config_cluster_manager_cluster_runtime_outlier_detection>`.
Expand Down
21 changes: 18 additions & 3 deletions docs/configuration/cluster_manager/cluster_runtime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,24 @@ outlier_detection.max_ejection_percent
<config_cluster_manager_cluster_outlier_detection_max_ejection_percent>`
setting in outlier detection

outlier_detection.enforcing
:ref:`enforcing
<config_cluster_manager_cluster_outlier_detection_enforcing>`
outlier_detection.enforcing_consecutive_5xx
:ref:`enforcing_consecutive_5xx
<config_cluster_manager_cluster_outlier_detection_enforcing_consecutive_5xx>`
setting in outlier detection

outlier_detection.enforcing_success_rate
:ref:`enforcing_success_rate
<config_cluster_manager_cluster_outlier_detection_enforcing_success_rate>`
setting in outlier detection

outlier_detection.success_rate_minimum_hosts
:ref:`success_rate_minimum_hosts
<config_cluster_manager_cluster_outlier_detection_success_rate_minimum_hosts>`
setting in outlier detection

outlier_detection.success_rate_request_volume
:ref:`success_rate_request_volume
<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>`
setting in outlier detection

Core
Expand Down
14 changes: 13 additions & 1 deletion docs/intro/arch_overview/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,19 @@ If an upstream host returns some number of consecutive 5xx, it will be ejected.
case a 5xx means an actual 5xx respond code, or an event that would cause the HTTP router to return
one on the upstream's behalf (reset, connection failure, etc.). The number of consecutive 5xx
required for ejection is controlled by the :ref:`outlier_detection.consecutive_5xx
<config_cluster_manager_cluster_outlier_detection>` value.
<config_cluster_manager_cluster_outlier_detection_consecutive_5xx>` value.

Success Rate
^^^^^^^^^^^^

Success Rate based outlier ejection aggregates success rate data from every host in a cluster. Then at given
intervals ejects hosts based on statistical outlier detection. Success Rate outlier ejection will not be
calculated for a host if its request volume over the aggregation interval is less than the
:ref:`outlier_detection.success_rate_request_volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>`
value. Moreover, detection will not be performed for a cluster if the number of hosts
with the minimum required request volume in an interval is less than the
:ref:`outlier_detection.success_rate_minimum_hosts<config_cluster_manager_cluster_outlier_detection_success_rate_minimum_hosts>`
value.

Ejection event logging
----------------------
Expand Down
2 changes: 1 addition & 1 deletion include/envoy/upstream/outlier_detection.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ class DetectorHostSink {

typedef std::unique_ptr<DetectorHostSink> DetectorHostSinkPtr;

enum class EjectionType { Consecutive5xx };
enum class EjectionType { Consecutive5xx, SuccessRate };

/**
* Sink for outlier detection event logs.
Expand Down
17 changes: 16 additions & 1 deletion source/common/json/config_schemas.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1089,6 +1089,16 @@ const std::string Json::Schema::CLUSTER_SCHEMA(R"EOF(
"minimum" : 0,
"exclusiveMinimum" : true
},
"success_rate_minimum_hosts" : {
"type" : "integer",
"minimum" : 0,
"exclusiveMinimum" : true
},
"success_rate_request_volume" : {
"type" : "integer",
"minimum" : 0,
"exclusiveMinimum" : true
},
"interval_ms" : {
"type" : "integer",
"minimum" : 0,
Expand All @@ -1104,7 +1114,12 @@ const std::string Json::Schema::CLUSTER_SCHEMA(R"EOF(
"minimum" : 0,
"maximum" : 100
},
"enforcing" : {
"enforcing_consecutive_5xx" : {
"type" : "integer",
"minimum" : 0,
"maximum" : 100
},
"enforcing_success_rate" : {
"type" : "integer",
"minimum" : 0,
"maximum" : 100
Expand Down
134 changes: 131 additions & 3 deletions source/common/upstream/outlier_detection_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,12 @@ void DetectorHostSinkImpl::uneject(SystemTime unejection_time) {
last_unejection_time_.value(unejection_time);
}

void DetectorHostSinkImpl::updateCurrentSuccessRateBucket() {
success_rate_accumulator_bucket_.store(success_rate_accumulator_.updateCurrentWriter());
}

void DetectorHostSinkImpl::putHttpResponseCode(uint64_t response_code) {
success_rate_accumulator_bucket_.load()->total_request_counter_++;
if (Http::CodeUtility::is5xx(response_code)) {
std::shared_ptr<DetectorImpl> detector = detector_.lock();
if (!detector) {
Expand All @@ -47,6 +52,7 @@ void DetectorHostSinkImpl::putHttpResponseCode(uint64_t response_code) {
detector->onConsecutive5xx(host_.lock());
}
} else {
success_rate_accumulator_bucket_.load()->success_request_counter_++;
consecutive_5xx_ = 0;
}
}
Expand All @@ -58,7 +64,14 @@ DetectorConfig::DetectorConfig(const Json::Object& json_config)
consecutive_5xx_(static_cast<uint64_t>(json_config.getInteger("consecutive_5xx", 5))),
max_ejection_percent_(
static_cast<uint64_t>(json_config.getInteger("max_ejection_percent", 10))),
enforcing_(static_cast<uint64_t>(json_config.getInteger("enforcing", 100))) {}
success_rate_minimum_hosts_(
static_cast<uint64_t>(json_config.getInteger("success_rate_minimum_hosts", 5))),
success_rate_request_volume_(
static_cast<uint64_t>(json_config.getInteger("success_rate_request_volume", 100))),
enforcing_consecutive_5xx_(
static_cast<uint64_t>(json_config.getInteger("enforcing_consecutive_5xx", 100))),
enforcing_success_rate_(
static_cast<uint64_t>(json_config.getInteger("enforcing_success_rate", 100))) {}

DetectorImpl::DetectorImpl(const Cluster& cluster, const Json::Object& json_config,
Event::Dispatcher& dispatcher, Runtime::Loader& runtime,
Expand Down Expand Up @@ -146,14 +159,27 @@ void DetectorImpl::checkHostForUneject(HostSharedPtr host, DetectorHostSinkImpl*
}
}

bool DetectorImpl::enforceEjection(EjectionType type) {
switch (type) {
case EjectionType::Consecutive5xx:
return runtime_.snapshot().featureEnabled("outlier_detection.enforcing_consecutive_5xx",
config_.enforcingConsecutive5xx());
case EjectionType::SuccessRate:
return runtime_.snapshot().featureEnabled("outlier_detection.enforcing_success_rate",
config_.enforcingSuccessRate());
}

NOT_REACHED;
}

void DetectorImpl::ejectHost(HostSharedPtr host, EjectionType type) {
uint64_t max_ejection_percent = std::min<uint64_t>(
100, runtime_.snapshot().getInteger("outlier_detection.max_ejection_percent",
config_.maxEjectionPercent()));
double ejected_percent = 100.0 * stats_.ejections_active_.value() / host_sinks_.size();
if (ejected_percent < max_ejection_percent) {
stats_.ejections_total_.inc();
if (runtime_.snapshot().featureEnabled("outlier_detection.enforcing", config_.enforcing())) {
if (enforceEjection(type)) {
stats_.ejections_active_.inc();
host_sinks_[host]->eject(time_source_.currentSystemTime());
runCallbacks(host);
Expand Down Expand Up @@ -208,12 +234,93 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) {
ejectHost(host, EjectionType::Consecutive5xx);
}

// The canonical factor for outlier detection in normal distributions is 2. However, host
// success rates are intuitively a distribution with negative skew, with most of the mass around
// 100 and a left tail. Therefore, a more aggressive (lower) factor is needed to detect
// outliers.
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;

double Utility::successRateEjectionThreshold(
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) {
// This function is using mean and standard deviation as statistical measures for outlier
// detection. First the mean is calculated by dividing the sum of success rate data over the
// number of data points. Then variance is calculated by taking the mean of the
// squared difference of data points to the mean of the data. Then standard deviation is
// calculated by taking the square root of the variance. Then the outlier threshold is
// calculated as the difference between the mean and the product of the standard
// deviation and a constant factor.
//
// For example with a data set that looks like success_rate_data = {50, 100, 100, 100, 100} the
// math would work as follows:
// success_rate_sum = 450
// mean = 90
// variance = 400
// stdev = 20
// threshold returned = 52
double mean = success_rate_sum / valid_success_rate_hosts.size();
double variance = 0;
std::for_each(valid_success_rate_hosts.begin(), valid_success_rate_hosts.end(),
[&variance, mean](HostSuccessRatePair v) {
variance += std::pow(v.success_rate_ - mean, 2);
});
variance /= valid_success_rate_hosts.size();
double stdev = std::sqrt(variance);

return mean - (SUCCESS_RATE_STDEV_FACTOR * stdev);
}

void DetectorImpl::processSuccessRateEjections() {
uint64_t success_rate_minimum_hosts = runtime_.snapshot().getInteger(
"outlier_detection.success_rate_minimum_hosts", config_.successRateMinimumHosts());
uint64_t success_rate_request_volume = runtime_.snapshot().getInteger(
"outlier_detection.success_rate_request_volume", config_.successRateRequestVolume());
std::vector<HostSuccessRatePair> valid_success_rate_hosts;
double success_rate_sum = 0;

// Exit early if there are not enough hosts.
if (host_sinks_.size() < success_rate_minimum_hosts) {
return;
}

// reserve upper bound of vector size to avoid reallocation.
valid_success_rate_hosts.reserve(host_sinks_.size());

for (const auto& host : host_sinks_) {
host.second->updateCurrentSuccessRateBucket();
// Don't do work if the host is already ejected.
if (!host.first->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) {
Optional<double> host_success_rate =
host.second->successRateAccumulator().getSuccessRate(success_rate_request_volume);

if (host_success_rate.valid()) {
valid_success_rate_hosts.emplace_back(
HostSuccessRatePair(host.first, host_success_rate.value()));
success_rate_sum += host_success_rate.value();
}
}
}

if (valid_success_rate_hosts.size() >= success_rate_minimum_hosts) {
double ejection_threshold =
Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts);
for (const auto& host_success_rate_pair : valid_success_rate_hosts) {
if (host_success_rate_pair.success_rate_ < ejection_threshold) {
stats_.ejections_success_rate_.inc();
ejectHost(host_success_rate_pair.host_, EjectionType::SuccessRate);
}
}
}
}

void DetectorImpl::onIntervalTimer() {
SystemTime now = time_source_.currentSystemTime();

for (auto host : host_sinks_) {
checkHostForUneject(host.first, host.second, now);
}

processSuccessRateEjections();

armIntervalTimer();
}

Expand Down Expand Up @@ -268,9 +375,11 @@ std::string EventLoggerImpl::typeToString(EjectionType type) {
switch (type) {
case EjectionType::Consecutive5xx:
return "5xx";
case EjectionType::SuccessRate:
return "SuccessRate";
}

NOT_IMPLEMENTED;
NOT_REACHED;
}

int EventLoggerImpl::secsSinceLastAction(const Optional<SystemTime>& lastActionTime,
Expand All @@ -281,5 +390,24 @@ int EventLoggerImpl::secsSinceLastAction(const Optional<SystemTime>& lastActionT
return -1;
}

SuccessRateAccumulatorBucket* SuccessRateAccumulator::updateCurrentWriter() {
// Right now current is being written to and backup is not. Flush the backup and swap.
backup_success_rate_bucket_->success_request_counter_ = 0;
backup_success_rate_bucket_->total_request_counter_ = 0;

current_success_rate_bucket_.swap(backup_success_rate_bucket_);

return current_success_rate_bucket_.get();
}

Optional<double> SuccessRateAccumulator::getSuccessRate(uint64_t success_rate_request_volume) {
if (backup_success_rate_bucket_->total_request_counter_ < success_rate_request_volume) {
return Optional<double>();
}

return Optional<double>(backup_success_rate_bucket_->success_request_counter_ * 100.0 /
backup_success_rate_bucket_->total_request_counter_);
}

} // Outlier
} // Upstream
Loading