Skip to content

outlier: success rate detection#546

Merged
junr03 merged 34 commits intomasterfrom
sr-outlier-detection
Mar 23, 2017
Merged

outlier: success rate detection#546
junr03 merged 34 commits intomasterfrom
sr-outlier-detection

Conversation

@junr03
Copy link
Copy Markdown
Member

@junr03 junr03 commented Mar 8, 2017

WIP

To-do:

@rshriram
Copy link
Copy Markdown
Member

rshriram commented Mar 8, 2017

what does sr mean in this context?

@junr03
Copy link
Copy Markdown
Member Author

junr03 commented Mar 8, 2017

@rshriram success rate

@junr03 junr03 changed the title outlier: sr detection outlier: success rate detection Mar 8, 2017
for (auto host : valid_sr_hosts) {
if (host.second < mean - (2 * stdev)) {
stats_.ejections_consecutive_5xx_.inc();
ejectHost(host.first, EjectionType::SuccessRate);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the max #hosts we could potentially eject here? (worry about to eject too many). I'll look into more details, but wondering if you have an answer to this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a max ejection percentage completed in the ejectHost function

struct SRAccumulatorBucket {
std::atomic<uint64_t> success_rq_counter_;
std::atomic<uint64_t> total_rq_counter_;
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line everywhere in various files


class DetectorImpl;

class SRAccumulatorImpl {
Copy link
Copy Markdown
Contributor

@ccaraman ccaraman Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment about the class. I always forget what SR means. Please expand on why we bucket and swap.

public:
SRAccumulatorImpl()
: current_sr_bucket_(new SRAccumulatorBucket()),
backup_sr_bucket_(new SRAccumulatorBucket()){};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit spacing before {}


typedef std::shared_ptr<Detector> DetectorPtr;

struct SRAccumulatorBucket {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either move this struct into the class where you are using it or add a comment about what its intents are.

checkHostForUneject(host.first, host.second, now);

// Success Rate Outlier Detection
// First swap out the current bucket been written to, to keep data valid
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by fix?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing full stop.

: current_sr_bucket_(new SRAccumulatorBucket()),
backup_sr_bucket_(new SRAccumulatorBucket()){};
SRAccumulatorBucket* getCurrentWriter();
Optional<double> getSR(uint64_t rq_volume_threshold);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Expand to getSuccessRate and add comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that I have used SR everywhere, and expanding it here (while useful, and does not make the name too long) would be confusing.

stdev = std::sqrt(stdev);

for (auto host : valid_sr_hosts) {
if (host.second < mean - (2 * stdev)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we ever want to change how many stdevs away we allow?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but I am not sure if we would like to make this configurable. In general for this statistics approach I am more curious to deploy non-enforcing and see how it behaves. My intuition is that mean and stdev is going to be sensitive, but I want to see data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know. I would make the 2 a static const member var that way the name could indicate its use.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point.


for (auto host : valid_sr_hosts) {
if (host.second < mean - (2 * stdev)) {
stats_.ejections_consecutive_5xx_.inc();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong stat?

}

SRAccumulatorBucket* SRAccumulatorImpl::getCurrentWriter() {
// Right now current_ is being written to and backup_ is not. Flush the backup and swap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add . to the sentence.

config_.rqVolumeThreshold())) {

// Calculate the statistics (mean, stdev). We are using mean to detect outliers.
double mean = sr_sum / sr_data.size();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move stats functions into separate math utils or something

uint64_t maxEjectionPercent() { return max_ejection_percent_; }
uint64_t enforcing() { return enforcing_; }
uint64_t significantHostThreshold() { return significant_host_threshold_; }
uint64_t rqVolumeThreshold() { return rq_volume_threshold_; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments on this as well? e.g. what is significantHostThreshold etc

Copy link
Copy Markdown
Member Author

@junr03 junr03 Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll document this in the docs

@@ -209,8 +219,51 @@ void DetectorImpl::onConsecutive5xxWorker(HostPtr host) {

void DetectorImpl::onIntervalTimer() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's a good time to start thinking about small test setup where you can feed upstream hosts params and run ejection on that group of hosts, similar to the idea (https://github.com/lyft/envoy/blob/master/test/common/upstream/load_balancer_simulation_test.cc#L22) of what we are doing to validate how AZ routing works.

With more complicated outlier detection algos we have to have a way to run simulation.

std::vector<double> sr_data;
double sr_sum = 0;
// reserve upper bound of vector size to avoid reallocation.
sr_data.reserve(host_sinks_.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we do the check for hosts_sinks_.size() > significant_host_threshold here too. only reserve if it we need to.


.. _config_cluster_manager_cluster_outlier_detection_enforcing_sr:

enforcing_sr
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/enforcing_sr/enforcing_success_rate


.. _config_cluster_manager_cluster_outlier_detection_rq_volume_threshold:

rq_volume_threshold:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rq_volume_threshold/request_volume_threshold


.. _config_cluster_manager_cluster_outlier_detection_significant_host_threshold:

significant_host_threshold
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear this is related to request volume and success rate. Use a more specific name.

Comment thread docs/intro/arch_overview/outlier.rst Outdated
Success Rate
^^^^^^^^^^^^

Success Rate based outlier ejection aggregates success rate data from every host in a cluster, and at a given
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run on

Comment thread source/common/json/config_schemas.cc Outdated
"minimum" : 0,
"exclusiveMinimum" : true
},
"rq_volume_threshold" : {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_volume_threshold

}

void DetectorHostSinkImpl::updateCurrentSRBucket() {
sr_accumulator_bucket_.store(sr_accumulator_.getCurrentWriter());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere in this entire change where you do just SR switch to some variant of "Success rate"

const double Utility::SR_STDEV_FACTOR = 1.9;

double Utility::srEjectionThreshold(double sr_sum, std::vector<double>& sr_data) {
double mean = sr_sum / sr_data.size();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment in this function on what all of this math is actually doing, with an example.


const double Utility::SR_STDEV_FACTOR = 1.9;

double Utility::srEjectionThreshold(double sr_sum, std::vector<double>& sr_data) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::vector&

checkHostForUneject(host.first, host.second, now);

// Success Rate Outlier Detection.
// First swap out the current bucket been written to, to keep data valid.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar


void DetectorImpl::onIntervalTimer() {
SystemTime now = time_source_.currentSystemTime();
std::unordered_map<HostPtr, double> valid_sr_hosts;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all success rate interval handling into dedicated function

class Utility {
public:
/**
* This function returns the Success Rate trheshold for Success Rate outlier detection. If a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

* stats. This implementation has a fixed window size of time, and thus only needs a
* bucket to write to, and a bucket to accumulate/run stats over.
*/
class SRAccumulatorImpl {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove Impl, not implementing an interface

: current_sr_bucket_(new SRAccumulatorBucket()),
backup_sr_bucket_(new SRAccumulatorBucket()) {}

SRAccumulatorBucket* getCurrentWriter();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is poorly named, as it does the swap also. Make name more descriptive and add comment.

@mattklein123
Copy link
Copy Markdown
Member

Another high level comment: It's unclear to me how we are going to usefully debug this change during rollout. You probably need logging changes, potentially the ability to dump current SR from /clusters output, etc. That needs thinking.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, much easier to understand. some more comments.

const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;

double Utility::successRateEjectionThreshold(double success_rate_sum,
std::vector<double>& success_rate_data) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::vector& (I think I already commented on this but not sure)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did comment on this already. I missed it writing the big comment after the function signature.

ejectHost(host, EjectionType::Consecutive5xx);
}

const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment, how was this chosen?

if (host_sinks_.size() >= success_rate_minimum_hosts) {
// reserve upper bound of vector size to avoid reallocation.
success_rate_data.reserve(host_sinks_.size());
}
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do an early return if you don't have enough hosts.

host.second->updateCurrentSuccessRateBucket();

// If there are not enough hosts to begin with, don't do the work.
if (host_sinks_.size() >= success_rate_minimum_hosts) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do an early return if you don't have enough hosts.

}

void DetectorImpl::successRateEjections() {
std::unordered_map<HostSharedPtr, double> valid_success_rate_hosts;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need map lookup here, an std::list or reserved std::vector containing a struct with host and double will be more efficient.

for (auto host : valid_success_rate_hosts) {
if (host.second < ejection_threshold) {
Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts);
for (auto tuple : valid_success_rate_hosts) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto&.
on a side note, I'd make it more readable with some simple struct introduction with named fields instead of get<1>, get<0>

return Optional<double>();
}

return Optional<double>(backup_success_rate_bucket_->success_request_counter_ * 100 /
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are losing precision here, should probably sr_counter_ * 100.0 / total_coutner_

* @return a valid Optional<double> with the success rate. If there were not enough requests, an
* invalid Optional<double> is returned.
*/
Optional<double> getSuccessRate(uint64_t success_rate_request_volume);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param name is incorrect... request_volume_threshold

};

/**
* Utilities for Outlier Detection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Utilities for Outlier Detection.

@junr03 junr03 force-pushed the sr-outlier-detection branch from 4321114 to 607512c Compare March 23, 2017 17:42
checkHostForUneject(host.first, host.second, now);
}

successRateEjections();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would rename processSuccessRateEjections()

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good after name nit

@junr03 junr03 merged commit 6fbc857 into master Mar 23, 2017
@junr03 junr03 deleted the sr-outlier-detection branch March 23, 2017 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants