Conversation
|
what does sr mean in this context? |
|
@rshriram success rate |
| for (auto host : valid_sr_hosts) { | ||
| if (host.second < mean - (2 * stdev)) { | ||
| stats_.ejections_consecutive_5xx_.inc(); | ||
| ejectHost(host.first, EjectionType::SuccessRate); |
There was a problem hiding this comment.
what's the max #hosts we could potentially eject here? (worry about to eject too many). I'll look into more details, but wondering if you have an answer to this
There was a problem hiding this comment.
Yes, there is a max ejection percentage completed in the ejectHost function
| struct SRAccumulatorBucket { | ||
| std::atomic<uint64_t> success_rq_counter_; | ||
| std::atomic<uint64_t> total_rq_counter_; | ||
| }; |
There was a problem hiding this comment.
nit: new line everywhere in various files
|
|
||
| class DetectorImpl; | ||
|
|
||
| class SRAccumulatorImpl { |
There was a problem hiding this comment.
Add comment about the class. I always forget what SR means. Please expand on why we bucket and swap.
| public: | ||
| SRAccumulatorImpl() | ||
| : current_sr_bucket_(new SRAccumulatorBucket()), | ||
| backup_sr_bucket_(new SRAccumulatorBucket()){}; |
|
|
||
| typedef std::shared_ptr<Detector> DetectorPtr; | ||
|
|
||
| struct SRAccumulatorBucket { |
There was a problem hiding this comment.
I would either move this struct into the class where you are using it or add a comment about what its intents are.
| checkHostForUneject(host.first, host.second, now); | ||
|
|
||
| // Success Rate Outlier Detection | ||
| // First swap out the current bucket been written to, to keep data valid |
| : current_sr_bucket_(new SRAccumulatorBucket()), | ||
| backup_sr_bucket_(new SRAccumulatorBucket()){}; | ||
| SRAccumulatorBucket* getCurrentWriter(); | ||
| Optional<double> getSR(uint64_t rq_volume_threshold); |
There was a problem hiding this comment.
nit: Expand to getSuccessRate and add comment.
There was a problem hiding this comment.
I feel that I have used SR everywhere, and expanding it here (while useful, and does not make the name too long) would be confusing.
| stdev = std::sqrt(stdev); | ||
|
|
||
| for (auto host : valid_sr_hosts) { | ||
| if (host.second < mean - (2 * stdev)) { |
There was a problem hiding this comment.
would we ever want to change how many stdevs away we allow?
There was a problem hiding this comment.
Potentially, but I am not sure if we would like to make this configurable. In general for this statistics approach I am more curious to deploy non-enforcing and see how it behaves. My intuition is that mean and stdev is going to be sensitive, but I want to see data.
There was a problem hiding this comment.
good to know. I would make the 2 a static const member var that way the name could indicate its use.
|
|
||
| for (auto host : valid_sr_hosts) { | ||
| if (host.second < mean - (2 * stdev)) { | ||
| stats_.ejections_consecutive_5xx_.inc(); |
| } | ||
|
|
||
| SRAccumulatorBucket* SRAccumulatorImpl::getCurrentWriter() { | ||
| // Right now current_ is being written to and backup_ is not. Flush the backup and swap |
| config_.rqVolumeThreshold())) { | ||
|
|
||
| // Calculate the statistics (mean, stdev). We are using mean to detect outliers. | ||
| double mean = sr_sum / sr_data.size(); |
There was a problem hiding this comment.
can you move stats functions into separate math utils or something
| uint64_t maxEjectionPercent() { return max_ejection_percent_; } | ||
| uint64_t enforcing() { return enforcing_; } | ||
| uint64_t significantHostThreshold() { return significant_host_threshold_; } | ||
| uint64_t rqVolumeThreshold() { return rq_volume_threshold_; } |
There was a problem hiding this comment.
can you add comments on this as well? e.g. what is significantHostThreshold etc
There was a problem hiding this comment.
I'll document this in the docs
| @@ -209,8 +219,51 @@ void DetectorImpl::onConsecutive5xxWorker(HostPtr host) { | |||
|
|
|||
| void DetectorImpl::onIntervalTimer() { | |||
There was a problem hiding this comment.
maybe it's a good time to start thinking about small test setup where you can feed upstream hosts params and run ejection on that group of hosts, similar to the idea (https://github.com/lyft/envoy/blob/master/test/common/upstream/load_balancer_simulation_test.cc#L22) of what we are doing to validate how AZ routing works.
With more complicated outlier detection algos we have to have a way to run simulation.
| std::vector<double> sr_data; | ||
| double sr_sum = 0; | ||
| // reserve upper bound of vector size to avoid reallocation. | ||
| sr_data.reserve(host_sinks_.size()); |
There was a problem hiding this comment.
nit: can we do the check for hosts_sinks_.size() > significant_host_threshold here too. only reserve if it we need to.
|
|
||
| .. _config_cluster_manager_cluster_outlier_detection_enforcing_sr: | ||
|
|
||
| enforcing_sr |
There was a problem hiding this comment.
s/enforcing_sr/enforcing_success_rate
|
|
||
| .. _config_cluster_manager_cluster_outlier_detection_rq_volume_threshold: | ||
|
|
||
| rq_volume_threshold: |
There was a problem hiding this comment.
s/rq_volume_threshold/request_volume_threshold
|
|
||
| .. _config_cluster_manager_cluster_outlier_detection_significant_host_threshold: | ||
|
|
||
| significant_host_threshold |
There was a problem hiding this comment.
not clear this is related to request volume and success rate. Use a more specific name.
| Success Rate | ||
| ^^^^^^^^^^^^ | ||
|
|
||
| Success Rate based outlier ejection aggregates success rate data from every host in a cluster, and at a given |
| "minimum" : 0, | ||
| "exclusiveMinimum" : true | ||
| }, | ||
| "rq_volume_threshold" : { |
| } | ||
|
|
||
| void DetectorHostSinkImpl::updateCurrentSRBucket() { | ||
| sr_accumulator_bucket_.store(sr_accumulator_.getCurrentWriter()); |
There was a problem hiding this comment.
Everywhere in this entire change where you do just SR switch to some variant of "Success rate"
| const double Utility::SR_STDEV_FACTOR = 1.9; | ||
|
|
||
| double Utility::srEjectionThreshold(double sr_sum, std::vector<double>& sr_data) { | ||
| double mean = sr_sum / sr_data.size(); |
There was a problem hiding this comment.
Add comment in this function on what all of this math is actually doing, with an example.
|
|
||
| const double Utility::SR_STDEV_FACTOR = 1.9; | ||
|
|
||
| double Utility::srEjectionThreshold(double sr_sum, std::vector<double>& sr_data) { |
| checkHostForUneject(host.first, host.second, now); | ||
|
|
||
| // Success Rate Outlier Detection. | ||
| // First swap out the current bucket been written to, to keep data valid. |
|
|
||
| void DetectorImpl::onIntervalTimer() { | ||
| SystemTime now = time_source_.currentSystemTime(); | ||
| std::unordered_map<HostPtr, double> valid_sr_hosts; |
There was a problem hiding this comment.
Move all success rate interval handling into dedicated function
| class Utility { | ||
| public: | ||
| /** | ||
| * This function returns the Success Rate trheshold for Success Rate outlier detection. If a |
| * stats. This implementation has a fixed window size of time, and thus only needs a | ||
| * bucket to write to, and a bucket to accumulate/run stats over. | ||
| */ | ||
| class SRAccumulatorImpl { |
There was a problem hiding this comment.
remove Impl, not implementing an interface
| : current_sr_bucket_(new SRAccumulatorBucket()), | ||
| backup_sr_bucket_(new SRAccumulatorBucket()) {} | ||
|
|
||
| SRAccumulatorBucket* getCurrentWriter(); |
There was a problem hiding this comment.
this is poorly named, as it does the swap also. Make name more descriptive and add comment.
|
Another high level comment: It's unclear to me how we are going to usefully debug this change during rollout. You probably need logging changes, potentially the ability to dump current SR from /clusters output, etc. That needs thinking. |
mattklein123
left a comment
There was a problem hiding this comment.
thanks, much easier to understand. some more comments.
| const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9; | ||
|
|
||
| double Utility::successRateEjectionThreshold(double success_rate_sum, | ||
| std::vector<double>& success_rate_data) { |
There was a problem hiding this comment.
const std::vector& (I think I already commented on this but not sure)
There was a problem hiding this comment.
You did comment on this already. I missed it writing the big comment after the function signature.
| ejectHost(host, EjectionType::Consecutive5xx); | ||
| } | ||
|
|
||
| const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9; |
There was a problem hiding this comment.
comment, how was this chosen?
| if (host_sinks_.size() >= success_rate_minimum_hosts) { | ||
| // reserve upper bound of vector size to avoid reallocation. | ||
| success_rate_data.reserve(host_sinks_.size()); | ||
| } |
There was a problem hiding this comment.
Do an early return if you don't have enough hosts.
| host.second->updateCurrentSuccessRateBucket(); | ||
|
|
||
| // If there are not enough hosts to begin with, don't do the work. | ||
| if (host_sinks_.size() >= success_rate_minimum_hosts) { |
There was a problem hiding this comment.
Do an early return if you don't have enough hosts.
| } | ||
|
|
||
| void DetectorImpl::successRateEjections() { | ||
| std::unordered_map<HostSharedPtr, double> valid_success_rate_hosts; |
There was a problem hiding this comment.
You don't actually need map lookup here, an std::list or reserved std::vector containing a struct with host and double will be more efficient.
| for (auto host : valid_success_rate_hosts) { | ||
| if (host.second < ejection_threshold) { | ||
| Utility::successRateEjectionThreshold(success_rate_sum, valid_success_rate_hosts); | ||
| for (auto tuple : valid_success_rate_hosts) { |
There was a problem hiding this comment.
const auto&.
on a side note, I'd make it more readable with some simple struct introduction with named fields instead of get<1>, get<0>
| return Optional<double>(); | ||
| } | ||
|
|
||
| return Optional<double>(backup_success_rate_bucket_->success_request_counter_ * 100 / |
There was a problem hiding this comment.
you are losing precision here, should probably sr_counter_ * 100.0 / total_coutner_
| * @return a valid Optional<double> with the success rate. If there were not enough requests, an | ||
| * invalid Optional<double> is returned. | ||
| */ | ||
| Optional<double> getSuccessRate(uint64_t success_rate_request_volume); |
There was a problem hiding this comment.
param name is incorrect... request_volume_threshold
| }; | ||
|
|
||
| /** | ||
| * Utilities for Outlier Detection |
There was a problem hiding this comment.
nit: Utilities for Outlier Detection.
4321114 to
607512c
Compare
| checkHostForUneject(host.first, host.second, now); | ||
| } | ||
|
|
||
| successRateEjections(); |
There was a problem hiding this comment.
nit: I would rename processSuccessRateEjections()
mattklein123
left a comment
There was a problem hiding this comment.
looks good after name nit
WIP
To-do: