Add "downstream_nodes" optional config settings to the fault filter#1047
Conversation
| "upstream_cluster" : "...", | ||
| "headers" : [] | ||
| "headers" : [], | ||
| "match_downstream_cluster": false, |
| * http.fault.<downstream-cluster>.abort.http_status | ||
| * http.fault.<downstream-cluster>.delay.fixed_delay_percent | ||
| * http.fault.<downstream-cluster>.delay.fixed_duration_ms | ||
| In case the following settings are not found in the runtime it defaults to the config settings. |
|
|
||
| delays_injected, Counter, Total requests that were delayed | ||
| aborts_injected, Counter, Total requests that were aborted | ||
| <downstream-cluster>.delays_injected, Counter, Total delayed requests for the given downstream cluster |
There was a problem hiding this comment.
Keep sentence consistent with above wording? Total requests that were delayed for the given downstream cluster. Same below.
There was a problem hiding this comment.
that makes a table (when rendered) expand a lot, i prefer to keep the sentence shorter.
|
|
||
| .. _config_http_filters_fault_injection_match_downstream_cluster: | ||
|
|
||
| match_downstream_cluster: |
There was a problem hiding this comment.
I got hung up on the logic below. I think the original cause was this name. I think we should change this name, and related function to something else: downstream_cluster_specific_settings or something like that.
| config_->delayPercent())) { | ||
| uint64_t duration_ms = config_->runtime().snapshot().getInteger( | ||
| "fault.http.delay.fixed_duration_ms", config_->delayDuration()); | ||
| if (config_->matchDownstreamCluster()) { |
There was a problem hiding this comment.
As per comment above I think this name confuses the point of the flag. Which is: get specific cluster settings.
|
@htuch @mattklein123 can i get the review |
|
|
||
| .. _config_http_filters_fault_injection_downstream_cluster: | ||
|
|
||
| downstream_cluster_specific_settings: |
There was a problem hiding this comment.
In general it seems odd to me that this needs to be specified at all. I would just always look for downstream cluster variants if the header is set, and use the default settings as the defaults for the lookups. Is there any particular reason to have this be configurable separately?
There was a problem hiding this comment.
This allows us to handle cases when we'd like to get settings based on other input params (not only downstream cluster) and make decisions based only on those, having it as a config setting directly expresses the intent.
I'd vote for explicit control via configuration and avoid unnecessary lookups (not everyone wants to do downstream cluster specific conf).
There was a problem hiding this comment.
The defaults for the lookups for the downstream specific cluster runtime lookups can be taken from the main config. So if they aren't present, the behavior will be the same. I really don't see people wanting to do 2 runtime lookups to get a default out of runtime, then override with a cluster specific one. So I think we can collapse this logic and make it simpler. Let's go ahead and do that.
| // nodes. If not set then inject for all. | ||
| Runtime::Loader& runtime_; | ||
| FaultFilterStats stats_; | ||
| std::string stats_prefix_; |
| Event::TimerPtr delay_timer_; | ||
| std::string downstream_cluster_{}; | ||
|
|
||
| std::string delay_percent_key_{"fault.http.delay.fixed_delay_percent"}; |
There was a problem hiding this comment.
those could not be const in current impl because it depends on whether downstream cluster specific settings are used of not.
There was a problem hiding this comment.
OK, well per above comment, if we are going to generalize and just do one runtime lookup, these don't need to be local variables anyway. They can just all be local variables inside the lookup function.
|
@RomanDzhabarov in thinking about this more, I think we still probably want a runtime generalizable way of doing global injection vs. cluster specific injection. Perhaps we should first look up cluster specific, and if no fault/delay is specified, look for global? In that case, the default runtime global keys can be static/const. I think that would be better and give us full control? |
|
Yes, I agree we should use downstream runtime settings if present, otherwise use global runtime settings if present or use config settings. |
rshriram
left a comment
There was a problem hiding this comment.
At high level, the logic for downstream specific fault injection is fine. But do we need to have it as a first class entity in the configuration? Why can't we match the nodes using headers alone?
| not set, faults are injected for all downstream nodes. Downstream node name is taken from | ||
| :ref:`the HTTP x-envoy-downstream-service-node <config_http_conn_man_headers_downstream-service-node>` | ||
| header and compared against downstream_nodes list. | ||
|
|
There was a problem hiding this comment.
Why can't this be done as header match? Does it have to be a specific option?
Same applies to downstream_cluster as well.
There was a problem hiding this comment.
I want to make "downstream-service-node" runtime controllable (so i can switch between some nodes to global easily), which should be very easy to add on top of the current functionality.
arguably we probably do not need runtime control over all headers match and it's going to be a bigger change. Let me know your thoughts.
There was a problem hiding this comment.
That is perfectly fine. But what I was driving at was, do we need to add a config element for this?
Why can't this be done in the followign way:
fault_filter1:
headers:
x-envoy-downstream-service-node: canary1
delay:
duration_ms: 1000
fault_filter2:
headers:
x-envoy-downstream-service-node: canary2
delay:
duration_ms: 1000
The logic from lines 96-110 in fault_filter.cc can stay as it is [the one that checks if the downstream cluster name in the header is present as part of the runtime with overrides for delay and abort]
Make sense? Essentially piggyback on the header match itself, instead of adding a new config element. If there is a header match for X-Envoy-downstream-cluster, then pull it out, check with runtime, etc.
There was a problem hiding this comment.
I do not see how the schema above when we have two filters with different nodes in the "x-envoy-downstream-service-node" allows us to switch to the mode where we do not care about downstream node at all.
For example, I'm running a test where I inject failures on "canary" only node and now I want to switch to the mode where I do not care about downstream-node at all. How would I do that in the current header matching logic?
There was a problem hiding this comment.
Ah okay. Now I get it. I think this explanation could go into the docs or somewhere, to clarify the use case.
There was a problem hiding this comment.
yup, i'll update it!
There was a problem hiding this comment.
once this PR is in, I'll shoot another one with the runtime support and doc update (tomorrow/monday)
There was a problem hiding this comment.
I suspect at some point this is going to need to be made regex capable to be generally useful. I would add a TODO for that. I think it's fine to do this in v1. We can possibly add when we add runtime support.
| } | ||
|
|
||
| if (!matchesDownstreamNodes(headers)) { | ||
| return FilterHeadersStatus::Continue; |
There was a problem hiding this comment.
This is what I was worried about. We are doing two header match passes. We could fold this into a single header match, can't we? It makes the configuration simple as well.
With this PR, I can do match based on downstreams either using the header or using the downstream_nodes option. Instead of supporting multiple ways for doing a single thing, I would suggest providing a single well defined way of achieving X, whatever it is.
There was a problem hiding this comment.
I think I agree with this. @RomanDzhabarov for our use, can't we just use header match and then just add runtime control to whether we match on headers or not? Seems like there should be some compromise here that doesn't do two separate things.
There was a problem hiding this comment.
Alright we talked offline, will just do the full runtime solution and de-dup.
There was a problem hiding this comment.
Nevermind, talked again. This is kind of a far reaching change. To unblock let's go with this and we will add TODO to generalize and clean this up.
| uint64_t duration_ms = delayDuration(); | ||
|
|
||
| // Delay only if the duration is >0ms | ||
| if (0 != duration_ms) { |
There was a problem hiding this comment.
If you are going to provide a isDelayEnabled() utility function, I suggest folding this condition into that utility. Saves an unnecessary if check and makes code simple.
|
@mattklein123 addressed comments. |
| const std::string& upstreamCluster() { return upstream_cluster_; } | ||
| Runtime::Loader& runtime() { return runtime_; } | ||
| FaultFilterStats& stats() { return stats_; } | ||
| std::unordered_set<std::string>& downstreamNodes() { return downstream_nodes_; } |
| not set, faults are injected for all downstream nodes. Downstream node name is taken from | ||
| :ref:`the HTTP x-envoy-downstream-service-node <config_http_conn_man_headers_downstream-service-node>` | ||
| header and compared against downstream_nodes list. | ||
|
|
There was a problem hiding this comment.
I suspect at some point this is going to need to be made regex capable to be generally useful. I would add a TODO for that. I think it's fine to do this in v1. We can possibly add when we add runtime support.
| } | ||
|
|
||
| if (!matchesDownstreamNodes(headers)) { | ||
| return FilterHeadersStatus::Continue; |
There was a problem hiding this comment.
I think I agree with this. @RomanDzhabarov for our use, can't we just use header match and then just add runtime control to whether we match on headers or not? Seems like there should be some compromise here that doesn't do two separate things.
Signed-off-by: Alan Chiu <achiu@lyft.com> For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/master/PULL_REQUESTS.md) Description: release: 0.3.1.08242020 Risk Level: low Testing: ci Docs Changes: n/a Release Notes: n/a [Optional Fixes #Issue] [Optional Deprecated:] Signed-off-by: JP Simard <jp@jpsim.com>
Signed-off-by: Alan Chiu <achiu@lyft.com> For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/master/PULL_REQUESTS.md) Description: release: 0.3.1.08242020 Risk Level: low Testing: ci Docs Changes: n/a Release Notes: n/a [Optional Fixes #Issue] [Optional Deprecated:] Signed-off-by: JP Simard <jp@jpsim.com>
**Description** Follow up on #1020 Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Support "downstream_nodes" to allow injections only for specific downstream nodes, one of the use cases is to support fault injections for a canary downstream node only.
Allow reading downstream specific settings for faults.
@lyft/network-team