A33: Client-Side Fault Injection#201
Conversation
|
|
||
| 1. Fixed delay before forwarding the request; | ||
| 1. Overridden gRPC status code; | ||
| 1. Traffic matcher (upstream cluster name, downstream cluster name, and headers); |
There was a problem hiding this comment.
There isn't any detail in here as to what is covered by this part of the feature.
In the envoy docs, I see:
"upstream_cluster": "...",
"headers": [],
"downstream_nodes": [],
upstream_cluster makes sense - If set, we'd match on the cluster chosen by the config selector.
headers seems straightforward based on the envoy docs: "A match will happen if all the headers in the config are present in the request with the same values (or based on presence if the value field is not in the config)."
downstream_nodes, though, I'm not sure what to do with. Is this what you meant by "downstream cluster name"? For this it sounds like we would just match the value(s? - what if there are multiple) of the header "x-envoy-downstream-service-node" with the settings in "downstream_nodes" in the fault injection config.
But where does this header come from? It looks like it's set by envoy, not the user. Based on "the value is taken from the --service-node option" [ref]. We aren't setting this today as far as I'm aware, so does it make sense to support this feature?
There was a problem hiding this comment.
For the traffic matcher, I updated the design doc that I think that only header-matching is common enough. Both the upstream cluster or downstream nodes features will postponed until users are asking for them.
The headers (or our metadata) can be set either by upstream Envoy or the application. I'm not sure how much will that make.
There was a problem hiding this comment.
The headers (or our metadata) can be set either by upstream Envoy or the application.
The ones for the header matchers may be set by the application. But the one for the "downstream node" feature (which I now see we aren't supporting here) seems to be set only by Envoy, per the ref above. I'm not clear on what else this may be used for, and I am wondering if setting it is something we need to do in gRPC (or will need to do soon for other reasons).
There was a problem hiding this comment.
To support this the downstream node matching, I think we will need to add this header injection as well. Technically, gRPC's XdsClient can add x-envoy-downstream-service-node or service-node header with it's Node info for all processed requests?
But, for this proposal, I hope to focus on the most common usages of the fault injection filter, which is delay, abort, match header, and max active faults.
| // delay feature is disabled and delay_fraction will be ignored. | ||
| google.protobuf.Duration delay = 1; | ||
|
|
||
| // The possibility that this RPC will be delayed. |
There was a problem hiding this comment.
Should it say probability?
There was a problem hiding this comment.
You are right, probability is more accurate here. Updated.
There was a problem hiding this comment.
Even probability doesn't quite sound right in this context. What we are really talking about is, for example, out of 1 million RPCs delay will be introduced for these many RPCs (delay_fraction). If this is done on a random basis then probability applies to a given RPC being delayed because of this feature. However the number being described here is the "fraction" of RPCs (out of a million) that will be delayed by this feature.
markdroth
left a comment
There was a problem hiding this comment.
Thanks for doing this, Lidi. My comments here are mostly about overall organization and updates based on the HTTP filter design.
Please let me know if you have any questions. Thanks!
lidizheng
left a comment
There was a problem hiding this comment.
@markdroth Thanks for the advices! I have visited all comments, but would like to discuss whether to keep mentioning retry or not in the comment threads.
markdroth
left a comment
There was a problem hiding this comment.
This looks pretty good! Just a few comments remaining.
|
This looks great! Just one more thing: Please add a small section about the env var guard, similar to what I added in the HTTP filter gRFC in 7eaa156. |
|
@markdroth The env guard section is added. It also mentioned A39 since this flag will flip both features on. |
| Envoy implements fault injection is an `HTTPFilter`, which can choose to operate on bytes or HTTP messages or events. It is unclear if gRPC can perform the precise same behavior, since gRPC generally operates on proto messages (only Core has visibility to bytes, not Java/Golang). More importantly, there is not enough consensus for if the timeout timer should start at the beginning of an RPC, or the receipt of the first bytes from peer, or after the entire message is received. | ||
|
|
||
|
|
||
| ### Environment variable protection |
There was a problem hiding this comment.
This shouldn't be in the "Alternatives" section. Please move it up to the "Proposal" section, probably line 120.
There was a problem hiding this comment.
Updated. Good catch!
markdroth
left a comment
There was a problem hiding this comment.
One minor change left, but otherwise this looks great! I'll go ahead and approve.
Thanks for doing this!
| According to the [Envoy implementation](https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/http/fault/fault_filter.cc#L222), each feature of fault injection is checked with a different random number. Compared to using a single random number for all features, this affects the probability of an RPC being affected by fault injection. For example, the fault injection has 20% chance to delay and 5% chance to abort. Under the single random number solution, only 20% RPC will be affected. But if two features are independent, 24% RPC will be affected. So, **this doc suggests to make each feature independent by using a different random number each time.** | ||
|
|
||
|
|
||
| ### Environment Variable Protection |
There was a problem hiding this comment.
Please make changes similar to what I did in af7b012 to reflect the fact that this is just temporary.
There was a problem hiding this comment.
Done. Using the same words as your commit.
|
@dfawley @dapengzhang0 Added a section about the fault injection header percentage is capped by the effective fault injection filter config. Link to the implementation PR envoyproxy/envoy#10724 |
gRFC: https://github.com/lidizheng/proposal/blob/fault-injection/A33-Fault-Injection.md