xds: add support for TTLs#13201
Conversation
Signed-off-by: Bill Gallagher <bgallagher@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
htuch
left a comment
There was a problem hiding this comment.
Glad to see this landing! Flushing some nits/comments..
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
|
I think this is ready for another pass (failing CI task is GCC running out of disk space) |
|
Talking with @kyessenov over at envoyproxy/go-control-plane#359 he suggested that maybe TTLs should be fully client driven and not rely on the server to periodically push the same version to keep it alive. The way this PR is done right now, if we waited for the server to push a new version we'd flap between adding/removing the resource as there is no delay between the TTL expiration and the resource removal. To get this working without server involvement we could have Envoy either re-request some time before the TTL with an empty version or delay the removal of the resources by fetch_timeout after the TTL expires to allow for a xDS push to cancel the removal. This would increase the complexity of the TTL implementation in Envoy, but would make this just work for existing control planes. Thoughts? @mattklein123 @wgallagher @htuch |
I don't understand this part. Why would there be flapping as long as we get the same resource / version before the TTL expires? I thought the original idea was to response with the same resource version but empty resources to avoid large amount of data on the wire.
This seems potentially like an OK approach, but then I think we need to have a clear "not modified" response from the server to avoid sending the same data over and over again? How would this work? |
Sorry I meant this is the behavior today if we enabled TTLs on the server without pushing heartbeats. The TTL timer would fire, we'd call onConfigUpdate without the resource(s), forget about the them and do a xDS push, which the server would respond to and eventually Envoy would add back the resource(s). I think I had missed the part about wanting to send a heartbeat response that only contained the version, which makes server awareness of TTLs even more important. I'll make sure we test this case in this PR. |
|
@snowp I think this that since TTLs are opt-in, this won't affect existing management servers. Those that opt-in should send heartbeats or expect bad things to happen. I think if we want to go with a more elaborate client-driven scheme, we basically recreate etags, which is an option, but has somewhat high complexity and still requires some server support. We can always add this in later as well I think if we adopt the current TTL semantics, since Envoy can re-request some delta away from expiration. Please make sure we cover this in depth in https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol. |
+1 this is my feeling also. I think if the server drives it, it's simpler, especially if we don't resend full resource, and just the same resource version to force the heartbeat. |
|
I'm a little wary of pushing more complexity on the control plane. It seems that SotW TTL comes for free, since control plane can delete un-referenced resources with any update. Is this primarily an issue with delta xDS? It's not easy to keep up with the protocol changes (istio is still on SotW, and go-control-plane is just starting delta implementation). |
|
@mattklein123 @wgallagher either of you wanna take a look at this as a second reviewer? |
|
I can take a look. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks LGTM at a high level with one question. Very excited to see this feature about to land.
/wait-any
| google.protobuf.Any resource = 2; | ||
|
|
||
| // Time-to-live value for the resource. For each resource, a timer is started. The timer is reset each time the | ||
| // resource is received with a new TTL. If the resource is received with no TTL set, the timer is removed for the |
There was a problem hiding this comment.
Where did we land with more efficiently updating the TTL without requiring full wire transmission of the resource which could be large? I thought we had discussed allowing an empty resource, but same version and TTL to update the TTL, kind of like an etag?
My concern with not doing this now is it may be more confusing to retrofit this later. Is it worth considering doing this optimization now? As a follow up later if needed? Or do we not think it will be needed at all?
There was a problem hiding this comment.
@mattklein123 The problem is that VHDS uses empty messages already: #13201 (comment).
This means that we'll have to do some work to untangle this bit of tech debt, as otherwise we don't have a way of disambiguating a heartbeat response and a valid VHDS update. Some options are mentioned in the linked thread, but none of them seem amazing to me.
It would be pretty nice to have these special updates for SotW control planes (e.g. go-control-plane), as special heartbeat responses would allow us to respond with just the TTL'd resources in the response without worrying about whether the resource type is a collection (CDS/LDS) or not.
There was a problem hiding this comment.
I think we could just add an additional bool that clearly indicates a heartbeat response? I don't love it as it should be a oneof, etc. but I think it would work. If we go this route I guess this could be done later?
There was a problem hiding this comment.
I think adding a bool is conceptually the easiest thing, though it leaves the API messier than it could be if we could somehow get rid of the VHDS case. Would we be worried about clients that don't know about this field interpreting a heartbeat update as a real update? I guess we could use a client capability to guard this, though I'm not sure if most control plane check this in practice.
@htuch @markdroth in case either of you have more opinions here
There was a problem hiding this comment.
Rather than adding a bool to the API, I would prefer to just special-cae the VHDS behavior for now, with the understanding that we'll eliminate that special case as part of migrating VHDS from aliases to the new udpa naming scheme.
There was a problem hiding this comment.
I'm taking a look at what it would take to intercept all heartbeat responses except the VHDS ones in the mux impls, if that isn't too bad then I'll just include that in this PR.
There was a problem hiding this comment.
OK sounds good. Thanks for looking into this. At minimum if we don't implement now add some TODOs?
/wait
There was a problem hiding this comment.
Alright I pushed the relevant changes with a few tests.
I realized that the only reason why the VHDS thing is an issue is that the delta code will trigger updates even if the version of the updated resources didn't change. It might be okay to say that if resource empty && version didnt change we won't trigger the subscription callbacks. This would break VHDS implementations that don't increment the version when it sends delta updates, but it might be that those are rare enough that we can get away with it?
There was a problem hiding this comment.
This would break VHDS implementations that don't increment the version when it sends delta updates, but it might be that those are rare enough that we can get away with it?
Add a runtime flag and release note and if no one complains we can just remove it later?
There was a problem hiding this comment.
I think it's a violation of the spec to send new contents without changing the version. I'd be totally fine breaking any such usage, even without a flag.
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks this LGTM. I will defer to @htuch @markdroth for final approval.
markdroth
left a comment
There was a problem hiding this comment.
This looks really good! My remaining comments here are all minor things, mostly related to the docs.
|
|
||
| // Time-to-live value for the resource. For each resource, a timer is started. The timer is reset each time the | ||
| // resource is received with a new TTL. If the resource is received with no TTL set, the timer is removed for the | ||
| // resource. Upon expiration of the timer, the configuration for the resource will be removed. |
There was a problem hiding this comment.
Should we add a comment here documenting the fact that TTLs can be updated/removed by re-sending the Resource with the same version but without populating the resource field itself?
| new TTL. To remove the TTL, the management server resends the resource with the TTL field unset. | ||
|
|
||
| To allow for lightweight TTL updates ("heartbeats"), a response can be sent that provides a | ||
| :ref:`Resource <envoy_api_msg_Resource>` with the resource unset and version matching the |
There was a problem hiding this comment.
In the phrase "with the resource unset", let's make the word "resource" a link to the specific field in the Resource message, just to make sure this in unambiguous.
|
|
||
| To allow for lightweight TTL updates ("heartbeats"), a response can be sent that provides a | ||
| :ref:`Resource <envoy_api_msg_Resource>` with the resource unset and version matching the | ||
| clients version can be used to update the TTL. These resources will not be treated as resource |
There was a problem hiding this comment.
Suggest changing "clients version" to "most recently sent version".
| the xDS change, Envoy will remove the resource after a TTL specified by the server. See the | ||
| :ref:`protocol documentation <xds_protocol_ttl>` for more information. | ||
|
|
||
| Currently the behavior when a TTL expires is that the resource is *expired* (as opposed to reverted to the |
There was a problem hiding this comment.
Maybe change "the resource is expired" to "the resource is removed"? I'm thinking that might make the meaning a bit clearer.
| within the response, while for SotW xDS the server may wrap individual resources listed in the response within a | ||
| :ref:`Resource <envoy_api_msg_Resource>` in order to specify a TTL value. | ||
|
|
||
| The server can refresh or modify the TTL by issuing another response for the same version. Note that the entire resource |
There was a problem hiding this comment.
I think the second sentence here needs to be removed.
| Event::Dispatcher& dispatcher) | ||
| // TODO(snowp): Hard coding VHDS here is temporary until we can move it away from relying on | ||
| // empty resources as updates. | ||
| : supports_heartbeats_(!absl::StrContains(type_url, "VirtualHost")), |
There was a problem hiding this comment.
Maybe specify the full type_url string instead of doing a substring match here, just in case we ever add a new resource type that happens to have "VirtualHost" as a substring?
There was a problem hiding this comment.
Yeah that's a good call, will do
Signed-off-by: Snow Pettersen <snowp@lyft.com>
mattklein123
left a comment
There was a problem hiding this comment.
LGTM with small typo. Feel free to fix in a follow up if you want.
| // light-weight "heartbeat" updates to keep a resource with a TTL alive. | ||
| // | ||
| // The TTL feature is meant to support configurations that should be removed in the event of | ||
| // a management server // failure. For example, the feature may be used for fault injection |
There was a problem hiding this comment.
Yea will fix in follow up just to land this, thanks!
Adds support for TTL on both Delta and SOTW xDS: for Delta, we provide per resource TTLs, while SOTW has a per
API TTL. This allows the server to direct Envoy to remove the resources in the case of control plane unavailability.
Additional Description:
Risk Level: Medium/Low, new feature
Testing: UTs
Docs Changes: Updated XDS doc
Release Notes: Added
Fixes #7868