Skip to content

xds: add support for TTLs#13201

Merged
snowp merged 73 commits intoenvoyproxy:masterfrom
snowp:bgallagher/ttl
Oct 30, 2020
Merged

xds: add support for TTLs#13201
snowp merged 73 commits intoenvoyproxy:masterfrom
snowp:bgallagher/ttl

Conversation

@snowp
Copy link
Copy Markdown
Contributor

@snowp snowp commented Sep 21, 2020

Adds support for TTL on both Delta and SOTW xDS: for Delta, we provide per resource TTLs, while SOTW has a per
API TTL. This allows the server to direct Envoy to remove the resources in the case of control plane unavailability.

Additional Description:
Risk Level: Medium/Low, new feature
Testing: UTs
Docs Changes: Updated XDS doc
Release Notes: Added
Fixes #7868

Bill Gallagher and others added 8 commits September 18, 2020 10:31
Signed-off-by: Bill Gallagher <bgallagher@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13201 was opened by snowp.

see: more, trace.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this landing! Flushing some nits/comments..

Comment thread api/xds_protocol.rst Outdated
Comment thread api/xds_protocol.rst Outdated
Comment thread docs/root/api/client_features.rst Outdated
Comment thread source/common/config/delta_subscription_state.cc Outdated
Snow Pettersen added 9 commits September 23, 2020 19:36
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Sep 24, 2020

I think this is ready for another pass (failing CI task is GCC running out of disk space)

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Sep 24, 2020

Talking with @kyessenov over at envoyproxy/go-control-plane#359 he suggested that maybe TTLs should be fully client driven and not rely on the server to periodically push the same version to keep it alive.

The way this PR is done right now, if we waited for the server to push a new version we'd flap between adding/removing the resource as there is no delay between the TTL expiration and the resource removal.

To get this working without server involvement we could have Envoy either re-request some time before the TTL with an empty version or delay the removal of the resources by fetch_timeout after the TTL expires to allow for a xDS push to cancel the removal. This would increase the complexity of the TTL implementation in Envoy, but would make this just work for existing control planes.

Thoughts? @mattklein123 @wgallagher @htuch

@mattklein123
Copy link
Copy Markdown
Member

The way this PR is done right now, if we waited for the server to push a new version we'd flap between adding/removing the resource as there is no delay between the TTL expiration and the resource removal.

I don't understand this part. Why would there be flapping as long as we get the same resource / version before the TTL expires? I thought the original idea was to response with the same resource version but empty resources to avoid large amount of data on the wire.

To get this working without server involvement we could have Envoy either re-request some time before the TTL with an empty version or delay the removal of the resources by fetch_timeout after the TTL expires to allow for a xDS push to cancel the removal. This would increase the complexity of the TTL implementation in Envoy, but would make this just work for existing control planes.

This seems potentially like an OK approach, but then I think we need to have a clear "not modified" response from the server to avoid sending the same data over and over again? How would this work?

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Sep 24, 2020

I don't understand this part. Why would there be flapping as long as we get the same resource / version before the TTL expires? I thought the original idea was to response with the same resource version but empty resources to avoid large amount of data on the wire.

Sorry I meant this is the behavior today if we enabled TTLs on the server without pushing heartbeats. The TTL timer would fire, we'd call onConfigUpdate without the resource(s), forget about the them and do a xDS push, which the server would respond to and eventually Envoy would add back the resource(s).

I think I had missed the part about wanting to send a heartbeat response that only contained the version, which makes server awareness of TTLs even more important. I'll make sure we test this case in this PR.

@htuch
Copy link
Copy Markdown
Member

htuch commented Sep 24, 2020

@snowp I think this that since TTLs are opt-in, this won't affect existing management servers. Those that opt-in should send heartbeats or expect bad things to happen. I think if we want to go with a more elaborate client-driven scheme, we basically recreate etags, which is an option, but has somewhat high complexity and still requires some server support. We can always add this in later as well I think if we adopt the current TTL semantics, since Envoy can re-request some delta away from expiration.

Please make sure we cover this in depth in https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol.

@mattklein123
Copy link
Copy Markdown
Member

. I think if we want to go with a more elaborate client-driven scheme, we basically recreate etags, which is an option, but has somewhat high complexity and still requires some server support.

+1 this is my feeling also. I think if the server drives it, it's simpler, especially if we don't resend full resource, and just the same resource version to force the heartbeat.

@kyessenov
Copy link
Copy Markdown
Contributor

I'm a little wary of pushing more complexity on the control plane. It seems that SotW TTL comes for free, since control plane can delete un-referenced resources with any update. Is this primarily an issue with delta xDS? It's not easy to keep up with the protocol changes (istio is still on SotW, and go-control-plane is just starting delta implementation).

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Oct 26, 2020

@mattklein123 @wgallagher either of you wanna take a look at this as a second reviewer?

@mattklein123
Copy link
Copy Markdown
Member

I can take a look.

@mattklein123 mattklein123 self-assigned this Oct 27, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM at a high level with one question. Very excited to see this feature about to land.

/wait-any

google.protobuf.Any resource = 2;

// Time-to-live value for the resource. For each resource, a timer is started. The timer is reset each time the
// resource is received with a new TTL. If the resource is received with no TTL set, the timer is removed for the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we land with more efficiently updating the TTL without requiring full wire transmission of the resource which could be large? I thought we had discussed allowing an empty resource, but same version and TTL to update the TTL, kind of like an etag?

My concern with not doing this now is it may be more confusing to retrofit this later. Is it worth considering doing this optimization now? As a follow up later if needed? Or do we not think it will be needed at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 The problem is that VHDS uses empty messages already: #13201 (comment).

This means that we'll have to do some work to untangle this bit of tech debt, as otherwise we don't have a way of disambiguating a heartbeat response and a valid VHDS update. Some options are mentioned in the linked thread, but none of them seem amazing to me.

It would be pretty nice to have these special updates for SotW control planes (e.g. go-control-plane), as special heartbeat responses would allow us to respond with just the TTL'd resources in the response without worrying about whether the resource type is a collection (CDS/LDS) or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just add an additional bool that clearly indicates a heartbeat response? I don't love it as it should be a oneof, etc. but I think it would work. If we go this route I guess this could be done later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a bool is conceptually the easiest thing, though it leaves the API messier than it could be if we could somehow get rid of the VHDS case. Would we be worried about clients that don't know about this field interpreting a heartbeat update as a real update? I guess we could use a client capability to guard this, though I'm not sure if most control plane check this in practice.

@htuch @markdroth in case either of you have more opinions here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a bool to the API, I would prefer to just special-cae the VHDS behavior for now, with the understanding that we'll eliminate that special case as part of migrating VHDS from aliases to the new udpa naming scheme.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a look at what it would take to intercept all heartbeat responses except the VHDS ones in the mux impls, if that isn't too bad then I'll just include that in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good. Thanks for looking into this. At minimum if we don't implement now add some TODOs?

/wait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I pushed the relevant changes with a few tests.

I realized that the only reason why the VHDS thing is an issue is that the delta code will trigger updates even if the version of the updated resources didn't change. It might be okay to say that if resource empty && version didnt change we won't trigger the subscription callbacks. This would break VHDS implementations that don't increment the version when it sends delta updates, but it might be that those are rare enough that we can get away with it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would break VHDS implementations that don't increment the version when it sends delta updates, but it might be that those are rare enough that we can get away with it?

Add a runtime flag and release note and if no one complains we can just remove it later?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a violation of the spec to send new contents without changing the version. I'd be totally fine breaking any such usage, even without a flag.

Snow Pettersen added 3 commits October 27, 2020 22:01
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Snow Pettersen added 5 commits October 27, 2020 23:29
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
Signed-off-by: Snow Pettersen <snowp@lyft.com>
mattklein123
mattklein123 previously approved these changes Oct 28, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this LGTM. I will defer to @htuch @markdroth for final approval.

markdroth
markdroth previously approved these changes Oct 30, 2020
Copy link
Copy Markdown
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! My remaining comments here are all minor things, mostly related to the docs.


// Time-to-live value for the resource. For each resource, a timer is started. The timer is reset each time the
// resource is received with a new TTL. If the resource is received with no TTL set, the timer is removed for the
// resource. Upon expiration of the timer, the configuration for the resource will be removed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment here documenting the fact that TTLs can be updated/removed by re-sending the Resource with the same version but without populating the resource field itself?

Comment thread api/xds_protocol.rst Outdated
new TTL. To remove the TTL, the management server resends the resource with the TTL field unset.

To allow for lightweight TTL updates ("heartbeats"), a response can be sent that provides a
:ref:`Resource <envoy_api_msg_Resource>` with the resource unset and version matching the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the phrase "with the resource unset", let's make the word "resource" a link to the specific field in the Resource message, just to make sure this in unambiguous.

Comment thread api/xds_protocol.rst Outdated

To allow for lightweight TTL updates ("heartbeats"), a response can be sent that provides a
:ref:`Resource <envoy_api_msg_Resource>` with the resource unset and version matching the
clients version can be used to update the TTL. These resources will not be treated as resource
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing "clients version" to "most recently sent version".

the xDS change, Envoy will remove the resource after a TTL specified by the server. See the
:ref:`protocol documentation <xds_protocol_ttl>` for more information.

Currently the behavior when a TTL expires is that the resource is *expired* (as opposed to reverted to the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "the resource is expired" to "the resource is removed"? I'm thinking that might make the meaning a bit clearer.

within the response, while for SotW xDS the server may wrap individual resources listed in the response within a
:ref:`Resource <envoy_api_msg_Resource>` in order to specify a TTL value.

The server can refresh or modify the TTL by issuing another response for the same version. Note that the entire resource
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second sentence here needs to be removed.

Event::Dispatcher& dispatcher)
// TODO(snowp): Hard coding VHDS here is temporary until we can move it away from relying on
// empty resources as updates.
: supports_heartbeats_(!absl::StrContains(type_url, "VirtualHost")),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify the full type_url string instead of doing a substring match here, just in case we ever add a new resource type that happens to have "VirtualHost" as a substring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good call, will do

Signed-off-by: Snow Pettersen <snowp@lyft.com>
@snowp snowp dismissed stale reviews from markdroth and mattklein123 via 17820c0 October 30, 2020 16:16
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small typo. Feel free to fix in a follow up if you want.

// light-weight "heartbeat" updates to keep a resource with a TTL alive.
//
// The TTL feature is meant to support configurations that should be removed in the event of
// a management server // failure. For example, the feature may be used for fault injection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here, double "//"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea will fix in follow up just to land this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add TTL to v2 xDS

8 participants