Skip to content

dns: Create new field for delayed DNS resolution#35479

Closed
keithmattix wants to merge 4 commits intoenvoyproxy:mainfrom
keithmattix:dns/on-demand/add-delayed-dns-field
Closed

dns: Create new field for delayed DNS resolution#35479
keithmattix wants to merge 4 commits intoenvoyproxy:mainfrom
keithmattix:dns/on-demand/add-delayed-dns-field

Conversation

@keithmattix
Copy link
Copy Markdown
Contributor

@keithmattix keithmattix commented Jul 29, 2024

Part of #20562
Commit Message: Create new field for on-demand DNS resolution
Additional Description: This is the first step in providing a sort of on-demand DNS resolution for Envoy. I chose a cluster-level enum field over changes in dns_resolution configuration because the enqueueing behavior of the async DNS requests seem to be a property of the cluster and not any specific DNS resolver (e.g. c-ares). In general, I'm looking to get a thumbs up on the API design before diving into the implementation.
Risk Level: Low
Testing:
Docs Changes: Should be autogenerated from the proto.
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #35479 was opened by keithmattix.

see: more, trace.

This is the first step in providing a sort of on-demand DNS resolution
for Envoy. I chose a cluster-level enum field over changes in
dns_resolution configuration because the enqueueing behavior
of the async DNS requests seem to be a property of the cluster and not
any specific DNS resolver (e.g. c-ares).

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix keithmattix changed the title Create new field for delayed DNS resolution dns: Create new field for delayed DNS resolution Jul 29, 2024
@keithmattix keithmattix force-pushed the dns/on-demand/add-delayed-dns-field branch from d1ca53e to 915fddc Compare July 29, 2024 20:37
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@soulxu
Copy link
Copy Markdown
Member

soulxu commented Jul 30, 2024

/assign @ggreenway

@keithmattix
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait-any

Comment thread api/envoy/config/cluster/v3/cluster.proto Outdated
Comment thread api/envoy/config/cluster/v3/cluster.proto Outdated
@ggreenway
Copy link
Copy Markdown
Member

As I commented in the linked issue, it probably makes sense to make this applicable to other service discovery types as well (such as EDS). Both the behavior of not trying to fetch results until it's used, and the future mode of stopping service discovery if the cluster remains unused for a period of time.

@keithmattix
Copy link
Copy Markdown
Contributor Author

keithmattix commented Jul 31, 2024

As I commented in the linked issue, it probably makes sense to make this applicable to other service discovery types as well (such as EDS). Both the behavior of not trying to fetch results until it's used, and the future mode of stopping service discovery if the cluster remains unused for a period of time.

Hm, I understand the thrust of the idea, but IMO, DNS and xDS are different enough service discovery mechanisms to have this behavior split out. On-demand EDS is a far more invasive change IMO, but I could be wrong.

EDIT Looking back at the cluster type, I would actually expect any sort of EDS config to live within eds_cluster_config

@ggreenway
Copy link
Copy Markdown
Member

I'll defer to @markdroth as the api-shepherd to review the API changes, and determine whether other discovery types should be governed by the same config or not.

@ggreenway ggreenway removed their assignment Aug 2, 2024
@RyanTheOptimist
Copy link
Copy Markdown
Contributor

@markdroth friendly ping

@RyanTheOptimist
Copy link
Copy Markdown
Contributor

/assign-from @envoyproxy/api-shepherds

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/api-shepherds assignee is @htuch

🐱

Caused by: a #35479 (comment) was created by @RyanTheOptimist.

see: more, trace.

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix keithmattix force-pushed the dns/on-demand/add-delayed-dns-field branch from 41c62e4 to c650a26 Compare August 9, 2024 16:49
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix
Copy link
Copy Markdown
Contributor Author

/retest

@keithmattix keithmattix requested review from ggreenway and htuch August 12, 2024 18:39
Copy link
Copy Markdown
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not replying earlier; I've been on vacation for the last few weeks.

Please let me know if you have any questions. Thanks!

// the DNS resolution will affect the first connection's RTT. Consequently, failure to resolve DNS before the connect timeout
// will result in a connection reset. This mode is useful for scenarios where large numbers of clusters must be sent to Envoy for config
// practicality reasons, but traffic distribution is not uniform across all clusters.
DelayedDns delayed_dns = 58;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our long-term goal in xDS is to move most custom functionality like this into extension points instead of continually adding first-class fields for one-off features to the top-level API. Given that, I don't think we want to add yet another new DNS-specific field to the Cluster proto itself. Instead, I suggest taking this opportunity to move some of the cluster types into extensions. Specifically, I think we should do the following:

  1. Create an extension proto in extensions/clusters/ for DNS clusters. This proto should have all of the existing fields from the Cluster proto that control DNS behavior. It can also have a field to toggle between STRICT_DNS and LOGICAL_DNS behavior (or, alternatively, we could make them two different extensions, although we'd wind up duplicating a lot of the fields that way).
  2. Change the cluster resource validation code to automatically convert the legacy STRICT_DNS and LOGICAL_DNS enum values to the new extension.
  3. Add this new field in the new extension proto instead of adding it here in the Cluster proto.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that approach sounds reasonable to me! Is there somewhere I can read about the difference between extensions vs main protos and the benefit of keeping these things in extensions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we've explicitly written it down anywhere, but the main impetus is just to keep the API tractable. We can't keep adding every one-off knob that everyone wants as first-class fields, because the API will quickly become too complex for anyone to understand. Instead, we want to have a very simple overall structure with individual features in extension points, so that people that don't need a particular extension never need to see or care about the fields related to that extension.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack makes perfect sense. I'll keep this PR parked and create a new one like extensions/clusters/dns.proto

@adisuissa
Copy link
Copy Markdown
Contributor

/wait

@Stevenjin8
Copy link
Copy Markdown
Contributor

@markdroth I'm taking this issue over for @keithmattix. I'm a little confused by

Change the cluster resource validation code to automatically convert the legacy STRICT_DNS and LOGICAL_DNS enum values to the new extension.

My initial understanding is that I would create a Envoy::Config::ConfigValidator. In the validate(const Server::Instance& server, const std::vector<Envoy::Config::DecodedResourcePtr>& resources) method, do something like

cluster::v3::Config config = dynamic_cast<cluster::v3::Config>(resources)
if (config.type == STRICT_DNS or LOGICAL_DNS) {
   config.dns_cluster.respect_dns_ttl = config.respect_dns_ttl
   // etc for other fields, and raising errors if there are inconsistencies.
}

But the issue that the resources is const, so I can't actually change it.

So my question is, is my understanding correct? or is there another validation point that you are referring to?

@markdroth
Copy link
Copy Markdown
Contributor

I'm not the right person to answer how this should work in terms of the Envoy implementation.

From the xDS API standpoint, the intent here is that the "real" way to configure this should be via a new-style extension, and support for the old-style enum is present only for backward compatibility. The idea is that when the client sees a proto using the old-style enum, it should effectively change the received config from the old-style enum to the new style extension, and then process it as if it had actually received the new-style extension. That translation is really just there for backward compatibility, but it allows us to have only one implementation of actually supporting the functionality.

I know that @wbpcode previously did something like this for moving LB policies from an enum to an extension point, so he can probably advise in terms of the best way to do this in the Envoy implementation.

@Stevenjin8
Copy link
Copy Markdown
Contributor

going to finish this in # #36353

@markdroth markdroth mentioned this pull request Oct 4, 2024
5 tasks
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 26, 2024
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 2, 2024

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions Bot closed this Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants