Skip to content

DNS: allow propagating DNS responses with no records back to caller (#20890)#21027

Merged
yanavlasov merged 21 commits intoenvoyproxy:mainfrom
wanlill:enodata-pr
May 6, 2022
Merged

DNS: allow propagating DNS responses with no records back to caller (#20890)#21027
yanavlasov merged 21 commits intoenvoyproxy:mainfrom
wanlill:enodata-pr

Conversation

@wanlill
Copy link
Copy Markdown
Contributor

@wanlill wanlill commented Apr 26, 2022

Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller, this makes callers like StrictDnsClusterImpl not able to get a chance to detect the change when all records for a name are gone.

Signed-off-by: Wanli Li wanlil@netflix.com

Commit Message: DNS: allow propagating DNS responses with no records back to caller (#20890)
Additional Description: Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller, this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone.
Risk Level: High, I did realized that some buggy resolver/recursor might return nxdomain or nodata on failure and this may result in callers like StrictDnsClusterImpl removing all the instances. LogicalDnsCluster and RedisCluster already have checks against success response with no records but we'll change the DNS behavior that can be observed by callers which don't have this check (e.g. StrictDnsClusterImpl, in a good way, this is exactly what initiated this change), some of these callers might don't want this behavior change.
Testing: New unit tests added to dns_impl_test.cc
Docs Changes: current.rst changed, added to minor changes section.
Release Notes:
Platform Specific Features:
Runtime guard: envoy.reloadable_features.cares_accept_nodata controls individual dns resolution request
Fixes: #20890

…nvoyproxy#20890)

Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller,
this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone.

Signed-off-by: Wanli Li <wanlil@netflix.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #21027 was opened by wanlill.

see: more, trace.

Signed-off-by: Wanli Li <wanlil@netflix.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. An API comment to get started.

/wait

Comment on lines +43 to +44
// Treat ARES_ENODATA and ARES_ENOTFOUND as valid responses and propagate the "no records" result back to the caller.
bool accept_nodata = 5;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is the way it should work by default and this is a bug, however, this is a high risk change, and I think we should runtime guard it (feature flag). WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is a high risk change.

I wanted to add a runtime guard but then decided to make it a configurable thing by changing the proto and make it default off because:

  1. If this breaks someone then it probably implies their resolver/recursive might not behave correctly and it is probably a long term thing to fix the dns infra, meanwhile they will need to have a way to turn it off.
  2. IMO runtime guard is a temp thing and we'll flip it on and remove the guard at some time.

That said, I'm totally fine to add a guard and make it effective only if both the guard and the config are turned on, do we want that or just a runtime guard w/o the config option (i.e. change the default behavior but guard it with a flag)?

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would change the default behavior, clearly document it, and add the runtime flag for this behavior change. We can add a config option later if someone complains about it because they have to flip the flag. APIs are forever and I would rather not add a new API unless we really need to. Thank you!

/wait

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, thanks for the clarification! Can you tell me a little bit more about the "config option layer" you mentioned above? I'm afraid this is something I'm not familiar with (unless it means the proto change we've already been doing here 😀 )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry s/layer/later :)

Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
…ard flag on and off cases

Signed-off-by: Wanli Li <wanlil@netflix.com>
Copy link
Copy Markdown
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

dual_resolution_ = true;
}

accept_nodata_ = Runtime::runtimeFeatureEnabled("envoy.reloadable_features.cares_accept_nodata");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this into the constructor initializer list.

const DnsLookupFamily dns_lookup_family_;
// Queried for at construction time.
const AvailableInterfaces available_interfaces_;
bool accept_nodata_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark it const and provide default initializer.

Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks generally LGTM with small comments.

/wait

// that the first lookup failed to return any addresses. Note that DnsLookupFamily::All issues
// both lookups concurrently so there is no need to fire a second lookup here.
if (dns_lookup_family_ == DnsLookupFamily::Auto) {
family_ = AF_INET;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change and the one below needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not needed, I just found it's a bit confusing while debugging it since the lookup family_ has changed for the second request, happy to revert this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK got it, makes sense. Fine to keep.

if (status != ARES_SUCCESS) {
ENVOY_LOG_EVENT(debug, "cares_resolution_failure",
"dns resolution for {} failed with c-ares status {}", dns_name_, status);
if (!accept_nodata_ || !isResponseWithNoRecords(status)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can merge this if into the one above.

const DnsLookupFamily dns_lookup_family_;
// Queried for at construction time.
const AvailableInterfaces available_interfaces_;
const bool accept_nodata_{false};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializer not needed as you initialize in the constructor.

}

class DnsImplTest : public testing::TestWithParam<Address::IpVersion> {
std::vector<std::tuple<Address::IpVersion, bool>> paramGenerator() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to parameterize every test. Can you just fix all the tests for the new behavior and then add a specific test with a reverted scoped runtime to make sure the revert behavior works also?

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #21027 (comment) was created by @wanlill.

see: more, trace.

@wanlill wanlill requested a review from yanavlasov May 2, 2022 23:00
// Treat `ARES_ENODATA` or `ARES_ENOTFOUND` here as success to populate back the
// "empty records" response.
pending_response_.status_ = ResolutionStatus::Success;
ASSERT(addrinfo == nullptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the caller always set addrinfo to be nullptr when the status is ARES_ENODATA or ARES_ENOTFOUND?

Others LGTM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, otherwise the current DnsImpl should have been leaking memory in the non-success callback branch, c-ares doc and code for reference.

@yanavlasov
Copy link
Copy Markdown
Contributor

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit didn't fail.

🐱

Caused by: a #21027 (comment) was created by @yanavlasov.

see: more, trace.

@yanavlasov
Copy link
Copy Markdown
Contributor

I think something got stuck on AZP. Can you add an empty commit or merge main to restart CI please? I will approve and merge after that.

@yanavlasov
Copy link
Copy Markdown
Contributor

/wait

Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
@yanjunxiang-google
Copy link
Copy Markdown
Contributor

LGTM. Defer to @yanavlasov for approval and merge.

Signed-off-by: Wanli Li <wanlil@netflix.com>
@wanlill
Copy link
Copy Markdown
Contributor Author

wanlill commented May 5, 2022

@yanavlasov friendly ping, test is green, can you help to take another look?

@yanavlasov
Copy link
Copy Markdown
Contributor

Ok sorry for the delay. One file has merge conflict. Can yo resolve please? You can ping me on Slack and I will enable automerge.

@yanavlasov
Copy link
Copy Markdown
Contributor

/wait

@wanlill
Copy link
Copy Markdown
Contributor Author

wanlill commented May 5, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #21027 (comment) was created by @wanlill.

see: more, trace.

@wanlill
Copy link
Copy Markdown
Contributor Author

wanlill commented May 5, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #21027 (comment) was created by @wanlill.

see: more, trace.

@wanlill
Copy link
Copy Markdown
Contributor Author

wanlill commented May 5, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #21027 (comment) was created by @wanlill.

see: more, trace.

@yanavlasov yanavlasov merged commit 9160479 into envoyproxy:main May 6, 2022
ravenblackx pushed a commit to ravenblackx/envoy that referenced this pull request Jun 8, 2022
…nvoyproxy#20890) (envoyproxy#21027)

DNS: allow propagating DNS responses with no records back to caller (envoyproxy#20890)

Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller,
this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone.

Signed-off-by: Wanli Li <wanlil@netflix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STRICT_DNS clusters are incorrectly handling DNS responses with no records

5 participants