DNS: allow propagating DNS responses with no records back to caller (#20890)#21027
DNS: allow propagating DNS responses with no records back to caller (#20890)#21027yanavlasov merged 21 commits intoenvoyproxy:mainfrom
Conversation
…nvoyproxy#20890) Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller, this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone. Signed-off-by: Wanli Li <wanlil@netflix.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Wanli Li <wanlil@netflix.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for working on this. An API comment to get started.
/wait
| // Treat ARES_ENODATA and ARES_ENOTFOUND as valid responses and propagate the "no records" result back to the caller. | ||
| bool accept_nodata = 5; |
There was a problem hiding this comment.
IMO this is the way it should work by default and this is a bug, however, this is a high risk change, and I think we should runtime guard it (feature flag). WDYT?
There was a problem hiding this comment.
+1, this is a high risk change.
I wanted to add a runtime guard but then decided to make it a configurable thing by changing the proto and make it default off because:
- If this breaks someone then it probably implies their resolver/recursive might not behave correctly and it is probably a long term thing to fix the dns infra, meanwhile they will need to have a way to turn it off.
- IMO runtime guard is a temp thing and we'll flip it on and remove the guard at some time.
That said, I'm totally fine to add a guard and make it effective only if both the guard and the config are turned on, do we want that or just a runtime guard w/o the config option (i.e. change the default behavior but guard it with a flag)?
There was a problem hiding this comment.
I think I would change the default behavior, clearly document it, and add the runtime flag for this behavior change. We can add a config option later if someone complains about it because they have to flip the flag. APIs are forever and I would rather not add a new API unless we really need to. Thank you!
/wait
There was a problem hiding this comment.
SGTM, thanks for the clarification! Can you tell me a little bit more about the "config option layer" you mentioned above? I'm afraid this is something I'm not familiar with (unless it means the proto change we've already been doing here 😀 )
Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
…ard flag on and off cases Signed-off-by: Wanli Li <wanlil@netflix.com>
| dual_resolution_ = true; | ||
| } | ||
|
|
||
| accept_nodata_ = Runtime::runtimeFeatureEnabled("envoy.reloadable_features.cares_accept_nodata"); |
There was a problem hiding this comment.
Please move this into the constructor initializer list.
| const DnsLookupFamily dns_lookup_family_; | ||
| // Queried for at construction time. | ||
| const AvailableInterfaces available_interfaces_; | ||
| bool accept_nodata_; |
There was a problem hiding this comment.
Please mark it const and provide default initializer.
Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
Signed-off-by: Wanli Li <wanlil@netflix.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally LGTM with small comments.
/wait
| // that the first lookup failed to return any addresses. Note that DnsLookupFamily::All issues | ||
| // both lookups concurrently so there is no need to fire a second lookup here. | ||
| if (dns_lookup_family_ == DnsLookupFamily::Auto) { | ||
| family_ = AF_INET; |
There was a problem hiding this comment.
Why is this change and the one below needed?
There was a problem hiding this comment.
They are not needed, I just found it's a bit confusing while debugging it since the lookup family_ has changed for the second request, happy to revert this.
There was a problem hiding this comment.
OK got it, makes sense. Fine to keep.
| if (status != ARES_SUCCESS) { | ||
| ENVOY_LOG_EVENT(debug, "cares_resolution_failure", | ||
| "dns resolution for {} failed with c-ares status {}", dns_name_, status); | ||
| if (!accept_nodata_ || !isResponseWithNoRecords(status)) { |
There was a problem hiding this comment.
nit: you can merge this if into the one above.
| const DnsLookupFamily dns_lookup_family_; | ||
| // Queried for at construction time. | ||
| const AvailableInterfaces available_interfaces_; | ||
| const bool accept_nodata_{false}; |
There was a problem hiding this comment.
initializer not needed as you initialize in the constructor.
| } | ||
|
|
||
| class DnsImplTest : public testing::TestWithParam<Address::IpVersion> { | ||
| std::vector<std::tuple<Address::IpVersion, bool>> paramGenerator() { |
There was a problem hiding this comment.
I don't think we need to parameterize every test. Can you just fix all the tests for the new behavior and then add a specific test with a reverted scoped runtime to make sure the revert behavior works also?
|
Retrying Azure Pipelines: |
| // Treat `ARES_ENODATA` or `ARES_ENOTFOUND` here as success to populate back the | ||
| // "empty records" response. | ||
| pending_response_.status_ = ResolutionStatus::Success; | ||
| ASSERT(addrinfo == nullptr); |
There was a problem hiding this comment.
Are we sure the caller always set addrinfo to be nullptr when the status is ARES_ENODATA or ARES_ENOTFOUND?
Others LGTM.
|
/retest |
|
Retrying Azure Pipelines: |
|
I think something got stuck on AZP. Can you add an empty commit or merge main to restart CI please? I will approve and merge after that. |
|
/wait |
Signed-off-by: Wanli Li <wanlil@netflix.com>
|
LGTM. Defer to @yanavlasov for approval and merge. |
Signed-off-by: Wanli Li <wanlil@netflix.com>
|
@yanavlasov friendly ping, test is green, can you help to take another look? |
|
Ok sorry for the delay. One file has merge conflict. Can yo resolve please? You can ping me on Slack and I will enable automerge. |
|
/wait |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
…nvoyproxy#20890) (envoyproxy#21027) DNS: allow propagating DNS responses with no records back to caller (envoyproxy#20890) Currently ares resolver treats ARES_ENODATA and ARES_ENOTFOUND as failures and won't propagate the results back to the caller, this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone. Signed-off-by: Wanli Li <wanlil@netflix.com>
Currently ares resolver treats
ARES_ENODATAandARES_ENOTFOUNDas failures and won't propagate the results back to the caller, this makes callers likeStrictDnsClusterImplnot able to get a chance to detect the change when all records for a name are gone.Signed-off-by: Wanli Li wanlil@netflix.com
Commit Message: DNS: allow propagating DNS responses with no records back to caller (#20890)
Additional Description: Currently ares resolver treats
ARES_ENODATAandARES_ENOTFOUNDas failures and won't propagate the results back to the caller, this makes callers like strict_dns cluster not able to get a chance to detect the change when all records for a name are gone.Risk Level: High, I did realized that some buggy resolver/recursor might return nxdomain or nodata on failure and this may result in callers like
StrictDnsClusterImplremoving all the instances.LogicalDnsClusterandRedisClusteralready have checks against success response with no records but we'll change the DNS behavior that can be observed by callers which don't have this check (e.g.StrictDnsClusterImpl, in a good way, this is exactly what initiated this change), some of these callers might don't want this behavior change.Testing: New unit tests added to dns_impl_test.cc
Docs Changes: current.rst changed, added to minor changes section.
Release Notes:
Platform Specific Features:
Runtime guard:
envoy.reloadable_features.cares_accept_nodatacontrols individual dns resolution requestFixes: #20890