Skip to content

dns_cache: unconditionally floor failure backoff at min_refresh_interval_#134

Draft
Copilot wants to merge 2 commits into
flake-dfp-nofrom
copilot/flak-dfp-no
Draft

dns_cache: unconditionally floor failure backoff at min_refresh_interval_#134
Copilot wants to merge 2 commits into
flake-dfp-nofrom
copilot/flak-dfp-no

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 23, 2026

Applies review feedback from @agrawroh on envoyproxy#44570: simplify the failure backoff capping logic in finishResolve by removing the conditional guard around the min_refresh_interval_ floor.

Changes

  • dns_cache_impl.cc: Drop the uncapped_refresh_interval local variable and the if (refresh_interval < uncapped_refresh_interval) guard; replace with a single unconditional std::max floor:
// Before
const auto uncapped_refresh_interval = refresh_interval;
// ... TTL cap logic ...
if (refresh_interval < uncapped_refresh_interval) {
  const auto min_refresh_ms =
      std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_);
  refresh_interval = std::max(refresh_interval, min_refresh_ms);
}

// After
// ... TTL cap logic ...
refresh_interval = std::max(refresh_interval,
    std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_));
  • Update the comment to drop the "if and only if" qualifier, reflecting that the floor now applies unconditionally.
Original prompt

Address review feedback from @agrawroh on PR envoyproxy#44570 ("shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown"). The branch to work on is flake-dfp-no on this fork (phlax/envoy). Push commits directly to this branch so the upstream PR updates — do not open a new PR.

The files currently touched by the PR are:

  • source/common/shared_pool/shared_pool.h
  • source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc
  • test/extensions/common/dynamic_forward_proxy/dns_cache_impl_test.cc
  • changelogs/current.yaml

Please apply the following changes. Do not change unrelated code.


1. source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc — simplify and fix invariant

The current diff in finishResolve (around line 570) looks like:

const auto uncapped_refresh_interval = refresh_interval;
if (elapsed >= host_ttl_) {
  refresh_interval = std::chrono::milliseconds(0);
} else {
  const auto until_eviction =
      std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed);
  refresh_interval = std::min(refresh_interval, until_eviction);
}
// If (and only if) the host_ttl cap above reduced the failure backoff, floor the
// resulting interval at min_refresh_interval_ ...
if (refresh_interval < uncapped_refresh_interval) {
  const auto min_refresh_ms =
      std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_);
  refresh_interval = std::max(refresh_interval, min_refresh_ms);
}
primary_host_info->refresh_timer_->enableTimer(refresh_interval);
ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) refresh rate {} ms", host,
          refresh_interval.count());

Reviewer feedback (agrawroh):

  • Rename the local uncapped_refresh_intervalraw_backoff_ms.
  • Unconditionally floor at min_refresh_interval_. Replace the whole conditional floor block with a single line:
    refresh_interval = std::max(refresh_interval,
        std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_));
    If min_refresh_interval_ is already std::chrono::milliseconds, drop the cast.
  • Preserve the eviction invariant that the previous PR added (a failed resolve's re-arm fires no later than the host becomes eligible for eviction). Reviewer's preferred option (Newer release available build_bazel_rules_apple: 0.31.3 (current: 0.31.2) #2): change > to >= in onReResolveAlarm in the same file — find the branch that compares elapsed time against host_ttl_ for eviction and flip the strict > to >= so a re-arm exactly at the TTL boundary still evicts. Do NOT restore the +1ms magic offsets. Keep refresh_interval = std::chrono::milliseconds(0); in the past-TTL branch and keep until_eviction with no +1ms.
  • Update the debug log to include both the raw (pre-cap) and the armed (post-cap+floor) interval:
    ENVOY_LOG(debug,
              "DNS refresh rate reset for host '{}', (failure) raw={} ms armed={} ms",
              host, raw_backoff_ms.count(), refresh_interval.count());
  • Tidy the block comment — since the floor is now unconditional, drop the "if and only if the cap kicked in" language. Trim to: the cap bounds eviction delay; the floor prevents arming sub-dns_min_refresh_rate alarms (including the nextBackOffMs() == 0 case) which could kick rapid-fire resolves and race with dispatcher/resolver teardown.

2. test/extensions/common/dynamic_forward_proxy/dns_cache_impl_test.cc — add tests

a. A test with a seeded random generator that forces nextBackOffMs() == 0 and asserts the refresh alarm is armed at min_refresh_interval_ (the existing fixture around ResolveFailureBackoffCappedByHostTtl uses dns_min_refresh_rate = 1s). Wire in a NiceMock<Random::MockRandomGenerator> that feeds the failure backoff strategy if one isn't already there; set ON_CALL(random_, random()).WillByDefault(Return(0)); so the jittered exponential backoff returns 0. The test should configure the same refresh/ttl/min as the existing test, trigger a DNS failure, and assert enableTimer is called with 1000ms (the min_refresh_interval_), NOT 0. Name it ResolveFailureBackoffZeroFlooredByMinRefresh.

b. A test for the past-TTL branch (elapsed >= host_ttl_). Advance sim time past host_ttl before the failure; assert enableTimer is called with min_refresh_interval_ (not 1ms, not 0). Name it ResolveFailurePastHostTtlFlooredByMinRefresh. Keep style consistent with ResolveFailureBackoffCappedByHostTtl.

Also update the existing ResolveFailureBackoffCappedByHostTtl test's comment to match the new unconditional-floor semantics if it's still slightly stale.

3. source/common/shared_pool/shared_pool.h and a new test for the leak fix

Reviewer: "Could we please also add a regression test for the leak fix exercising the dispatcher teardown-drops-callback path?"

Find the existing shared_pool unit tests under test/common/shared_pool/ (likely shared_pool_test.cc). Add a new test exercisin...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Address review feedback on shared pool leak fix dns_cache: unconditionally floor failure backoff at min_refresh_interval_ Apr 23, 2026
Copilot AI requested a review from phlax April 23, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants