shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown by phlax · Pull Request #44570 · envoyproxy/envoy

phlax · 2026-04-22T08:15:08Z

Fixes dfp asan flake

phlax · 2026-04-22T09:50:54Z

doesnt fix - i can still repro with this pr applied

phlax · 2026-04-22T10:45:31Z

appears it does fix now

phlax · 2026-04-22T10:46:11Z

i have left bot comments in code for purposes of review - happy to remove if they are not helpful

Signed-off-by: Ryan Northey <ryan@synca.io>

jwendell · 2026-04-22T11:45:19Z

 - area: http2
  change: |
    Apply nghttp2 CVE-2026-27135 patch.
+- area: shared_pool


Wondering how relevant this amount of information is for users. Maybe squashing all changes like this into a single changelog entry like Fixed memory leaks ?

im reluctant to change other entries - but if you can suggest a terser changelog - happy to update

phlax · 2026-04-22T12:35:42Z

/retest ext proc

agrawroh

Left some feedback, looks good otherwise.

agrawroh · 2026-04-22T16:30:43Z

+      // resolves and race with dispatcher/resolver teardown in integration tests (observed
+      // as a LeakSanitizer leak in proxy_filter_integration_test DoubleResolution). When the
+      // cap does not kick in, the user-configured failure backoff is passed through unchanged.
+      if (refresh_interval < uncapped_refresh_interval) {


Please use <= in the guard.

if (refresh_interval <= uncapped_refresh_interval) { ... }

OR apply the floor unconditionally,

refresh_interval = std::max(refresh_interval, min_refresh_interval_);

Also, add a test with a seeded random generator that forces nextBackOffMs() == 0.

agrawroh · 2026-04-22T16:34:08Z

      if (elapsed >= host_ttl_) {
-        refresh_interval = std::chrono::milliseconds(1);
+        refresh_interval = std::chrono::milliseconds(0);
      } else {
        const auto until_eviction =
-            std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed) +
-            std::chrono::milliseconds(1);
+            std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed);
        refresh_interval = std::min(refresh_interval, until_eviction);
      }


In the previous PR we specifically added the guarantee that a failed resolve re-arms the alarm so eviction happens no later than the host becomes eligible for eviction. Now, this PR is breaking that invariant.

You should pick one:

Restore +1ms on until_eviction in the else branch and pin the past-TTL branch at std::chrono::milliseconds(1). Drop the min_refresh_interval_ floor in favor of a 1ms floor. Also update the test assertions as the current enableTimer(1000ms) expectation in dns_cache_impl_test.cc only works because of the current floor shape.

Flip dns_cache_impl.cc from > to >=. This localizes the invariant inside onReResolveAlarm and removes the need for any +1ms magic offset. It also matches the style we use elsewhere.

i asked the bot about this - it said that both were required/a good idea for some reason - unfortunately copilot is now down - so unable to follow up immediately

agrawroh · 2026-04-22T16:34:34Z

      std::chrono::milliseconds refresh_interval(
          primary_host_info->failure_backoff_strategy_->nextBackOffMs());
+      const auto uncapped_refresh_interval = refresh_interval;
      if (elapsed >= host_ttl_) {


We should add a test for this case.

agrawroh · 2026-04-22T16:35:50Z

+    Fixed a leak in ``ObjectSharedPool`` when the last reference to a pooled object is released
+    on a non-main thread after the main dispatcher has started teardown. The queued cross-thread
+    deletion now carries ownership, so the object is freed whether or not the dispatcher runs
+    the queued callback. Most visibly this fixes a LeakSanitizer leak in
+    ``StrictDnsClusterImpl`` involving the ``Locality`` shared pool during server shutdown.


Could we please also add a regression test for the leak fix exercising the dispatcher teardown-drops-callback path?

agrawroh · 2026-04-22T16:37:32Z

      ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) refresh rate {} ms", host,
                refresh_interval.count());


Let's log both? WDYT?

ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) raw={} ms armed={} ms", host, uncapped_refresh_interval.count(), refresh_interval.count());

agrawroh · 2026-04-22T16:38:33Z

+        const auto min_refresh_ms =
+            std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_);
+        refresh_interval = std::max(refresh_interval, min_refresh_ms);


You can just do:

refresh_interval = std::max(refresh_interval, min_refresh_interval_);

agrawroh · 2026-04-22T16:39:19Z

      const auto elapsed = now - primary_host_info->host_info_->lastUsedTime();
      std::chrono::milliseconds refresh_interval(
          primary_host_info->failure_backoff_strategy_->nextBackOffMs());
+      const auto uncapped_refresh_interval = refresh_interval;


uncapped_refresh_interval => raw_backoff_ms

…_interval_ floor Signed-off-by: Ryan Northey <ryan@synca.io>

Signed-off-by: Ryan Northey <ryan@synca.io>

…ete leak Agent-Logs-Url: https://github.com/phlax/envoy/sessions/d289657d-320a-49dc-9726-10d554189a29 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

…sh_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/ba248f4d-3f4e-4efd-a832-57155165a8f1 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

…_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/798a1ee4-459e-4f76-8dda-8bbeef41e639 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

Signed-off-by: Ryan Northey <ryan@synca.io>

phlax · 2026-04-23T11:29:01Z

@agrawroh hopefully all feedback is addressed

same as above i left in all bot comments for sake of reviewing - but happy to clean those up if not more generally useful

phlax requested review from RyanTheOptimist and mattklein123 as code owners April 22, 2026 08:15

phlax added this to the 1.38.0 milestone Apr 22, 2026

phlax assigned agrawroh Apr 22, 2026

phlax force-pushed the flake-dfp-no branch from cafe429 to 47fba6f Compare April 22, 2026 09:04

phlax changed the title ~~test/dfp: Fix asan flake~~ [WIP] test/dfp: Fix asan flake Apr 22, 2026

phlax marked this pull request as draft April 22, 2026 09:50

phlax force-pushed the flake-dfp-no branch from 47fba6f to d690069 Compare April 22, 2026 10:42

phlax changed the title ~~[WIP] test/dfp: Fix asan flake~~ shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown Apr 22, 2026

phlax marked this pull request as ready for review April 22, 2026 10:45

phlax force-pushed the flake-dfp-no branch 2 times, most recently from 713497f to aeda177 Compare April 22, 2026 10:55

test/dfp: Fix asan flake

518fed9

Signed-off-by: Ryan Northey <ryan@synca.io>

phlax force-pushed the flake-dfp-no branch from aeda177 to 518fed9 Compare April 22, 2026 11:41

jwendell reviewed Apr 22, 2026

View reviewed changes

agrawroh reviewed Apr 22, 2026

View reviewed changes

Copilot AI mentioned this pull request Apr 23, 2026

dns_cache: unconditionally floor failure backoff at min_refresh_interval_ phlax/envoy#134

Draft

Copilot AI and others added 2 commits April 23, 2026 09:55

dns_cache: simplify failure backoff floor - unconditional min_refresh…

7cf5d1f

…_interval_ floor Signed-off-by: Ryan Northey <ryan@synca.io>

host-ttl

77ad80f

Signed-off-by: Ryan Northey <ryan@synca.io>

Copilot AI mentioned this pull request Apr 23, 2026

shared_pool: regression test for dispatcher-teardown cross-thread delete leak phlax/envoy#135

Draft

shared_pool: regression test for dispatcher-teardown cross-thread del…

7deb39c

…ete leak Agent-Logs-Url: https://github.com/phlax/envoy/sessions/d289657d-320a-49dc-9726-10d554189a29 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

Copilot AI mentioned this pull request Apr 23, 2026

dns_cache: test that zero failure backoff is floored at dns_min_refresh_rate phlax/envoy#136

Draft

dns_cache: test that zero failure backoff is floored at dns_min_refre…

6eff930

…sh_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/ba248f4d-3f4e-4efd-a832-57155165a8f1 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

Copilot AI mentioned this pull request Apr 23, 2026

dns_cache: add test for past-TTL failure-backoff branch floored at dns_min_refresh_rate phlax/envoy#137

Draft

Copilot AI and others added 2 commits April 23, 2026 10:56

dns_cache: test past-TTL failure branch is floored at dns_min_refresh…

0529b02

…_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/798a1ee4-459e-4f76-8dda-8bbeef41e639 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>

logging

2f55c69

Signed-off-by: Ryan Northey <ryan@synca.io>

phlax added 3 commits April 23, 2026 11:14

test-fix

020f461

Signed-off-by: Ryan Northey <ryan@synca.io>

format

099720b

Signed-off-by: Ryan Northey <ryan@synca.io>

test-fixes

d0df690

Signed-off-by: Ryan Northey <ryan@synca.io>

phlax requested a review from agrawroh April 23, 2026 11:27

agrawroh approved these changes Apr 23, 2026

View reviewed changes

phlax merged commit c95e4b5 into envoyproxy:main Apr 23, 2026
28 of 29 checks passed

		ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) refresh rate {} ms", host,
		refresh_interval.count());

Conversation

phlax commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phlax commented Apr 22, 2026

Uh oh!

phlax commented Apr 22, 2026

Uh oh!

phlax commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phlax commented Apr 22, 2026

Uh oh!

agrawroh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phlax commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

phlax commented Apr 22, 2026 •

edited

Loading