Skip to content

shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown#44570

Merged
phlax merged 10 commits into
envoyproxy:mainfrom
phlax:flake-dfp-no
Apr 23, 2026
Merged

shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown#44570
phlax merged 10 commits into
envoyproxy:mainfrom
phlax:flake-dfp-no

Conversation

@phlax
Copy link
Copy Markdown
Member

@phlax phlax commented Apr 22, 2026

Fixes dfp asan flake

@phlax phlax added this to the 1.38.0 milestone Apr 22, 2026
@phlax phlax changed the title test/dfp: Fix asan flake [WIP] test/dfp: Fix asan flake Apr 22, 2026
@phlax phlax marked this pull request as draft April 22, 2026 09:50
@phlax
Copy link
Copy Markdown
Member Author

phlax commented Apr 22, 2026

doesnt fix - i can still repro with this pr applied

@phlax phlax changed the title [WIP] test/dfp: Fix asan flake shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown Apr 22, 2026
@phlax phlax marked this pull request as ready for review April 22, 2026 10:45
@phlax
Copy link
Copy Markdown
Member Author

phlax commented Apr 22, 2026

appears it does fix now

@phlax
Copy link
Copy Markdown
Member Author

phlax commented Apr 22, 2026

i have left bot comments in code for purposes of review - happy to remove if they are not helpful

@phlax phlax force-pushed the flake-dfp-no branch 2 times, most recently from 713497f to aeda177 Compare April 22, 2026 10:55
Signed-off-by: Ryan Northey <ryan@synca.io>
Comment thread changelogs/current.yaml
- area: http2
change: |
Apply nghttp2 CVE-2026-27135 patch.
- area: shared_pool
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how relevant this amount of information is for users. Maybe squashing all changes like this into a single changelog entry like Fixed memory leaks ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im reluctant to change other entries - but if you can suggest a terser changelog - happy to update

@phlax
Copy link
Copy Markdown
Member Author

phlax commented Apr 22, 2026

/retest ext proc

Copy link
Copy Markdown
Member

@agrawroh agrawroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some feedback, looks good otherwise.

// resolves and race with dispatcher/resolver teardown in integration tests (observed
// as a LeakSanitizer leak in proxy_filter_integration_test DoubleResolution). When the
// cap does not kick in, the user-configured failure backoff is passed through unchanged.
if (refresh_interval < uncapped_refresh_interval) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use <= in the guard.

if (refresh_interval <= uncapped_refresh_interval) { ... }

OR apply the floor unconditionally,

refresh_interval = std::max(refresh_interval, min_refresh_interval_);

Also, add a test with a seeded random generator that forces nextBackOffMs() == 0.

Comment on lines 574 to 580
if (elapsed >= host_ttl_) {
refresh_interval = std::chrono::milliseconds(1);
refresh_interval = std::chrono::milliseconds(0);
} else {
const auto until_eviction =
std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed) +
std::chrono::milliseconds(1);
std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed);
refresh_interval = std::min(refresh_interval, until_eviction);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR we specifically added the guarantee that a failed resolve re-arms the alarm so eviction happens no later than the host becomes eligible for eviction. Now, this PR is breaking that invariant.

You should pick one:

  1. Restore +1ms on until_eviction in the else branch and pin the past-TTL branch at std::chrono::milliseconds(1). Drop the min_refresh_interval_ floor in favor of a 1ms floor. Also update the test assertions as the current enableTimer(1000ms) expectation in dns_cache_impl_test.cc only works because of the current floor shape.
  2. Flip dns_cache_impl.cc from > to >=. This localizes the invariant inside onReResolveAlarm and removes the need for any +1ms magic offset. It also matches the style we use elsewhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i asked the bot about this - it said that both were required/a good idea for some reason - unfortunately copilot is now down - so unable to follow up immediately

std::chrono::milliseconds refresh_interval(
primary_host_info->failure_backoff_strategy_->nextBackOffMs());
const auto uncapped_refresh_interval = refresh_interval;
if (elapsed >= host_ttl_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test for this case.

Comment thread changelogs/current.yaml
Comment on lines +407 to +411
Fixed a leak in ``ObjectSharedPool`` when the last reference to a pooled object is released
on a non-main thread after the main dispatcher has started teardown. The queued cross-thread
deletion now carries ownership, so the object is freed whether or not the dispatcher runs
the queued callback. Most visibly this fixes a LeakSanitizer leak in
``StrictDnsClusterImpl`` involving the ``Locality`` shared pool during server shutdown.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please also add a regression test for the leak fix exercising the dispatcher teardown-drops-callback path?

Comment on lines 593 to 594
ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) refresh rate {} ms", host,
refresh_interval.count());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log both? WDYT?

ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) raw={} ms armed={} ms",
    host, uncapped_refresh_interval.count(), refresh_interval.count());

Comment on lines +588 to +590
const auto min_refresh_ms =
std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_);
refresh_interval = std::max(refresh_interval, min_refresh_ms);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do:

refresh_interval = std::max(refresh_interval, min_refresh_interval_);

const auto elapsed = now - primary_host_info->host_info_->lastUsedTime();
std::chrono::milliseconds refresh_interval(
primary_host_info->failure_backoff_strategy_->nextBackOffMs());
const auto uncapped_refresh_interval = refresh_interval;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncapped_refresh_interval => raw_backoff_ms

Copilot AI and others added 2 commits April 23, 2026 09:55
…_interval_ floor

Signed-off-by: Ryan Northey <ryan@synca.io>
Signed-off-by: Ryan Northey <ryan@synca.io>
Copilot AI and others added 2 commits April 23, 2026 10:56
Signed-off-by: Ryan Northey <ryan@synca.io>
phlax added 3 commits April 23, 2026 11:14
Signed-off-by: Ryan Northey <ryan@synca.io>
Signed-off-by: Ryan Northey <ryan@synca.io>
Signed-off-by: Ryan Northey <ryan@synca.io>
@phlax phlax requested a review from agrawroh April 23, 2026 11:27
@phlax
Copy link
Copy Markdown
Member Author

phlax commented Apr 23, 2026

@agrawroh hopefully all feedback is addressed

same as above i left in all bot comments for sake of reviewing - but happy to clean those up if not more generally useful

@phlax phlax merged commit c95e4b5 into envoyproxy:main Apr 23, 2026
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants