shared_pool: fix leak when last reference drops on worker thread during dispatcher teardown#44570
Conversation
|
doesnt fix - i can still repro with this pr applied |
|
appears it does fix now |
|
i have left bot comments in code for purposes of review - happy to remove if they are not helpful |
713497f to
aeda177
Compare
Signed-off-by: Ryan Northey <ryan@synca.io>
| - area: http2 | ||
| change: | | ||
| Apply nghttp2 CVE-2026-27135 patch. | ||
| - area: shared_pool |
There was a problem hiding this comment.
Wondering how relevant this amount of information is for users. Maybe squashing all changes like this into a single changelog entry like Fixed memory leaks ?
There was a problem hiding this comment.
im reluctant to change other entries - but if you can suggest a terser changelog - happy to update
|
/retest ext proc |
agrawroh
left a comment
There was a problem hiding this comment.
Left some feedback, looks good otherwise.
| // resolves and race with dispatcher/resolver teardown in integration tests (observed | ||
| // as a LeakSanitizer leak in proxy_filter_integration_test DoubleResolution). When the | ||
| // cap does not kick in, the user-configured failure backoff is passed through unchanged. | ||
| if (refresh_interval < uncapped_refresh_interval) { |
There was a problem hiding this comment.
Please use <= in the guard.
if (refresh_interval <= uncapped_refresh_interval) { ... }
OR apply the floor unconditionally,
refresh_interval = std::max(refresh_interval, min_refresh_interval_);
Also, add a test with a seeded random generator that forces nextBackOffMs() == 0.
| if (elapsed >= host_ttl_) { | ||
| refresh_interval = std::chrono::milliseconds(1); | ||
| refresh_interval = std::chrono::milliseconds(0); | ||
| } else { | ||
| const auto until_eviction = | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed) + | ||
| std::chrono::milliseconds(1); | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(host_ttl_ - elapsed); | ||
| refresh_interval = std::min(refresh_interval, until_eviction); | ||
| } |
There was a problem hiding this comment.
In the previous PR we specifically added the guarantee that a failed resolve re-arms the alarm so eviction happens no later than the host becomes eligible for eviction. Now, this PR is breaking that invariant.
You should pick one:
- Restore
+1msonuntil_evictionin the else branch and pin the past-TTL branch atstd::chrono::milliseconds(1). Drop themin_refresh_interval_floor in favor of a 1ms floor. Also update the test assertions as the currentenableTimer(1000ms)expectation indns_cache_impl_test.cconly works because of the current floor shape. - Flip
dns_cache_impl.ccfrom>to>=. This localizes the invariant insideonReResolveAlarmand removes the need for any+1msmagic offset. It also matches the style we use elsewhere.
There was a problem hiding this comment.
i asked the bot about this - it said that both were required/a good idea for some reason - unfortunately copilot is now down - so unable to follow up immediately
| std::chrono::milliseconds refresh_interval( | ||
| primary_host_info->failure_backoff_strategy_->nextBackOffMs()); | ||
| const auto uncapped_refresh_interval = refresh_interval; | ||
| if (elapsed >= host_ttl_) { |
There was a problem hiding this comment.
We should add a test for this case.
| Fixed a leak in ``ObjectSharedPool`` when the last reference to a pooled object is released | ||
| on a non-main thread after the main dispatcher has started teardown. The queued cross-thread | ||
| deletion now carries ownership, so the object is freed whether or not the dispatcher runs | ||
| the queued callback. Most visibly this fixes a LeakSanitizer leak in | ||
| ``StrictDnsClusterImpl`` involving the ``Locality`` shared pool during server shutdown. |
There was a problem hiding this comment.
Could we please also add a regression test for the leak fix exercising the dispatcher teardown-drops-callback path?
| ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) refresh rate {} ms", host, | ||
| refresh_interval.count()); |
There was a problem hiding this comment.
Let's log both? WDYT?
ENVOY_LOG(debug, "DNS refresh rate reset for host '{}', (failure) raw={} ms armed={} ms",
host, uncapped_refresh_interval.count(), refresh_interval.count());
| const auto min_refresh_ms = | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(min_refresh_interval_); | ||
| refresh_interval = std::max(refresh_interval, min_refresh_ms); |
There was a problem hiding this comment.
You can just do:
refresh_interval = std::max(refresh_interval, min_refresh_interval_);
| const auto elapsed = now - primary_host_info->host_info_->lastUsedTime(); | ||
| std::chrono::milliseconds refresh_interval( | ||
| primary_host_info->failure_backoff_strategy_->nextBackOffMs()); | ||
| const auto uncapped_refresh_interval = refresh_interval; |
There was a problem hiding this comment.
uncapped_refresh_interval => raw_backoff_ms
…_interval_ floor Signed-off-by: Ryan Northey <ryan@synca.io>
…ete leak Agent-Logs-Url: https://github.com/phlax/envoy/sessions/d289657d-320a-49dc-9726-10d554189a29 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
…sh_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/ba248f4d-3f4e-4efd-a832-57155165a8f1 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
…_rate Agent-Logs-Url: https://github.com/phlax/envoy/sessions/798a1ee4-459e-4f76-8dda-8bbeef41e639 Co-authored-by: phlax <454682+phlax@users.noreply.github.com>
|
@agrawroh hopefully all feedback is addressed same as above i left in all bot comments for sake of reviewing - but happy to clean those up if not more generally useful |
Fixes dfp asan flake