upstream: coalesce load balancer rebuilds during batch host updates#43346
Conversation
10ac01a to
ca9dd67
Compare
bdd832d to
d894e3d
Compare
During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
|
Hi @nezdolik, could we borrow some your EDS expertise for this PR's review as well? Thanks! |
|
/retest transients |
4 similar comments
|
/retest transients |
|
/retest transients |
|
/retest transients |
|
/retest transients |
botengyao
left a comment
There was a problem hiding this comment.
lgtm in high level, thanks for the improvment!
Merging main will pass the CI, also could you also add a changelog?
/wait
| priority_update_cb_ = priority_set_.addPriorityUpdateCb( | ||
| [this](uint32_t, const HostVector&, const HostVector&) { refresh(); }); | ||
| member_update_cb_ = | ||
| priority_set_.addMemberUpdateCb([this](const HostVector&, const HostVector&) { refresh(); }); |
There was a problem hiding this comment.
refresh() reads per_priority_load_ and per_priority_panic_ from LoadBalancerBase. So there are 2 callbacks that are in order, but it could break for future's refactor. Could we make this ordering-independent by pulling the dirty-priority processing into a protected idempotent helper, and calling it both from both?
There was a problem hiding this comment.
good point, you're right that relying on callback registration order is fragile. I've added a processDirtyPriorities() protected method on LoadBalancerBase that's idempotent (no-op if dirty set is empty). ThreadAwareLoadBalancerBase now calls it before refresh() in its own MemberUpdateCb, so even if the ordering changes in a future refactor, it will always have up-to-date per_priority_load_ and per_priority_panic_ state.
| total_healthy_hosts_); | ||
| recalculatePerPriorityPanic(); | ||
| stashed_random_.clear(); | ||
| dirty_priorities_.insert(priority); |
There was a problem hiding this comment.
I would suggest adding a runtime guard for this.
There was a problem hiding this comment.
done, added envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update with a changelog entry. when the guard is off, the original behavior is preserved (work done inline in PriorityUpdateCb). the flag is checked once at LB construction time so there's no per-update overhead.
|
/retest transients |
|
CC @envoyproxy/coverage-shephards: FYI only for changes made to |
|
/retest transients |
botengyao
left a comment
There was a problem hiding this comment.
lgtm in high level, thanks for the great contribution!
| source/extensions/internal_redirect/safe_cross_scheme: 81.3 | ||
| source/extensions/internal_redirect/allow_listed_routes: 85.7 | ||
| source/extensions/internal_redirect/previous_routes: 89.3 | ||
| source/extensions/load_balancing_policies/common: 96.3 |
There was a problem hiding this comment.
what can we do to fill the 0.3?
There was a problem hiding this comment.
for this patch I honestly don't know, that's why I ended up doing this. I think load balancing policies could deserve more tests, but do we want to add them in this pr?
There was a problem hiding this comment.
ok I will try a last stretch
There was a problem hiding this comment.
now fixed, sorry for the lazyness I had earlier :))
nezdolik
left a comment
There was a problem hiding this comment.
Thanks for great improvement, had just one question
| } | ||
|
|
||
| if (locality_weighted_balancing_) { | ||
| rebuildLocalityWrrForPriority(priority); |
There was a problem hiding this comment.
is there any co-dependency between regenerateLocalityRoutingStructures() and rebuildLocalityWrrForPriority() in order of invocation? E.g. in original code regenerateLocalityRoutingStructures() is invoked before rebuildLocalityWrrForPriority().
There was a problem hiding this comment.
good catch. there's no data co-dependency between the two, regenerateLocalityRoutingStructures() writes to locality_routing_state_/residual_capacity_ while rebuildLocalityWrrForPriority() writes to locality_wrr_, so they don't read each other's output. that said, you're right the original code had regenerateLocalityRoutingStructures() first, so I've restored that ordering in both the new and fallback paths to stay consistent with the original behavior.
…43697) Fixes a crash introduced by #43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com>
…nvoyproxy#43346) Commit Message: During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring it reads per-priority state after LoadBalancerBase has processed dirty priorities. Additionally, MockPrioritySet callback ordering is fixed to match real PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb). Additional Description: Risk Level: medium Testing: unit tests updated, new test for batch callback behavior Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: William Dauchy <william.dauchy@datadoghq.com> Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
…nvoyproxy#43697) Fixes a crash introduced by envoyproxy#43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com> Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
…nvoyproxy#43346) Commit Message: During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring it reads per-priority state after LoadBalancerBase has processed dirty priorities. Additionally, MockPrioritySet callback ordering is fixed to match real PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb). Additional Description: Risk Level: medium Testing: unit tests updated, new test for batch callback behavior Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
…nvoyproxy#43697) Fixes a crash introduced by envoyproxy#43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com>
…nvoyproxy#43346) Commit Message: During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring it reads per-priority state after LoadBalancerBase has processed dirty priorities. Additionally, MockPrioritySet callback ordering is fixed to match real PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb). Additional Description: Risk Level: medium Testing: unit tests updated, new test for batch callback behavior Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
…nvoyproxy#43697) Fixes a crash introduced by envoyproxy#43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com>
…nvoyproxy#43346) Commit Message: During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring it reads per-priority state after LoadBalancerBase has processed dirty priorities. Additionally, MockPrioritySet callback ordering is fixed to match real PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb). Additional Description: Risk Level: medium Testing: unit tests updated, new test for batch callback behavior Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
…nvoyproxy#43697) Fixes a crash introduced by envoyproxy#43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com>
…nvoyproxy#43346) Commit Message: During EDS batch updates that modify multiple priorities, all load balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase, EdfLoadBalancerBase) do expensive per-priority work for every individual priority update via PriorityUpdateCb. This includes: - LoadBalancerBase: recalculating per-priority health state and panic mode - ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures - EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource For clusters with ~5k endpoints undergoing bulk IP changes during rollouts, this causes significant CPU spikes on the main thread as each priority update triggers redundant recalculations across the full LB stack. The fix leverages the existing MemberUpdateCb which fires once after the entire batch completes (unlike PriorityUpdateCb which fires per priority). Instead of doing work immediately in PriorityUpdateCb, each LB level now marks the priority as dirty. The actual work happens in MemberUpdateCb, coalescing all dirty priorities into a single pass. ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring it reads per-priority state after LoadBalancerBase has processed dirty priorities. Additionally, MockPrioritySet callback ordering is fixed to match real PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb). Additional Description: Risk Level: medium Testing: unit tests updated, new test for batch callback behavior Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
…nvoyproxy#43697) Fixes a crash introduced by envoyproxy#43346. When a thread-aware load balancer is initialized mid-batch, the per-priority panic tracking vectors have not been resized yet because the batch member update callback hasn't fired. This causes an out-of-bounds read during the LB refresh loop. 1. Process any dirty priorities to properly size vectors if the load balancer is initialized mid-batch. 2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit vector reads. 3. Add an initialization regression test to prevent this pattern from breaking in the future. Commit Message: Additional Description: Risk Level: low (already guarded by `envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`). Testing: Docs Changes: Release Notes: Signed-off-by: Boteng Yao <boteng@google.com>
Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:
For clusters with ~5k endpoints undergoing bulk IP changes during rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.
The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.
ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.
Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).
Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]