Skip to content

upstream: coalesce load balancer rebuilds during batch host updates#43346

Merged
botengyao merged 8 commits into
envoyproxy:mainfrom
wdauchy:eds_batch
Feb 23, 2026
Merged

upstream: coalesce load balancer rebuilds during batch host updates#43346
botengyao merged 8 commits into
envoyproxy:mainfrom
wdauchy:eds_batch

Conversation

@wdauchy
Copy link
Copy Markdown
Contributor

@wdauchy wdauchy commented Feb 5, 2026

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

  • LoadBalancerBase: recalculating per-priority health state and panic mode
  • ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures
  • EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb, ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #43346 was opened by wdauchy.

see: more, trace.

@wdauchy wdauchy force-pushed the eds_batch branch 9 times, most recently from 10ac01a to ca9dd67 Compare February 6, 2026 19:23
@wdauchy wdauchy force-pushed the eds_batch branch 2 times, most recently from bdd832d to d894e3d Compare February 13, 2026 19:16
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase, ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
@wdauchy wdauchy changed the title upstream: defer PriorityUpdateCb during batch host updates upstream: coalesce load balancer rebuilds during batch host updates Feb 14, 2026
@wdauchy wdauchy marked this pull request as ready for review February 14, 2026 14:25
@botengyao
Copy link
Copy Markdown
Member

Hi @nezdolik, could we borrow some your EDS expertise for this PR's review as well? Thanks!

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 19, 2026

/retest transients

4 similar comments
@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 20, 2026

/retest transients

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 20, 2026

/retest transients

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 20, 2026

/retest transients

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 20, 2026

/retest transients

Copy link
Copy Markdown
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in high level, thanks for the improvment!

Merging main will pass the CI, also could you also add a changelog?

/wait

priority_update_cb_ = priority_set_.addPriorityUpdateCb(
[this](uint32_t, const HostVector&, const HostVector&) { refresh(); });
member_update_cb_ =
priority_set_.addMemberUpdateCb([this](const HostVector&, const HostVector&) { refresh(); });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh() reads per_priority_load_ and per_priority_panic_ from LoadBalancerBase. So there are 2 callbacks that are in order, but it could break for future's refactor. Could we make this ordering-independent by pulling the dirty-priority processing into a protected idempotent helper, and calling it both from both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, you're right that relying on callback registration order is fragile. I've added a processDirtyPriorities() protected method on LoadBalancerBase that's idempotent (no-op if dirty set is empty). ThreadAwareLoadBalancerBase now calls it before refresh() in its own MemberUpdateCb, so even if the ordering changes in a future refactor, it will always have up-to-date per_priority_load_ and per_priority_panic_ state.

total_healthy_hosts_);
recalculatePerPriorityPanic();
stashed_random_.clear();
dirty_priorities_.insert(priority);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a runtime guard for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update with a changelog entry. when the guard is off, the original behavior is preserved (work done inline in PriorityUpdateCb). the flag is checked once at LB construction time so there's no per-update overhead.

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 22, 2026

/retest transients

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/coverage.yaml).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #43346 was synchronize by wdauchy.

see: more, trace.

@wdauchy
Copy link
Copy Markdown
Contributor Author

wdauchy commented Feb 22, 2026

/retest transients

@wdauchy wdauchy requested a review from botengyao February 22, 2026 16:33
Copy link
Copy Markdown
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in high level, thanks for the great contribution!

Comment thread test/coverage.yaml
source/extensions/internal_redirect/safe_cross_scheme: 81.3
source/extensions/internal_redirect/allow_listed_routes: 85.7
source/extensions/internal_redirect/previous_routes: 89.3
source/extensions/load_balancing_policies/common: 96.3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what can we do to fill the 0.3?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this patch I honestly don't know, that's why I ended up doing this. I think load balancing policies could deserve more tests, but do we want to add them in this pr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will try a last stretch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now fixed, sorry for the lazyness I had earlier :))

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
Copy link
Copy Markdown
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great improvement, had just one question

}

if (locality_weighted_balancing_) {
rebuildLocalityWrrForPriority(priority);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any co-dependency between regenerateLocalityRoutingStructures() and rebuildLocalityWrrForPriority() in order of invocation? E.g. in original code regenerateLocalityRoutingStructures() is invoked before rebuildLocalityWrrForPriority().

Copy link
Copy Markdown
Contributor Author

@wdauchy wdauchy Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. there's no data co-dependency between the two, regenerateLocalityRoutingStructures() writes to locality_routing_state_/residual_capacity_ while rebuildLocalityWrrForPriority() writes to locality_wrr_, so they don't read each other's output. that said, you're right the original code had regenerateLocalityRoutingStructures() first, so I've restored that ordering in both the new and fallback paths to stay consistent with the original behavior.

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
@wdauchy wdauchy requested a review from botengyao February 23, 2026 12:16
Copy link
Copy Markdown
Member

@botengyao botengyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@botengyao botengyao merged commit c0abf38 into envoyproxy:main Feb 23, 2026
28 checks passed
botengyao added a commit that referenced this pull request Mar 2, 2026
…43697)

Fixes a crash introduced by #43346. 
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
bmjask pushed a commit to bmjask/envoy that referenced this pull request Mar 14, 2026
…nvoyproxy#43346)

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase,
ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic
mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing
structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per
HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during
rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per
priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also
updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb,
ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
bmjask pushed a commit to bmjask/envoy that referenced this pull request Mar 14, 2026
…nvoyproxy#43697)

Fixes a crash introduced by envoyproxy#43346.
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
bvandewalle pushed a commit to bvandewalle/envoy that referenced this pull request Mar 17, 2026
…nvoyproxy#43346)

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase,
ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic
mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing
structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per
HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during
rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per
priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also
updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb,
ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
bvandewalle pushed a commit to bvandewalle/envoy that referenced this pull request Mar 17, 2026
…nvoyproxy#43697)

Fixes a crash introduced by envoyproxy#43346. 
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
fishcakez pushed a commit to fishcakez/envoy that referenced this pull request Mar 25, 2026
…nvoyproxy#43346)

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase,
ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic
mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing
structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per
HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during
rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per
priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also
updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb,
ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
fishcakez pushed a commit to fishcakez/envoy that referenced this pull request Mar 25, 2026
…nvoyproxy#43697)

Fixes a crash introduced by envoyproxy#43346. 
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
henrymwang pushed a commit to DataDog/envoy that referenced this pull request Apr 13, 2026
…nvoyproxy#43346)

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase,
ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic
mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing
structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per
HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during
rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per
priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also
updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb,
ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
henrymwang pushed a commit to DataDog/envoy that referenced this pull request Apr 13, 2026
…nvoyproxy#43697)

Fixes a crash introduced by envoyproxy#43346.
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
@phlax phlax added this to the 1.38.0 milestone Apr 16, 2026
krinkinmu pushed a commit to grnmeira/envoy that referenced this pull request Apr 20, 2026
…nvoyproxy#43346)

Commit Message:
During EDS batch updates that modify multiple priorities, all load
balancers in the inheritance chain (LoadBalancerBase,
ZoneAwareLoadBalancerBase,
EdfLoadBalancerBase) do expensive per-priority work for every individual
priority update via PriorityUpdateCb. This includes:

- LoadBalancerBase: recalculating per-priority health state and panic
mode
- ZoneAwareLoadBalancerBase: rebuilding locality-weighted routing
structures
- EdfLoadBalancerBase: rebuilding EDF schedulers at O(n log n) per
HostsSource

For clusters with ~5k endpoints undergoing bulk IP changes during
rollouts,
this causes significant CPU spikes on the main thread as each priority
update triggers redundant recalculations across the full LB stack.

The fix leverages the existing MemberUpdateCb which fires once after the
entire batch completes (unlike PriorityUpdateCb which fires per
priority).
Instead of doing work immediately in PriorityUpdateCb, each LB level now
marks the priority as dirty. The actual work happens in MemberUpdateCb,
coalescing all dirty priorities into a single pass.

ThreadAwareLoadBalancerBase (used by RingHash and Maglev) is also
updated
to call refresh() from MemberUpdateCb instead of PriorityUpdateCb,
ensuring
it reads per-priority state after LoadBalancerBase has processed dirty
priorities.

Additionally, MockPrioritySet callback ordering is fixed to match real
PrioritySetImpl behavior (PriorityUpdateCb fires before MemberUpdateCb).

Additional Description:
Risk Level: medium
Testing: unit tests updated, new test for batch callback behavior
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: William Dauchy <william.dauchy@datadoghq.com>
krinkinmu pushed a commit to grnmeira/envoy that referenced this pull request Apr 20, 2026
…nvoyproxy#43697)

Fixes a crash introduced by envoyproxy#43346. 
When a thread-aware load balancer is initialized mid-batch, the
per-priority panic tracking vectors have not been resized yet because
the batch member update callback hasn't fired. This causes an
out-of-bounds read during the LB refresh loop.

1. Process any dirty priorities to properly size vectors if the load
balancer is initialized mid-batch.
2. Add a bounds check `ASSERT` to prevent silent out-of-bounds bit
vector reads.
3. Add an initialization regression test to prevent this pattern from
breaking in the future.

Commit Message:
Additional Description:
Risk Level: low (already guarded by
`envoy.reloadable_features.coalesce_lb_rebuilds_on_batch_update`).
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Boteng Yao <boteng@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants