fix(prefill-router): persist endpoint after handshake so decode rebuild reactivates#8965
Conversation
…ld reactivates When all decode pods in a namespace temporarily go away (e.g., the DYN_HEALTH_CHECK_ENABLED canary livenessProbe kills two replicas in quick succession, or a manual rollout deletes the decode pods), the watcher tears down the decode WorkerSet via remove_worker_set + remove_prefill_activator. Once the new decode pods register fresh instance_ids in ETCD, the watcher rebuilds the WorkerSet. The new PrefillRouter calls register_prefill_router, which now finds the activator map empty and falls into the None branch, returning a fresh DecodeWaiting(tx). Prefill workers haven't changed, so no second activate_prefill_router call ever fires. The new PrefillRouter sits on that oneshot::Receiver forever; prefill_router OnceLock stays None; PrefillRouter::generate falls back to aggregated mode and forwards bare requests to decode workers, which reject them with "Disaggregated params are required for decode mode" until the frontend pod itself is restarted. The fix mirrors what the existing prefill-rejoin path at lines 740-770 already does: persist the prefill endpoint as PrefillReady(rx) after a successful handshake so future decode WorkerSet rebuilds find it and activate immediately. Two changes: 1. DecodeWaiting branch: after waking the original receiver, also store a fresh PrefillReady(rx) carrying a clone of the endpoint. Endpoint is #[derive(Clone)]. 2. PrefillReady branch: previously errored with "already activated". Change to refresh the cached endpoint and reactivate any deactivated decode-side router. This handles two cases: (a) duplicate activate_prefill_router calls, (b) prefill rejoin after the DecodeWaiting branch's persist step has left a PrefillReady entry. The None branch (lines 740-789) is unchanged and still handles the case where the activator map was cleared between the original handshake and a prefill rejoin. Manually reproduced on dynamo-trtllm:head-tot-rc13-2026-04-29-x86-b200 in a 1P+2D Kubernetes deployment. After "kubectl delete pod" of both decode pods, the post-restart inference success rate is 0/10 (HTTP 500 ValueError: Disaggregated params required for decode mode). With this patch the same scenario should return to 1/1 readiness with inference probes succeeding immediately after the new decode pods register their instance_ids. Verification: - An integration test in tests/fault_tolerance/ should drive the full end-to-end scenario (real engine, kill decode rank, verify inference succeeds after rebuild) — left for a follow-up since the existing test_canary_rank_pause.py harness only covers the canary's pause detection and not the WorkerSet-rebuild routing path. - Rust unit tests for the new PrefillReady-branch behavior are also follow-up work; they need a way to construct a stub Endpoint without a DistributedRuntime.
Companion to the activate_prefill_router persist-on-handshake fix in
the previous commit. That commit caches the prefill endpoint as
PrefillReady(rx) in prefill_router_activators after a successful
handshake so future decode WorkerSet rebuilds find it and activate
immediately. But the watcher's component-teardown path
(handle_delete_helper) was clearing prefill_router_activators on ANY
WorkerSet removal — including decode-component removal — which wiped
the cache exactly when we needed it.
Empirically verified on a 1P+2D Kubernetes deployment with the
activate_prefill_router fix alone: after `kubectl delete pod` of both
decode pods, the frontend log shows the cached PrefillReady was set
during cold start ("Activated prefill router for decode WorkerSet")
and then cleared by the watcher ("Removed WorkerSet (no remaining
instances in namespace)"). Post-rebuild inference returns 0/10 with
"Disaggregated params required for decode mode" — same as without the
fix.
Move remove_prefill_activator inside the `supports_prefill()` branch
so it only fires when the prefill component is gone (where the cached
endpoint is genuinely stale). Decode-component teardown leaves the
cache intact; the next handle_add that creates a new decode WorkerSet
calls register_prefill_router and finds PrefillReady, activating
without needing prefill workers to re-register.
Both the activator cleanup AND the existing decode-side
deactivate_prefill_router_for_decode call are now under the same
prefill-teardown guard, which is the correct semantic — neither
should fire on decode-side removal.
…persists The previous v2 fix (commit cde9b57 + ec33872) only kept PrefillReady alive across decode rebuilds when decode happened to register first at cold start. When prefill won the race instead, the existing register_prefill_router PrefillReady-arm consumed the cached oneshot::Receiver and emptied the map, leaving the same buggy state. Empirically reproduced on -v2 image: frontend log shows "Stored prefill endpoint for future decode WorkerSet registration" (the prefill-first path) at cold start, then post-rebuild requests still return 0/10 with the disagg-params error. The right fix is to cache the Endpoint directly instead of wrapping it in a oneshot::Receiver: enum PrefillActivationState { - PrefillReady(oneshot::Receiver<Endpoint>), + PrefillReady(Endpoint), } The Endpoint is Clone (#[derive(Debug, Clone)]), so register_prefill_router synthesizes a fresh oneshot::channel and primes the receiver with a clone of the cached endpoint, then re-inserts PrefillReady with the endpoint so the next call finds it again. activate_prefill_router inserts PrefillReady(endpoint) directly without the oneshot dance. Both race winners now leave a durable PrefillReady cache: - Decode arrives first → activate_prefill_router DecodeWaiting branch wakes original receiver, then inserts PrefillReady(endpoint) - Prefill arrives first → activate_prefill_router None branch inserts PrefillReady(endpoint); decode's register_prefill_router consumes, hands out fresh rx with clone, AND re-inserts PrefillReady(endpoint) Decode rebuild (next register_prefill_router for the same key) finds PrefillReady regardless of cold-start race, hands out a fresh rx, and re-caches. Prefill teardown still clears the cache via remove_prefill_activator (in the watcher's prefill-only branch from the previous commit). Companion to ec33872 (watcher: scope remove_prefill_activator to prefill teardown). Both commits together fix the "all decodes restart → frontend stuck in aggregated mode" wedge.
WalkthroughRefactors the prefill/decode rendezvous state machine from storing Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/discovery/model_manager.rs`:
- Around line 49-56: The PrefillActivationState enum's PrefillReady(Endpoint)
variant must be boxed to satisfy clippy::large_enum_variant; change it to
PrefillReady(Box<Endpoint>) in the enum definition (PrefillActivationState) and
update all places that construct or pattern-match it: when destructuring
PrefillReady(endpoint) dereference the box (e.g., *endpoint) to access Endpoint,
when sending or cloning use (*endpoint).clone() or clone the inner value, and
when constructing PrefillReady wrap the endpoint with Box::new(...) (e.g.,
PrefillReady(Box::new(endpoint.clone()))). Ensure every usage site that
previously assumed owning Endpoint is adjusted to handle Box<Endpoint>.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b96ab0fc-15c7-4fd0-b6ad-b5c3d592177f
📒 Files selected for processing (2)
lib/llm/src/discovery/model_manager.rslib/llm/src/discovery/watcher.rs
…arge_enum_variant CodeRabbit / CI Clippy flagged `PrefillReady(Endpoint)` as creating a large size disparity vs `DecodeWaiting(oneshot::Sender<Endpoint>)`. Box the cached Endpoint so the enum variants stay balanced. Touches: - enum definition: `PrefillReady(Endpoint)` → `PrefillReady(Box<Endpoint>)` - register_prefill_router PrefillReady arm: `endpoint.clone()` → `(*endpoint).clone()` to dereference the Box for the oneshot::Sender (which still carries Endpoint, not Box<Endpoint>) - 4 construction sites: wrap with `Box::new(endpoint)` - destructure-and-reinsert path keeps the Box (no change needed) Logic is unchanged; only the storage indirection changes. Behavior verified end-to-end on -v3 image (commits 1+2+3) was 10/10. This follow-up satisfies the lint without touching any state-machine code paths.
CI rust-tests cargo-fmt check flagged the four `.insert(key, PrefillActivationState::PrefillReady(Box::new(endpoint)));` lines as too long for the single-line form. Apply rustfmt's preferred multi-line wrap. Pure formatting; no behavior change.
…ardown
Companion to commit 2 ("watcher: scope remove_prefill_activator to prefill
teardown only"). That commit deliberately stopped clearing activator state on
decode teardown so the cached `PrefillReady` survives across decode rebuilds.
The unintended consequence: a `DecodeWaiting(sender)` entry left over from a
decode that registered before any prefill also persists, even though its
`oneshot::Receiver` was dropped along with the PrefillRouter inside the
removed WorkerSet.
The next decode rebuild's `register_prefill_router` then finds the stale
`DecodeWaiting`, hits the `Some(DecodeWaiting)` arm at model_manager.rs:737,
emits `Decode WorkerSet already registered for this prefill router`, and
returns `None`. The rebuilt WorkerSet ends up with no PrefillRouter at all.
When prefill finally registers, `activate_prefill_router` wakes the orphaned
receiver and activates the *old* router (still kept alive by the spawn
task's Arc), so logs *look* like success ("Activating prefill router" /
"Prefill router activated successfully") while the rebuilt WorkerSet stays
empty. Inference requests bypass the missing PrefillRouter and decode
workers reject them with "Disaggregated params required for decode mode".
Reproducer:
1. Start frontend; decode registers; prefill held back (e.g., invalid
image, replicas=0, impossible nodeSelector). Activator → DecodeWaiting.
2. Force-delete the decode pod. Watcher tears down the WorkerSet. With
commit 2 alone, DecodeWaiting stays in the map (stale).
3. Replacement decode pod registers. register_prefill_router finds stale
DecodeWaiting → returns None → rebuilt WorkerSet has no PrefillRouter.
Frontend log: ERROR "Decode WorkerSet already registered for this
prefill router".
4. Allow prefill to register. activate_prefill_router activates the
orphaned old router; rebuilt WorkerSet still empty.
5. Inference: HTTP 500 "Disaggregated params are required for decode mode".
Verified end-to-end on `head-tot-rc13-pr8965-2026-05-01-x86-b200`
(d104, 2026-05-01): 5/5 inference probes return the expected disagg-params
500 with the predicted log signature.
Fix:
- Add `ModelManager::remove_decode_prefill_waiter` that atomically removes
the activator entry only if it's `DecodeWaiting`. `PrefillReady` cache
entries are preserved.
- Watcher's `handle_delete_helper` now calls `remove_decode_prefill_waiter`
on decode-component teardown (when `!supports_prefill()`), preserving
PR 8965's primary contribution while plugging the stale-waiter gap.
State-machine invariants are now:
DecodeWaiting → "a live decode router is waiting on the receiver"
PrefillReady → "a prefill endpoint is cached for future decode rebuilds"
…code-waiter cleanup Addresses two correctness issues raised by dynamo-ops review on PR 8965. 1. **Race on PrefillReady re-insert** (model_manager.rs) Both `register_prefill_router` and `activate_prefill_router` previously used a `remove → process → insert` pattern on `prefill_router_activators`. Between the `remove` and the `insert` the entry is gone from the map, so a concurrent `remove_prefill_activator` call (called by the watcher on prefill-component teardown) hits the empty map, skips its cleanup, and we then re-insert a stale `PrefillReady` for a prefill that's already gone. The next decode rebuild's lookup finds the resurrected cache and activates a `PrefillRouter` against a dead endpoint. Refactor both functions to use DashMap's `Entry` API. The shard lock is held for the duration of the OccupiedEntry, so any concurrent `remove_prefill_activator` serializes after us and observes the entry it needs to clear. The PrefillReady-consume path in `register_prefill_router` now reads-and-clones in place without removing; the DecodeWaiting/PrefillReady → PrefillReady transition in `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an atomic value swap that returns the old value. 2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs) `handle_delete_helper`'s decode branch only called `remove_decode_prefill_waiter` when `remove_worker_set` returned `Some(_)`. If decode registered (creating a `DecodeWaiting` activator entry via `register_prefill_router`) but `handle_add_helper` failed later (e.g., on `kv_chooser_for`, monitor setup, or `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever landed in the manager — and on later teardown `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting` entry orphaned in the activator map. Drop the `removed.is_some()` guard. The helper itself is state-safe (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so calling it on a key that's vacant or holds `PrefillReady` is a no-op. The prefill-teardown branch keeps its `removed.is_some()` guard because `remove_prefill_activator` blanket-removes both states, and calling it without a real WorkerSet teardown could orphan a live decode-side `DecodeWaiting`. Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting, ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
CI rust-tests caught a one-liner fmt nit at model_manager.rs:765 — rustfmt prefers the let-binding on a single line at this width. Pure cosmetic.
…decode-rebuild-stall Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
0947b7b to
a2e9e75
Compare
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
…decode-rebuild-stall Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com> # Conflicts: # docs/backends/vllm/vllm-omni.md
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
PeaBrane
left a comment
There was a problem hiding this comment.
LGTM. The state-machine change looks right to me: the initial decode-first and prefill-first rendezvous paths still work, and keeping the prefill endpoint cached across decode WorkerSet rebuilds addresses the stall without adding request-path overhead.
One follow-up I’d like to see, if practical: a regression/integration test for decode-first activation followed by a decode WorkerSet rebuild while prefill stays alive, asserting that the rebuilt decode side activates immediately from the cached PrefillReady endpoint. The new DecodeWaiting cleanup tests cover one half of the lifecycle; this would lock down the core behavior this PR is fixing.
Summary
Fixes a routing-chain stall in the disaggregated frontend that turns transient decode-pod restarts into permanent inference outages. After all decode pods in a namespace briefly disappear (canary livenessProbe kills two replicas in quick succession, or a manual rollout), every subsequent inference request returns:
…until the frontend pod itself is restarted. Pods stay 1/1 Running, prefill workers are clean, but routing has degenerated to aggregated mode (bare requests forwarded to decode workers) and stays that way.
The PR also addresses a corner case discovered during fix iteration: decode registers before prefill, then is removed before prefill ever arrives — leaving a stale
DecodeWaitingentry that orphans the next decode rebuild.Two patterns, both fixed
Pattern 1 — decode-rebuild stall after handshake completes
Post-handshake the activator map was empty, so the rebuild's
register_prefill_routerfell into theNonebranch, created a freshDecodeWaiting(tx), and waited forever (prefill workers hadn't changed; nobody callsactivate_prefill_routeragain).PrefillRouter::generatefalls back to aggregated mode and forwards bare requests to decode → trtllm 500s, vllm/sglang silent agg-mode degradation.Pattern 2 — stale DecodeWaiting after decode-first race
How to reproduce
Both patterns can be exercised on a single Kubernetes namespace running 1 prefill + 1–2 decode workers + a frontend. No special engine config required.
Pattern 1 reproducer (decode-rebuild stall)
Setup: 1P + 2D + canary livenessProbe wired to
/health. Required env on each worker:DYN_HEALTH_CHECK_ENABLED=true,DYN_CANARY_WAIT_TIME=10,DYN_HEALTH_CHECK_REQUEST_TIMEOUT=3, pluslivenessProbe: {httpGet: {path: /health, port: system}, failureThreshold: 6, periodSeconds: 10}. Apply the DGD and wait for all 4 pods Ready, then 60 s for worker discovery to settle.In production this trigger fires when the canary livenessProbe kills both decode replicas in close succession (we observed ~1.5 min apart on a real incident), bringing the namespace's decode count to 0 briefly.
Pattern 2 reproducer (decode-first race / stale DecodeWaiting)
Setup: 1P + 1D + frontend. Make the PREFILL image temporarily invalid (e.g.,
nvcr.io/.../dynamo-trtllm:DOESNOTEXIST) so it never registers; decode + frontend images are real. Apply the DGD and wait for frontend + decode Ready (prefill stays inImagePullBackOff). The activator now holdsDecodeWaiting— decode arrived first, is waiting for prefill to register.Frontend log signatures
Activated prefill router+Activating prefill routerRemoved WorkerSet (no remaining instances in namespace)Activating prefill routerever)Activating prefill router+Prefill router activated successfully(cache hand-off)No prefill endpoint for namespace yet, storing senderRemoved WorkerSetERROR Decode WorkerSet already registered for this prefill router— rebuilt WorkerSet has no PrefillRouterRemoved stale DecodeWaiting activator on decode WorkerSet teardown(debug)Activating prefill routerafter prefill arrivesFix (8 commits)
cde9b578a6activate_prefill_routerDecodeWaiting branch: persistPrefillReadyafter waking the original receiverec33872bdbremove_prefill_activatorto prefill-component teardown only07142a7d6aPrefillReady(oneshot::Receiver<Endpoint>)→PrefillReady(Endpoint); consumer hands out fresh receiver synthesized from cache and re-inserts atomicallyd4088241c0PrefillReady(Endpoint)→PrefillReady(Box<Endpoint>)forclippy::large_enum_variant3bc7891affef1dadcdf5remove_decode_prefill_waiterhelper + watcher decode-teardown call — Pattern 2 fix47b6f0b91cEntryAPI for atomic per-key state transitions; ungate decode-waiter cleanup onremoved.is_some()(addresses dynamo-ops review)96ecac2204State-machine invariants are now unambiguous:
DecodeWaiting(sender)= a live decode router is waiting on the receiver — cleared on decode teardownPrefillReady(Box<Endpoint>)= a prefill endpoint is cached for future decode rebuilds — cleared only on prefill teardownVerification
End-to-end verified on
head-tot-rc13-pr8965-2026-05-02-x86-b200:Without these fixes, Patterns 1 and 2 produce 100% HTTP 500
Disaggregated params required for decode mode(trtllm) or silent agg-mode degradation (vllm/sglang). The latest two commits (entry-API atomicity + ungated cleanup) are logic-equivalent to the prior verified head; re-verification on a fresh image build pending.Test plan
PrefillReady-branch refresh-and-reactivate behavior. Currently blocked on the existing test-mod note ("activate_prefill_router requires an Endpoint, so we test the registration state machine and cleanup only" —model_manager.rs:1094-1095). A small#[cfg(test)]shim that constructs a stubEndpointwould unblock this.tests/fault_tolerance/mirroringtest_canary_rank_pause.py: spin up a real engine + decode worker, kill the decode rank, wait for restart, send an inference request, assert it returns 200.--enforce-disaggto confirm the existing strict-mode error path still fires correctly when prefill is actually gone (this PR shouldn't change that).Related
The rebuild trigger pattern was first observed in production with
head-tot-rc13-pr13495on the GB200 NVCF disagg shadow function: canary livenessProbe killed both decode replicas within ~1.5 minutes of each other, bringing the namespace's decode count to 0 briefly. After the new pods came up healthy, every subsequent request hit this bug for the next 75 minutes until the function was redeployed via NVCF version cutover.Pre-fix mitigations (without
kubectlaccess in NVCF): raiseDYN_HEALTH_CHECK_REQUEST_TIMEOUTabove prefill P99 (≥ 130 s for the prod 480B), or drop the canary livenessProbe entirely. Both remove the trigger but don't fix the underlying state-machine gap.🤖 Generated with Claude Code
Summary by CodeRabbit