Skip to content

fix(prefill-router): persist endpoint after handshake so decode rebuild reactivates#8965

Merged
yifjiang merged 15 commits into
ai-dynamo:mainfrom
yifjiang:fix/prefill-router-decode-rebuild-stall
May 7, 2026
Merged

fix(prefill-router): persist endpoint after handshake so decode rebuild reactivates#8965
yifjiang merged 15 commits into
ai-dynamo:mainfrom
yifjiang:fix/prefill-router-decode-rebuild-stall

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented May 1, 2026

Summary

Fixes a routing-chain stall in the disaggregated frontend that turns transient decode-pod restarts into permanent inference outages. After all decode pods in a namespace briefly disappear (canary livenessProbe kills two replicas in quick succession, or a manual rollout), every subsequent inference request returns:

HTTP 500 ValueError: Disaggregated params are required for decode mode

…until the frontend pod itself is restarted. Pods stay 1/1 Running, prefill workers are clean, but routing has degenerated to aggregated mode (bare requests forwarded to decode workers) and stays that way.

The PR also addresses a corner case discovered during fix iteration: decode registers before prefill, then is removed before prefill ever arrives — leaving a stale DecodeWaiting entry that orphans the next decode rebuild.

Two patterns, both fixed

Pattern 1 — decode-rebuild stall after handshake completes

decode registers → prefill registers → handshake completes (map empty)
→ all decode pods removed (canary kills both, or kubectl delete)
→ new decode pods register → rebuild's PrefillRouter stalls in DecodeWaiting forever

Post-handshake the activator map was empty, so the rebuild's register_prefill_router fell into the None branch, created a fresh DecodeWaiting(tx), and waited forever (prefill workers hadn't changed; nobody calls activate_prefill_router again). PrefillRouter::generate falls back to aggregated mode and forwards bare requests to decode → trtllm 500s, vllm/sglang silent agg-mode degradation.

Pattern 2 — stale DecodeWaiting after decode-first race

decode registers first → activator stores DecodeWaiting(sender)
→ decode WorkerSet removed before prefill ever arrives
→ activator state preserved on decode teardown (good for Pattern 1's PrefillReady cache)
  …but DecodeWaiting's receiver is now dead (held by the dropped PrefillRouter)
→ new decode pod registers → register_prefill_router sees stale DecodeWaiting → returns None
→ rebuilt WorkerSet has no PrefillRouter

How to reproduce

Both patterns can be exercised on a single Kubernetes namespace running 1 prefill + 1–2 decode workers + a frontend. No special engine config required.

Pattern 1 reproducer (decode-rebuild stall)

Setup: 1P + 2D + canary livenessProbe wired to /health. Required env on each worker: DYN_HEALTH_CHECK_ENABLED=true, DYN_CANARY_WAIT_TIME=10, DYN_HEALTH_CHECK_REQUEST_TIMEOUT=3, plus livenessProbe: {httpGet: {path: /health, port: system}, failureThreshold: 6, periodSeconds: 10}. Apply the DGD and wait for all 4 pods Ready, then 60 s for worker discovery to settle.

# Pre-trigger sanity probe
curl -X POST http://<frontend>/v1/chat/completions -d '{...}'    # → HTTP 200

# Trigger: kubectl-delete BOTH decodes simultaneously
kubectl delete pod decodeworker-0 decodeworker-1 --grace-period=0 --force

# Wait ~2 min for replacement decodes to be Ready, then probe again:
curl -X POST http://<frontend>/v1/chat/completions -d '{...}'
# without fix: HTTP 500 ValueError: Disaggregated params are required for decode mode
# with fix:    HTTP 200 ✓

In production this trigger fires when the canary livenessProbe kills both decode replicas in close succession (we observed ~1.5 min apart on a real incident), bringing the namespace's decode count to 0 briefly.

Pattern 2 reproducer (decode-first race / stale DecodeWaiting)

Setup: 1P + 1D + frontend. Make the PREFILL image temporarily invalid (e.g., nvcr.io/.../dynamo-trtllm:DOESNOTEXIST) so it never registers; decode + frontend images are real. Apply the DGD and wait for frontend + decode Ready (prefill stays in ImagePullBackOff). The activator now holds DecodeWaiting — decode arrived first, is waiting for prefill to register.

# Trigger: kubectl-delete the decode pod while still DecodeWaiting
kubectl delete pod decodeworker-0 --grace-period=0 --force

# Wait for replacement decode Ready. Now patch the prefill image to valid:
kubectl patch dgd <name> --type=json -p \
  '[{"op":"replace","path":"/spec/services/PrefillWorker/extraPodSpec/mainContainer/image","value":"nvcr.io/.../dynamo-trtllm:<real-tag>"}]'

# Wait for prefill Ready, then probe:
curl -X POST http://<frontend>/v1/chat/completions -d '{...}'
# without fix: HTTP 500 ValueError: Disaggregated params are required for decode mode
# with fix:    HTTP 200 ✓

Frontend log signatures

State Cold-start handshake After teardown Recovery
Without fix (P1) Activated prefill router + Activating prefill router Removed WorkerSet (no remaining instances in namespace) (silent — no second Activating prefill router ever)
With fix (P1) same same Activating prefill router + Prefill router activated successfully (cache hand-off)
Without fix (P2) No prefill endpoint for namespace yet, storing sender Removed WorkerSet ERROR Decode WorkerSet already registered for this prefill router — rebuilt WorkerSet has no PrefillRouter
With fix (P2) same Removed stale DecodeWaiting activator on decode WorkerSet teardown (debug) second Activating prefill router after prefill arrives

Fix (8 commits)

Commit What
cde9b578a6 activate_prefill_router DecodeWaiting branch: persist PrefillReady after waking the original receiver
ec33872bdb watcher: scope remove_prefill_activator to prefill-component teardown only
07142a7d6a refactor PrefillReady(oneshot::Receiver<Endpoint>)PrefillReady(Endpoint); consumer hands out fresh receiver synthesized from cache and re-inserts atomically
d4088241c0 PrefillReady(Endpoint)PrefillReady(Box<Endpoint>) for clippy::large_enum_variant
3bc7891aff cargo fmt for boxed inserts
ef1dadcdf5 add remove_decode_prefill_waiter helper + watcher decode-teardown call — Pattern 2 fix
47b6f0b91c refactor both functions to Entry API for atomic per-key state transitions; ungate decode-waiter cleanup on removed.is_some() (addresses dynamo-ops review)
96ecac2204 cargo fmt for entry-API refactor

State-machine invariants are now unambiguous:

  • DecodeWaiting(sender) = a live decode router is waiting on the receiver — cleared on decode teardown
  • PrefillReady(Box<Endpoint>) = a prefill endpoint is cached for future decode rebuilds — cleared only on prefill teardown

Verification

End-to-end verified on head-tot-rc13-pr8965-2026-05-02-x86-b200:

# Reproducer Result
1 burst-sweep, canary OFF 3/3 RECOVERY at T+30 / 90 / 120 s ✅
2 multi-decode restart (Pattern 1) 10 / 10 ✅
3 decode-first race (Pattern 2) 5 / 5 ✅
4 burst-sweep, canary ON 3/3 RECOVERY at T+60 / 120 / 180 s ✅

Without these fixes, Patterns 1 and 2 produce 100% HTTP 500 Disaggregated params required for decode mode (trtllm) or silent agg-mode degradation (vllm/sglang). The latest two commits (entry-API atomicity + ungated cleanup) are logic-equivalent to the prior verified head; re-verification on a fresh image build pending.

Test plan

  • Manual reproduction on a real Kubernetes cluster across all four reproducers — confirms each fires before the fix and recovers after.
  • Unit test for the new PrefillReady-branch refresh-and-reactivate behavior. Currently blocked on the existing test-mod note ("activate_prefill_router requires an Endpoint, so we test the registration state machine and cleanup only" — model_manager.rs:1094-1095). A small #[cfg(test)] shim that constructs a stub Endpoint would unblock this.
  • Integration test in tests/fault_tolerance/ mirroring test_canary_rank_pause.py: spin up a real engine + decode worker, kill the decode rank, wait for restart, send an inference request, assert it returns 200.
  • Run with --enforce-disagg to confirm the existing strict-mode error path still fires correctly when prefill is actually gone (this PR shouldn't change that).

Related

The rebuild trigger pattern was first observed in production with head-tot-rc13-pr13495 on the GB200 NVCF disagg shadow function: canary livenessProbe killed both decode replicas within ~1.5 minutes of each other, bringing the namespace's decode count to 0 briefly. After the new pods came up healthy, every subsequent request hit this bug for the next 75 minutes until the function was redeployed via NVCF version cutover.

Pre-fix mitigations (without kubectl access in NVCF): raise DYN_HEALTH_CHECK_REQUEST_TIMEOUT above prefill P99 (≥ 130 s for the prod 480B), or drop the canary livenessProbe entirely. Both remove the trigger but don't fix the underlying state-machine gap.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Optimized internal routing state machine to improve endpoint caching efficiency and reduce unnecessary re-registrations during system operations.
    • Enhanced worker set deletion cleanup logic to ensure more precise resource management and system stability.

…ld reactivates

When all decode pods in a namespace temporarily go away (e.g., the
DYN_HEALTH_CHECK_ENABLED canary livenessProbe kills two replicas in
quick succession, or a manual rollout deletes the decode pods), the
watcher tears down the decode WorkerSet via remove_worker_set +
remove_prefill_activator. Once the new decode pods register fresh
instance_ids in ETCD, the watcher rebuilds the WorkerSet. The new
PrefillRouter calls register_prefill_router, which now finds the
activator map empty and falls into the None branch, returning a fresh
DecodeWaiting(tx). Prefill workers haven't changed, so no second
activate_prefill_router call ever fires. The new PrefillRouter sits on
that oneshot::Receiver forever; prefill_router OnceLock stays None;
PrefillRouter::generate falls back to aggregated mode and forwards bare
requests to decode workers, which reject them with "Disaggregated
params are required for decode mode" until the frontend pod itself is
restarted.

The fix mirrors what the existing prefill-rejoin path at lines 740-770
already does: persist the prefill endpoint as PrefillReady(rx) after a
successful handshake so future decode WorkerSet rebuilds find it and
activate immediately. Two changes:

1. DecodeWaiting branch: after waking the original receiver, also store
   a fresh PrefillReady(rx) carrying a clone of the endpoint. Endpoint
   is #[derive(Clone)].

2. PrefillReady branch: previously errored with "already activated".
   Change to refresh the cached endpoint and reactivate any deactivated
   decode-side router. This handles two cases: (a) duplicate
   activate_prefill_router calls, (b) prefill rejoin after the
   DecodeWaiting branch's persist step has left a PrefillReady entry.
   The None branch (lines 740-789) is unchanged and still handles the
   case where the activator map was cleared between the original
   handshake and a prefill rejoin.

Manually reproduced on dynamo-trtllm:head-tot-rc13-2026-04-29-x86-b200
in a 1P+2D Kubernetes deployment. After "kubectl delete pod" of both
decode pods, the post-restart inference success rate is 0/10
(HTTP 500 ValueError: Disaggregated params required for decode mode).
With this patch the same scenario should return to 1/1 readiness with
inference probes succeeding immediately after the new decode pods
register their instance_ids.

Verification:
- An integration test in tests/fault_tolerance/ should drive the full
  end-to-end scenario (real engine, kill decode rank, verify inference
  succeeds after rebuild) — left for a follow-up since the existing
  test_canary_rank_pause.py harness only covers the canary's pause
  detection and not the WorkerSet-rebuild routing path.
- Rust unit tests for the new PrefillReady-branch behavior are also
  follow-up work; they need a way to construct a stub Endpoint without
  a DistributedRuntime.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

@github-actions github-actions Bot added fix external-contribution Pull request is from an external contributor labels May 1, 2026
Companion to the activate_prefill_router persist-on-handshake fix in
the previous commit. That commit caches the prefill endpoint as
PrefillReady(rx) in prefill_router_activators after a successful
handshake so future decode WorkerSet rebuilds find it and activate
immediately. But the watcher's component-teardown path
(handle_delete_helper) was clearing prefill_router_activators on ANY
WorkerSet removal — including decode-component removal — which wiped
the cache exactly when we needed it.

Empirically verified on a 1P+2D Kubernetes deployment with the
activate_prefill_router fix alone: after `kubectl delete pod` of both
decode pods, the frontend log shows the cached PrefillReady was set
during cold start ("Activated prefill router for decode WorkerSet")
and then cleared by the watcher ("Removed WorkerSet (no remaining
instances in namespace)"). Post-rebuild inference returns 0/10 with
"Disaggregated params required for decode mode" — same as without the
fix.

Move remove_prefill_activator inside the `supports_prefill()` branch
so it only fires when the prefill component is gone (where the cached
endpoint is genuinely stale). Decode-component teardown leaves the
cache intact; the next handle_add that creates a new decode WorkerSet
calls register_prefill_router and finds PrefillReady, activating
without needing prefill workers to re-register.

Both the activator cleanup AND the existing decode-side
deactivate_prefill_router_for_decode call are now under the same
prefill-teardown guard, which is the correct semantic — neither
should fire on decode-side removal.
…persists

The previous v2 fix (commit cde9b57 + ec33872) only kept
PrefillReady alive across decode rebuilds when decode happened to
register first at cold start. When prefill won the race instead, the
existing register_prefill_router PrefillReady-arm consumed the cached
oneshot::Receiver and emptied the map, leaving the same buggy state.
Empirically reproduced on -v2 image: frontend log shows "Stored
prefill endpoint for future decode WorkerSet registration" (the
prefill-first path) at cold start, then post-rebuild requests still
return 0/10 with the disagg-params error.

The right fix is to cache the Endpoint directly instead of wrapping
it in a oneshot::Receiver:

  enum PrefillActivationState {
-     PrefillReady(oneshot::Receiver<Endpoint>),
+     PrefillReady(Endpoint),
  }

The Endpoint is Clone (#[derive(Debug, Clone)]), so register_prefill_router
synthesizes a fresh oneshot::channel and primes the receiver with a
clone of the cached endpoint, then re-inserts PrefillReady with the
endpoint so the next call finds it again. activate_prefill_router
inserts PrefillReady(endpoint) directly without the oneshot dance.

Both race winners now leave a durable PrefillReady cache:

- Decode arrives first → activate_prefill_router DecodeWaiting branch
  wakes original receiver, then inserts PrefillReady(endpoint)
- Prefill arrives first → activate_prefill_router None branch inserts
  PrefillReady(endpoint); decode's register_prefill_router consumes,
  hands out fresh rx with clone, AND re-inserts PrefillReady(endpoint)

Decode rebuild (next register_prefill_router for the same key) finds
PrefillReady regardless of cold-start race, hands out a fresh rx, and
re-caches. Prefill teardown still clears the cache via
remove_prefill_activator (in the watcher's prefill-only branch from
the previous commit).

Companion to ec33872 (watcher: scope remove_prefill_activator to
prefill teardown). Both commits together fix the
"all decodes restart → frontend stuck in aggregated mode" wedge.
@pull-request-size pull-request-size Bot added size/L and removed size/M labels May 1, 2026
@yifjiang yifjiang marked this pull request as ready for review May 1, 2026 17:31
@yifjiang yifjiang requested a review from a team as a code owner May 1, 2026 17:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Walkthrough

Refactors the prefill/decode rendezvous state machine from storing oneshot::Receiver to caching actual Endpoint objects. Updates deletion cleanup logic to distinguish whether a WorkerSet was removed and whether the removed component is the PREFILL component, applying cache clearing only when appropriate.

Changes

Cohort / File(s) Summary
State Machine Refactoring
lib/llm/src/discovery/model_manager.rs
Refactors prefill/decode rendezvous from storing oneshot::Receiver to caching Endpoint objects. register_prefill_router now creates fresh oneshot channels on cache hits and reinserts cached endpoints. activate_prefill_router caches endpoints after DecodeWaiting handshake and refreshes cached endpoint with potential deactivated decode-side reactivation. Removes oneshot sender/channel plumbing in non-cached paths.
Cleanup Logic Refinement
lib/llm/src/discovery/watcher.rs
Updates deletion cleanup to distinguish whether a WorkerSet was actually removed and whether the removed component is PREFILL. Activator cache is now cleared only when both conditions are met (deletion occurred and component supports prefill), while preserving decode-side deactivation behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(prefill-router): persist endpoint after handshake so decode rebuild reactivates' accurately summarizes the main change: persisting the prefill endpoint after handshake to enable decode rebuild reactivation, directly addressing the core fix for the routing-chain stall.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all required sections: Overview (Summary), Details (detailed pattern analysis, reproducers, fix breakdown, verification), and Related Issues references. It includes root cause, reproduction steps, verification results, and a thorough test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/discovery/model_manager.rs`:
- Around line 49-56: The PrefillActivationState enum's PrefillReady(Endpoint)
variant must be boxed to satisfy clippy::large_enum_variant; change it to
PrefillReady(Box<Endpoint>) in the enum definition (PrefillActivationState) and
update all places that construct or pattern-match it: when destructuring
PrefillReady(endpoint) dereference the box (e.g., *endpoint) to access Endpoint,
when sending or cloning use (*endpoint).clone() or clone the inner value, and
when constructing PrefillReady wrap the endpoint with Box::new(...) (e.g.,
PrefillReady(Box::new(endpoint.clone()))). Ensure every usage site that
previously assumed owning Endpoint is adjusted to handle Box<Endpoint>.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b96ab0fc-15c7-4fd0-b6ad-b5c3d592177f

📥 Commits

Reviewing files that changed from the base of the PR and between 67dc080 and 07142a7.

📒 Files selected for processing (2)
  • lib/llm/src/discovery/model_manager.rs
  • lib/llm/src/discovery/watcher.rs

Comment thread lib/llm/src/discovery/model_manager.rs
…arge_enum_variant

CodeRabbit / CI Clippy flagged `PrefillReady(Endpoint)` as creating a
large size disparity vs `DecodeWaiting(oneshot::Sender<Endpoint>)`.
Box the cached Endpoint so the enum variants stay balanced.

Touches:
- enum definition: `PrefillReady(Endpoint)` → `PrefillReady(Box<Endpoint>)`
- register_prefill_router PrefillReady arm: `endpoint.clone()` →
  `(*endpoint).clone()` to dereference the Box for the oneshot::Sender
  (which still carries Endpoint, not Box<Endpoint>)
- 4 construction sites: wrap with `Box::new(endpoint)`
- destructure-and-reinsert path keeps the Box (no change needed)

Logic is unchanged; only the storage indirection changes. Behavior
verified end-to-end on -v3 image (commits 1+2+3) was 10/10. This
follow-up satisfies the lint without touching any state-machine code
paths.
yifjiang

This comment was marked as outdated.

CI rust-tests cargo-fmt check flagged the four
`.insert(key, PrefillActivationState::PrefillReady(Box::new(endpoint)));`
lines as too long for the single-line form. Apply rustfmt's preferred
multi-line wrap. Pure formatting; no behavior change.
…ardown

Companion to commit 2 ("watcher: scope remove_prefill_activator to prefill
teardown only"). That commit deliberately stopped clearing activator state on
decode teardown so the cached `PrefillReady` survives across decode rebuilds.
The unintended consequence: a `DecodeWaiting(sender)` entry left over from a
decode that registered before any prefill also persists, even though its
`oneshot::Receiver` was dropped along with the PrefillRouter inside the
removed WorkerSet.

The next decode rebuild's `register_prefill_router` then finds the stale
`DecodeWaiting`, hits the `Some(DecodeWaiting)` arm at model_manager.rs:737,
emits `Decode WorkerSet already registered for this prefill router`, and
returns `None`. The rebuilt WorkerSet ends up with no PrefillRouter at all.
When prefill finally registers, `activate_prefill_router` wakes the orphaned
receiver and activates the *old* router (still kept alive by the spawn
task's Arc), so logs *look* like success ("Activating prefill router" /
"Prefill router activated successfully") while the rebuilt WorkerSet stays
empty. Inference requests bypass the missing PrefillRouter and decode
workers reject them with "Disaggregated params required for decode mode".

Reproducer:
  1. Start frontend; decode registers; prefill held back (e.g., invalid
     image, replicas=0, impossible nodeSelector). Activator → DecodeWaiting.
  2. Force-delete the decode pod. Watcher tears down the WorkerSet. With
     commit 2 alone, DecodeWaiting stays in the map (stale).
  3. Replacement decode pod registers. register_prefill_router finds stale
     DecodeWaiting → returns None → rebuilt WorkerSet has no PrefillRouter.
     Frontend log: ERROR "Decode WorkerSet already registered for this
     prefill router".
  4. Allow prefill to register. activate_prefill_router activates the
     orphaned old router; rebuilt WorkerSet still empty.
  5. Inference: HTTP 500 "Disaggregated params are required for decode mode".

Verified end-to-end on `head-tot-rc13-pr8965-2026-05-01-x86-b200`
(d104, 2026-05-01): 5/5 inference probes return the expected disagg-params
500 with the predicted log signature.

Fix:
  - Add `ModelManager::remove_decode_prefill_waiter` that atomically removes
    the activator entry only if it's `DecodeWaiting`. `PrefillReady` cache
    entries are preserved.
  - Watcher's `handle_delete_helper` now calls `remove_decode_prefill_waiter`
    on decode-component teardown (when `!supports_prefill()`), preserving
    PR 8965's primary contribution while plugging the stale-waiter gap.

State-machine invariants are now:
  DecodeWaiting → "a live decode router is waiting on the receiver"
  PrefillReady  → "a prefill endpoint is cached for future decode rebuilds"
Comment thread lib/llm/src/discovery/model_manager.rs Outdated
Comment thread lib/llm/src/discovery/watcher.rs Outdated
…code-waiter cleanup

Addresses two correctness issues raised by dynamo-ops review on PR 8965.

1. **Race on PrefillReady re-insert** (model_manager.rs)

   Both `register_prefill_router` and `activate_prefill_router` previously
   used a `remove → process → insert` pattern on
   `prefill_router_activators`. Between the `remove` and the `insert` the
   entry is gone from the map, so a concurrent `remove_prefill_activator`
   call (called by the watcher on prefill-component teardown) hits the
   empty map, skips its cleanup, and we then re-insert a stale
   `PrefillReady` for a prefill that's already gone. The next decode
   rebuild's lookup finds the resurrected cache and activates a
   `PrefillRouter` against a dead endpoint.

   Refactor both functions to use DashMap's `Entry` API. The shard lock is
   held for the duration of the OccupiedEntry, so any concurrent
   `remove_prefill_activator` serializes after us and observes the entry
   it needs to clear. The PrefillReady-consume path in
   `register_prefill_router` now reads-and-clones in place without
   removing; the DecodeWaiting/PrefillReady → PrefillReady transition in
   `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an
   atomic value swap that returns the old value.

2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs)

   `handle_delete_helper`'s decode branch only called
   `remove_decode_prefill_waiter` when `remove_worker_set` returned
   `Some(_)`. If decode registered (creating a `DecodeWaiting` activator
   entry via `register_prefill_router`) but `handle_add_helper` failed
   later (e.g., on `kv_chooser_for`, monitor setup, or
   `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever
   landed in the manager — and on later teardown
   `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting`
   entry orphaned in the activator map.

   Drop the `removed.is_some()` guard. The helper itself is state-safe
   (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so
   calling it on a key that's vacant or holds `PrefillReady` is a no-op.
   The prefill-teardown branch keeps its `removed.is_some()` guard
   because `remove_prefill_activator` blanket-removes both states, and
   calling it without a real WorkerSet teardown could orphan a live
   decode-side `DecodeWaiting`.

Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary
OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting,
ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
CI rust-tests caught a one-liner fmt nit at model_manager.rs:765 — rustfmt
prefers the let-binding on a single line at this width. Pure cosmetic.
…decode-rebuild-stall

Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
@yifjiang yifjiang force-pushed the fix/prefill-router-decode-rebuild-stall branch from 0947b7b to a2e9e75 Compare May 5, 2026 16:38
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
@yifjiang yifjiang requested a review from a team as a code owner May 5, 2026 16:48
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 5, 2026
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
…decode-rebuild-stall

Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>

# Conflicts:
#	docs/backends/vllm/vllm-omni.md
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@PeaBrane PeaBrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The state-machine change looks right to me: the initial decode-first and prefill-first rendezvous paths still work, and keeping the prefill endpoint cached across decode WorkerSet rebuilds addresses the stall without adding request-path overhead.

One follow-up I’d like to see, if practical: a regression/integration test for decode-first activation followed by a decode WorkerSet rebuild while prefill stays alive, asserting that the rebuilt decode side activates immediately from the cached PrefillReady endpoint. The new DecodeWaiting cleanup tests cover one half of the lifecycle; this would lock down the core behavior this PR is fixing.

@yifjiang yifjiang merged commit 7ae69a3 into ai-dynamo:main May 7, 2026
159 of 162 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor fix size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants