feat(multimodal): move MM routing into vLLM frontend processor by krishung5 · Pull Request #8065 · ai-dynamo/dynamo

krishung5 · 2026-04-10T15:50:37Z

Overview

Moves multimodal-aware KV routing from the separate MM Router Worker process into the frontend's vLLM processor, eliminating the extra network hop and redundant image processing for multimodal requests.

Architecture

Client -> Frontend (vLLM processor + KV routing) -> Backend Worker 1..N
          (image decode, HF processor,                (SHM read ~2ms,
           SHM write ~2ms, KV route)                   skip HF processor,
                                                       inference)

The frontend runs the full vLLM process_inputs() pipeline, then transfers pre-processed mm_kwargs to the backend via shared memory. The backend skips image download and HF processing entirely.

Transfer Modes

Controlled by DYNAMO_MM_TRANSFER env var:

Value	Behavior	Use case
`shm` (default)	Use shared memory (~2ms). If backend can't read the segment (cross-node), falls back to URL reprocessing.	Same-node deployments
`nixl`	Use NIXL RDMA	Cross-node deployments
`DYNAMO_DISABLE_NIXL_MM=1`	Disable all mm_kwargs transfer. Backend re-processes images from URLs.	Debugging / fallback

Fallback behavior: When SHM transfer fails (e.g., cross-node where /dev/shm is not shared), the backend falls back to downloading and processing images from URLs. The HF processor is not skipped in this case. A follow-up PR will explore making NIXL or TCP request plane the cross-node default.

Key Changes

Frontend (vllm_processor.py):

_prepare_mm_routing() extracts mm_routing_info from vLLM's process_inputs() output and prepares SHM/NIXL transfer
Transfer mode selected via DYNAMO_MM_TRANSFER env var (shm default, nixl for cross-node)
Only suppress URL fallback when ALL features transferred (partial transfer safety)
SHM handle cleanup in try/finally to prevent leaks on error
Top-level imports for MmKwargsSender/MmKwargsShmSender (no lazy imports)
preprocess_workers != 0 guard restored (pool not supported for vllm processor)
Forwards upstream mm_processor_kwargs through both KV and non-KV router paths

Backend (handlers.py):

_try_receive_mm_kwargs_nixl() receives pre-processed mm_kwargs via SHM or NIXL (mutually exclusive, determined by which metadata key is present in extra_args)
_receive_mm_kwargs_shm() reads from shared memory (~2ms)
Calls inject_into_mm_cache() directly (public API from vLLM upstream)
isinstance(MultiModalKwargsItem) validation on both NIXL and SHM pickle.loads paths
Bails out of fast path if expanded_token_ids missing (prevents placeholder misalignment)
Incorporates upstream mm_processor_kwargs and use_audio_in_video support

Transfer (mm_kwargs_transfer.py):

MmKwargsShmSender / MmKwargsShmReceiver — POSIX shared memory transfer
MmKwargsSender / MmKwargsReceiver — NIXL RDMA with pre-registered descriptor pool
NIXL receiver preserves spec order (not completion order) for multi-image requests
SHM cleanup: only silence FileNotFoundError, log other exceptions
SHM receiver: let exceptions propagate instead of silent swallow
NIXL _acquire_descriptor: explicit RuntimeError for None _data_ref (not assert)

New files:

routing_utils.py — model-agnostic mm_features to block_mm_infos conversion
media_connector.py — DynamoMediaConnector wraps ImageLoader with LRU cache

Tests:

30 e2e tests: 10 scenarios × 3 transfer modes (shm, nixl, disabled)
- Text-only routing, single/multi-image, HTTP/data URI, staircase, swapped order, HTTP vs data URI equivalence
15 unit tests:
- SHM roundtrip, multi-image ordering, NIXL mock ordering, partial transfer, cleanup
- SHM cleanup error handling (FileNotFoundError, other exceptions, all-handles-attempted)
- NIXL descriptor validation (RuntimeError on None _data_ref)

Upstream vLLM Dependency

This PR depends on vllm-project/vllm#39502, which exposes InputProcessor.inject_into_mm_cache() as a public API for injecting pre-processed mm_kwargs into the processor cache. Until merged, apply the patch:

SITE_PACKAGES_ROOT="$(python3 -c 'import pathlib, vllm; print(pathlib.Path(vllm.__file__).resolve().parent.parent)')"
cd "$SITE_PACKAGES_ROOT"
curl -sL https://github.com/vllm-project/vllm/pull/39502.diff | python3 -c '
import sys
chunks = sys.stdin.read().split("diff --git ")
filtered = [c for c in chunks if c.startswith("a/vllm/")]
print("".join("diff --git " + c for c in filtered), end="")
' > /tmp/vllm_pr39502_vllm_only.diff
patch --dry-run -p1 < /tmp/vllm_pr39502_vllm_only.diff
patch -p1 < /tmp/vllm_pr39502_vllm_only.diff
cd -

Where should the reviewer start?

components/src/dynamo/frontend/vllm_processor.py — main integration: _prepare_mm_routing() and transfer mode selection
components/src/dynamo/vllm/handlers.py — backend SHM/NIXL receive and inject_into_mm_cache() call
components/src/dynamo/common/multimodal/mm_kwargs_transfer.py — SHM and NIXL sender/receiver
tests/mm_router/test_vllm_mm_router_e2e.py — 30 e2e tests (10 scenarios × 3 transfer modes)
components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py — 15 unit tests

Summary by CodeRabbit

New Features
- Frontend now directly handles vLLM processor and KV-aware multimodal routing
- Added NIXL wire protocol for efficient pre-processed multimodal data transfer
- New media connector with integrated image caching capabilities
Removals
- Removed dedicated MM router worker component; functionality consolidated into frontend
Tests
- Added comprehensive tests for media connector, NIXL transfer, and routing utilities
Documentation
- Updated multimodal KV routing workflow documentation with new architecture

- Add 18 unit tests for new multimodal routing modules: - test_routing_utils.py: block_mm_infos and routing_info_from_features - test_media_connector.py: ImageLoader LRU cache integration - test_mm_kwargs_transfer.py: metadata serialization and sender logic - Move launch script to examples/backends/vllm/launch/agg_multimodal_router.sh (alongside existing agg_multimodal.sh) - Rename _externally_processed -> externally_processed in handlers.py Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Krish Hung <krishung5@gmail.com>

- Remove examples/backends/vllm/mm_router_worker/ (replaced by frontend vLLM processor with in-process KV router) - Remove tests/mm_router/test_vllm_mm_router_e2e.py (tested the removed mm_router_worker architecture) - Update docs/features/multimodal/multimodal-kv-routing.md to describe the new frontend routing approach for vLLM, keeping TRT-LLM's mm_router_worker architecture as-is Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Krish Hung <krishung5@gmail.com>

Validates multimodal KV-aware routing end-to-end: Frontend (vLLM processor + KvRouter) → vLLM backend worker. Tests cover text-only overlap, repeated same/different images, staircase image counts, swapped image order, HTTP image URLs, and HTTP vs data URI parity.

The receiver stored all pickled kwargs items under the same dict key, so only the last image survived. vLLM then hit IndexError when accessing mm_kwargs for the 2nd/3rd image. Accumulate items as a list and build mm_kwargs with all of them.

Allows packing all workers onto GPU 0 for functional testing on single-GPU machines via SINGLE_GPU=true environment variable.

github-actions · 2026-04-10T15:55:05Z

🌿 Fern Docs Preview: https://nvidia-preview-093d9d0d-3e3a-4e2f-864b-db045825d36d.docs.buildwithfern.com/dynamo/dev

coderabbitai · 2026-04-10T16:11:11Z

Walkthrough

Added new multimodal MM infrastructure (media connector, NIXL-based kwargs transfer, routing utilities) to support frontend-driven MM processing and KV cache routing. Removed standalone MM router worker. Integrated MM routing into frontend vLLM processor and backend handlers with NIXL tensor transfer support.

Changes

Cohort / File(s)	Summary
MM Core Infrastructure `components/src/dynamo/common/multimodal/media_connector.py`, `components/src/dynamo/common/multimodal/mm_kwargs_transfer.py`, `components/src/dynamo/common/multimodal/routing_utils.py`	New modules for vLLM media connector registration, NIXL-based MM kwargs sender/receiver with Pydantic metadata models, and KV cache block routing metadata generation from vLLM MM features.
MM Tests `components/src/dynamo/common/tests/multimodal/test_media_connector.py`, `components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py`, `components/src/dynamo/common/tests/multimodal/test_routing_utils.py`	Unit tests covering ImageLoader LRU cache behavior, MM kwargs transfer metadata serialization, MM sender/receiver edge cases, and routing block/feature-based metadata construction.
Frontend Integration `components/src/dynamo/frontend/vllm_processor.py`	Added `block_size` parameter, `_prepare_mm_routing()` method for NIXL MM kwargs transfer registration, NVTX instrumentation, and KV router integration passing pre-processed MM metadata and routing info instead of raw multi-modal data.
Backend Integration `components/src/dynamo/vllm/handlers.py`	Added `_try_receive_mm_kwargs_nixl()` for fast-path pre-rendered MM input in decode worker, NIXL completion handling, backend hashing preference logic, and NVTX instrumentation around MM data extraction and build flows.
Documentation & Examples `docs/features/multimodal/multimodal-kv-routing.md`, `examples/backends/vllm/launch/agg_multimodal_router.sh`	Updated MM KV routing documentation to reflect frontend-led MM processing flow; added new launch script for vLLM multimodal router demo with worker/frontend health checks and environment variable overrides.
Deprecated MM Router Worker `examples/backends/vllm/mm_router_worker/*`	Removed entire MM router worker module (README, handler, processor, router script, launch script, `__init__.py`, `__main__.py`) and related public re-exports.
E2E Test Update `tests/mm_router/test_vllm_mm_router_e2e.py`	Updated topology to remove `VLLMMMRouterWorkerProcess`, changed vLLM worker/frontend configuration for KV routing mode with `--dyn-chat-processor vllm` and `--router-mode kv`, adjusted memory/block-size parameters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.77% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(multimodal): move MM routing into vLLM frontend processor' accurately describes the main architectural change of moving MM routing from a separate worker into the frontend processor.
Description check	✅ Passed	The PR description comprehensively covers the overview, architecture, transfer modes, key changes, files involved, dependencies, and reviewer guidance. All required template sections are present and well-detailed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (1)

components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py (1)
79-99: Strengthen this sender test and remove the inline import.

This block only locks in the empty/skip branches, but the sender bug fixed in this PR was the multi-item overwrite path. Adding a two-feature prepare() case here would protect that regression and lets you reuse the existing module-scope MmKwargsSender import.

As per coding guidelines, "keep all imports at module top (flag any imports inside functions/classes)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py`
around lines 79 - 99, Update the tests to remove the inline import and add a new
case that covers the multi-item overwrite path: delete the from-dynamo... import
inside test_prepare_with_no_data_returns_none and rely on the module-level
MmKwargsSender import, and add a new async test (e.g.,
test_prepare_with_two_features_preserves_both) which creates two MagicMock
features with non-None data and modality="image", calls
MmKwargsSender().prepare([feat1, feat2], modality="image"), and asserts that
returned meta is not None and that both futures (or their corresponding entries)
are present and distinct so the previous multi-item overwrite bug is prevented.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/common/multimodal/media_connector.py`:
- Around line 75-80: The cache hit path in fetch_image returns the cached PIL
Image directly and ignores the requested image_mode, causing inconsistent
behavior versus a miss; update the cache-hit branch in fetch_image (use key =
image_url.lower(), self._image_loader._image_cache and cache[key]) to check the
requested image_mode and, if provided and different from cache[key].mode, return
a converted image (use PIL Image.convert) rather than the cached object (do not
mutate cache); otherwise return the cached image as before, and still call
cache.move_to_end(key) to preserve LRU behavior.
- Around line 96-103: The current blanket except in
DynamoMediaConnector.fetch_image_async hides real errors; change it to only
catch the expected unsupported-source exception(s) (e.g., catch ValueError from
ImageLoader as e) and perform the debug log + return await
super().fetch_image_async(image_url, image_mode=image_mode) inside that handler;
for any other exception, log it (including the exception) and re-raise so
transient fetch/decode errors are not swallowed. Ensure you reference the same
image_url slicing and the call to super().fetch_image_async when implementing
the narrowed except and re-raise behavior.

In `@components/src/dynamo/common/multimodal/mm_kwargs_transfer.py`:
- Around line 200-223: The concurrent reads append pickled items in completion
order and can scramble the original metadata.tensor_specs order; modify the loop
that creates read tasks (around metadata.tensor_specs, _do_read, read_tasks, and
the "__pickled_kwargs_item__" handling) to capture the spec index and
pre-allocate a results list (or mapping) sized to metadata.tensor_specs for the
"__pickled_kwargs_item__" key, then assign each completed read into
results[name][index] instead of appending so items preserve the original spec
order; ensure non-pickled names continue to set results[name] = t as before and
then await asyncio.gather(*read_tasks).

In `@components/src/dynamo/common/multimodal/routing_utils.py`:
- Around line 47-54: The overlap check in constructing mm_objects uses an
inclusive end bound and should treat image_ranges as [img_start, img_end)
(exclusive end); change the condition in the list comprehension that builds
mm_objects (referencing mm_hashes, image_ranges, block_start, block_end,
img_start, img_end and the mm_objects variable) from "if block_end > img_start
and block_start <= img_end" to use an exclusive end comparison (e.g., ensure
block_start < img_end rather than <=) so a block starting exactly at img_end is
not considered overlapping; update the predicate accordingly to prevent
overstating cache overlap and misrouting.

In `@components/src/dynamo/frontend/vllm_processor.py`:
- Around line 193-207: The code currently sets nixl_transferred=True whenever
nixl_meta is non-None, which can mark a partial or wrong-modality transfer as
successful; update the block handling the result of
self._mm_kwargs_sender.prepare(vllm_preproc.mm_features, modality="image") so
that you only mark nixl_transferred=True when (1) the returned nixl_meta
indicates the same supported modality you intended (not just any non-None), and
(2) every feature that had data in vllm_preproc.mm_features was actually
included in nixl_meta.tensor_specs (i.e., count/IDs of transferred tensors
equals the count/IDs of non-None features). If the check fails, leave URL
fallback enabled and do not set nixl_transferred; apply the same guarded logic
in the analogous handling around lines 404-410 to avoid dropping
multi_modal_data on partial/mismatched transfers.

In `@components/src/dynamo/vllm/handlers.py`:
- Around line 1265-1274: The current fallback uses request["token_ids"] when
expanded_token_ids is missing, which misaligns mm_placeholders and transferred
kwargs; instead, when extra_args.get("expanded_token_ids") is falsy, do not set
expanded_token_ids to request["token_ids"] and instead abort the NIXL fast-path
(e.g., return/raise a signal so the normal processor path rebuilds the prompt).
Update the logic around expanded_token_ids in the handler so that the code that
relies on mm_placeholders uses only truly expanded_token_ids and that absence of
expanded_token_ids triggers exiting the fast path rather than falling back to
request["token_ids"].

In `@docs/features/multimodal/multimodal-kv-routing.md`:
- Around line 86-94: Add a row for the SINGLE_GPU environment variable to the
"Key environment variables" table: document the variable name `SINGLE_GPU`, set
its default to `false` (or `0`) and add a short description like "Enable
single-GPU packing mode for the launcher (use on 1‑GPU machines); accepts
true/false or 1/0" so readers know how to enable the single‑GPU path and what
values are supported; ensure the description matches the launcher behavior for
SINGLE_GPU.

In `@examples/backends/vllm/launch/agg_multimodal_router.sh`:
- Around line 17-18: The script currently sets strict mode with "set -euo
pipefail" but lacks the repo-standard process-group cleanup; immediately after
that line add the top-level trap 'echo Cleaning up...; kill 0' EXIT so all child
processes (not just direct parents) are killed on Ctrl+C or exit; replace or
remove the ad-hoc PID-list cleanup logic later in the script (references to the
current PID kill block around lines 74-84) so the single trap handles cleanup
consistently.
- Around line 28-29: Replace hard-coded default ports by allocating free ports
at runtime: for HTTP_PORT, KV_EVENTS_PORT_BASE and any worker system ports in
this script (e.g., the variables at the other commented locations) use the
repo's alloc_port helper when the env var is unset (i.e., set
HTTP_PORT="${HTTP_PORT:-$(alloc_port)}" and similarly for KV_EVENTS_PORT_BASE
and worker port vars), preserving the ability for callers to override by
exporting env vars; keep BLOCK_SIZE as-is but update the port assignments around
the references to HTTP_PORT, KV_EVENTS_PORT_BASE and the worker ports so they
export the dynamically allocated values instead of static defaults.
- Around line 49-50: The summary prints health URLs using VLLM_SYSTEM_PORT_BASE
+ i but the actual workers are started on 18079 + i*2, so override of
VLLM_SYSTEM_PORT_BASE has no effect; fix by deriving worker ports from
VLLM_SYSTEM_PORT_BASE everywhere: replace literal 18079 in the worker launch
logic with $((VLLM_SYSTEM_PORT_BASE - 2)) (so
worker_port=$((VLLM_SYSTEM_PORT_BASE - 2 + i*2)) or equivalent), and update the
summary/banner health URL computation to use the same expression
($((VLLM_SYSTEM_PORT_BASE - 2 + i*2))) instead of VLLM_SYSTEM_PORT_BASE + i;
apply the same change to the other occurrences mentioned (around the other
summaries/prints at the noted sections).

In `@tests/mm_router/test_vllm_mm_router_e2e.py`:
- Around line 184-187: Replace the fixed time.sleep with a real readiness check:
after entering the VLLMWorkerProcess context, poll the worker's health endpoint
(use vllm_port) or call an existing readiness helper (e.g.,
VLLMWorkerProcess.wait_until_ready or a project helper like
wait_for_service_health) until it reports healthy, then start FrontendProcess;
if no helper exists, use the provided test fixtures
(runtime_services_dynamic_ports / start_services_with_http / ManagedProcess)
instead of sleeping to wait for the worker to be ready before creating
FrontendProcess.

---

Nitpick comments:
In `@components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py`:
- Around line 79-99: Update the tests to remove the inline import and add a new
case that covers the multi-item overwrite path: delete the from-dynamo... import
inside test_prepare_with_no_data_returns_none and rely on the module-level
MmKwargsSender import, and add a new async test (e.g.,
test_prepare_with_two_features_preserves_both) which creates two MagicMock
features with non-None data and modality="image", calls
MmKwargsSender().prepare([feat1, feat2], modality="image"), and asserts that
returned meta is not None and that both futures (or their corresponding entries)
are present and distinct so the previous multi-item overwrite bug is prevented.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca540856-1e39-4b32-bc13-36a2e7f1b14a

📥 Commits

Reviewing files that changed from the base of the PR and between 6c877e4 and 5431c5d.

📒 Files selected for processing (18)

components/src/dynamo/common/multimodal/media_connector.py
components/src/dynamo/common/multimodal/mm_kwargs_transfer.py
components/src/dynamo/common/multimodal/routing_utils.py
components/src/dynamo/common/tests/multimodal/test_media_connector.py
components/src/dynamo/common/tests/multimodal/test_mm_kwargs_transfer.py
components/src/dynamo/common/tests/multimodal/test_routing_utils.py
components/src/dynamo/frontend/vllm_processor.py
components/src/dynamo/vllm/handlers.py
docs/features/multimodal/multimodal-kv-routing.md
examples/backends/vllm/launch/agg_multimodal_router.sh
examples/backends/vllm/mm_router_worker/README.md
examples/backends/vllm/mm_router_worker/__init__.py
examples/backends/vllm/mm_router_worker/__main__.py
examples/backends/vllm/mm_router_worker/handler.py
examples/backends/vllm/mm_router_worker/launch.sh
examples/backends/vllm/mm_router_worker/mm_processor.py
examples/backends/vllm/mm_router_worker/mm_router_worker.py
tests/mm_router/test_vllm_mm_router_e2e.py

💤 Files with no reviewable changes (7)

examples/backends/vllm/mm_router_worker/main.py
examples/backends/vllm/mm_router_worker/init.py
examples/backends/vllm/mm_router_worker/README.md
examples/backends/vllm/mm_router_worker/launch.sh
examples/backends/vllm/mm_router_worker/handler.py
examples/backends/vllm/mm_router_worker/mm_router_worker.py
examples/backends/vllm/mm_router_worker/mm_processor.py

Adds --dyn-preprocess-workers N to parallelize process_inputs() across N worker processes, each with its own GIL. Workers register mm_kwargs with NIXL directly in their own address space — only lightweight metadata (~330 bytes) crosses the process boundary, while the backend reads the 3.7MB pickled tensors via RDMA from worker memory. Default is 0 (serial path unchanged). Activated with e.g. --dyn-preprocess-workers 4 or PREPROCESS_WORKERS=4 in the launch script.

Threads share memory — eliminates all inter-process overhead (pickle, pipe, per-worker NIXL connectors, worker globals). CPU-bound numpy/PIL operations in process_inputs() release the GIL, allowing threads to overlap. NIXL registration stays in the main async event loop.

Register tensors directly with NIXL instead of pickling the entire MultiModalKwargsItem. Field metadata (type, batch_size, keep_on_cpu) is serialized as lightweight fields in TensorTransferSpec (~200 bytes) while tensor data transfers via RDMA zero-copy. This removes the 30-50ms pickle.dumps() per image that was the dominant NIXL overhead, reducing frontend per-request cost from ~50-70ms to ~5-10ms for the NIXL path.

This reverts commit 08c03b4.

Follow-up on furionw's review comment: the previous refactor unified the NIXL and SHM senders under MmKwargsSender(ABC), but vllm_processor still wrapped each `sender.prepare()` call with an NVTX annotation. Push the NVTX annotation into the base class via the template-method pattern. Subclasses declare their own class-level `_nvtx_label` / `_nvtx_color` attrs and implement `_prepare()`; the base's concrete `prepare()` wraps `_prepare()` with `_nvtx.annotate(...)` so the callsite in vllm_processor doesn't need to know which transport is in use. - MmKwargsSender: add `_nvtx_label`/`_nvtx_color` class attrs, make `prepare()` concrete (template), add abstract `_prepare()`. - MmKwargsNixlSender: override class attrs, rename `prepare` -> `_prepare`, drop the now-redundant outer `mm_nixl:sender_prepare` range. - MmKwargsShmSender: override class attrs, rename `prepare` -> `_prepare`, drop the now-redundant outer `mm_shm:sender_prepare` range. - vllm_processor: drop the `_nvtx.annotate(nvtx_label, color=nvtx_color)` wrap at the callsite; the sender owns it now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`prompt` is assigned from two branches with different types: the pre-rendered MultiModalInput dict on the fast path, and the TokensPrompt/EmbedsPrompt/None tuple from _build_prompt_from_request. mypy inferred the variable's type from the first assignment and flagged the tuple unpack as incompatible. Declare `prompt: Any` above the if/else so both branches type-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fe-core # Conflicts: # components/src/dynamo/vllm/handlers.py

The cleanup trap on EXIT/INT/TERM does `kill 0`, which sends SIGTERM to every process in the current process group — including the script itself. When the script catches that SIGTERM it re-enters the trap, prints "Cleaning up..." again, and fires `kill 0` again, spinning in an infinite loop that ends in a segfault when bash trips over itself. Clear the trap inside the handler (`trap - EXIT INT TERM`) so the second SIGTERM is a no-op. Reproduces on the agg_multimodal_router.sh launch any time Ctrl-C or a parent signal kills the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches the existing MmKwargsNixlSender / MmKwargsShmSender / MmKwargsShmReceiver naming convention. Addresses review comment on PR #8065. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous _try_receive_mm_kwargs_nixl dispatched to _receive_mm_kwargs_shm internally, so its name was misleading and the two receive branches had substantial duplicate logic (unpickle + validate + build EngineInput + inject). Renames the entry point to _try_receive_mm_kwargs (transport-agnostic) and factors the shared post-receive flow into a single _receive_mm_kwargs helper parameterized by transport kind. The transport-specific bits (receiver acquisition, RDMA-read vs shm-open, metadata validation) are now small branches; everything downstream is shared. Addresses review comment on PR #8065. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…SHM wire Three related changes that mirror the existing MmKwargsSender ABC pattern across the rest of the transfer layer, addressing follow-up review feedback on PR #8065. 1. Pydantic SHM metadata Added MmKwargsShmItem + MmKwargsShmTransferMetadata Pydantic models so the SHM wire format matches NIXL's MmKwargsTransferMetadata. Sender now emits metadata.model_dump(); receiver accepts the validated Pydantic object; handlers.py validates on entry. 2. MmKwargsReceiver ABC + NVTX template method Mirror of the existing MmKwargsSender ABC. Base class owns the per-request NVTX annotation via class-level _nvtx_label / _nvtx_color attrs; subclasses implement async _receive(metadata). Both MmKwargsNixlReceiver and MmKwargsShmReceiver now inherit. handlers.py's _receive_mm_kwargs drops its is_nixl / string-tag dispatch and takes a MmKwargsReceiver instance directly. 3. Sender _prepare() scaffold dedup The feature-iteration / mm_hash collection / None-data skip / pickle-dumps loop was duplicated across NIXL and SHM senders. Moved into MmKwargsSender.prepare() as a template method that delegates transport-specific work to two small hooks: - _encode_item(idx, pickled) -> (encoded_item, cleanup_item) - _assemble_extra_args(modality, encoded_items, mm_hashes) Each subclass now implements only its transport-unique bits. Tested: 144/144 unit tests pass, 30/30 e2e tests pass (all 10 scenarios across shm / nixl / disabled transport modes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fe-core

krishung5 · 2026-04-22T11:56:18Z

@furionw Thanks for the comments. I made some changes and hoped this makes the NIXL and SHM transfer paths more consolidated and structured. Lmk what you think!

High-level structure of the multimodal mm_kwargs transfer layer after
the three follow-up review refactors (Pydantic SHM metadata,
MmKwargsReceiver ABC, sender _prepare() scaffold dedup).

Wire protocol

Both transports now use Pydantic BaseModel for their metadata. SHM is
no longer a raw dict.

Transport	Pydantic class	`extra_args` key	Payload location
NIXL	`MmKwargsTransferMetadata`	`mm_kwargs_nixl`	pre-registered descriptor pool
SHM	`MmKwargsShmTransferMetadata`	`mm_kwargs_shm`	`/dev/shm` segments

Sender side (frontend)

MmKwargsSender (ABC)
│
├─ async prepare(mm_features, modality)          ← template method, owns NVTX
│    │
│    ├─ if not self._is_available() or not mm_features:
│    │      return None, []
│    │
│    ├─ for feat in mm_features:
│    │      • collect feat.mm_hash
│    │      • skip feat.data is None
│    │      • pickle feat.data       (NVTX via _pickle_nvtx_label)
│    │      • _encode_item(idx, pickled)   ← hook
│    │           → (encoded_item, cleanup_item)
│    │
│    └─ _assemble_extra_args(modality, encoded_items, mm_hashes)  ← hook
│           → {"mm_kwargs_<transport>": <pydantic-dumped dict>}
│
└─ abstract cleanup(items)

Concrete senders

Sender	`_encode_item`	`_assemble_extra_args`	`cleanup`
`MmKwargsNixlSender`	Register pickled bytes as NIXL descriptor → `(TensorTransferSpec, completion_future)`	Wrap specs in `MmKwargsTransferMetadata` → `{"mm_kwargs_nixl": ...}`	`await asyncio.gather(*completion_futures)`
`MmKwargsShmSender`	Create `shm.SharedMemory`, write pickled bytes → `(MmKwargsShmItem, shm_handle)`	Wrap items in `MmKwargsShmTransferMetadata` → `{"mm_kwargs_shm": ...}`	`handle.close(); handle.unlink()`

Receiver side (backend)

MmKwargsReceiver (ABC)
│
├─ async receive(metadata)            ← template method, owns NVTX
│      └─ _receive(metadata)          ← hook
│
└─ abstract _receive(metadata) → dict[str, Any]
        Result is always {"__pickled_kwargs_item__": list[bytes]}

Concrete receivers

Receiver	`_receive` behavior	Lifecycle
`MmKwargsNixlReceiver`	Acquire descriptor from pool → NIXL RDMA READ → release back to pool	Long-lived (pool is expensive to set up)
`MmKwargsShmReceiver`	`SharedMemory(name, create=False)` → read `[:size]` → close	Throwaway per request

Handler dispatch (vllm/handlers.py)

async def _try_receive_mm_kwargs(request) -> EngineInput | None:
    # SHM path first (same-node, ~1.5 ms).
    if shm_meta_raw := extra_args.get("mm_kwargs_shm"):
        meta = MmKwargsShmTransferMetadata.model_validate(shm_meta_raw)
        return await self._receive_mm_kwargs(
            extra_args, "shm", MmKwargsShmReceiver(), meta,
        )

    # NIXL path (cross-node fallback).
    if nixl_meta_raw := extra_args.get("mm_kwargs_nixl"):
        meta = MmKwargsTransferMetadata.model_validate(nixl_meta_raw)
        if self._mm_kwargs_receiver is None:
            self._mm_kwargs_receiver = MmKwargsNixlReceiver()   # pooled
        return await self._receive_mm_kwargs(
            extra_args, "nixl", self._mm_kwargs_receiver, meta,
        )

    return None

_receive_mm_kwargs() is now a single, shared, transport-agnostic path:

async def _receive_mm_kwargs(extra_args, transport, receiver, metadata):
    # 1. validate mm_hashes + mm_placeholders
    # 2. results = await receiver.receive(metadata)    ← polymorphic
    # 3. unpickle + validate each MultiModalKwargsItem
    # 4. validate expanded_token_ids
    # 5. build engine_input dict
    # 6. inject_into_mm_cache
    # 7. return engine_input (or None on failure)

furionw

Thank you!

Add the vllm-project/vllm#39502 patch instructions to the MM KV routing guide so users who install Dynamo find the setup steps in the checked-in docs (previously only in the PR description). Clarify the Overview to cover both vLLM and TRT-LLM rather than sounding vLLM-only, and scope Transfer Mode Details to vLLM (TRT-LLM backends re-run their own preprocessing and are unaffected by DYNAMO_MM_TRANSFER). Drop the stale "Known Limitations" bullet that referenced the same upstream API — the new Prerequisites section is now the authoritative pointer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fe-core # Conflicts: # components/src/dynamo/vllm/handlers.py

furionw · 2026-04-24T01:24:38Z

Thanks for working on these! just want to make sure you two are aware of these parallel work and coordinate

we're removing standalone mm router in feat(multimodal): move MM routing into vLLM frontend processor #8065
we're adding approx support in standalone mm router in fix: support approx routing in mm router #8135

krishung5 and others added 7 commits April 10, 2026 02:27

Handle MM routing in vLLM processor frontend

9623ebc

Rename _externally_processed to externally_processed

b096b03

Add SINGLE_GPU support to agg_multimodal_router launch script

5431c5d

Allows packing all workers onto GPU 0 for functional testing on single-GPU machines via SINGLE_GPU=true environment variable.

krishung5 requested a review from a team April 10, 2026 15:50

krishung5 requested review from a team as code owners April 10, 2026 15:50

pull-request-size Bot added the size/XXL label Apr 10, 2026

github-actions Bot added feat documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` multimodal labels Apr 10, 2026

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 11:00 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 11:01 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 15:57 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 15:58 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 18:30 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 13, 2026 19:15 Inactive

Revert "Eliminate pickle overhead in NIXL mm_kwargs transfer"

aec3a42

This reverts commit 08c03b4.

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 09:54 Inactive

remove dead sync fetch_image path from DynamoMediaConnector

47851f8

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 10:11 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 17, 2026 10:12 Inactive

furionw reviewed Apr 17, 2026

View reviewed changes

Comment thread components/src/dynamo/frontend/vllm_processor.py

krishung5 and others added 5 commits April 20, 2026 10:03

Merge origin/main into krish/mm-router-vllm-fe-core

6c9afb4

Merge remote-tracking branch 'origin/main' into krish/mm-router-vllm-…

8b7d412

…fe-core # Conflicts: # components/src/dynamo/vllm/handlers.py

copy-pr-bot Bot temporarily deployed to GITLAB April 21, 2026 09:43 Inactive

furionw reviewed Apr 21, 2026

View reviewed changes

Comment thread components/src/dynamo/common/multimodal/mm_kwargs_transfer.py Outdated

furionw reviewed Apr 21, 2026

View reviewed changes

Comment thread components/src/dynamo/vllm/handlers.py Outdated

furionw reviewed Apr 21, 2026

View reviewed changes

Comment thread components/src/dynamo/vllm/handlers.py Outdated

krishung5 and others added 4 commits April 22, 2026 03:57

refactor(multimodal): rename MmKwargsReceiver -> MmKwargsNixlReceiver

c2987f6

Matches the existing MmKwargsNixlSender / MmKwargsShmSender / MmKwargsShmReceiver naming convention. Addresses review comment on PR #8065. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into krish/mm-router-vllm-…

639a041

…fe-core

copy-pr-bot Bot temporarily deployed to GITLAB April 22, 2026 11:46 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 22, 2026 11:52 Inactive

furionw approved these changes Apr 22, 2026

View reviewed changes

krishung5 and others added 2 commits April 22, 2026 23:24

Merge remote-tracking branch 'origin/main' into krish/mm-router-vllm-…

e5c4b60

…fe-core # Conflicts: # components/src/dynamo/vllm/handlers.py

copy-pr-bot Bot temporarily deployed to GITLAB April 23, 2026 06:29 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 23, 2026 06:50 Inactive

furionw mentioned this pull request Apr 24, 2026

fix: support approx routing in mm router #8135

Open

krishung5 merged commit 17f701a into main Apr 24, 2026
88 checks passed

Conversation

krishung5 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Transfer Modes

Key Changes

Upstream vLLM Dependency

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krishung5 commented Apr 22, 2026

Wire protocol

Sender side (frontend)

Concrete senders

Receiver side (backend)

Concrete receivers

Handler dispatch (vllm/handlers.py)

Uh oh!

furionw left a comment

Choose a reason for hiding this comment

Uh oh!

furionw commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krishung5 commented Apr 10, 2026 •

edited

Loading

github-actions Bot commented Apr 10, 2026 •

edited

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading