[Feat][Router] Add disaggregated prefill orchestrated routing #777

yahavb · 2025-12-31T20:39:00Z

Implements support for disaggregated prefill as outlined in the 2025 Q1 roadmap. This enables prefill/decode disaggregation with router-orchestrated KV cache transfer.

Closes #26

Summary

This PR implements disaggregated prefill routing - a feature listed in the 2025 Q1 roadmap. It adds a new routing algorithm disaggregated_prefill_orchestrated that enables prefill/decode disaggregation with router-orchestrated KV cache transfer.

See full proposal: proposals/disaggregated-prefill-orchestrated-routing.md

Motivation

This complements LMCache-based disaggregated inference by supporting backends with custom kv_connector implementations:

| Approach | KV Transfer | Use Case | |----------|-------------|----------| | LMCache-based DI | LMCache + NIXL | GPU clusters with LMCache | | Router-orchestrated DI (this PR) | vLLM native kv_transfer_config | Any backend with kv_connector |

Changes

| File | Change | |------|--------| | routing_logic.py | New DisaggregatedPrefillOrchestratedRouter class | | parser.py | New --prefill-model-labels, --decode-model-labels arguments | | request.py | New route_orchestrated_disaggregated_request() function |

Usage

python -m vllm_router.app \
    --routing-logic=disaggregated_prefill_orchestrated \
    --service-discovery=k8s \
    --k8s-label-selector="app in (prefill,decode)" \
    --prefill-model-labels=prefill \
    --decode-model-labels=decode

Tested

✅ End-to-end tested with prefill/decode pods on EKS

FIX #26

Checklist:

[Feat][Router] prefix
Signed-off-by (used -s)
pre-commit checks (TODO: run uv run pre-commit run --all-files)

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

gemini-code-assist · 2025-12-31T20:39:16Z

Summary of Changes

Hello @yahavb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the vLLM router by adding support for router-orchestrated disaggregated prefill. This new routing logic allows the system to separate the compute-intensive prefill phase from the memory-bound decode phase, with the router intelligently managing the flow of requests and KV cache metadata between dedicated prefill and decode pods. This change improves resource utilization, enables independent scaling of prefill and decode components, and offers greater flexibility for various backend implementations beyond LMCache.

Highlights

New Routing Algorithm: Introduced disaggregated_prefill_orchestrated as a new routing logic option, enabling router-orchestrated KV cache transfer for disaggregated prefill/decode.
KV Cache Transfer Orchestration: The router now handles the full prefill-decode request flow internally, including forwarding the KV cache metadata from the prefill response to the decode request.
Custom KV Connector Support: This new approach supports backends with custom kv_connector implementations, providing an alternative to LMCache-based disaggregated inference.
CLI Arguments: Added new CLI arguments --prefill-model-labels and --decode-model-labels to specify labels for identifying prefill and decode pods.
Architectural Components: Implemented a new DisaggregatedPrefillOrchestratedRouter class and a dedicated route_orchestrated_disaggregated_request() function to manage the complex two-step routing process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new routing logic for disaggregated prefill, which is a valuable feature. The overall structure is good, but there are several areas for improvement regarding correctness, performance, and maintainability. Key issues include duplicated and dead code for the orchestration logic, a deviation from the proposal where max_tokens=1 is not set for prefill requests, inefficient creation of aiohttp.ClientSession for each request, and incorrect handling of streaming responses which buffers the entire response in memory. I've provided specific comments and suggestions to address these points, which should lead to a more robust, performant, and maintainable implementation.

gemini-code-assist · 2025-12-31T20:41:37Z

src/vllm_router/routers/routing_logic.py

+    async def handle_orchestrated_request(
+        self,
+        endpoints: List[EndpointInfo],
+        request_json: Dict,
+        request_path: str,
+        aiohttp_client,
+    ):
+        """
+        Orchestrate the full Prefill → Decode flow.
+        
+        Args:
+            endpoints: List of available endpoints
+            request_json: The original request body
+            request_path: The API path (e.g., /v1/chat/completions)
+            aiohttp_client: The aiohttp client session for making HTTP requests
+            
+        Returns:
+            An async generator that yields the streaming response from decode
+        """
+        import aiohttp
+        import json
+
+        prefiller_endpoints, decoder_endpoints = self._find_endpoints(endpoints)
+
+        # Select endpoints (simple first-available for now, can add load balancing later)
+        prefill_url = prefiller_endpoints[0].url
+        decode_url = decoder_endpoints[0].url
+
+        request_id = str(uuid.uuid4())
+        logger.info(f"[{request_id}] Starting orchestrated disaggregated inference")
+        logger.info(f"[{request_id}] Prefill endpoint: {prefill_url}")
+        logger.info(f"[{request_id}] Decode endpoint: {decode_url}")
+
+        # Step 1: Send request to Prefill
+        prefill_api_url = f"{prefill_url}{request_path}"
+        logger.info(f"[{request_id}] Sending prefill request to {prefill_api_url}")
+
+        try:
+            async with aiohttp.ClientSession() as session:
+                # Call Prefill
+                async with session.post(
+                    prefill_api_url,
+                    json=request_json,
+                    headers={"Content-Type": "application/json", "X-Request-ID": request_id},
+                    timeout=aiohttp.ClientTimeout(total=300)  # 5 min timeout for prefill
+                ) as prefill_resp:
+                    if prefill_resp.status != 200:
+                        error_text = await prefill_resp.text()
+                        logger.error(f"[{request_id}] Prefill failed with status {prefill_resp.status}: {error_text}")
+                        yield json.dumps({"error": f"Prefill failed: {error_text}"}).encode()
+                        return
+
+                    prefill_data = await prefill_resp.json()
+                    logger.info(f"[{request_id}] Prefill completed successfully")
+                    logger.debug(f"[{request_id}] Prefill response: {prefill_data}")
+
+                # Step 2: Add prefill metadata and send to Decode
+                decode_request = request_json.copy()
+                decode_request["disagg_prefill_resp"] = prefill_data
+
+                decode_api_url = f"{decode_url}{request_path}"
+                logger.info(f"[{request_id}] Sending decode request to {decode_api_url}")
+
+                # Check if streaming is requested
+                is_streaming = request_json.get("stream", False)
+
+                async with session.post(
+                    decode_api_url,
+                    json=decode_request,
+                    headers={"Content-Type": "application/json", "X-Request-ID": request_id},
+                    timeout=aiohttp.ClientTimeout(total=600)  # 10 min timeout for decode
+                ) as decode_resp:
+                    if decode_resp.status != 200:
+                        error_text = await decode_resp.text()
+                        logger.error(f"[{request_id}] Decode failed with status {decode_resp.status}: {error_text}")
+                        yield json.dumps({"error": f"Decode failed: {error_text}"}).encode()
+                        return
+
+                    # Stream the decode response back to client
+                    if is_streaming:
+                        async for chunk in decode_resp.content.iter_any():
+                            if chunk:
+                                yield chunk
+                    else:
+                        response_data = await decode_resp.read()
+                        yield response_data
+
+                    logger.info(f"[{request_id}] Decode completed successfully")
+
+        except aiohttp.ClientError as e:
+            logger.error(f"[{request_id}] HTTP error during orchestrated request: {e}")
+            yield json.dumps({"error": f"HTTP error: {str(e)}"}).encode()
+        except Exception as e:
+            logger.error(f"[{request_id}] Unexpected error during orchestrated request: {e}")
+            yield json.dumps({"error": f"Unexpected error: {str(e)}"}).encode()
+


This handle_orchestrated_request method appears to be dead code. The actual orchestration logic is implemented in route_orchestrated_disaggregated_request in src/vllm_router/services/request_service/request.py.

This duplicated logic can lead to confusion and maintenance issues. Additionally, this method contains some anti-patterns:

Local imports of aiohttp and json.

It accepts an aiohttp_client argument but creates a new aiohttp.ClientSession instead of using the provided one.

It's recommended to remove this method to keep the codebase clean and maintain a single source of truth for the orchestration logic.

Fixed in commit 2e4d2d2. Removed the entire handle_orchestrated_request method. The orchestration logic now lives only in route_orchestrated_disaggregated_request() in request.py.

gemini-code-assist · 2025-12-31T20:41:37Z