[Optimize] Robust stability for PD deployment #5338

rainyfly · 2025-12-02T11:59:27Z

Motivation

当P实例或者 D 实例收到正在处理的相同 req_id 请求时，拒绝处理，并打印错误日志，防止相同req_id 的请求混乱服务资源。
当P向 D 通信失败，或者 D 向 P 通信失败失败时，该条请求将会处理失败，返回报错信息。
当D长时间没收到 P 的首 token 时，可能由于 P 故障或者下线，D 的 block可能会泄露，需要触发超时回收机制。
当 P cache messager 进程挂掉或者出问题时，服务将不能处理请求，会卡在 token processor 里，补充对应检测机制。
zmq 走及时通信模式，不缓存消息，防止某个实例重启后接收到过期消息并处理。

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot

Pull request overview

This PR enhances the robustness and stability of PD (Prefill-Decode) disaggregated deployment by adding comprehensive error handling, health monitoring, and timeout mechanisms to prevent resource leaks and service disruptions.

Key Changes:

Added communication status tracking for P-D and D-P message exchanges with proper error handling and retry logic
Implemented health check mechanism for token processor to detect cache messenger process failures
Added timeout-based resource reclamation in D instance to prevent block leakage when P fails to send first token

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`fastdeploy/splitwise/splitwise_connector.py`	Modified message sending functions to return success status, configured ZMQ immediate mode to prevent message caching, added error tracking
`fastdeploy/splitwise/internal_adapter_utils.py`	Added health check command handler to monitor token processor health status
`fastdeploy/output/token_processor.py`	Implemented health monitoring with timestamps tracking batch processing lifecycle, repositioned resource manager logging
`fastdeploy/envs.py`	Added timeout configuration for first token from P (300s) and token processor health check (120s), reduced prefill wait time from 30s to 5s
`fastdeploy/engine/sched/resource_manager_v1.py`	Added background thread to monitor and recycle preallocated resources that timeout waiting for first token, implemented duplicate request detection
`fastdeploy/engine/common_engine.py`	Added duplicate request ID validation, communication failure handling with error reporting to scheduler, fixed error handling for D resource allocation failures

Copilot · 2025-12-02T12:06:37Z

fastdeploy/splitwise/splitwise_connector.py

                self._close_connection(addr)

        except Exception as e:
            self.logger.error(f"Message preparation failed: {e}")


When an exception occurs in the outer try block (line 176-177), the function returns is_successful which is still True. The function should set is_successful = False before returning, or return False explicitly when an exception is caught.

Suggested change

self.logger.error(f"Message preparation failed: {e}")

self.logger.error(f"Message preparation failed: {e}")

is_successful = False

Copilot · 2025-12-02T12:06:38Z

fastdeploy/engine/sched/resource_manager_v1.py

+            try:
+                with self.lock:
+                    need_recycle_request_ids = []
+                    for request_id, timestamp in self.preallocated_requests_timestamp.items():


Iterating over self.preallocated_requests_timestamp.items() while potentially modifying it in another thread can cause a RuntimeError ("dictionary changed size during iteration"). Even though this is inside a lock, the dictionary could be modified by other methods like add_prefilled_request (line 1082) after the lock is released but before the next iteration of the while loop. Consider using list(self.preallocated_requests_timestamp.items()) to create a snapshot for iteration.

Suggested change

for request_id, timestamp in self.preallocated_requests_timestamp.items():

for request_id, timestamp in list(self.preallocated_requests_timestamp.items()):

Copilot · 2025-12-02T12:06:38Z

fastdeploy/output/token_processor.py

+        """
+        with self.health_lock:
+            if self.timestamp_for_alive_after_handle_batch is None:  # has entered handle batch
+                if time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT:


The health check has a potential race condition. If timestamp_for_alive_before_handle_batch is None when checked, the subtraction time.time() - self.timestamp_for_alive_before_handle_batch will raise a TypeError. This can happen before the first batch is processed or when exiting batch processing. Add a null check: if self.timestamp_for_alive_before_handle_batch is not None and time.time() - self.timestamp_for_alive_before_handle_batch > ...

Suggested change

if time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT:

if (

self.timestamp_for_alive_before_handle_batch is not None

and time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT

):

Copilot · 2025-12-02T12:06:39Z

fastdeploy/engine/common_engine.py

                        for task in tasks:
+                            if self.resource_manager.has_existed_request(task.request_id):
+                                self.llm_logger.error(
+                                    f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."


Typo: "recieved" should be "received"

Suggested change

f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."

f"request_id: {task.request_id} has been added to scheduler, received requests with same request_id."

Copilot · 2025-12-02T12:06:39Z

fastdeploy/engine/common_engine.py

+                                is_successful = self.split_connector.send_splitwise_tasks([task], task.idx)
+                                if not is_successful:  # Send request for block ids to D failed


The return value type of send_splitwise_tasks is being checked as a boolean, but the function actually returns a dictionary. In Python, non-empty dictionaries are truthy, which means if not is_successful will be False even when all send operations failed. This check should verify that all operations succeeded, e.g., if not is_successful or not all(is_successful.values()):

Copilot · 2025-12-02T12:06:40Z

fastdeploy/engine/common_engine.py

                        for task in tasks:
+                            if self.resource_manager.has_existed_request(task.request_id):
+                                self.llm_logger.error(
+                                    f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."


Typo: "recieved" should be "received"

Suggested change

f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."

f"request_id: {task.request_id} has been added to scheduler, received requests with same request_id."

Copilot · 2025-12-02T12:06:40Z

fastdeploy/envs.py

@@ -150,7 +150,11 @@
    "FD_ENABLE_PDL": lambda: int(os.getenv("FD_ENABLE_PDL", "1")),
    # "Number of tokens in the group for Mixture of Experts (MoE) computation processing on HPU"
    "FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),


[nitpick] The default value for FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS was changed from 30 seconds to 5 seconds. This is a significant behavior change that reduces the wait time by 83%. Consider documenting the rationale for this change in the PR description, as it may impact production deployments and timeout behavior.

Suggested change

"FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),

"FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),

# [IMPORTANT] Default value changed from 30 seconds to 5 seconds (reduced by 83%).

# Rationale: Shorter wait time improves responsiveness and aligns with observed resource decode times.

# WARNING: This may impact production deployments and timeout behavior. Adjust via environment variable if needed.

Copilot · 2025-12-02T12:06:41Z

fastdeploy/engine/sched/resource_manager_v1.py

+                        request = self.requests[request_id]
+                        self.prerelease_resource(request)


Potential KeyError if request_id has already been removed from self.requests between the dictionary iteration and this access. The code should check if the request_id exists: if request_id in self.requests: request = self.requests[request_id]; self.prerelease_resource(request). Otherwise, the monitoring thread could crash when a request is removed by another thread.

Suggested change

request = self.requests[request_id]

self.prerelease_resource(request)

if request_id in self.requests:

request = self.requests[request_id]

self.prerelease_resource(request)

Copilot · 2025-12-02T12:06:41Z

fastdeploy/engine/common_engine.py

-                        self.split_connector.send_cache_info_to_prefill([task])
+                        send_info_to_p_successful = self.split_connector.send_cache_info_to_prefill([task])
+                        if send_info_to_p_successful is False:
+                            self.resource_manager.pre_recycle_resource(task.request_id)


Calling pre_recycle_resource without checking if the request was actually allocated. At this point (line 1182-1184), is_success is False, which means preallocate_resource_in_d was never called or failed. Recycling resources that were never allocated could lead to corruption of the resource manager state. This code block should only recycle if resources were actually preallocated.

Copilot · 2025-12-02T12:06:41Z

fastdeploy/splitwise/splitwise_connector.py

+                splitwise_task_send_status[task.request_id] = is_successful
                task.disaggregate_info["cache_info"] = decode_diagg
            task.disaggregate_info["role"] = "prefill"
+        return splitwise_task_send_status


The function returns a dictionary mapping request IDs to success status, but callers expect a boolean. This API design mismatch causes incorrect behavior - an empty dictionary evaluates to False, but a non-empty dictionary always evaluates to True regardless of success status. Consider either: (1) returning a single boolean value (e.g., all(splitwise_task_send_status.values()) if the dict is not empty, else True), or (2) updating all callers to properly handle the dictionary return type.

Suggested change

return splitwise_task_send_status

return all(splitwise_task_send_status.values()) if splitwise_task_send_status else True

juncaipeng · 2025-12-02T13:36:32Z

fastdeploy/engine/common_engine.py

                    need_delete_tasks = []
                    if envs.FD_OFFLINE_PERF_TEST_FOR_PD:
                        for task in tasks:
+                            if self.resource_manager.has_existed_request(task.request_id):


建议提前检查这个问题，比如请求插入 scheduler 之前就检查

juncaipeng · 2025-12-02T13:37:57Z

fastdeploy/engine/common_engine.py

+                            is_successful = self.split_connector.send_splitwise_tasks([task], task.idx)
+                            if not is_successful:  # Send request for block ids to D failed
+                                self.llm_logger.error(f"{task.request_id} send request for block ids to D failed.")
+                                self.scheduler.put_results(


这些给 scheduler 的错误信息，可以抽象一个简单函数，多处用到了

juncaipeng · 2025-12-02T13:39:19Z

fastdeploy/engine/common_engine.py

                        self.llm_logger.info(f"Resource available, processing task {task.request_id}")
-                        self.split_connector.send_cache_info_to_prefill([task])
+                        send_info_to_p_successful = self.split_connector.send_cache_info_to_prefill([task])
+                        if send_info_to_p_successful is False:


if not bool_value:

paddle-bot · 2025-12-02T22:38:14Z

Thanks for your contribution!

* [Optimize] Robust stabilty for PD deployment --------- Co-authored-by: Kaipeng Deng <dengkaipeng@baidu.com>

[Optimize] Robust stability for PD deployment

0f080a0

Copilot AI review requested due to automatic review settings December 2, 2025 11:59

Copilot started reviewing on behalf of rainyfly December 2, 2025 11:59 View session

Copilot finished reviewing on behalf of rainyfly December 2, 2025 12:03

Copilot AI reviewed Dec 2, 2025

View reviewed changes

juncaipeng reviewed Dec 2, 2025

View reviewed changes

heavengate added a commit that referenced this pull request Dec 15, 2025

[Optimize][Cherry-pick] Robust stabilty for PD deployment #5338 (#5395)

4c76171

* [Optimize] Robust stabilty for PD deployment --------- Co-authored-by: Kaipeng Deng <dengkaipeng@baidu.com>

	self.logger.error(f"Message preparation failed: {e}")
	self.logger.error(f"Message preparation failed: {e}")
	is_successful = False

	for request_id, timestamp in self.preallocated_requests_timestamp.items():
	for request_id, timestamp in list(self.preallocated_requests_timestamp.items()):

-                if time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT:
+                if (
+                    self.timestamp_for_alive_before_handle_batch is not None
+                    and time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT
+                ):

	f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."
	f"request_id: {task.request_id} has been added to scheduler, received requests with same request_id."

		is_successful = self.split_connector.send_splitwise_tasks([task], task.idx)
		if not is_successful: # Send request for block ids to D failed

-    "FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),
+    "FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),
+    # [IMPORTANT] Default value changed from 30 seconds to 5 seconds (reduced by 83%).
+    # Rationale: Shorter wait time improves responsiveness and aligns with observed resource decode times.
+    # WARNING: This may impact production deployments and timeout behavior. Adjust via environment variable if needed.

		request = self.requests[request_id]
		self.prerelease_resource(request)

	return splitwise_task_send_status
	return all(splitwise_task_send_status.values()) if splitwise_task_send_status else True

[Optimize] Robust stability for PD deployment #5338

Are you sure you want to change the base?

[Optimize] Robust stability for PD deployment #5338

Uh oh!

Conversation

rainyfly commented Dec 2, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

juncaipeng Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

juncaipeng Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

juncaipeng Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

paddle-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants