Skip to content

Conversation

@rainyfly
Copy link
Collaborator

@rainyfly rainyfly commented Dec 2, 2025

Motivation

  1. 当P实例或者 D 实例收到正在处理的相同 req_id 请求时,拒绝处理,并打印错误日志,防止相同req_id 的请求混乱服务资源。
  2. 当P向 D 通信失败,或者 D 向 P 通信失败失败时,该条请求将会处理失败,返回报错信息。
  3. 当D长时间没收到 P 的首 token 时,可能由于 P 故障或者下线,D 的 block可能会泄露,需要触发超时回收机制。
  4. 当 P cache messager 进程挂掉或者出问题时,服务将不能处理请求,会卡在 token processor 里,补充对应检测机制。
  5. zmq 走及时通信模式,不缓存消息,防止某个实例重启后接收到过期消息并处理。

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 2, 2025 11:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the robustness and stability of PD (Prefill-Decode) disaggregated deployment by adding comprehensive error handling, health monitoring, and timeout mechanisms to prevent resource leaks and service disruptions.

Key Changes:

  • Added communication status tracking for P-D and D-P message exchanges with proper error handling and retry logic
  • Implemented health check mechanism for token processor to detect cache messenger process failures
  • Added timeout-based resource reclamation in D instance to prevent block leakage when P fails to send first token

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
fastdeploy/splitwise/splitwise_connector.py Modified message sending functions to return success status, configured ZMQ immediate mode to prevent message caching, added error tracking
fastdeploy/splitwise/internal_adapter_utils.py Added health check command handler to monitor token processor health status
fastdeploy/output/token_processor.py Implemented health monitoring with timestamps tracking batch processing lifecycle, repositioned resource manager logging
fastdeploy/envs.py Added timeout configuration for first token from P (300s) and token processor health check (120s), reduced prefill wait time from 30s to 5s
fastdeploy/engine/sched/resource_manager_v1.py Added background thread to monitor and recycle preallocated resources that timeout waiting for first token, implemented duplicate request detection
fastdeploy/engine/common_engine.py Added duplicate request ID validation, communication failure handling with error reporting to scheduler, fixed error handling for D resource allocation failures

self._close_connection(addr)

except Exception as e:
self.logger.error(f"Message preparation failed: {e}")
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an exception occurs in the outer try block (line 176-177), the function returns is_successful which is still True. The function should set is_successful = False before returning, or return False explicitly when an exception is caught.

Suggested change
self.logger.error(f"Message preparation failed: {e}")
self.logger.error(f"Message preparation failed: {e}")
is_successful = False

Copilot uses AI. Check for mistakes.
try:
with self.lock:
need_recycle_request_ids = []
for request_id, timestamp in self.preallocated_requests_timestamp.items():
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over self.preallocated_requests_timestamp.items() while potentially modifying it in another thread can cause a RuntimeError ("dictionary changed size during iteration"). Even though this is inside a lock, the dictionary could be modified by other methods like add_prefilled_request (line 1082) after the lock is released but before the next iteration of the while loop. Consider using list(self.preallocated_requests_timestamp.items()) to create a snapshot for iteration.

Suggested change
for request_id, timestamp in self.preallocated_requests_timestamp.items():
for request_id, timestamp in list(self.preallocated_requests_timestamp.items()):

Copilot uses AI. Check for mistakes.
"""
with self.health_lock:
if self.timestamp_for_alive_after_handle_batch is None: # has entered handle batch
if time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT:
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health check has a potential race condition. If timestamp_for_alive_before_handle_batch is None when checked, the subtraction time.time() - self.timestamp_for_alive_before_handle_batch will raise a TypeError. This can happen before the first batch is processed or when exiting batch processing. Add a null check: if self.timestamp_for_alive_before_handle_batch is not None and time.time() - self.timestamp_for_alive_before_handle_batch > ...

Suggested change
if time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT:
if (
self.timestamp_for_alive_before_handle_batch is not None
and time.time() - self.timestamp_for_alive_before_handle_batch > envs.FD_TOKEN_PROCESSOR_HEALTH_TIMEOUT
):

Copilot uses AI. Check for mistakes.
for task in tasks:
if self.resource_manager.has_existed_request(task.request_id):
self.llm_logger.error(
f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "recieved" should be "received"

Suggested change
f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."
f"request_id: {task.request_id} has been added to scheduler, received requests with same request_id."

Copilot uses AI. Check for mistakes.
Comment on lines +735 to +736
is_successful = self.split_connector.send_splitwise_tasks([task], task.idx)
if not is_successful: # Send request for block ids to D failed
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value type of send_splitwise_tasks is being checked as a boolean, but the function actually returns a dictionary. In Python, non-empty dictionaries are truthy, which means if not is_successful will be False even when all send operations failed. This check should verify that all operations succeeded, e.g., if not is_successful or not all(is_successful.values()):

Copilot uses AI. Check for mistakes.
for task in tasks:
if self.resource_manager.has_existed_request(task.request_id):
self.llm_logger.error(
f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "recieved" should be "received"

Suggested change
f"request_id: {task.request_id} has been added to scheduler, recieved requests with same request_id."
f"request_id: {task.request_id} has been added to scheduler, received requests with same request_id."

Copilot uses AI. Check for mistakes.
@@ -150,7 +150,11 @@
"FD_ENABLE_PDL": lambda: int(os.getenv("FD_ENABLE_PDL", "1")),
# "Number of tokens in the group for Mixture of Experts (MoE) computation processing on HPU"
"FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The default value for FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS was changed from 30 seconds to 5 seconds. This is a significant behavior change that reduces the wait time by 83%. Consider documenting the rationale for this change in the PR description, as it may impact production deployments and timeout behavior.

Suggested change
"FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),
"FD_HPU_CHUNK_SIZE": lambda: int(os.getenv("FD_HPU_CHUNK_SIZE", "64")),
# [IMPORTANT] Default value changed from 30 seconds to 5 seconds (reduced by 83%).
# Rationale: Shorter wait time improves responsiveness and aligns with observed resource decode times.
# WARNING: This may impact production deployments and timeout behavior. Adjust via environment variable if needed.

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +245
request = self.requests[request_id]
self.prerelease_resource(request)
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential KeyError if request_id has already been removed from self.requests between the dictionary iteration and this access. The code should check if the request_id exists: if request_id in self.requests: request = self.requests[request_id]; self.prerelease_resource(request). Otherwise, the monitoring thread could crash when a request is removed by another thread.

Suggested change
request = self.requests[request_id]
self.prerelease_resource(request)
if request_id in self.requests:
request = self.requests[request_id]
self.prerelease_resource(request)

Copilot uses AI. Check for mistakes.
self.split_connector.send_cache_info_to_prefill([task])
send_info_to_p_successful = self.split_connector.send_cache_info_to_prefill([task])
if send_info_to_p_successful is False:
self.resource_manager.pre_recycle_resource(task.request_id)
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling pre_recycle_resource without checking if the request was actually allocated. At this point (line 1182-1184), is_success is False, which means preallocate_resource_in_d was never called or failed. Recycling resources that were never allocated could lead to corruption of the resource manager state. This code block should only recycle if resources were actually preallocated.

Copilot uses AI. Check for mistakes.
splitwise_task_send_status[task.request_id] = is_successful
task.disaggregate_info["cache_info"] = decode_diagg
task.disaggregate_info["role"] = "prefill"
return splitwise_task_send_status
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns a dictionary mapping request IDs to success status, but callers expect a boolean. This API design mismatch causes incorrect behavior - an empty dictionary evaluates to False, but a non-empty dictionary always evaluates to True regardless of success status. Consider either: (1) returning a single boolean value (e.g., all(splitwise_task_send_status.values()) if the dict is not empty, else True), or (2) updating all callers to properly handle the dictionary return type.

Suggested change
return splitwise_task_send_status
return all(splitwise_task_send_status.values()) if splitwise_task_send_status else True

Copilot uses AI. Check for mistakes.
need_delete_tasks = []
if envs.FD_OFFLINE_PERF_TEST_FOR_PD:
for task in tasks:
if self.resource_manager.has_existed_request(task.request_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议提前检查这个问题,比如请求插入 scheduler 之前就检查

is_successful = self.split_connector.send_splitwise_tasks([task], task.idx)
if not is_successful: # Send request for block ids to D failed
self.llm_logger.error(f"{task.request_id} send request for block ids to D failed.")
self.scheduler.put_results(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些给 scheduler 的错误信息,可以抽象一个简单函数,多处用到了

self.llm_logger.info(f"Resource available, processing task {task.request_id}")
self.split_connector.send_cache_info_to_prefill([task])
send_info_to_p_successful = self.split_connector.send_cache_info_to_prefill([task])
if send_info_to_p_successful is False:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not bool_value:

@paddle-bot
Copy link

paddle-bot bot commented Dec 2, 2025

Thanks for your contribution!

heavengate added a commit that referenced this pull request Dec 15, 2025
* [Optimize] Robust stabilty for PD deployment

---------

Co-authored-by: Kaipeng Deng <dengkaipeng@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants