Skip to content

Conversation

@qwes5s5
Copy link
Collaborator

@qwes5s5 qwes5s5 commented Oct 30, 2025

Motivation

目前推理阶段缺乏细粒度的时间打点数据,无法支撑对推理内部阶段的耗时分布查询。因此,需要对 FastDeploy 推理阶段进行细化划分,并增加日志打点。

此外,现有日志系统存在以下问题:

  • 日志仅包含时间、进程、消息等预定义信息,缺乏上下文链路信息,无法快速定位所属的 Trace 与 Span。
  • 系统可观测性不足,故障排查效率低。
  • 当前日志格式固定,仅支持 msg 参数,缺乏扩展能力。

为了解决无法快速定位的问题,引入 OpenTelemetry LoggingInstrumentor 工具,将 日志(Logs) 与 追踪(Traces) 关联起来,从而提升系统的可观测性与调试能力。

Modifications

1. 新增 Trace Logger

  • 新建 trace_logger,并提供获取函数 get_trace_logger。
  • 后续所有 FastDeploy 各阶段耗时打点 均通过此 logger 进行。
  • 日志输出至独立文件 trace_logger.log。

2. 新增自定义 Formatter

  • 创建新的格式化器 CustomColoredFormatter,与 RL Profiler 格式兼容。
  • 支持日志扩展字段 extra={"attributes": {...}}。
  • 目前在打印时会添加以下关键信息:
    • request_id
    • user_id
    • event
    • stage
  • 有助于后续日志聚合与性能分析。

3. 引入 LoggingInstrumentor

  • 接入 OpenTelemetry LoggingInstrumentor。
  • 当系统开启 Trace 时,会自动在日志中注入:
    • otel_trace_id
    • otel_span_id
  • 修改 Formatter 逻辑,在日志中体现 Trace 相关字段。

4. FastDeploy 阶段划分与打点

在 FastDeploy 各主要阶段插入日志打点,以支持耗时分析与追踪。
打点事件与阶段对应表:

事件描述 event 名称 stage 名称
请求前处理开始 PREPROCESSING_START PREPROCESSING
请求前处理结束 PREPROCESSING_END PREPROCESSING
请求调度开始 REQUEST_SCHEDULE_START SCHEDULE
请求排队开始 REQUEST_QUEUE_START SCHEDULE
请求排队结束 REQUEST_QUEUE_END SCHEDULE
资源分配开始 RESOURCE_ALLOCATE_START SCHEDULE
资源分配结束 RESOURCE_ALLOCATE_END SCHEDULE
请求调度结束 REQUEST_SCHEDULE_END SCHEDULE
开始推理 INFERENCE_START PREFILL
首 Token 生成 FIRST_TOKEN_GENERATED PREFILL
Decode 开始 DECODE_START DECODE
推理完成 INFERENCE_END DECODE
后处理开始 POSTPROCESSING_START POSTPROCESSING
后处理结束 POSTPROCESSING_END POSTPROCESSING

5. 打点工具类实现

为了规范化和自动化日志追踪信息的记录,定义了以下核心组件:
核心枚举类 (Enums)
这些枚举定义了 FastDeploy 请求处理流程中的标准打点事件和阶段,是实现细粒度追踪的基础。

LoggingEventName:

  • 作用: 定义了请求处理流程中所有细粒度的打点事件(如 PREPROCESSING_START, INFERENCE_END)。
  • 定位: 最小的耗时分析单元。

StageName:

  • 作用: 定义了请求处理流程中所有宏观的主阶段(如 PREPROCESSING, SCHEDULE, DECODE)。
  • 定位: 高层级的流程划分和性能概览。

EVENT_TO_STAGE_MAP:

  • 作用: 将每一个细粒度的 LoggingEventName 映射到其所属的 StageName。
  • 价值: 确保日志在记录 Event 时,能够自动附加其所属的 Stage 信息,从而简化打点流程并便于后续的日志查询和聚合分析。

trace_logger打印函数 (print)

  • 这是实际执行日志打点的工具函数。它封装了复杂的日志 extra 字段构造逻辑,确保日志记录的简洁性和上下文完整性。

Usage or Command

打点示例:

trace_logger.info(
    "preprocess end",
    extra={
        "attributes": {
            "request_id": f"{request_id}",
            "user_id": f"{getattr(request, 'user', '')}",
            "event": "POSTPROCESSING_END",
            "stage": "POSTPROCESSING",
        }
    },
)

Accuracy Tests

打印示例(未开启trace):

[2025-10-30 06:33:47,246] [INFO    ] (engine_client.py:add_requests:175) [thread=140634757722560] [thread_name=MainThread] [timestamp=1761806027246] [request_id=chatcmpl-114514-29ff96da-3825-41a1-acd8-601890942bef] [user_id=114514] [event=PREPROCESSING_START] [stage=PREPROCESSING] preprocess is started

打印示例(开启 Trace):

[2025-10-30 07:47:28,661] [INFO    ] (engine_client.py:add_requests:175) [otel_trace_id=cea88436e433abb227589130de85a620] [otel_span_id=f0e8826f89e7f716] [thread=139858837144000] [thread_name=MainThread] [timestamp=1761810448661] [request_id=chatcmpl-b6d68f14-4477-4d9d-a6de-bacc60ea782a] [event=PREPROCESSING_START] [stage=PREPROCESSING] preprocess is started

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Oct 30, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Oct 30, 2025
@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch 5 times, most recently from a7a07eb to afc83d9 Compare November 4, 2025 05:19
@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 4, 2025

 /re-run run_tests_with_coverage

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 4, 2025

/re-run run_tests_with_coverage

EmmonsCurse
EmmonsCurse previously approved these changes Nov 4, 2025
@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch 2 times, most recently from 02155ff to 67e6730 Compare November 10, 2025 03:30
@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch from cd4e314 to a25b7c9 Compare November 10, 2025 04:12
@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch from f21552e to 126f89c Compare November 10, 2025 06:23
@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 10, 2025

/re-run ci_xpu

sunlei1024
sunlei1024 previously approved these changes Nov 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive tracing and logging capabilities to the FastDeploy system for tracking request lifecycle events and improving observability.

  • Introduces a new trace logging module with event-based logging across preprocessing, scheduling, inference, and postprocessing stages
  • Adds OpenTelemetry logging instrumentation support to integrate with distributed tracing systems
  • Implements a custom log formatter that supports structured attributes and OTEL span/trace ID injection

Reviewed Changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
fastdeploy/trace/trace_logger.py New trace logging function that records events with request context
fastdeploy/trace/constants.py Defines logging event names, stage names, and event-to-stage mappings
fastdeploy/logger/formatters.py Adds CustomFormatter with attribute expansion and OTEL field support
fastdeploy/logger/logger.py Implements get_trace_logger method for creating trace-specific loggers
fastdeploy/utils.py Initializes global trace_logger instance
fastdeploy/metrics/trace_util.py Integrates OpenTelemetry logging instrumentation
fastdeploy/output/token_processor.py Adds trace logging for token generation events
fastdeploy/entrypoints/engine_client.py Adds trace logging at preprocessing start
fastdeploy/engine/common_engine.py Adds trace logging throughout scheduling and resource allocation
fastdeploy/entrypoints/openai/serving_completion.py Adds trace logging for completion endpoint postprocessing end
fastdeploy/entrypoints/openai/serving_chat.py Adds trace logging for chat endpoint postprocessing end
requirements*.txt Adds opentelemetry-instrumentation-logging dependency
tests/trace/test_trace_logger.py Tests for trace_print function
tests/trace/test_constants.py Tests for event and stage enumerations
tests/output/test_token_processor_trace_print.py Tests trace logging in token processor
tests/logger/test_logger.py Tests for get_trace_logger method
tests/logger/test_formatters.py Extensive tests for CustomFormatter and ColoredFormatter
Comments suppressed due to low confidence (3)

fastdeploy/logger/formatters.py:59

  • Except block directly handles BaseException.
        except:

fastdeploy/logger/formatters.py:125

  • Except block directly handles BaseException.
        except:

fastdeploy/trace/trace_logger.py:23

  • Except block directly handles BaseException.
    except:

@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch 2 times, most recently from 6dd3637 to 5c49942 Compare November 13, 2025 15:51
@qwes5s5 qwes5s5 force-pushed the new_add_trace_logger branch from 24573ae to 845b899 Compare November 14, 2025 03:13
@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_with_coverage
/re-run run_tests_logprob
/re-run run_ce_cases
/re-run ci_xpu

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_ce_cases

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_with_coverage

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_logprob

1 similar comment
@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_logprob

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_with_coverage

@qwes5s5
Copy link
Collaborator Author

qwes5s5 commented Nov 14, 2025

/re-run run_tests_with_coverage
/re-run run_tests_logprob

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 36216e6 into PaddlePaddle:develop Nov 17, 2025
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants