[XPU] [CI] Xpu Ci Refactor #5252

plusNew001 · 2025-11-26T12:42:17Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-26T12:42:24Z

Thanks for your contribution!

CLAassistant · 2025-11-26T12:42:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ plusNew001
❌ suijiaxin

suijiaxin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR refactors the XPU CI testing infrastructure from a monolithic bash script to a pytest-based framework. The refactoring improves test maintainability, modularity, and extensibility by separating test cases into individual files and extracting common functionality into reusable helper functions.

Migrates from scripts/run_ci_xpu.sh to pytest-based framework with scripts/run_xpu_ci_pytest.sh
Creates modular test files for different scenarios (V1 mode, W4A8 quantization, VL model, EP parallel configurations)
Centralizes common utilities in conftest.py for process management, health checks, and environment setup

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 34 comments.

Show a summary per file

File	Description
`tests/xpu_ci_pytest/conftest.py`	Core infrastructure providing fixtures, server management, health checks, and EP environment setup functions
`tests/xpu_ci_pytest/test_w4a8.py`	Test case for W4A8 quantization mode with ERNIE-4.5-300B model
`tests/xpu_ci_pytest/test_vl_model.py`	Test case for vision-language model (ERNIE-4.5-VL-28B) with image input
`tests/xpu_ci_pytest/test_v1_mode.py`	Test case for V1 mode with wint4 quantization and prefix caching
`tests/xpu_ci_pytest/test_ep4tp4_online.py`	Test case for Expert Parallel (EP=4) with Tensor Parallel (TP=4) online service
`tests/xpu_ci_pytest/test_ep4tp4_all2all.py`	Test case for EP4TP4 configuration with all2all communication enabled
`tests/xpu_ci_pytest/test_ep4tp1_online.py`	Test case for EP4TP1 configuration with data parallel enabled
`tests/xpu_ci_pytest/README.md`	Comprehensive documentation covering usage, architecture, and how to add new tests
`scripts/run_xpu_ci_pytest.sh`	Main CI entry script that sets up environment and executes pytest tests
`.github/workflows/ci_xpu.yml`	GitHub workflow updated to use new pytest-based script

Copilot · 2025-11-26T12:53:51Z

tests/xpu_ci/conftest.py

+    """获取MODEL_PATH环境变量"""
+    model_path = os.getenv("MODEL_PATH")
+    if not model_path:
+        raise ValueError("MODEL_PATH environment variable is not set")


The error message is generic and doesn't provide helpful context. Consider improving it to help users understand what went wrong:

if not model_path: raise ValueError( "MODEL_PATH environment variable is not set. " "Please set it to the directory containing your models, e.g., " "export MODEL_PATH=/path/to/models" )

Suggested change

raise ValueError("MODEL_PATH environment variable is not set")

raise ValueError(

"MODEL_PATH environment variable is not set.\n"

"Please set it to the directory containing your models before running tests.\n"

"Example: export MODEL_PATH=/path/to/models"

)

Copilot · 2025-11-26T12:53:52Z

tests/xpu_ci/conftest.py

+        shell=True
+    )
+    if result.returncode != 0:
+        print("下载xDeepEP失败")
+        return False
+
+    print("解压xDeepEP...")
+    result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True)
+    if result.returncode != 0:
+        print("解压xDeepEP失败")
+        return False
+
+    print("编译xDeepEP...")
+    result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True)
+    if result.returncode != 0:
+        print("编译xDeepEP失败")


The error messages in download_and_build_xdeepep() function don't provide actionable information. They should include:

The actual error from subprocess

Suggestions on what the user can do to fix it

print("下载xDeepEP...") result = subprocess.run( "wget -q https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz", shell=True, capture_output=True, text=True ) if result.returncode != 0: print(f"下载xDeepEP失败: {result.stderr}") print("请检查网络连接或手动下载文件") return False

Apply similar improvements to the extraction and build error messages.

Suggested change

shell=True

)

if result.returncode != 0:

print("下载xDeepEP失败")

return False

print("解压xDeepEP...")

result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True)

if result.returncode != 0:

print("解压xDeepEP失败")

return False

print("编译xDeepEP...")

result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True)

if result.returncode != 0:

print("编译xDeepEP失败")

shell=True,

capture_output=True,

text=True

)

if result.returncode != 0:

print(f"下载xDeepEP失败: {result.stderr.strip()}")

print("建议: 请检查网络连接，或手动下载 https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz 到当前目录，并确保 wget 命令可用。")

return False

print("解压xDeepEP...")

result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True, capture_output=True, text=True)

if result.returncode != 0:

print(f"解压xDeepEP失败: {result.stderr.strip()}")

print("建议: 请检查 xDeepEP.tar.gz 文件是否完整，磁盘空间是否足够，并确保 tar 命令可用。")

return False

print("编译xDeepEP...")

result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True, capture_output=True, text=True)

if result.returncode != 0:

print(f"编译xDeepEP失败: {result.stderr.strip()}")

print("建议: 请检查编译依赖是否齐全（如 gcc、make），并查看上方错误信息定位问题。")

Copilot · 2025-11-26T12:53:52Z

scripts/run_xpu_ci_pytest.sh

+echo "============================安装测试依赖============================"
+python -m pip install openai -U
+python -m pip uninstall -y triton
+python -m pip install triton==3.3.0


The triton version 3.3.0 is hardcoded. Consider making it configurable or documenting why this specific version is required:

TRITON_VERSION="${TRITON_VERSION:-3.3.0}" python -m pip uninstall -y triton python -m pip install "triton==${TRITON_VERSION}"

Suggested change

python -m pip install triton==3.3.0

# Triton版本可通过环境变量TRITON_VERSION配置，默认为3.3.0。请根据项目兼容性需求调整。

TRITON_VERSION="${TRITON_VERSION:-3.3.0}"

python -m pip install "triton==${TRITON_VERSION}"

Copilot · 2025-11-26T12:53:52Z

tests/xpu_ci_pytest/README.md

+
+### 步骤3: 添加到CI流程
+
+在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:


Minor style improvement: add spaces around "pytest" for better readability in Chinese text.

Suggested change

在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:

在 `scripts/run_xpu_ci_pytest.sh` 文件的 pytest 命令中添加新的测试文件:

Copilot · 2025-11-26T12:53:53Z

tests/xpu_ci/README.md

+## 常见问题
+
+### 1. 如何调试单个测试?
+
+```bash
+# 使用pytest的调试选项
+python -m pytest -v -s --pdb tests/xpu_ci_pytest/test_xxx.py
+
+# 或者直接在代码中添加断点
+import pdb; pdb.set_trace()
+```
+
+### 2. 如何查看服务器日志?
+
+测试失败时会自动打印 `server.log` 和 `log/workerlog.0` 的内容。
+你也可以在测试运行时手动查看:
+
+```bash
+tail -f server.log
+tail -f log/workerlog.0
+```
+
+### 3. 如何跳过某个测试?
+
+```python
+@pytest.mark.skip(reason="暂时跳过此测试")
+def test_example(xpu_env):
+    pass
+```
+
+### 4. 如何添加超时控制?
+
+```python
+@pytest.mark.timeout(300)  # 5分钟超时
+def test_example(xpu_env):
+    pass
+```


The "常见问题" (FAQ) section is helpful, but it's missing some important troubleshooting scenarios:

What to do when health check times out

How to handle port conflicts

What to do when models are not found

How to run tests in parallel (or why not to)

How to handle XPU device issues

Consider adding these common scenarios to help users debug issues more effectively.

Copilot · 2025-11-26T12:53:59Z

tests/xpu_ci/test_ep4tp4_online.py

+- 特性: disable-sequence-parallel-moe
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-11-26T12:54:00Z

tests/xpu_ci/test_v1_mode.py

+- 特性: enable-prefix-caching, enable-chunked-prefill
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-11-26T12:54:00Z

tests/xpu_ci/test_vl_model.py

+- 特性: reasoning-parser, tool-call-parser, enable-chunked-prefill
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-11-26T12:54:00Z

tests/xpu_ci/test_w4a8.py

+- Tensor Parallel: 4
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-11-26T12:54:01Z

tests/xpu_ci/conftest.py

+                    model_id = data["data"][0].get("id", "unknown")
+                    print(f"\n模型就绪!模型ID: {model_id}, 总耗时 {elapsed} 秒")
+                    return True
+        except (json.JSONDecodeError, Exception) as e:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except (json.JSONDecodeError, Exception) as e:

except (json.JSONDecodeError, Exception) as e:

# 忽略异常，可能是服务未就绪或返回非JSON数据，继续重试

…Deploy into xpu-ci-refactor2

Set the global pip index URL to Tsinghua mirror.

codecov-commenter · 2025-12-02T06:02:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@69e003a). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5252   +/-   ##
==========================================
  Coverage           ?   59.17%           
==========================================
  Files              ?      324           
  Lines              ?    39932           
  Branches           ?     6033           
==========================================
  Hits               ?    23631           
  Misses             ?    14436           
  Partials           ?     1865

Flag	Coverage Δ
GPU	`59.17% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

suijiaxin added 3 commits November 25, 2025 19:14

add xpu ci

01b7a59

add case

53e211f

add case

879da08

Copilot AI review requested due to automatic review settings November 26, 2025 12:42

Copilot started reviewing on behalf of plusNew001 November 26, 2025 12:42 View session

Copilot finished reviewing on behalf of plusNew001 November 26, 2025 12:46

Copilot AI reviewed Nov 26, 2025

View reviewed changes

root and others added 10 commits November 27, 2025 11:03

fix ci bug

46d9641

Merge branch 'xpu-ci-refactor2' of https://github.com/plusNew001/Fast…

cdcc1a7

…Deploy into xpu-ci-refactor2

Update Docker image tag to 'latest' in CI workflow

b5b7134

Fix set -e usage in run_xpu_ci_pytest.sh

125fa79

Merge branch 'develop' into xpu-ci-refactor2

787aa16

add pd case

57efd70

add case

af70e5c

Configure pip to use Tsinghua mirror for dependencies

9e7c60e

Set the global pip index URL to Tsinghua mirror.

fix ci bug

98d6f40

Merge branch 'develop' into xpu-ci-refactor2

2709a30

root and others added 3 commits December 2, 2025 06:49

fix bug

39051cc

fix bug

5767bc0

Merge branch 'develop' into xpu-ci-refactor2

e082d3b

EmmonsCurse approved these changes Dec 2, 2025

View reviewed changes

plusNew001 merged commit 8e0f4df into PaddlePaddle:develop Dec 2, 2025
13 of 17 checks passed

-        raise ValueError("MODEL_PATH environment variable is not set")
+        raise ValueError(
+            "MODEL_PATH environment variable is not set.\n"
+            "Please set it to the directory containing your models before running tests.\n"
+            "Example: export MODEL_PATH=/path/to/models"
+        )

-python -m pip install triton==3.3.0
+# Triton版本可通过环境变量TRITON_VERSION配置，默认为3.3.0。请根据项目兼容性需求调整。
+TRITON_VERSION="${TRITON_VERSION:-3.3.0}"
+python -m pip install "triton==${TRITON_VERSION}"


		### 步骤3: 添加到CI流程

		在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:

	在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:
	在 `scripts/run_xpu_ci_pytest.sh` 文件的 pytest 命令中添加新的测试文件:

	except (json.JSONDecodeError, Exception) as e:
	except (json.JSONDecodeError, Exception) as e:
	# 忽略异常，可能是服务未就绪或返回非JSON数据，继续重试

[XPU] [CI] Xpu Ci Refactor #5252

[XPU] [CI] Xpu Ci Refactor #5252

Uh oh!

Conversation

plusNew001 commented Nov 26, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 26, 2025

Uh oh!

CLAassistant commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Nov 26, 2025 •

edited

Loading

codecov-commenter commented Dec 2, 2025 •

edited

Loading