Skip to content

Conversation

@plusNew001
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings November 26, 2025 12:42
@paddle-bot
Copy link

paddle-bot bot commented Nov 26, 2025

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Nov 26, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ plusNew001
❌ suijiaxin


suijiaxin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the XPU CI testing infrastructure from a monolithic bash script to a pytest-based framework. The refactoring improves test maintainability, modularity, and extensibility by separating test cases into individual files and extracting common functionality into reusable helper functions.

  • Migrates from scripts/run_ci_xpu.sh to pytest-based framework with scripts/run_xpu_ci_pytest.sh
  • Creates modular test files for different scenarios (V1 mode, W4A8 quantization, VL model, EP parallel configurations)
  • Centralizes common utilities in conftest.py for process management, health checks, and environment setup

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 34 comments.

Show a summary per file
File Description
tests/xpu_ci_pytest/conftest.py Core infrastructure providing fixtures, server management, health checks, and EP environment setup functions
tests/xpu_ci_pytest/test_w4a8.py Test case for W4A8 quantization mode with ERNIE-4.5-300B model
tests/xpu_ci_pytest/test_vl_model.py Test case for vision-language model (ERNIE-4.5-VL-28B) with image input
tests/xpu_ci_pytest/test_v1_mode.py Test case for V1 mode with wint4 quantization and prefix caching
tests/xpu_ci_pytest/test_ep4tp4_online.py Test case for Expert Parallel (EP=4) with Tensor Parallel (TP=4) online service
tests/xpu_ci_pytest/test_ep4tp4_all2all.py Test case for EP4TP4 configuration with all2all communication enabled
tests/xpu_ci_pytest/test_ep4tp1_online.py Test case for EP4TP1 configuration with data parallel enabled
tests/xpu_ci_pytest/README.md Comprehensive documentation covering usage, architecture, and how to add new tests
scripts/run_xpu_ci_pytest.sh Main CI entry script that sets up environment and executes pytest tests
.github/workflows/ci_xpu.yml GitHub workflow updated to use new pytest-based script

"""获取MODEL_PATH环境变量"""
model_path = os.getenv("MODEL_PATH")
if not model_path:
raise ValueError("MODEL_PATH environment variable is not set")
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is generic and doesn't provide helpful context. Consider improving it to help users understand what went wrong:

if not model_path:
    raise ValueError(
        "MODEL_PATH environment variable is not set. "
        "Please set it to the directory containing your models, e.g., "
        "export MODEL_PATH=/path/to/models"
    )
Suggested change
raise ValueError("MODEL_PATH environment variable is not set")
raise ValueError(
"MODEL_PATH environment variable is not set.\n"
"Please set it to the directory containing your models before running tests.\n"
"Example: export MODEL_PATH=/path/to/models"
)

Copilot uses AI. Check for mistakes.
Comment on lines 323 to 338
shell=True
)
if result.returncode != 0:
print("下载xDeepEP失败")
return False

print("解压xDeepEP...")
result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True)
if result.returncode != 0:
print("解压xDeepEP失败")
return False

print("编译xDeepEP...")
result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True)
if result.returncode != 0:
print("编译xDeepEP失败")
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages in download_and_build_xdeepep() function don't provide actionable information. They should include:

  1. The actual error from subprocess
  2. Suggestions on what the user can do to fix it
print("下载xDeepEP...")
result = subprocess.run(
    "wget -q https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz",
    shell=True,
    capture_output=True,
    text=True
)
if result.returncode != 0:
    print(f"下载xDeepEP失败: {result.stderr}")
    print("请检查网络连接或手动下载文件")
    return False

Apply similar improvements to the extraction and build error messages.

Suggested change
shell=True
)
if result.returncode != 0:
print("下载xDeepEP失败")
return False
print("解压xDeepEP...")
result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True)
if result.returncode != 0:
print("解压xDeepEP失败")
return False
print("编译xDeepEP...")
result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True)
if result.returncode != 0:
print("编译xDeepEP失败")
shell=True,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"下载xDeepEP失败: {result.stderr.strip()}")
print("建议: 请检查网络连接,或手动下载 https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz 到当前目录,并确保 wget 命令可用。")
return False
print("解压xDeepEP...")
result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True, capture_output=True, text=True)
if result.returncode != 0:
print(f"解压xDeepEP失败: {result.stderr.strip()}")
print("建议: 请检查 xDeepEP.tar.gz 文件是否完整,磁盘空间是否足够,并确保 tar 命令可用。")
return False
print("编译xDeepEP...")
result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True, capture_output=True, text=True)
if result.returncode != 0:
print(f"编译xDeepEP失败: {result.stderr.strip()}")
print("建议: 请检查编译依赖是否齐全(如 gcc、make),并查看上方错误信息定位问题。")

Copilot uses AI. Check for mistakes.
echo "============================安装测试依赖============================"
python -m pip install openai -U
python -m pip uninstall -y triton
python -m pip install triton==3.3.0
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The triton version 3.3.0 is hardcoded. Consider making it configurable or documenting why this specific version is required:

TRITON_VERSION="${TRITON_VERSION:-3.3.0}"
python -m pip uninstall -y triton
python -m pip install "triton==${TRITON_VERSION}"
Suggested change
python -m pip install triton==3.3.0
# Triton版本可通过环境变量TRITON_VERSION配置,默认为3.3.0。请根据项目兼容性需求调整。
TRITON_VERSION="${TRITON_VERSION:-3.3.0}"
python -m pip install "triton==${TRITON_VERSION}"

Copilot uses AI. Check for mistakes.

### 步骤3: 添加到CI流程

在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style improvement: add spaces around "pytest" for better readability in Chinese text.

Suggested change
`scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件:
`scripts/run_xpu_ci_pytest.sh` 文件的 pytest 命令中添加新的测试文件:

Copilot uses AI. Check for mistakes.
Comment on lines 300 to 336
## 常见问题

### 1. 如何调试单个测试?

```bash
# 使用pytest的调试选项
python -m pytest -v -s --pdb tests/xpu_ci_pytest/test_xxx.py

# 或者直接在代码中添加断点
import pdb; pdb.set_trace()
```

### 2. 如何查看服务器日志?

测试失败时会自动打印 `server.log` 和 `log/workerlog.0` 的内容。
你也可以在测试运行时手动查看:

```bash
tail -f server.log
tail -f log/workerlog.0
```

### 3. 如何跳过某个测试?

```python
@pytest.mark.skip(reason="暂时跳过此测试")
def test_example(xpu_env):
pass
```

### 4. 如何添加超时控制?

```python
@pytest.mark.timeout(300) # 5分钟超时
def test_example(xpu_env):
pass
```
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "常见问题" (FAQ) section is helpful, but it's missing some important troubleshooting scenarios:

  1. What to do when health check times out
  2. How to handle port conflicts
  3. What to do when models are not found
  4. How to run tests in parallel (or why not to)
  5. How to handle XPU device issues

Consider adding these common scenarios to help users debug issues more effectively.

Copilot uses AI. Check for mistakes.
- 特性: disable-sequence-parallel-moe
"""

import os
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
- 特性: enable-prefix-caching, enable-chunked-prefill
"""

import os
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
- 特性: reasoning-parser, tool-call-parser, enable-chunked-prefill
"""

import os
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
- Tensor Parallel: 4
"""

import os
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
model_id = data["data"][0].get("id", "unknown")
print(f"\n模型就绪!模型ID: {model_id}, 总耗时 {elapsed} 秒")
return True
except (json.JSONDecodeError, Exception) as e:
Copy link

Copilot AI Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except (json.JSONDecodeError, Exception) as e:
except (json.JSONDecodeError, Exception) as e:
# 忽略异常,可能是服务未就绪或返回非JSON数据,继续重试

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@69e003a). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5252   +/-   ##
==========================================
  Coverage           ?   59.17%           
==========================================
  Files              ?      324           
  Lines              ?    39932           
  Branches           ?     6033           
==========================================
  Hits               ?    23631           
  Misses             ?    14436           
  Partials           ?     1865           
Flag Coverage Δ
GPU 59.17% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@plusNew001 plusNew001 merged commit 8e0f4df into PaddlePaddle:develop Dec 2, 2025
13 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants