-
Notifications
You must be signed in to change notification settings - Fork 693
[XPU] [CI] Xpu Ci Refactor #5252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
|
suijiaxin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the XPU CI testing infrastructure from a monolithic bash script to a pytest-based framework. The refactoring improves test maintainability, modularity, and extensibility by separating test cases into individual files and extracting common functionality into reusable helper functions.
- Migrates from
scripts/run_ci_xpu.shto pytest-based framework withscripts/run_xpu_ci_pytest.sh - Creates modular test files for different scenarios (V1 mode, W4A8 quantization, VL model, EP parallel configurations)
- Centralizes common utilities in
conftest.pyfor process management, health checks, and environment setup
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 34 comments.
Show a summary per file
| File | Description |
|---|---|
tests/xpu_ci_pytest/conftest.py |
Core infrastructure providing fixtures, server management, health checks, and EP environment setup functions |
tests/xpu_ci_pytest/test_w4a8.py |
Test case for W4A8 quantization mode with ERNIE-4.5-300B model |
tests/xpu_ci_pytest/test_vl_model.py |
Test case for vision-language model (ERNIE-4.5-VL-28B) with image input |
tests/xpu_ci_pytest/test_v1_mode.py |
Test case for V1 mode with wint4 quantization and prefix caching |
tests/xpu_ci_pytest/test_ep4tp4_online.py |
Test case for Expert Parallel (EP=4) with Tensor Parallel (TP=4) online service |
tests/xpu_ci_pytest/test_ep4tp4_all2all.py |
Test case for EP4TP4 configuration with all2all communication enabled |
tests/xpu_ci_pytest/test_ep4tp1_online.py |
Test case for EP4TP1 configuration with data parallel enabled |
tests/xpu_ci_pytest/README.md |
Comprehensive documentation covering usage, architecture, and how to add new tests |
scripts/run_xpu_ci_pytest.sh |
Main CI entry script that sets up environment and executes pytest tests |
.github/workflows/ci_xpu.yml |
GitHub workflow updated to use new pytest-based script |
| """获取MODEL_PATH环境变量""" | ||
| model_path = os.getenv("MODEL_PATH") | ||
| if not model_path: | ||
| raise ValueError("MODEL_PATH environment variable is not set") |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is generic and doesn't provide helpful context. Consider improving it to help users understand what went wrong:
if not model_path:
raise ValueError(
"MODEL_PATH environment variable is not set. "
"Please set it to the directory containing your models, e.g., "
"export MODEL_PATH=/path/to/models"
)| raise ValueError("MODEL_PATH environment variable is not set") | |
| raise ValueError( | |
| "MODEL_PATH environment variable is not set.\n" | |
| "Please set it to the directory containing your models before running tests.\n" | |
| "Example: export MODEL_PATH=/path/to/models" | |
| ) |
tests/xpu_ci/conftest.py
Outdated
| shell=True | ||
| ) | ||
| if result.returncode != 0: | ||
| print("下载xDeepEP失败") | ||
| return False | ||
|
|
||
| print("解压xDeepEP...") | ||
| result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True) | ||
| if result.returncode != 0: | ||
| print("解压xDeepEP失败") | ||
| return False | ||
|
|
||
| print("编译xDeepEP...") | ||
| result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True) | ||
| if result.returncode != 0: | ||
| print("编译xDeepEP失败") |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages in download_and_build_xdeepep() function don't provide actionable information. They should include:
- The actual error from subprocess
- Suggestions on what the user can do to fix it
print("下载xDeepEP...")
result = subprocess.run(
"wget -q https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz",
shell=True,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"下载xDeepEP失败: {result.stderr}")
print("请检查网络连接或手动下载文件")
return FalseApply similar improvements to the extraction and build error messages.
| shell=True | |
| ) | |
| if result.returncode != 0: | |
| print("下载xDeepEP失败") | |
| return False | |
| print("解压xDeepEP...") | |
| result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True) | |
| if result.returncode != 0: | |
| print("解压xDeepEP失败") | |
| return False | |
| print("编译xDeepEP...") | |
| result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True) | |
| if result.returncode != 0: | |
| print("编译xDeepEP失败") | |
| shell=True, | |
| capture_output=True, | |
| text=True | |
| ) | |
| if result.returncode != 0: | |
| print(f"下载xDeepEP失败: {result.stderr.strip()}") | |
| print("建议: 请检查网络连接,或手动下载 https://paddle-qa.bj.bcebos.com/xpu_third_party/xDeepEP.tar.gz 到当前目录,并确保 wget 命令可用。") | |
| return False | |
| print("解压xDeepEP...") | |
| result = subprocess.run("tar -xzf xDeepEP.tar.gz", shell=True, capture_output=True, text=True) | |
| if result.returncode != 0: | |
| print(f"解压xDeepEP失败: {result.stderr.strip()}") | |
| print("建议: 请检查 xDeepEP.tar.gz 文件是否完整,磁盘空间是否足够,并确保 tar 命令可用。") | |
| return False | |
| print("编译xDeepEP...") | |
| result = subprocess.run("cd xDeepEP && bash build.sh && cd -", shell=True, capture_output=True, text=True) | |
| if result.returncode != 0: | |
| print(f"编译xDeepEP失败: {result.stderr.strip()}") | |
| print("建议: 请检查编译依赖是否齐全(如 gcc、make),并查看上方错误信息定位问题。") |
| echo "============================安装测试依赖============================" | ||
| python -m pip install openai -U | ||
| python -m pip uninstall -y triton | ||
| python -m pip install triton==3.3.0 |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The triton version 3.3.0 is hardcoded. Consider making it configurable or documenting why this specific version is required:
TRITON_VERSION="${TRITON_VERSION:-3.3.0}"
python -m pip uninstall -y triton
python -m pip install "triton==${TRITON_VERSION}"| python -m pip install triton==3.3.0 | |
| # Triton版本可通过环境变量TRITON_VERSION配置,默认为3.3.0。请根据项目兼容性需求调整。 | |
| TRITON_VERSION="${TRITON_VERSION:-3.3.0}" | |
| python -m pip install "triton==${TRITON_VERSION}" |
tests/xpu_ci_pytest/README.md
Outdated
|
|
||
| ### 步骤3: 添加到CI流程 | ||
|
|
||
| 在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件: |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style improvement: add spaces around "pytest" for better readability in Chinese text.
| 在 `scripts/run_xpu_ci_pytest.sh` 文件的pytest命令中添加新的测试文件: | |
| 在 `scripts/run_xpu_ci_pytest.sh` 文件的 pytest 命令中添加新的测试文件: |
| ## 常见问题 | ||
|
|
||
| ### 1. 如何调试单个测试? | ||
|
|
||
| ```bash | ||
| # 使用pytest的调试选项 | ||
| python -m pytest -v -s --pdb tests/xpu_ci_pytest/test_xxx.py | ||
|
|
||
| # 或者直接在代码中添加断点 | ||
| import pdb; pdb.set_trace() | ||
| ``` | ||
|
|
||
| ### 2. 如何查看服务器日志? | ||
|
|
||
| 测试失败时会自动打印 `server.log` 和 `log/workerlog.0` 的内容。 | ||
| 你也可以在测试运行时手动查看: | ||
|
|
||
| ```bash | ||
| tail -f server.log | ||
| tail -f log/workerlog.0 | ||
| ``` | ||
|
|
||
| ### 3. 如何跳过某个测试? | ||
|
|
||
| ```python | ||
| @pytest.mark.skip(reason="暂时跳过此测试") | ||
| def test_example(xpu_env): | ||
| pass | ||
| ``` | ||
|
|
||
| ### 4. 如何添加超时控制? | ||
|
|
||
| ```python | ||
| @pytest.mark.timeout(300) # 5分钟超时 | ||
| def test_example(xpu_env): | ||
| pass | ||
| ``` |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "常见问题" (FAQ) section is helpful, but it's missing some important troubleshooting scenarios:
- What to do when health check times out
- How to handle port conflicts
- What to do when models are not found
- How to run tests in parallel (or why not to)
- How to handle XPU device issues
Consider adding these common scenarios to help users debug issues more effectively.
tests/xpu_ci/test_ep4tp4_online.py
Outdated
| - 特性: disable-sequence-parallel-moe | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
tests/xpu_ci/test_v1_mode.py
Outdated
| - 特性: enable-prefix-caching, enable-chunked-prefill | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
tests/xpu_ci/test_vl_model.py
Outdated
| - 特性: reasoning-parser, tool-call-parser, enable-chunked-prefill | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
tests/xpu_ci/test_w4a8.py
Outdated
| - Tensor Parallel: 4 | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
tests/xpu_ci/conftest.py
Outdated
| model_id = data["data"][0].get("id", "unknown") | ||
| print(f"\n模型就绪!模型ID: {model_id}, 总耗时 {elapsed} 秒") | ||
| return True | ||
| except (json.JSONDecodeError, Exception) as e: |
Copilot
AI
Nov 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except (json.JSONDecodeError, Exception) as e: | |
| except (json.JSONDecodeError, Exception) as e: | |
| # 忽略异常,可能是服务未就绪或返回非JSON数据,继续重试 |
…Deploy into xpu-ci-refactor2
Set the global pip index URL to Tsinghua mirror.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5252 +/- ##
==========================================
Coverage ? 59.17%
==========================================
Files ? 324
Lines ? 39932
Branches ? 6033
==========================================
Hits ? 23631
Misses ? 14436
Partials ? 1865
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.