From 88cacd0188d7817ee8183b1c41d276683e59ce50 Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:07:59 +0000 Subject: [PATCH 1/6] docs for npu support --- docs/source/GetStarted/NPU-setup.md | 344 +++++++++++++++++++++++++ docs/source_en/GetStarted/NPU-setup.md | 344 +++++++++++++++++++++++++ 2 files changed, 688 insertions(+) create mode 100644 docs/source/GetStarted/NPU-setup.md create mode 100644 docs/source_en/GetStarted/NPU-setup.md diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md new file mode 100644 index 00000000..a7923fd6 --- /dev/null +++ b/docs/source/GetStarted/NPU-setup.md @@ -0,0 +1,344 @@ +# NPU(昇腾)开箱指南 + +本文档介绍如何在华为昇腾 NPU 环境下安装和使用 Twinkle 框架。 + +## 环境要求 + +在开始之前,请确保您的系统满足以下要求: + +| 组件 | 版本要求 | 说明 | +|------|---------|------| +| Python | >= 3.11, < 3.13 | Twinkle 框架要求 | +| 昇腾固件驱动(HDK) | 推荐最新版本 | 硬件驱动和固件 | +| CANN 工具包 | 8.3.RC1 或更高 | 异构计算架构 | +| PyTorch | 2.7.1 | 深度学习框架 | +| torch_npu | 2.7.1 | 昇腾 PyTorch 适配插件 | + +**重要说明**: +- torch 和 torch_npu 版本**必须完全一致**(例如都为 2.7.1) +- 推荐使用 Python 3.11 以获得最佳兼容性 +- CANN 工具包需要约 10GB+ 磁盘空间 + +## 支持的硬件 + +Twinkle 当前支持以下昇腾 NPU 设备: + +- 昇腾 910 系列 +- 其他兼容的昇腾加速卡 + +## 安装步骤 + +### 1. 安装 NPU 环境(驱动、CANN、torch_npu) + +NPU 环境的安装包括昇腾驱动、CANN 工具包、PyTorch 和 torch_npu。 + +**📖 完整安装教程**:[torch_npu 官方安装指南](https://gitcode.com/Ascend/pytorch/overview) + +该文档包含: +- 昇腾驱动(HDK)安装步骤 +- CANN 工具包安装步骤 +- PyTorch 和 torch_npu 安装步骤 +- 版本配套说明 + +**推荐版本配置**: +- Python: 3.11 +- PyTorch: 2.7.1 +- torch_npu: 2.7.1 +- CANN: 8.3.RC1 或更高 + +### 2. 安装 Twinkle + +NPU 环境配置完成后,从源码安装 Twinkle 框架: + +```bash +git clone https://github.com/modelscope/twinkle.git +cd twinkle +pip install -e ".[transformers,ray]" +``` + +### 3. 验证安装 + +创建测试脚本 `verify_npu.py`: + +```python +import torch +import torch_npu + +print(f"PyTorch version: {torch.__version__}") +print(f"torch_npu version: {torch_npu.__version__}") +print(f"NPU available: {torch.npu.is_available()}") +print(f"NPU device count: {torch.npu.device_count()}") + +if torch.npu.is_available(): + print(f"Current NPU device: {torch.npu.current_device()}") + print(f"NPU device name: {torch.npu.get_device_name(0)}") + + # 简单测试 + x = torch.randn(3, 3).npu() + y = torch.randn(3, 3).npu() + z = x + y + print(f"NPU computation test passed: {z.shape}") +``` + +运行验证: + +```bash +python verify_npu.py +``` + +如果输出显示 `NPU available: True` 且没有报错,说明安装成功! + +**注意**:目前 Twinkle 暂未提供 NPU 的 Docker 镜像,建议使用手动安装方式。如需容器化部署,请参考昇腾社区的官方镜像。 + +## 快速开始 + +**重要提示**:以下示例均来自 `cookbook/` 目录,已在实际 NPU 环境中验证通过。建议直接运行 cookbook 中的脚本,而不是复制粘贴代码片段。 + +### SFT LoRA 微调 + +已验证的 4 卡 DP+FSDP 训练示例: + +**示例路径**:[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + +**运行方式**: +```bash +# 指定使用 4 张 NPU 卡 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 + +# 运行训练 +python cookbook/sft/lora_npu.py +``` + +**示例特性**: +- ✅ Ray 分布式模式 +- ✅ DP + FSDP 混合并行(2x2) +- ✅ LoRA 微调 +- ✅ 完整的数据加载和训练循环 + +### GRPO 强化学习训练 + +已验证的多卡 GRPO 训练示例: + +**示例路径**:[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + +**运行方式**: +```bash +# 指定使用 8 张 NPU 卡 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + +# 运行训练 +python cookbook/grpo/lora_npu.py +``` + +**示例特性**: +- ✅ Actor-Critic 架构 +- ✅ 支持 Reference Model +- ✅ 可选 TorchSampler 或 VLLMSampler +- ✅ 完整的 RL 训练流程 + +### 更多示例 + +查看 `cookbook/remote/tinker/ascend/` 目录了解远程训练服务端配置。 + +## NPU 特定配置 + +### 环境变量 + +Twinkle 在 NPU 环境下会自动识别和使用以下环境变量: + +| 环境变量 | 说明 | 示例 | +|---------|------|------| +| `ASCEND_RT_VISIBLE_DEVICES` | 指定可见的 NPU 设备 | `0,1,2,3` | +| `TRUST_REMOTE_CODE` | 允许加载远程代码 | `1` | +| `TWINKLE_SEED` | 随机种子 | `42` | +| `TWINKLE_FULL_DETERMINISM` | 完全确定性训练 | `1` | + +### 设备网格配置 + +在 NPU 环境下,需要指定 `device_type='npu'`。以下是**已验证**的配置示例: + +```python +from twinkle import DeviceMesh + +# 单卡 +device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu') + +# 2 卡 DP +device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu') + +# 4 卡 DP + FSDP (2x2) - 已验证 +device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu') +``` + +**注意**:TP/PP/EP 等高级并行策略暂无 NPU 验证,配置方式请参考 GPU 文档。 + +### 设备组配置 + +使用 Ray 模式时,需要在 DeviceGroup 中指定设备类型: + +```python +from twinkle.infra import DeviceGroup + +device_groups = [ + DeviceGroup( + name='actor', + ranks=[0, 1, 2, 3, 4, 5], # Actor 使用 6 张卡 + device_type='npu', + ), + DeviceGroup( + name='ref', + ranks=[6, 7], # Reference 模型使用 2 张卡 + device_type='npu', + ), +] +``` + +## 并行策略 + +Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略: + +| 并行类型 | 说明 | NPU 支持 | 验证状态 | +|---------|------|---------|---------| +| DP (Data Parallel) | 数据并行 | ✅ | 已验证(见 cookbook/sft/lora_npu.py) | +| FSDP (Fully Sharded Data Parallel) | 完全分片数据并行 | ✅ | 已验证(见 cookbook/sft/lora_npu.py) | +| TP (Tensor Parallel) | 张量并行(Megatron) | 🚧 | 待验证 | +| PP (Pipeline Parallel) | 流水线并行(Megatron) | 🚧 | 待验证 | +| CP (Context Parallel) | 上下文并行 | 🚧 | 待验证 | +| EP (Expert Parallel) | 专家并行(MoE) | 🚧 | 待验证 | + +**图例说明**: +- ✅ 已验证:有实际运行示例代码 +- 🚧 待验证:理论上支持但暂无 NPU 验证示例 +- ❌ 不支持:当前版本不可用 + +### DP + FSDP 示例(已验证) + +以下示例来自 `cookbook/sft/lora_npu.py`,在实际 NPU 环境中验证通过: + +```python +import numpy as np +from twinkle import DeviceMesh + +# 4 卡:DP=2, FSDP=2 +device_mesh = DeviceMesh( + device_type='npu', + mesh=np.array([[0, 1], [2, 3]]), + mesh_dim_names=('dp', 'fsdp') +) +``` + +**注意**:Megatron 后端(TP/PP/EP)在 NPU 上的支持正在开发中,暂无可用示例。如需使用这些高级并行策略,请先在 GPU 环境下验证,或关注项目更新。 + +## 常见问题 + +### 1. torch_npu 版本不匹配 + +**问题**:安装 torch_npu 后出现版本不兼容警告或错误。 + +**解决方案**: +- 确保 torch 和 torch_npu 版本完全一致 +- 检查 CANN 版本是否与 torch_npu 兼容 + +```bash +# 查看当前版本 +python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)" + +# 重新安装匹配版本 +pip uninstall torch torch_npu -y +pip install torch==2.7.1 +pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl +``` + +### 2. CANN 工具包版本问题 + +**问题**:CANN 版本与 torch_npu 不兼容。 + +**解决方案**: +- 参考[昇腾社区版本配套表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html) +- 安装对应版本的 CANN 工具包 + +## 功能支持情况 + +基于实际代码验证的功能支持矩阵: + +| 功能 | GPU | NPU | 验证示例 | 说明 | +|------|-----|-----|---------|------| +| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| DP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| FSDP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| Ray 分布式 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| 全量微调 | ✅ | 🚧 | - | 理论支持,待验证 | +| QLoRA | ✅ | ❌ | - | 量化算子暂不支持 | +| DPO | ✅ | 🚧 | - | 理论支持,待验证 | +| Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 | +| VLLMSampler | ✅ | 🚧 | - | 需 vLLM-Ascend,待验证 | +| Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 | + +**图例说明**: +- ✅ **已验证**:有实际运行示例,确认可用 +- 🚧 **待验证**:理论上支持但暂无 NPU 环境验证 +- ⚠️ **部分支持**:可用但有限制或性能差异 +- ❌ **不支持**:当前版本不可用 + +**使用建议**: +1. 优先使用标记为"已验证"的功能,稳定性有保障 +2. "待验证"功能可以尝试,但可能遇到兼容性问题 +3. 遇到问题时,参考对应的示例代码进行配置 + +## 示例代码 + +Twinkle 提供了以下经过验证的 NPU 训练示例: + +### SFT 训练 +- **4 卡 DP+FSDP LoRA 微调**:[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + - 使用 Ray 模式进行分布式训练 + - 演示 DP + FSDP 混合并行 + - 包含完整的数据加载和训练循环 + +### GRPO 训练 +- **多卡 GRPO RL 训练**:[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + - Actor-Critic 架构 + - 支持参考模型(Reference Model) + - 可选 TorchSampler 或 VLLMSampler + +### 远程训练(Tinker 协议) +- **服务端配置**:[cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend) + - 提供 HTTP API 接口 + - 支持远程训练和推理 + - 适用于生产环境部署 + +**运行示例**: +```bash +# SFT 训练 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 +python cookbook/sft/lora_npu.py + +# GRPO 训练 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python cookbook/grpo/lora_npu.py +``` + +## 参考资源 + +- [昇腾社区官网](https://www.hiascend.com/) +- [CANN 软件安装指南](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html) +- [torch_npu GitHub](https://github.com/Ascend/pytorch) +- [Twinkle GitHub](https://github.com/modelscope/twinkle) +- [Twinkle 文档](https://twinkle.readthedocs.io/) + +## 获取帮助 + +如果您在使用过程中遇到问题: + +1. **查看日志**:设置环境变量 `ASCEND_GLOBAL_LOG_LEVEL=1` 获取详细日志 +2. **提交 Issue**:[Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues) +3. **社区讨论**:[昇腾社区论坛](https://www.hiascend.com/forum) + +## 下一步 + +- 📖 阅读 [快速开始](Quick-start.md) 了解更多训练示例 +- 📖 阅读 [安装指南](Installation.md) 了解其他平台的安装 +- 🚀 浏览 `cookbook/` 目录查看完整示例代码 +- 💡 查看 [Twinkle 文档](https://twinkle.readthedocs.io/) 了解高级功能 diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md new file mode 100644 index 00000000..2c16f79b --- /dev/null +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -0,0 +1,344 @@ +# NPU (Ascend) Setup Guide + +This guide explains how to install and use the Twinkle framework on Huawei Ascend NPU environments. + +## Requirements + +Before starting, ensure your system meets the following requirements: + +| Component | Version Requirement | Notes | +|-----------|-------------------|-------| +| Python | >= 3.11, < 3.13 | Required by Twinkle framework | +| Ascend Firmware Driver (HDK) | Latest recommended | Hardware driver and firmware | +| CANN Toolkit | 8.3.RC1 or higher | Heterogeneous Computing Architecture | +| PyTorch | 2.7.1 | Deep learning framework | +| torch_npu | 2.7.1 | Ascend PyTorch adapter | + +**Important Notes**: +- PyTorch and torch_npu versions **must match exactly** (e.g., both 2.7.1) +- Python 3.11 is recommended for best compatibility +- CANN toolkit requires approximately 10GB+ disk space + +## Supported Hardware + +Twinkle currently supports the following Ascend NPU devices: + +- Ascend 910 series +- Other compatible Ascend accelerators + +## Installation + +### 1. Install NPU Environment (Driver, CANN, torch_npu) + +NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu. + +**📖 Complete Installation Guide**: [torch_npu Official Installation Guide](https://gitcode.com/Ascend/pytorch/overview) + +The guide covers: +- Ascend driver (HDK) installation steps +- CANN toolkit installation steps +- PyTorch and torch_npu installation steps +- Version compatibility instructions + +**Recommended Version Configuration**: +- Python: 3.11 +- PyTorch: 2.7.1 +- torch_npu: 2.7.1 +- CANN: 8.3.RC1 or higher + +### 2. Install Twinkle + +After NPU environment is configured, install Twinkle framework from source: + +```bash +git clone https://github.com/modelscope/twinkle.git +cd twinkle +pip install -e ".[transformers,ray]" +``` + +### 3. Verify Installation + +Create test script `verify_npu.py`: + +```python +import torch +import torch_npu + +print(f"PyTorch version: {torch.__version__}") +print(f"torch_npu version: {torch_npu.__version__}") +print(f"NPU available: {torch.npu.is_available()}") +print(f"NPU device count: {torch.npu.device_count()}") + +if torch.npu.is_available(): + print(f"Current NPU device: {torch.npu.current_device()}") + print(f"NPU device name: {torch.npu.get_device_name(0)}") + + # Simple test + x = torch.randn(3, 3).npu() + y = torch.randn(3, 3).npu() + z = x + y + print(f"NPU computation test passed: {z.shape}") +``` + +Run verification: + +```bash +python verify_npu.py +``` + +If output shows `NPU available: True` without errors, installation is successful! + +**Note**: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official Ascend Community images. + +## Quick Start + +**Important**: All examples below are from the `cookbook/` directory and have been verified on actual NPU environments. We recommend running scripts directly from cookbook rather than copying code snippets. + +### SFT LoRA Fine-tuning + +Verified 4-card DP+FSDP training example: + +**Example Path**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + +**How to Run**: +```bash +# Specify 4 NPU cards +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 + +# Run training +python cookbook/sft/lora_npu.py +``` + +**Example Features**: +- ✅ Ray distributed mode +- ✅ DP + FSDP hybrid parallelism (2x2) +- ✅ LoRA fine-tuning +- ✅ Complete data loading and training loop + +### GRPO Reinforcement Learning Training + +Verified multi-card GRPO training example: + +**Example Path**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + +**How to Run**: +```bash +# Specify 8 NPU cards +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + +# Run training +python cookbook/grpo/lora_npu.py +``` + +**Example Features**: +- ✅ Actor-Critic architecture +- ✅ Supports Reference Model +- ✅ Optional TorchSampler or VLLMSampler +- ✅ Complete RL training workflow + +### More Examples + +Check the `cookbook/remote/tinker/ascend/` directory for remote training server configurations. + +## NPU-Specific Configuration + +### Environment Variables + +Twinkle automatically recognizes and uses the following environment variables in NPU environments: + +| Environment Variable | Description | Example | +|---------------------|-------------|---------| +| `ASCEND_RT_VISIBLE_DEVICES` | Specify visible NPU devices | `0,1,2,3` | +| `TRUST_REMOTE_CODE` | Allow loading remote code | `1` | +| `TWINKLE_SEED` | Random seed | `42` | +| `TWINKLE_FULL_DETERMINISM` | Fully deterministic training | `1` | + +### Device Mesh Configuration + +In NPU environments, specify `device_type='npu'`. Here are **verified** configuration examples: + +```python +from twinkle import DeviceMesh + +# Single card +device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu') + +# 2-card DP +device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu') + +# 4-card DP + FSDP (2x2) - Verified +device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu') +``` + +**Note**: Advanced parallelism strategies like TP/PP/EP have not been verified on NPU. Please refer to GPU documentation for configuration details. + +### Device Group Configuration + +When using Ray mode, specify device type in DeviceGroup: + +```python +from twinkle.infra import DeviceGroup + +device_groups = [ + DeviceGroup( + name='actor', + ranks=[0, 1, 2, 3, 4, 5], # Actor uses 6 cards + device_type='npu', + ), + DeviceGroup( + name='ref', + ranks=[6, 7], # Reference model uses 2 cards + device_type='npu', + ), +] +``` + +## Parallelism Strategies + +Currently **verified** parallelism strategies on Twinkle NPU: + +| Parallel Type | Description | NPU Support | Verification Status | +|--------------|-------------|-------------|-------------------| +| DP (Data Parallel) | Data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) | +| FSDP (Fully Sharded Data Parallel) | Fully sharded data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) | +| TP (Tensor Parallel) | Tensor parallelism (Megatron) | 🚧 | To be verified | +| PP (Pipeline Parallel) | Pipeline parallelism (Megatron) | 🚧 | To be verified | +| CP (Context Parallel) | Context parallelism | 🚧 | To be verified | +| EP (Expert Parallel) | Expert parallelism (MoE) | 🚧 | To be verified | + +**Legend**: +- ✅ Verified: Has working example code +- 🚧 To be verified: Theoretically supported but no NPU validation +- ❌ Not supported: Currently unavailable + +### DP + FSDP Example (Verified) + +The following example is from `cookbook/sft/lora_npu.py`, verified on actual NPU environment: + +```python +import numpy as np +from twinkle import DeviceMesh + +# 4 cards: DP=2, FSDP=2 +device_mesh = DeviceMesh( + device_type='npu', + mesh=np.array([[0, 1], [2, 3]]), + mesh_dim_names=('dp', 'fsdp') +) +``` + +**Note**: Megatron backend (TP/PP/EP) support on NPU is under development with no available examples yet. If you need these advanced parallelism strategies, please validate on GPU environment first or follow project updates. + +## Common Issues + +### 1. torch_npu Version Mismatch + +**Problem**: Version incompatibility warnings or errors after installing torch_npu. + +**Solution**: +- Ensure torch and torch_npu versions match exactly +- Check CANN version compatibility with torch_npu + +```bash +# Check current versions +python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)" + +# Reinstall matching versions +pip uninstall torch torch_npu -y +pip install torch==2.7.1 +pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl +``` + +### 2. CANN Toolkit Version Issues + +**Problem**: CANN version incompatible with torch_npu. + +**Solution**: +- Refer to [Ascend Community version compatibility table](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html) +- Install matching CANN toolkit version + +## Feature Support Matrix + +Feature support matrix based on actual code verification: + +| Feature | GPU | NPU | Verification Example | Notes | +|---------|-----|-----|---------------------|-------| +| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| DP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| FSDP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| Ray Distributed | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| Full Fine-tuning | ✅ | 🚧 | - | Theoretically supported, to be verified | +| QLoRA | ✅ | ❌ | - | Quantization operators not supported | +| DPO | ✅ | 🚧 | - | Theoretically supported, to be verified | +| Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification | +| VLLMSampler | ✅ | 🚧 | - | Requires vLLM-Ascend, to be verified | +| Flash Attention | ✅ | ⚠️ | - | Some operators unsupported | + +**Legend**: +- ✅ **Verified**: Has working examples, confirmed available +- 🚧 **To be verified**: Theoretically supported but no NPU validation +- ⚠️ **Partial support**: Available but with limitations or performance differences +- ❌ **Not supported**: Currently unavailable + +**Usage Recommendations**: +1. Prioritize "Verified" features for stable production use +2. "To be verified" features can be tried but may have compatibility issues +3. Refer to corresponding example code when encountering problems + +## Example Code + +Twinkle provides the following verified NPU training examples: + +### SFT Training +- **4-card DP+FSDP LoRA fine-tuning**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + - Uses Ray mode for distributed training + - Demonstrates DP + FSDP hybrid parallelism + - Includes complete data loading and training loop + +### GRPO Training +- **Multi-card GRPO RL training**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + - Actor-Critic architecture + - Supports Reference Model + - Optional TorchSampler or VLLMSampler + +### Remote Training (Tinker Protocol) +- **Server Configuration**: [cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend) + - Provides HTTP API interface + - Supports remote training and inference + - Suitable for production deployment + +**Running Examples**: +```bash +# SFT Training +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 +python cookbook/sft/lora_npu.py + +# GRPO Training +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python cookbook/grpo/lora_npu.py +``` + +## References + +- [Ascend Community Official Website](https://www.hiascend.com/) +- [CANN Software Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html) +- [torch_npu GitHub](https://github.com/Ascend/pytorch) +- [Twinkle GitHub](https://github.com/modelscope/twinkle) +- [Twinkle Documentation](https://twinkle.readthedocs.io/) + +## Getting Help + +If you encounter problems during usage: + +1. **Check logs**: Set environment variable `ASCEND_GLOBAL_LOG_LEVEL=1` for detailed logs +2. **Submit Issue**: [Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues) +3. **Community Discussion**: [Ascend Community Forum](https://www.hiascend.com/forum) + +## Next Steps + +- 📖 Read [Quick Start](Quick-start.md) for more training examples +- 📖 Read [Installation Guide](Installation.md) for other platform installations +- 🚀 Browse `cookbook/` directory for complete example code +- 💡 Check [Twinkle Documentation](https://twinkle.readthedocs.io/) for advanced features From 256ce764895f92fb8b3ccd248fb217a253741663 Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:12:52 +0000 Subject: [PATCH 2/6] docs update --- docs/source/GetStarted/NPU-setup.md | 53 -------------------------- docs/source_en/GetStarted/NPU-setup.md | 53 -------------------------- 2 files changed, 106 deletions(-) diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md index a7923fd6..de2b3a89 100644 --- a/docs/source/GetStarted/NPU-setup.md +++ b/docs/source/GetStarted/NPU-setup.md @@ -140,59 +140,6 @@ python cookbook/grpo/lora_npu.py 查看 `cookbook/remote/tinker/ascend/` 目录了解远程训练服务端配置。 -## NPU 特定配置 - -### 环境变量 - -Twinkle 在 NPU 环境下会自动识别和使用以下环境变量: - -| 环境变量 | 说明 | 示例 | -|---------|------|------| -| `ASCEND_RT_VISIBLE_DEVICES` | 指定可见的 NPU 设备 | `0,1,2,3` | -| `TRUST_REMOTE_CODE` | 允许加载远程代码 | `1` | -| `TWINKLE_SEED` | 随机种子 | `42` | -| `TWINKLE_FULL_DETERMINISM` | 完全确定性训练 | `1` | - -### 设备网格配置 - -在 NPU 环境下,需要指定 `device_type='npu'`。以下是**已验证**的配置示例: - -```python -from twinkle import DeviceMesh - -# 单卡 -device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu') - -# 2 卡 DP -device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu') - -# 4 卡 DP + FSDP (2x2) - 已验证 -device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu') -``` - -**注意**:TP/PP/EP 等高级并行策略暂无 NPU 验证,配置方式请参考 GPU 文档。 - -### 设备组配置 - -使用 Ray 模式时,需要在 DeviceGroup 中指定设备类型: - -```python -from twinkle.infra import DeviceGroup - -device_groups = [ - DeviceGroup( - name='actor', - ranks=[0, 1, 2, 3, 4, 5], # Actor 使用 6 张卡 - device_type='npu', - ), - DeviceGroup( - name='ref', - ranks=[6, 7], # Reference 模型使用 2 张卡 - device_type='npu', - ), -] -``` - ## 并行策略 Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略: diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md index 2c16f79b..03bbee83 100644 --- a/docs/source_en/GetStarted/NPU-setup.md +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -140,59 +140,6 @@ python cookbook/grpo/lora_npu.py Check the `cookbook/remote/tinker/ascend/` directory for remote training server configurations. -## NPU-Specific Configuration - -### Environment Variables - -Twinkle automatically recognizes and uses the following environment variables in NPU environments: - -| Environment Variable | Description | Example | -|---------------------|-------------|---------| -| `ASCEND_RT_VISIBLE_DEVICES` | Specify visible NPU devices | `0,1,2,3` | -| `TRUST_REMOTE_CODE` | Allow loading remote code | `1` | -| `TWINKLE_SEED` | Random seed | `42` | -| `TWINKLE_FULL_DETERMINISM` | Fully deterministic training | `1` | - -### Device Mesh Configuration - -In NPU environments, specify `device_type='npu'`. Here are **verified** configuration examples: - -```python -from twinkle import DeviceMesh - -# Single card -device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu') - -# 2-card DP -device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu') - -# 4-card DP + FSDP (2x2) - Verified -device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu') -``` - -**Note**: Advanced parallelism strategies like TP/PP/EP have not been verified on NPU. Please refer to GPU documentation for configuration details. - -### Device Group Configuration - -When using Ray mode, specify device type in DeviceGroup: - -```python -from twinkle.infra import DeviceGroup - -device_groups = [ - DeviceGroup( - name='actor', - ranks=[0, 1, 2, 3, 4, 5], # Actor uses 6 cards - device_type='npu', - ), - DeviceGroup( - name='ref', - ranks=[6, 7], # Reference model uses 2 cards - device_type='npu', - ), -] -``` - ## Parallelism Strategies Currently **verified** parallelism strategies on Twinkle NPU: From ba89de41b55b4809353bd857760bb093e30dfa66 Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:22:02 +0000 Subject: [PATCH 3/6] update --- docs/source/GetStarted/NPU-setup.md | 18 ++++++++++++++++-- docs/source_en/GetStarted/NPU-setup.md | 18 ++++++++++++++++-- 2 files changed, 32 insertions(+), 4 deletions(-) diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md index de2b3a89..60f4dc18 100644 --- a/docs/source/GetStarted/NPU-setup.md +++ b/docs/source/GetStarted/NPU-setup.md @@ -56,7 +56,21 @@ cd twinkle pip install -e ".[transformers,ray]" ``` -### 3. 验证安装 +### 3. 安装 vLLM 和 vLLM-Ascend(可选) + +如果需要使用 VLLMSampler 进行高效推理,可以安装 vLLM 和 vLLM-Ascend: + +```bash +# 安装 vLLM +pip install vllm + +# 安装 vLLM-Ascend(昇腾适配版本) +# 请参考官方文档:https://github.com/vllm-project/vllm +``` + +**注意**:vLLM-Ascend 的安装可能需要特定的 CANN 版本配套,请参考 vLLM-Ascend 官方文档进行安装。 + +### 4. 验证安装 创建测试脚本 `verify_npu.py`: @@ -220,7 +234,7 @@ pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl | QLoRA | ✅ | ❌ | - | 量化算子暂不支持 | | DPO | ✅ | 🚧 | - | 理论支持,待验证 | | Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 | -| VLLMSampler | ✅ | 🚧 | - | 需 vLLM-Ascend,待验证 | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | | Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 | **图例说明**: diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md index 03bbee83..c0404209 100644 --- a/docs/source_en/GetStarted/NPU-setup.md +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -56,7 +56,21 @@ cd twinkle pip install -e ".[transformers,ray]" ``` -### 3. Verify Installation +### 3. Install vLLM and vLLM-Ascend (Optional) + +If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend: + +```bash +# Install vLLM +pip install vllm + +# Install vLLM-Ascend (Ascend adaptation version) +# Please refer to the official documentation: https://github.com/vllm-project/vllm +``` + +**Note**: vLLM-Ascend installation may require specific CANN version compatibility. Please refer to the official vLLM-Ascend documentation for installation instructions. + +### 4. Verify Installation Create test script `verify_npu.py`: @@ -220,7 +234,7 @@ Feature support matrix based on actual code verification: | QLoRA | ✅ | ❌ | - | Quantization operators not supported | | DPO | ✅ | 🚧 | - | Theoretically supported, to be verified | | Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification | -| VLLMSampler | ✅ | 🚧 | - | Requires vLLM-Ascend, to be verified | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | | Flash Attention | ✅ | ⚠️ | - | Some operators unsupported | **Legend**: From 39e3404c0e1d7810d4cdf49d5c66054247ca173e Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:25:32 +0000 Subject: [PATCH 4/6] update --- docs/source/GetStarted/NPU-setup.md | 17 +++++++++++------ docs/source_en/GetStarted/NPU-setup.md | 17 +++++++++++------ 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md index 60f4dc18..f15a7cd2 100644 --- a/docs/source/GetStarted/NPU-setup.md +++ b/docs/source/GetStarted/NPU-setup.md @@ -58,17 +58,22 @@ pip install -e ".[transformers,ray]" ### 3. 安装 vLLM 和 vLLM-Ascend(可选) -如果需要使用 VLLMSampler 进行高效推理,可以安装 vLLM 和 vLLM-Ascend: +如果需要使用 VLLMSampler 进行高效推理,可以安装 vLLM 和 vLLM-Ascend。 + +**安装步骤**(参考 [Swift 文档](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)): ```bash -# 安装 vLLM -pip install vllm +# 第一步:安装 vLLM +pip install vllm==0.11.0 -# 安装 vLLM-Ascend(昇腾适配版本) -# 请参考官方文档:https://github.com/vllm-project/vllm +# 第二步:安装 vLLM-Ascend +pip install vllm-ascend==0.11.0rc3 ``` -**注意**:vLLM-Ascend 的安装可能需要特定的 CANN 版本配套,请参考 vLLM-Ascend 官方文档进行安装。 +**注意事项**: +- 按照上述顺序安装,忽略可能的依赖冲突提示 +- 安装前确保已激活 CANN 环境:`source /usr/local/Ascend/ascend-toolkit/set_env.sh` +- 推荐使用的版本为 vLLM 0.11.0 和 vLLM-Ascend 0.11.0rc3 ### 4. 验证安装 diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md index c0404209..acd3c9a7 100644 --- a/docs/source_en/GetStarted/NPU-setup.md +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -58,17 +58,22 @@ pip install -e ".[transformers,ray]" ### 3. Install vLLM and vLLM-Ascend (Optional) -If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend: +If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend. + +**Installation Steps** (Reference: [Swift Documentation](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)): ```bash -# Install vLLM -pip install vllm +# Step 1: Install vLLM +pip install vllm==0.11.0 -# Install vLLM-Ascend (Ascend adaptation version) -# Please refer to the official documentation: https://github.com/vllm-project/vllm +# Step 2: Install vLLM-Ascend +pip install vllm-ascend==0.11.0rc3 ``` -**Note**: vLLM-Ascend installation may require specific CANN version compatibility. Please refer to the official vLLM-Ascend documentation for installation instructions. +**Important Notes**: +- Follow the installation order above and ignore potential dependency conflict warnings +- Ensure CANN environment is activated before installation: `source /usr/local/Ascend/ascend-toolkit/set_env.sh` +- Recommended versions are vLLM 0.11.0 and vLLM-Ascend 0.11.0rc3 ### 4. Verify Installation From 478e538d5bde3c0346d816155fac24fe48f91344 Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:35:33 +0000 Subject: [PATCH 5/6] update --- docs/source/GetStarted/NPU-setup.md | 4 ++-- docs/source_en/GetStarted/NPU-setup.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md index f15a7cd2..e3cac951 100644 --- a/docs/source/GetStarted/NPU-setup.md +++ b/docs/source/GetStarted/NPU-setup.md @@ -177,7 +177,7 @@ Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略: - 🚧 待验证:理论上支持但暂无 NPU 验证示例 - ❌ 不支持:当前版本不可用 -### DP + FSDP 示例(已验证) +### DP + FSDP 示例 以下示例来自 `cookbook/sft/lora_npu.py`,在实际 NPU 环境中验证通过: @@ -235,11 +235,11 @@ pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl | FSDP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | | Ray 分布式 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | | TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | | 全量微调 | ✅ | 🚧 | - | 理论支持,待验证 | | QLoRA | ✅ | ❌ | - | 量化算子暂不支持 | | DPO | ✅ | 🚧 | - | 理论支持,待验证 | | Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 | -| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | | Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 | **图例说明**: diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md index acd3c9a7..61d3503f 100644 --- a/docs/source_en/GetStarted/NPU-setup.md +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -177,7 +177,7 @@ Currently **verified** parallelism strategies on Twinkle NPU: - 🚧 To be verified: Theoretically supported but no NPU validation - ❌ Not supported: Currently unavailable -### DP + FSDP Example (Verified) +### DP + FSDP Example The following example is from `cookbook/sft/lora_npu.py`, verified on actual NPU environment: @@ -235,11 +235,11 @@ Feature support matrix based on actual code verification: | FSDP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | | Ray Distributed | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | | TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | | Full Fine-tuning | ✅ | 🚧 | - | Theoretically supported, to be verified | | QLoRA | ✅ | ❌ | - | Quantization operators not supported | | DPO | ✅ | 🚧 | - | Theoretically supported, to be verified | | Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification | -| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | | Flash Attention | ✅ | ⚠️ | - | Some operators unsupported | **Legend**: From 76d9f10c71a6b5253b6f047204d70aa77fca8fd5 Mon Sep 17 00:00:00 2001 From: addsubmuldiv Date: Thu, 5 Feb 2026 07:39:41 +0000 Subject: [PATCH 6/6] update --- docs/source/GetStarted/NPU-setup.md | 2 +- docs/source_en/GetStarted/NPU-setup.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md index e3cac951..000d38f8 100644 --- a/docs/source/GetStarted/NPU-setup.md +++ b/docs/source/GetStarted/NPU-setup.md @@ -60,7 +60,7 @@ pip install -e ".[transformers,ray]" 如果需要使用 VLLMSampler 进行高效推理,可以安装 vLLM 和 vLLM-Ascend。 -**安装步骤**(参考 [Swift 文档](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)): +**安装步骤**: ```bash # 第一步:安装 vLLM diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md index 61d3503f..78a70c83 100644 --- a/docs/source_en/GetStarted/NPU-setup.md +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -60,7 +60,7 @@ pip install -e ".[transformers,ray]" If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend. -**Installation Steps** (Reference: [Swift Documentation](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)): +**Installation Steps**: ```bash # Step 1: Install vLLM