diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md new file mode 100644 index 00000000..000d38f8 --- /dev/null +++ b/docs/source/GetStarted/NPU-setup.md @@ -0,0 +1,310 @@ +# NPU(昇腾)开箱指南 + +本文档介绍如何在华为昇腾 NPU 环境下安装和使用 Twinkle 框架。 + +## 环境要求 + +在开始之前,请确保您的系统满足以下要求: + +| 组件 | 版本要求 | 说明 | +|------|---------|------| +| Python | >= 3.11, < 3.13 | Twinkle 框架要求 | +| 昇腾固件驱动(HDK) | 推荐最新版本 | 硬件驱动和固件 | +| CANN 工具包 | 8.3.RC1 或更高 | 异构计算架构 | +| PyTorch | 2.7.1 | 深度学习框架 | +| torch_npu | 2.7.1 | 昇腾 PyTorch 适配插件 | + +**重要说明**: +- torch 和 torch_npu 版本**必须完全一致**(例如都为 2.7.1) +- 推荐使用 Python 3.11 以获得最佳兼容性 +- CANN 工具包需要约 10GB+ 磁盘空间 + +## 支持的硬件 + +Twinkle 当前支持以下昇腾 NPU 设备: + +- 昇腾 910 系列 +- 其他兼容的昇腾加速卡 + +## 安装步骤 + +### 1. 安装 NPU 环境(驱动、CANN、torch_npu) + +NPU 环境的安装包括昇腾驱动、CANN 工具包、PyTorch 和 torch_npu。 + +**📖 完整安装教程**:[torch_npu 官方安装指南](https://gitcode.com/Ascend/pytorch/overview) + +该文档包含: +- 昇腾驱动(HDK)安装步骤 +- CANN 工具包安装步骤 +- PyTorch 和 torch_npu 安装步骤 +- 版本配套说明 + +**推荐版本配置**: +- Python: 3.11 +- PyTorch: 2.7.1 +- torch_npu: 2.7.1 +- CANN: 8.3.RC1 或更高 + +### 2. 安装 Twinkle + +NPU 环境配置完成后,从源码安装 Twinkle 框架: + +```bash +git clone https://github.com/modelscope/twinkle.git +cd twinkle +pip install -e ".[transformers,ray]" +``` + +### 3. 安装 vLLM 和 vLLM-Ascend(可选) + +如果需要使用 VLLMSampler 进行高效推理,可以安装 vLLM 和 vLLM-Ascend。 + +**安装步骤**: + +```bash +# 第一步:安装 vLLM +pip install vllm==0.11.0 + +# 第二步:安装 vLLM-Ascend +pip install vllm-ascend==0.11.0rc3 +``` + +**注意事项**: +- 按照上述顺序安装,忽略可能的依赖冲突提示 +- 安装前确保已激活 CANN 环境:`source /usr/local/Ascend/ascend-toolkit/set_env.sh` +- 推荐使用的版本为 vLLM 0.11.0 和 vLLM-Ascend 0.11.0rc3 + +### 4. 验证安装 + +创建测试脚本 `verify_npu.py`: + +```python +import torch +import torch_npu + +print(f"PyTorch version: {torch.__version__}") +print(f"torch_npu version: {torch_npu.__version__}") +print(f"NPU available: {torch.npu.is_available()}") +print(f"NPU device count: {torch.npu.device_count()}") + +if torch.npu.is_available(): + print(f"Current NPU device: {torch.npu.current_device()}") + print(f"NPU device name: {torch.npu.get_device_name(0)}") + + # 简单测试 + x = torch.randn(3, 3).npu() + y = torch.randn(3, 3).npu() + z = x + y + print(f"NPU computation test passed: {z.shape}") +``` + +运行验证: + +```bash +python verify_npu.py +``` + +如果输出显示 `NPU available: True` 且没有报错,说明安装成功! + +**注意**:目前 Twinkle 暂未提供 NPU 的 Docker 镜像,建议使用手动安装方式。如需容器化部署,请参考昇腾社区的官方镜像。 + +## 快速开始 + +**重要提示**:以下示例均来自 `cookbook/` 目录,已在实际 NPU 环境中验证通过。建议直接运行 cookbook 中的脚本,而不是复制粘贴代码片段。 + +### SFT LoRA 微调 + +已验证的 4 卡 DP+FSDP 训练示例: + +**示例路径**:[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + +**运行方式**: +```bash +# 指定使用 4 张 NPU 卡 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 + +# 运行训练 +python cookbook/sft/lora_npu.py +``` + +**示例特性**: +- ✅ Ray 分布式模式 +- ✅ DP + FSDP 混合并行(2x2) +- ✅ LoRA 微调 +- ✅ 完整的数据加载和训练循环 + +### GRPO 强化学习训练 + +已验证的多卡 GRPO 训练示例: + +**示例路径**:[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + +**运行方式**: +```bash +# 指定使用 8 张 NPU 卡 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + +# 运行训练 +python cookbook/grpo/lora_npu.py +``` + +**示例特性**: +- ✅ Actor-Critic 架构 +- ✅ 支持 Reference Model +- ✅ 可选 TorchSampler 或 VLLMSampler +- ✅ 完整的 RL 训练流程 + +### 更多示例 + +查看 `cookbook/remote/tinker/ascend/` 目录了解远程训练服务端配置。 + +## 并行策略 + +Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略: + +| 并行类型 | 说明 | NPU 支持 | 验证状态 | +|---------|------|---------|---------| +| DP (Data Parallel) | 数据并行 | ✅ | 已验证(见 cookbook/sft/lora_npu.py) | +| FSDP (Fully Sharded Data Parallel) | 完全分片数据并行 | ✅ | 已验证(见 cookbook/sft/lora_npu.py) | +| TP (Tensor Parallel) | 张量并行(Megatron) | 🚧 | 待验证 | +| PP (Pipeline Parallel) | 流水线并行(Megatron) | 🚧 | 待验证 | +| CP (Context Parallel) | 上下文并行 | 🚧 | 待验证 | +| EP (Expert Parallel) | 专家并行(MoE) | 🚧 | 待验证 | + +**图例说明**: +- ✅ 已验证:有实际运行示例代码 +- 🚧 待验证:理论上支持但暂无 NPU 验证示例 +- ❌ 不支持:当前版本不可用 + +### DP + FSDP 示例 + +以下示例来自 `cookbook/sft/lora_npu.py`,在实际 NPU 环境中验证通过: + +```python +import numpy as np +from twinkle import DeviceMesh + +# 4 卡:DP=2, FSDP=2 +device_mesh = DeviceMesh( + device_type='npu', + mesh=np.array([[0, 1], [2, 3]]), + mesh_dim_names=('dp', 'fsdp') +) +``` + +**注意**:Megatron 后端(TP/PP/EP)在 NPU 上的支持正在开发中,暂无可用示例。如需使用这些高级并行策略,请先在 GPU 环境下验证,或关注项目更新。 + +## 常见问题 + +### 1. torch_npu 版本不匹配 + +**问题**:安装 torch_npu 后出现版本不兼容警告或错误。 + +**解决方案**: +- 确保 torch 和 torch_npu 版本完全一致 +- 检查 CANN 版本是否与 torch_npu 兼容 + +```bash +# 查看当前版本 +python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)" + +# 重新安装匹配版本 +pip uninstall torch torch_npu -y +pip install torch==2.7.1 +pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl +``` + +### 2. CANN 工具包版本问题 + +**问题**:CANN 版本与 torch_npu 不兼容。 + +**解决方案**: +- 参考[昇腾社区版本配套表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html) +- 安装对应版本的 CANN 工具包 + +## 功能支持情况 + +基于实际代码验证的功能支持矩阵: + +| 功能 | GPU | NPU | 验证示例 | 说明 | +|------|-----|-----|---------|------| +| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| DP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| FSDP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| Ray 分布式 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 | +| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 | +| 全量微调 | ✅ | 🚧 | - | 理论支持,待验证 | +| QLoRA | ✅ | ❌ | - | 量化算子暂不支持 | +| DPO | ✅ | 🚧 | - | 理论支持,待验证 | +| Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 | +| Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 | + +**图例说明**: +- ✅ **已验证**:有实际运行示例,确认可用 +- 🚧 **待验证**:理论上支持但暂无 NPU 环境验证 +- ⚠️ **部分支持**:可用但有限制或性能差异 +- ❌ **不支持**:当前版本不可用 + +**使用建议**: +1. 优先使用标记为"已验证"的功能,稳定性有保障 +2. "待验证"功能可以尝试,但可能遇到兼容性问题 +3. 遇到问题时,参考对应的示例代码进行配置 + +## 示例代码 + +Twinkle 提供了以下经过验证的 NPU 训练示例: + +### SFT 训练 +- **4 卡 DP+FSDP LoRA 微调**:[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + - 使用 Ray 模式进行分布式训练 + - 演示 DP + FSDP 混合并行 + - 包含完整的数据加载和训练循环 + +### GRPO 训练 +- **多卡 GRPO RL 训练**:[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + - Actor-Critic 架构 + - 支持参考模型(Reference Model) + - 可选 TorchSampler 或 VLLMSampler + +### 远程训练(Tinker 协议) +- **服务端配置**:[cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend) + - 提供 HTTP API 接口 + - 支持远程训练和推理 + - 适用于生产环境部署 + +**运行示例**: +```bash +# SFT 训练 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 +python cookbook/sft/lora_npu.py + +# GRPO 训练 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python cookbook/grpo/lora_npu.py +``` + +## 参考资源 + +- [昇腾社区官网](https://www.hiascend.com/) +- [CANN 软件安装指南](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html) +- [torch_npu GitHub](https://github.com/Ascend/pytorch) +- [Twinkle GitHub](https://github.com/modelscope/twinkle) +- [Twinkle 文档](https://twinkle.readthedocs.io/) + +## 获取帮助 + +如果您在使用过程中遇到问题: + +1. **查看日志**:设置环境变量 `ASCEND_GLOBAL_LOG_LEVEL=1` 获取详细日志 +2. **提交 Issue**:[Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues) +3. **社区讨论**:[昇腾社区论坛](https://www.hiascend.com/forum) + +## 下一步 + +- 📖 阅读 [快速开始](Quick-start.md) 了解更多训练示例 +- 📖 阅读 [安装指南](Installation.md) 了解其他平台的安装 +- 🚀 浏览 `cookbook/` 目录查看完整示例代码 +- 💡 查看 [Twinkle 文档](https://twinkle.readthedocs.io/) 了解高级功能 diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md new file mode 100644 index 00000000..78a70c83 --- /dev/null +++ b/docs/source_en/GetStarted/NPU-setup.md @@ -0,0 +1,310 @@ +# NPU (Ascend) Setup Guide + +This guide explains how to install and use the Twinkle framework on Huawei Ascend NPU environments. + +## Requirements + +Before starting, ensure your system meets the following requirements: + +| Component | Version Requirement | Notes | +|-----------|-------------------|-------| +| Python | >= 3.11, < 3.13 | Required by Twinkle framework | +| Ascend Firmware Driver (HDK) | Latest recommended | Hardware driver and firmware | +| CANN Toolkit | 8.3.RC1 or higher | Heterogeneous Computing Architecture | +| PyTorch | 2.7.1 | Deep learning framework | +| torch_npu | 2.7.1 | Ascend PyTorch adapter | + +**Important Notes**: +- PyTorch and torch_npu versions **must match exactly** (e.g., both 2.7.1) +- Python 3.11 is recommended for best compatibility +- CANN toolkit requires approximately 10GB+ disk space + +## Supported Hardware + +Twinkle currently supports the following Ascend NPU devices: + +- Ascend 910 series +- Other compatible Ascend accelerators + +## Installation + +### 1. Install NPU Environment (Driver, CANN, torch_npu) + +NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu. + +**📖 Complete Installation Guide**: [torch_npu Official Installation Guide](https://gitcode.com/Ascend/pytorch/overview) + +The guide covers: +- Ascend driver (HDK) installation steps +- CANN toolkit installation steps +- PyTorch and torch_npu installation steps +- Version compatibility instructions + +**Recommended Version Configuration**: +- Python: 3.11 +- PyTorch: 2.7.1 +- torch_npu: 2.7.1 +- CANN: 8.3.RC1 or higher + +### 2. Install Twinkle + +After NPU environment is configured, install Twinkle framework from source: + +```bash +git clone https://github.com/modelscope/twinkle.git +cd twinkle +pip install -e ".[transformers,ray]" +``` + +### 3. Install vLLM and vLLM-Ascend (Optional) + +If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend. + +**Installation Steps**: + +```bash +# Step 1: Install vLLM +pip install vllm==0.11.0 + +# Step 2: Install vLLM-Ascend +pip install vllm-ascend==0.11.0rc3 +``` + +**Important Notes**: +- Follow the installation order above and ignore potential dependency conflict warnings +- Ensure CANN environment is activated before installation: `source /usr/local/Ascend/ascend-toolkit/set_env.sh` +- Recommended versions are vLLM 0.11.0 and vLLM-Ascend 0.11.0rc3 + +### 4. Verify Installation + +Create test script `verify_npu.py`: + +```python +import torch +import torch_npu + +print(f"PyTorch version: {torch.__version__}") +print(f"torch_npu version: {torch_npu.__version__}") +print(f"NPU available: {torch.npu.is_available()}") +print(f"NPU device count: {torch.npu.device_count()}") + +if torch.npu.is_available(): + print(f"Current NPU device: {torch.npu.current_device()}") + print(f"NPU device name: {torch.npu.get_device_name(0)}") + + # Simple test + x = torch.randn(3, 3).npu() + y = torch.randn(3, 3).npu() + z = x + y + print(f"NPU computation test passed: {z.shape}") +``` + +Run verification: + +```bash +python verify_npu.py +``` + +If output shows `NPU available: True` without errors, installation is successful! + +**Note**: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official Ascend Community images. + +## Quick Start + +**Important**: All examples below are from the `cookbook/` directory and have been verified on actual NPU environments. We recommend running scripts directly from cookbook rather than copying code snippets. + +### SFT LoRA Fine-tuning + +Verified 4-card DP+FSDP training example: + +**Example Path**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + +**How to Run**: +```bash +# Specify 4 NPU cards +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 + +# Run training +python cookbook/sft/lora_npu.py +``` + +**Example Features**: +- ✅ Ray distributed mode +- ✅ DP + FSDP hybrid parallelism (2x2) +- ✅ LoRA fine-tuning +- ✅ Complete data loading and training loop + +### GRPO Reinforcement Learning Training + +Verified multi-card GRPO training example: + +**Example Path**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + +**How to Run**: +```bash +# Specify 8 NPU cards +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + +# Run training +python cookbook/grpo/lora_npu.py +``` + +**Example Features**: +- ✅ Actor-Critic architecture +- ✅ Supports Reference Model +- ✅ Optional TorchSampler or VLLMSampler +- ✅ Complete RL training workflow + +### More Examples + +Check the `cookbook/remote/tinker/ascend/` directory for remote training server configurations. + +## Parallelism Strategies + +Currently **verified** parallelism strategies on Twinkle NPU: + +| Parallel Type | Description | NPU Support | Verification Status | +|--------------|-------------|-------------|-------------------| +| DP (Data Parallel) | Data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) | +| FSDP (Fully Sharded Data Parallel) | Fully sharded data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) | +| TP (Tensor Parallel) | Tensor parallelism (Megatron) | 🚧 | To be verified | +| PP (Pipeline Parallel) | Pipeline parallelism (Megatron) | 🚧 | To be verified | +| CP (Context Parallel) | Context parallelism | 🚧 | To be verified | +| EP (Expert Parallel) | Expert parallelism (MoE) | 🚧 | To be verified | + +**Legend**: +- ✅ Verified: Has working example code +- 🚧 To be verified: Theoretically supported but no NPU validation +- ❌ Not supported: Currently unavailable + +### DP + FSDP Example + +The following example is from `cookbook/sft/lora_npu.py`, verified on actual NPU environment: + +```python +import numpy as np +from twinkle import DeviceMesh + +# 4 cards: DP=2, FSDP=2 +device_mesh = DeviceMesh( + device_type='npu', + mesh=np.array([[0, 1], [2, 3]]), + mesh_dim_names=('dp', 'fsdp') +) +``` + +**Note**: Megatron backend (TP/PP/EP) support on NPU is under development with no available examples yet. If you need these advanced parallelism strategies, please validate on GPU environment first or follow project updates. + +## Common Issues + +### 1. torch_npu Version Mismatch + +**Problem**: Version incompatibility warnings or errors after installing torch_npu. + +**Solution**: +- Ensure torch and torch_npu versions match exactly +- Check CANN version compatibility with torch_npu + +```bash +# Check current versions +python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)" + +# Reinstall matching versions +pip uninstall torch torch_npu -y +pip install torch==2.7.1 +pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl +``` + +### 2. CANN Toolkit Version Issues + +**Problem**: CANN version incompatible with torch_npu. + +**Solution**: +- Refer to [Ascend Community version compatibility table](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html) +- Install matching CANN toolkit version + +## Feature Support Matrix + +Feature support matrix based on actual code verification: + +| Feature | GPU | NPU | Verification Example | Notes | +|---------|-----|-----|---------------------|-------| +| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| DP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| FSDP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| Ray Distributed | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working | +| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working | +| Full Fine-tuning | ✅ | 🚧 | - | Theoretically supported, to be verified | +| QLoRA | ✅ | ❌ | - | Quantization operators not supported | +| DPO | ✅ | 🚧 | - | Theoretically supported, to be verified | +| Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification | +| Flash Attention | ✅ | ⚠️ | - | Some operators unsupported | + +**Legend**: +- ✅ **Verified**: Has working examples, confirmed available +- 🚧 **To be verified**: Theoretically supported but no NPU validation +- ⚠️ **Partial support**: Available but with limitations or performance differences +- ❌ **Not supported**: Currently unavailable + +**Usage Recommendations**: +1. Prioritize "Verified" features for stable production use +2. "To be verified" features can be tried but may have compatibility issues +3. Refer to corresponding example code when encountering problems + +## Example Code + +Twinkle provides the following verified NPU training examples: + +### SFT Training +- **4-card DP+FSDP LoRA fine-tuning**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py) + - Uses Ray mode for distributed training + - Demonstrates DP + FSDP hybrid parallelism + - Includes complete data loading and training loop + +### GRPO Training +- **Multi-card GRPO RL training**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py) + - Actor-Critic architecture + - Supports Reference Model + - Optional TorchSampler or VLLMSampler + +### Remote Training (Tinker Protocol) +- **Server Configuration**: [cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend) + - Provides HTTP API interface + - Supports remote training and inference + - Suitable for production deployment + +**Running Examples**: +```bash +# SFT Training +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 +python cookbook/sft/lora_npu.py + +# GRPO Training +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 +python cookbook/grpo/lora_npu.py +``` + +## References + +- [Ascend Community Official Website](https://www.hiascend.com/) +- [CANN Software Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html) +- [torch_npu GitHub](https://github.com/Ascend/pytorch) +- [Twinkle GitHub](https://github.com/modelscope/twinkle) +- [Twinkle Documentation](https://twinkle.readthedocs.io/) + +## Getting Help + +If you encounter problems during usage: + +1. **Check logs**: Set environment variable `ASCEND_GLOBAL_LOG_LEVEL=1` for detailed logs +2. **Submit Issue**: [Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues) +3. **Community Discussion**: [Ascend Community Forum](https://www.hiascend.com/forum) + +## Next Steps + +- 📖 Read [Quick Start](Quick-start.md) for more training examples +- 📖 Read [Installation Guide](Installation.md) for other platform installations +- 🚀 Browse `cookbook/` directory for complete example code +- 💡 Check [Twinkle Documentation](https://twinkle.readthedocs.io/) for advanced features