From 88cacd0188d7817ee8183b1c41d276683e59ce50 Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:07:59 +0000
Subject: [PATCH 1/6] docs for npu support

---
 docs/source/GetStarted/NPU-setup.md    | 344 +++++++++++++++++++++++++
 docs/source_en/GetStarted/NPU-setup.md | 344 +++++++++++++++++++++++++
 2 files changed, 688 insertions(+)
 create mode 100644 docs/source/GetStarted/NPU-setup.md
 create mode 100644 docs/source_en/GetStarted/NPU-setup.md

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
new file mode 100644
index 00000000..a7923fd6
--- /dev/null
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -0,0 +1,344 @@
+# NPU（昇腾）开箱指南
+
+本文档介绍如何在华为昇腾 NPU 环境下安装和使用 Twinkle 框架。
+
+## 环境要求
+
+在开始之前，请确保您的系统满足以下要求：
+
+| 组件 | 版本要求 | 说明 |
+|------|---------|------|
+| Python | >= 3.11, < 3.13 | Twinkle 框架要求 |
+| 昇腾固件驱动（HDK） | 推荐最新版本 | 硬件驱动和固件 |
+| CANN 工具包 | 8.3.RC1 或更高 | 异构计算架构 |
+| PyTorch | 2.7.1 | 深度学习框架 |
+| torch_npu | 2.7.1 | 昇腾 PyTorch 适配插件 |
+
+**重要说明**：
+- torch 和 torch_npu 版本**必须完全一致**（例如都为 2.7.1）
+- 推荐使用 Python 3.11 以获得最佳兼容性
+- CANN 工具包需要约 10GB+ 磁盘空间
+
+## 支持的硬件
+
+Twinkle 当前支持以下昇腾 NPU 设备：
+
+- 昇腾 910 系列
+- 其他兼容的昇腾加速卡
+
+## 安装步骤
+
+### 1. 安装 NPU 环境（驱动、CANN、torch_npu）
+
+NPU 环境的安装包括昇腾驱动、CANN 工具包、PyTorch 和 torch_npu。
+
+**📖 完整安装教程**：[torch_npu 官方安装指南](https://gitcode.com/Ascend/pytorch/overview)
+
+该文档包含：
+- 昇腾驱动（HDK）安装步骤
+- CANN 工具包安装步骤
+- PyTorch 和 torch_npu 安装步骤
+- 版本配套说明
+
+**推荐版本配置**：
+- Python: 3.11
+- PyTorch: 2.7.1
+- torch_npu: 2.7.1
+- CANN: 8.3.RC1 或更高
+
+### 2. 安装 Twinkle
+
+NPU 环境配置完成后，从源码安装 Twinkle 框架：
+
+```bash
+git clone https://github.com/modelscope/twinkle.git
+cd twinkle
+pip install -e ".[transformers,ray]"
+```
+
+### 3. 验证安装
+
+创建测试脚本 `verify_npu.py`：
+
+```python
+import torch
+import torch_npu
+
+print(f"PyTorch version: {torch.__version__}")
+print(f"torch_npu version: {torch_npu.__version__}")
+print(f"NPU available: {torch.npu.is_available()}")
+print(f"NPU device count: {torch.npu.device_count()}")
+
+if torch.npu.is_available():
+    print(f"Current NPU device: {torch.npu.current_device()}")
+    print(f"NPU device name: {torch.npu.get_device_name(0)}")
+
+    # 简单测试
+    x = torch.randn(3, 3).npu()
+    y = torch.randn(3, 3).npu()
+    z = x + y
+    print(f"NPU computation test passed: {z.shape}")
+```
+
+运行验证：
+
+```bash
+python verify_npu.py
+```
+
+如果输出显示 `NPU available: True` 且没有报错，说明安装成功！
+
+**注意**：目前 Twinkle 暂未提供 NPU 的 Docker 镜像，建议使用手动安装方式。如需容器化部署，请参考昇腾社区的官方镜像。
+
+## 快速开始
+
+**重要提示**：以下示例均来自 `cookbook/` 目录，已在实际 NPU 环境中验证通过。建议直接运行 cookbook 中的脚本，而不是复制粘贴代码片段。
+
+### SFT LoRA 微调
+
+已验证的 4 卡 DP+FSDP 训练示例：
+
+**示例路径**：[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py)
+
+**运行方式**：
+```bash
+# 指定使用 4 张 NPU 卡
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
+
+# 运行训练
+python cookbook/sft/lora_npu.py
+```
+
+**示例特性**：
+- ✅ Ray 分布式模式
+- ✅ DP + FSDP 混合并行（2x2）
+- ✅ LoRA 微调
+- ✅ 完整的数据加载和训练循环
+
+### GRPO 强化学习训练
+
+已验证的多卡 GRPO 训练示例：
+
+**示例路径**：[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py)
+
+**运行方式**：
+```bash
+# 指定使用 8 张 NPU 卡
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+# 运行训练
+python cookbook/grpo/lora_npu.py
+```
+
+**示例特性**：
+- ✅ Actor-Critic 架构
+- ✅ 支持 Reference Model
+- ✅ 可选 TorchSampler 或 VLLMSampler
+- ✅ 完整的 RL 训练流程
+
+### 更多示例
+
+查看 `cookbook/remote/tinker/ascend/` 目录了解远程训练服务端配置。
+
+## NPU 特定配置
+
+### 环境变量
+
+Twinkle 在 NPU 环境下会自动识别和使用以下环境变量：
+
+| 环境变量 | 说明 | 示例 |
+|---------|------|------|
+| `ASCEND_RT_VISIBLE_DEVICES` | 指定可见的 NPU 设备 | `0,1,2,3` |
+| `TRUST_REMOTE_CODE` | 允许加载远程代码 | `1` |
+| `TWINKLE_SEED` | 随机种子 | `42` |
+| `TWINKLE_FULL_DETERMINISM` | 完全确定性训练 | `1` |
+
+### 设备网格配置
+
+在 NPU 环境下，需要指定 `device_type='npu'`。以下是**已验证**的配置示例：
+
+```python
+from twinkle import DeviceMesh
+
+# 单卡
+device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu')
+
+# 2 卡 DP
+device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu')
+
+# 4 卡 DP + FSDP (2x2) - 已验证
+device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu')
+```
+
+**注意**：TP/PP/EP 等高级并行策略暂无 NPU 验证，配置方式请参考 GPU 文档。
+
+### 设备组配置
+
+使用 Ray 模式时，需要在 DeviceGroup 中指定设备类型：
+
+```python
+from twinkle.infra import DeviceGroup
+
+device_groups = [
+    DeviceGroup(
+        name='actor',
+        ranks=[0, 1, 2, 3, 4, 5],  # Actor 使用 6 张卡
+        device_type='npu',
+    ),
+    DeviceGroup(
+        name='ref',
+        ranks=[6, 7],  # Reference 模型使用 2 张卡
+        device_type='npu',
+    ),
+]
+```
+
+## 并行策略
+
+Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略：
+
+| 并行类型 | 说明 | NPU 支持 | 验证状态 |
+|---------|------|---------|---------|
+| DP (Data Parallel) | 数据并行 | ✅ | 已验证（见 cookbook/sft/lora_npu.py） |
+| FSDP (Fully Sharded Data Parallel) | 完全分片数据并行 | ✅ | 已验证（见 cookbook/sft/lora_npu.py） |
+| TP (Tensor Parallel) | 张量并行（Megatron） | 🚧 | 待验证 |
+| PP (Pipeline Parallel) | 流水线并行（Megatron） | 🚧 | 待验证 |
+| CP (Context Parallel) | 上下文并行 | 🚧 | 待验证 |
+| EP (Expert Parallel) | 专家并行（MoE） | 🚧 | 待验证 |
+
+**图例说明**：
+- ✅ 已验证：有实际运行示例代码
+- 🚧 待验证：理论上支持但暂无 NPU 验证示例
+- ❌ 不支持：当前版本不可用
+
+### DP + FSDP 示例（已验证）
+
+以下示例来自 `cookbook/sft/lora_npu.py`，在实际 NPU 环境中验证通过：
+
+```python
+import numpy as np
+from twinkle import DeviceMesh
+
+# 4 卡：DP=2, FSDP=2
+device_mesh = DeviceMesh(
+    device_type='npu',
+    mesh=np.array([[0, 1], [2, 3]]),
+    mesh_dim_names=('dp', 'fsdp')
+)
+```
+
+**注意**：Megatron 后端（TP/PP/EP）在 NPU 上的支持正在开发中，暂无可用示例。如需使用这些高级并行策略，请先在 GPU 环境下验证，或关注项目更新。
+
+## 常见问题
+
+### 1. torch_npu 版本不匹配
+
+**问题**：安装 torch_npu 后出现版本不兼容警告或错误。
+
+**解决方案**：
+- 确保 torch 和 torch_npu 版本完全一致
+- 检查 CANN 版本是否与 torch_npu 兼容
+
+```bash
+# 查看当前版本
+python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"
+
+# 重新安装匹配版本
+pip uninstall torch torch_npu -y
+pip install torch==2.7.1
+pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl
+```
+
+### 2. CANN 工具包版本问题
+
+**问题**：CANN 版本与 torch_npu 不兼容。
+
+**解决方案**：
+- 参考[昇腾社区版本配套表](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html)
+- 安装对应版本的 CANN 工具包
+
+## 功能支持情况
+
+基于实际代码验证的功能支持矩阵：
+
+| 功能 | GPU | NPU | 验证示例 | 说明 |
+|------|-----|-----|---------|------|
+| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
+| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
+| DP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
+| FSDP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
+| Ray 分布式 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
+| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
+| 全量微调 | ✅ | 🚧 | - | 理论支持，待验证 |
+| QLoRA | ✅ | ❌ | - | 量化算子暂不支持 |
+| DPO | ✅ | 🚧 | - | 理论支持，待验证 |
+| Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 |
+| VLLMSampler | ✅ | 🚧 | - | 需 vLLM-Ascend，待验证 |
+| Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 |
+
+**图例说明**：
+- ✅ **已验证**：有实际运行示例，确认可用
+- 🚧 **待验证**：理论上支持但暂无 NPU 环境验证
+- ⚠️ **部分支持**：可用但有限制或性能差异
+- ❌ **不支持**：当前版本不可用
+
+**使用建议**：
+1. 优先使用标记为"已验证"的功能，稳定性有保障
+2. "待验证"功能可以尝试，但可能遇到兼容性问题
+3. 遇到问题时，参考对应的示例代码进行配置
+
+## 示例代码
+
+Twinkle 提供了以下经过验证的 NPU 训练示例：
+
+### SFT 训练
+- **4 卡 DP+FSDP LoRA 微调**：[cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py)
+  - 使用 Ray 模式进行分布式训练
+  - 演示 DP + FSDP 混合并行
+  - 包含完整的数据加载和训练循环
+
+### GRPO 训练
+- **多卡 GRPO RL 训练**：[cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py)
+  - Actor-Critic 架构
+  - 支持参考模型（Reference Model）
+  - 可选 TorchSampler 或 VLLMSampler
+
+### 远程训练（Tinker 协议）
+- **服务端配置**：[cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend)
+  - 提供 HTTP API 接口
+  - 支持远程训练和推理
+  - 适用于生产环境部署
+
+**运行示例**：
+```bash
+# SFT 训练
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
+python cookbook/sft/lora_npu.py
+
+# GRPO 训练
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python cookbook/grpo/lora_npu.py
+```
+
+## 参考资源
+
+- [昇腾社区官网](https://www.hiascend.com/)
+- [CANN 软件安装指南](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html)
+- [torch_npu GitHub](https://github.com/Ascend/pytorch)
+- [Twinkle GitHub](https://github.com/modelscope/twinkle)
+- [Twinkle 文档](https://twinkle.readthedocs.io/)
+
+## 获取帮助
+
+如果您在使用过程中遇到问题：
+
+1. **查看日志**：设置环境变量 `ASCEND_GLOBAL_LOG_LEVEL=1` 获取详细日志
+2. **提交 Issue**：[Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues)
+3. **社区讨论**：[昇腾社区论坛](https://www.hiascend.com/forum)
+
+## 下一步
+
+- 📖 阅读 [快速开始](Quick-start.md) 了解更多训练示例
+- 📖 阅读 [安装指南](Installation.md) 了解其他平台的安装
+- 🚀 浏览 `cookbook/` 目录查看完整示例代码
+- 💡 查看 [Twinkle 文档](https://twinkle.readthedocs.io/) 了解高级功能
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
new file mode 100644
index 00000000..2c16f79b
--- /dev/null
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -0,0 +1,344 @@
+# NPU (Ascend) Setup Guide
+
+This guide explains how to install and use the Twinkle framework on Huawei Ascend NPU environments.
+
+## Requirements
+
+Before starting, ensure your system meets the following requirements:
+
+| Component | Version Requirement | Notes |
+|-----------|-------------------|-------|
+| Python | >= 3.11, < 3.13 | Required by Twinkle framework |
+| Ascend Firmware Driver (HDK) | Latest recommended | Hardware driver and firmware |
+| CANN Toolkit | 8.3.RC1 or higher | Heterogeneous Computing Architecture |
+| PyTorch | 2.7.1 | Deep learning framework |
+| torch_npu | 2.7.1 | Ascend PyTorch adapter |
+
+**Important Notes**:
+- PyTorch and torch_npu versions **must match exactly** (e.g., both 2.7.1)
+- Python 3.11 is recommended for best compatibility
+- CANN toolkit requires approximately 10GB+ disk space
+
+## Supported Hardware
+
+Twinkle currently supports the following Ascend NPU devices:
+
+- Ascend 910 series
+- Other compatible Ascend accelerators
+
+## Installation
+
+### 1. Install NPU Environment (Driver, CANN, torch_npu)
+
+NPU environment installation includes Ascend driver, CANN toolkit, PyTorch, and torch_npu.
+
+**📖 Complete Installation Guide**: [torch_npu Official Installation Guide](https://gitcode.com/Ascend/pytorch/overview)
+
+The guide covers:
+- Ascend driver (HDK) installation steps
+- CANN toolkit installation steps
+- PyTorch and torch_npu installation steps
+- Version compatibility instructions
+
+**Recommended Version Configuration**:
+- Python: 3.11
+- PyTorch: 2.7.1
+- torch_npu: 2.7.1
+- CANN: 8.3.RC1 or higher
+
+### 2. Install Twinkle
+
+After NPU environment is configured, install Twinkle framework from source:
+
+```bash
+git clone https://github.com/modelscope/twinkle.git
+cd twinkle
+pip install -e ".[transformers,ray]"
+```
+
+### 3. Verify Installation
+
+Create test script `verify_npu.py`:
+
+```python
+import torch
+import torch_npu
+
+print(f"PyTorch version: {torch.__version__}")
+print(f"torch_npu version: {torch_npu.__version__}")
+print(f"NPU available: {torch.npu.is_available()}")
+print(f"NPU device count: {torch.npu.device_count()}")
+
+if torch.npu.is_available():
+    print(f"Current NPU device: {torch.npu.current_device()}")
+    print(f"NPU device name: {torch.npu.get_device_name(0)}")
+
+    # Simple test
+    x = torch.randn(3, 3).npu()
+    y = torch.randn(3, 3).npu()
+    z = x + y
+    print(f"NPU computation test passed: {z.shape}")
+```
+
+Run verification:
+
+```bash
+python verify_npu.py
+```
+
+If output shows `NPU available: True` without errors, installation is successful!
+
+**Note**: Twinkle does not currently provide NPU Docker images. Manual installation is recommended. For containerized deployment, please refer to official Ascend Community images.
+
+## Quick Start
+
+**Important**: All examples below are from the `cookbook/` directory and have been verified on actual NPU environments. We recommend running scripts directly from cookbook rather than copying code snippets.
+
+### SFT LoRA Fine-tuning
+
+Verified 4-card DP+FSDP training example:
+
+**Example Path**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py)
+
+**How to Run**:
+```bash
+# Specify 4 NPU cards
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
+
+# Run training
+python cookbook/sft/lora_npu.py
+```
+
+**Example Features**:
+- ✅ Ray distributed mode
+- ✅ DP + FSDP hybrid parallelism (2x2)
+- ✅ LoRA fine-tuning
+- ✅ Complete data loading and training loop
+
+### GRPO Reinforcement Learning Training
+
+Verified multi-card GRPO training example:
+
+**Example Path**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py)
+
+**How to Run**:
+```bash
+# Specify 8 NPU cards
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+# Run training
+python cookbook/grpo/lora_npu.py
+```
+
+**Example Features**:
+- ✅ Actor-Critic architecture
+- ✅ Supports Reference Model
+- ✅ Optional TorchSampler or VLLMSampler
+- ✅ Complete RL training workflow
+
+### More Examples
+
+Check the `cookbook/remote/tinker/ascend/` directory for remote training server configurations.
+
+## NPU-Specific Configuration
+
+### Environment Variables
+
+Twinkle automatically recognizes and uses the following environment variables in NPU environments:
+
+| Environment Variable | Description | Example |
+|---------------------|-------------|---------|
+| `ASCEND_RT_VISIBLE_DEVICES` | Specify visible NPU devices | `0,1,2,3` |
+| `TRUST_REMOTE_CODE` | Allow loading remote code | `1` |
+| `TWINKLE_SEED` | Random seed | `42` |
+| `TWINKLE_FULL_DETERMINISM` | Fully deterministic training | `1` |
+
+### Device Mesh Configuration
+
+In NPU environments, specify `device_type='npu'`. Here are **verified** configuration examples:
+
+```python
+from twinkle import DeviceMesh
+
+# Single card
+device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu')
+
+# 2-card DP
+device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu')
+
+# 4-card DP + FSDP (2x2) - Verified
+device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu')
+```
+
+**Note**: Advanced parallelism strategies like TP/PP/EP have not been verified on NPU. Please refer to GPU documentation for configuration details.
+
+### Device Group Configuration
+
+When using Ray mode, specify device type in DeviceGroup:
+
+```python
+from twinkle.infra import DeviceGroup
+
+device_groups = [
+    DeviceGroup(
+        name='actor',
+        ranks=[0, 1, 2, 3, 4, 5],  # Actor uses 6 cards
+        device_type='npu',
+    ),
+    DeviceGroup(
+        name='ref',
+        ranks=[6, 7],  # Reference model uses 2 cards
+        device_type='npu',
+    ),
+]
+```
+
+## Parallelism Strategies
+
+Currently **verified** parallelism strategies on Twinkle NPU:
+
+| Parallel Type | Description | NPU Support | Verification Status |
+|--------------|-------------|-------------|-------------------|
+| DP (Data Parallel) | Data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) |
+| FSDP (Fully Sharded Data Parallel) | Fully sharded data parallelism | ✅ | Verified (see cookbook/sft/lora_npu.py) |
+| TP (Tensor Parallel) | Tensor parallelism (Megatron) | 🚧 | To be verified |
+| PP (Pipeline Parallel) | Pipeline parallelism (Megatron) | 🚧 | To be verified |
+| CP (Context Parallel) | Context parallelism | 🚧 | To be verified |
+| EP (Expert Parallel) | Expert parallelism (MoE) | 🚧 | To be verified |
+
+**Legend**:
+- ✅ Verified: Has working example code
+- 🚧 To be verified: Theoretically supported but no NPU validation
+- ❌ Not supported: Currently unavailable
+
+### DP + FSDP Example (Verified)
+
+The following example is from `cookbook/sft/lora_npu.py`, verified on actual NPU environment:
+
+```python
+import numpy as np
+from twinkle import DeviceMesh
+
+# 4 cards: DP=2, FSDP=2
+device_mesh = DeviceMesh(
+    device_type='npu',
+    mesh=np.array([[0, 1], [2, 3]]),
+    mesh_dim_names=('dp', 'fsdp')
+)
+```
+
+**Note**: Megatron backend (TP/PP/EP) support on NPU is under development with no available examples yet. If you need these advanced parallelism strategies, please validate on GPU environment first or follow project updates.
+
+## Common Issues
+
+### 1. torch_npu Version Mismatch
+
+**Problem**: Version incompatibility warnings or errors after installing torch_npu.
+
+**Solution**:
+- Ensure torch and torch_npu versions match exactly
+- Check CANN version compatibility with torch_npu
+
+```bash
+# Check current versions
+python -c "import torch; import torch_npu; print(torch.__version__, torch_npu.__version__)"
+
+# Reinstall matching versions
+pip uninstall torch torch_npu -y
+pip install torch==2.7.1
+pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl
+```
+
+### 2. CANN Toolkit Version Issues
+
+**Problem**: CANN version incompatible with torch_npu.
+
+**Solution**:
+- Refer to [Ascend Community version compatibility table](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0015.html)
+- Install matching CANN toolkit version
+
+## Feature Support Matrix
+
+Feature support matrix based on actual code verification:
+
+| Feature | GPU | NPU | Verification Example | Notes |
+|---------|-----|-----|---------------------|-------|
+| SFT + LoRA | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
+| GRPO | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
+| DP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
+| FSDP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
+| Ray Distributed | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
+| TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
+| Full Fine-tuning | ✅ | 🚧 | - | Theoretically supported, to be verified |
+| QLoRA | ✅ | ❌ | - | Quantization operators not supported |
+| DPO | ✅ | 🚧 | - | Theoretically supported, to be verified |
+| Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification |
+| VLLMSampler | ✅ | 🚧 | - | Requires vLLM-Ascend, to be verified |
+| Flash Attention | ✅ | ⚠️ | - | Some operators unsupported |
+
+**Legend**:
+- ✅ **Verified**: Has working examples, confirmed available
+- 🚧 **To be verified**: Theoretically supported but no NPU validation
+- ⚠️ **Partial support**: Available but with limitations or performance differences
+- ❌ **Not supported**: Currently unavailable
+
+**Usage Recommendations**:
+1. Prioritize "Verified" features for stable production use
+2. "To be verified" features can be tried but may have compatibility issues
+3. Refer to corresponding example code when encountering problems
+
+## Example Code
+
+Twinkle provides the following verified NPU training examples:
+
+### SFT Training
+- **4-card DP+FSDP LoRA fine-tuning**: [cookbook/sft/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/sft/lora_npu.py)
+  - Uses Ray mode for distributed training
+  - Demonstrates DP + FSDP hybrid parallelism
+  - Includes complete data loading and training loop
+
+### GRPO Training
+- **Multi-card GRPO RL training**: [cookbook/grpo/lora_npu.py](https://github.com/modelscope/twinkle/blob/main/cookbook/grpo/lora_npu.py)
+  - Actor-Critic architecture
+  - Supports Reference Model
+  - Optional TorchSampler or VLLMSampler
+
+### Remote Training (Tinker Protocol)
+- **Server Configuration**: [cookbook/remote/tinker/ascend/](https://github.com/modelscope/twinkle/tree/main/cookbook/remote/tinker/ascend)
+  - Provides HTTP API interface
+  - Supports remote training and inference
+  - Suitable for production deployment
+
+**Running Examples**:
+```bash
+# SFT Training
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
+python cookbook/sft/lora_npu.py
+
+# GRPO Training
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+python cookbook/grpo/lora_npu.py
+```
+
+## References
+
+- [Ascend Community Official Website](https://www.hiascend.com/)
+- [CANN Software Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha002/softwareinstall/instg/atlasdeploy_03_0001.html)
+- [torch_npu GitHub](https://github.com/Ascend/pytorch)
+- [Twinkle GitHub](https://github.com/modelscope/twinkle)
+- [Twinkle Documentation](https://twinkle.readthedocs.io/)
+
+## Getting Help
+
+If you encounter problems during usage:
+
+1. **Check logs**: Set environment variable `ASCEND_GLOBAL_LOG_LEVEL=1` for detailed logs
+2. **Submit Issue**: [Twinkle GitHub Issues](https://github.com/modelscope/twinkle/issues)
+3. **Community Discussion**: [Ascend Community Forum](https://www.hiascend.com/forum)
+
+## Next Steps
+
+- 📖 Read [Quick Start](Quick-start.md) for more training examples
+- 📖 Read [Installation Guide](Installation.md) for other platform installations
+- 🚀 Browse `cookbook/` directory for complete example code
+- 💡 Check [Twinkle Documentation](https://twinkle.readthedocs.io/) for advanced features

From 256ce764895f92fb8b3ccd248fb217a253741663 Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:12:52 +0000
Subject: [PATCH 2/6] docs update

---
 docs/source/GetStarted/NPU-setup.md    | 53 --------------------------
 docs/source_en/GetStarted/NPU-setup.md | 53 --------------------------
 2 files changed, 106 deletions(-)

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
index a7923fd6..de2b3a89 100644
--- a/docs/source/GetStarted/NPU-setup.md
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -140,59 +140,6 @@ python cookbook/grpo/lora_npu.py
 
 查看 `cookbook/remote/tinker/ascend/` 目录了解远程训练服务端配置。
 
-## NPU 特定配置
-
-### 环境变量
-
-Twinkle 在 NPU 环境下会自动识别和使用以下环境变量：
-
-| 环境变量 | 说明 | 示例 |
-|---------|------|------|
-| `ASCEND_RT_VISIBLE_DEVICES` | 指定可见的 NPU 设备 | `0,1,2,3` |
-| `TRUST_REMOTE_CODE` | 允许加载远程代码 | `1` |
-| `TWINKLE_SEED` | 随机种子 | `42` |
-| `TWINKLE_FULL_DETERMINISM` | 完全确定性训练 | `1` |
-
-### 设备网格配置
-
-在 NPU 环境下，需要指定 `device_type='npu'`。以下是**已验证**的配置示例：
-
-```python
-from twinkle import DeviceMesh
-
-# 单卡
-device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu')
-
-# 2 卡 DP
-device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu')
-
-# 4 卡 DP + FSDP (2x2) - 已验证
-device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu')
-```
-
-**注意**：TP/PP/EP 等高级并行策略暂无 NPU 验证，配置方式请参考 GPU 文档。
-
-### 设备组配置
-
-使用 Ray 模式时，需要在 DeviceGroup 中指定设备类型：
-
-```python
-from twinkle.infra import DeviceGroup
-
-device_groups = [
-    DeviceGroup(
-        name='actor',
-        ranks=[0, 1, 2, 3, 4, 5],  # Actor 使用 6 张卡
-        device_type='npu',
-    ),
-    DeviceGroup(
-        name='ref',
-        ranks=[6, 7],  # Reference 模型使用 2 张卡
-        device_type='npu',
-    ),
-]
-```
-
 ## 并行策略
 
 Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略：
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
index 2c16f79b..03bbee83 100644
--- a/docs/source_en/GetStarted/NPU-setup.md
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -140,59 +140,6 @@ python cookbook/grpo/lora_npu.py
 
 Check the `cookbook/remote/tinker/ascend/` directory for remote training server configurations.
 
-## NPU-Specific Configuration
-
-### Environment Variables
-
-Twinkle automatically recognizes and uses the following environment variables in NPU environments:
-
-| Environment Variable | Description | Example |
-|---------------------|-------------|---------|
-| `ASCEND_RT_VISIBLE_DEVICES` | Specify visible NPU devices | `0,1,2,3` |
-| `TRUST_REMOTE_CODE` | Allow loading remote code | `1` |
-| `TWINKLE_SEED` | Random seed | `42` |
-| `TWINKLE_FULL_DETERMINISM` | Fully deterministic training | `1` |
-
-### Device Mesh Configuration
-
-In NPU environments, specify `device_type='npu'`. Here are **verified** configuration examples:
-
-```python
-from twinkle import DeviceMesh
-
-# Single card
-device_mesh = DeviceMesh.from_sizes(dp_size=1, device_type='npu')
-
-# 2-card DP
-device_mesh = DeviceMesh.from_sizes(dp_size=2, device_type='npu')
-
-# 4-card DP + FSDP (2x2) - Verified
-device_mesh = DeviceMesh.from_sizes(dp_size=2, fsdp_size=2, device_type='npu')
-```
-
-**Note**: Advanced parallelism strategies like TP/PP/EP have not been verified on NPU. Please refer to GPU documentation for configuration details.
-
-### Device Group Configuration
-
-When using Ray mode, specify device type in DeviceGroup:
-
-```python
-from twinkle.infra import DeviceGroup
-
-device_groups = [
-    DeviceGroup(
-        name='actor',
-        ranks=[0, 1, 2, 3, 4, 5],  # Actor uses 6 cards
-        device_type='npu',
-    ),
-    DeviceGroup(
-        name='ref',
-        ranks=[6, 7],  # Reference model uses 2 cards
-        device_type='npu',
-    ),
-]
-```
-
 ## Parallelism Strategies
 
 Currently **verified** parallelism strategies on Twinkle NPU:

From ba89de41b55b4809353bd857760bb093e30dfa66 Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:22:02 +0000
Subject: [PATCH 3/6] update

---
 docs/source/GetStarted/NPU-setup.md    | 18 ++++++++++++++++--
 docs/source_en/GetStarted/NPU-setup.md | 18 ++++++++++++++++--
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
index de2b3a89..60f4dc18 100644
--- a/docs/source/GetStarted/NPU-setup.md
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -56,7 +56,21 @@ cd twinkle
 pip install -e ".[transformers,ray]"
 ```
 
-### 3. 验证安装
+### 3. 安装 vLLM 和 vLLM-Ascend（可选）
+
+如果需要使用 VLLMSampler 进行高效推理，可以安装 vLLM 和 vLLM-Ascend：
+
+```bash
+# 安装 vLLM
+pip install vllm
+
+# 安装 vLLM-Ascend（昇腾适配版本）
+# 请参考官方文档：https://github.com/vllm-project/vllm
+```
+
+**注意**：vLLM-Ascend 的安装可能需要特定的 CANN 版本配套，请参考 vLLM-Ascend 官方文档进行安装。
+
+### 4. 验证安装
 
 创建测试脚本 `verify_npu.py`：
 
@@ -220,7 +234,7 @@ pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl
 | QLoRA | ✅ | ❌ | - | 量化算子暂不支持 |
 | DPO | ✅ | 🚧 | - | 理论支持，待验证 |
 | Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 |
-| VLLMSampler | ✅ | 🚧 | - | 需 vLLM-Ascend，待验证 |
+| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
 | Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 |
 
 **图例说明**：
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
index 03bbee83..c0404209 100644
--- a/docs/source_en/GetStarted/NPU-setup.md
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -56,7 +56,21 @@ cd twinkle
 pip install -e ".[transformers,ray]"
 ```
 
-### 3. Verify Installation
+### 3. Install vLLM and vLLM-Ascend (Optional)
+
+If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend:
+
+```bash
+# Install vLLM
+pip install vllm
+
+# Install vLLM-Ascend (Ascend adaptation version)
+# Please refer to the official documentation: https://github.com/vllm-project/vllm
+```
+
+**Note**: vLLM-Ascend installation may require specific CANN version compatibility. Please refer to the official vLLM-Ascend documentation for installation instructions.
+
+### 4. Verify Installation
 
 Create test script `verify_npu.py`:
 
@@ -220,7 +234,7 @@ Feature support matrix based on actual code verification:
 | QLoRA | ✅ | ❌ | - | Quantization operators not supported |
 | DPO | ✅ | 🚧 | - | Theoretically supported, to be verified |
 | Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification |
-| VLLMSampler | ✅ | 🚧 | - | Requires vLLM-Ascend, to be verified |
+| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
 | Flash Attention | ✅ | ⚠️ | - | Some operators unsupported |
 
 **Legend**:

From 39e3404c0e1d7810d4cdf49d5c66054247ca173e Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:25:32 +0000
Subject: [PATCH 4/6] update

---
 docs/source/GetStarted/NPU-setup.md    | 17 +++++++++++------
 docs/source_en/GetStarted/NPU-setup.md | 17 +++++++++++------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
index 60f4dc18..f15a7cd2 100644
--- a/docs/source/GetStarted/NPU-setup.md
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -58,17 +58,22 @@ pip install -e ".[transformers,ray]"
 
 ### 3. 安装 vLLM 和 vLLM-Ascend（可选）
 
-如果需要使用 VLLMSampler 进行高效推理，可以安装 vLLM 和 vLLM-Ascend：
+如果需要使用 VLLMSampler 进行高效推理，可以安装 vLLM 和 vLLM-Ascend。
+
+**安装步骤**（参考 [Swift 文档](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)）：
 
 ```bash
-# 安装 vLLM
-pip install vllm
+# 第一步：安装 vLLM
+pip install vllm==0.11.0
 
-# 安装 vLLM-Ascend（昇腾适配版本）
-# 请参考官方文档：https://github.com/vllm-project/vllm
+# 第二步：安装 vLLM-Ascend
+pip install vllm-ascend==0.11.0rc3
 ```
 
-**注意**：vLLM-Ascend 的安装可能需要特定的 CANN 版本配套，请参考 vLLM-Ascend 官方文档进行安装。
+**注意事项**：
+- 按照上述顺序安装，忽略可能的依赖冲突提示
+- 安装前确保已激活 CANN 环境：`source /usr/local/Ascend/ascend-toolkit/set_env.sh`
+- 推荐使用的版本为 vLLM 0.11.0 和 vLLM-Ascend 0.11.0rc3
 
 ### 4. 验证安装
 
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
index c0404209..acd3c9a7 100644
--- a/docs/source_en/GetStarted/NPU-setup.md
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -58,17 +58,22 @@ pip install -e ".[transformers,ray]"
 
 ### 3. Install vLLM and vLLM-Ascend (Optional)
 
-If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend:
+If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend.
+
+**Installation Steps** (Reference: [Swift Documentation](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)):
 
 ```bash
-# Install vLLM
-pip install vllm
+# Step 1: Install vLLM
+pip install vllm==0.11.0
 
-# Install vLLM-Ascend (Ascend adaptation version)
-# Please refer to the official documentation: https://github.com/vllm-project/vllm
+# Step 2: Install vLLM-Ascend
+pip install vllm-ascend==0.11.0rc3
 ```
 
-**Note**: vLLM-Ascend installation may require specific CANN version compatibility. Please refer to the official vLLM-Ascend documentation for installation instructions.
+**Important Notes**:
+- Follow the installation order above and ignore potential dependency conflict warnings
+- Ensure CANN environment is activated before installation: `source /usr/local/Ascend/ascend-toolkit/set_env.sh`
+- Recommended versions are vLLM 0.11.0 and vLLM-Ascend 0.11.0rc3
 
 ### 4. Verify Installation
 

From 478e538d5bde3c0346d816155fac24fe48f91344 Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:35:33 +0000
Subject: [PATCH 5/6] update

---
 docs/source/GetStarted/NPU-setup.md    | 4 ++--
 docs/source_en/GetStarted/NPU-setup.md | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
index f15a7cd2..e3cac951 100644
--- a/docs/source/GetStarted/NPU-setup.md
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -177,7 +177,7 @@ Twinkle 在 NPU 上目前支持以下**经过验证**的并行策略：
 - 🚧 待验证：理论上支持但暂无 NPU 验证示例
 - ❌ 不支持：当前版本不可用
 
-### DP + FSDP 示例（已验证）
+### DP + FSDP 示例
 
 以下示例来自 `cookbook/sft/lora_npu.py`，在实际 NPU 环境中验证通过：
 
@@ -235,11 +235,11 @@ pip install torch_npu-2.7.1-cp311-cp311-linux_aarch64.whl
 | FSDP 并行 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
 | Ray 分布式 | ✅ | ✅ | cookbook/sft/lora_npu.py | 已验证可用 |
 | TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
+| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
 | 全量微调 | ✅ | 🚧 | - | 理论支持，待验证 |
 | QLoRA | ✅ | ❌ | - | 量化算子暂不支持 |
 | DPO | ✅ | 🚧 | - | 理论支持，待验证 |
 | Megatron TP/PP | ✅ | 🚧 | - | 待适配和验证 |
-| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | 已验证可用 |
 | Flash Attention | ✅ | ⚠️ | - | 部分算子不支持 |
 
 **图例说明**：
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
index acd3c9a7..61d3503f 100644
--- a/docs/source_en/GetStarted/NPU-setup.md
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -177,7 +177,7 @@ Currently **verified** parallelism strategies on Twinkle NPU:
 - 🚧 To be verified: Theoretically supported but no NPU validation
 - ❌ Not supported: Currently unavailable
 
-### DP + FSDP Example (Verified)
+### DP + FSDP Example
 
 The following example is from `cookbook/sft/lora_npu.py`, verified on actual NPU environment:
 
@@ -235,11 +235,11 @@ Feature support matrix based on actual code verification:
 | FSDP Parallel | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
 | Ray Distributed | ✅ | ✅ | cookbook/sft/lora_npu.py | Verified and working |
 | TorchSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
+| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
 | Full Fine-tuning | ✅ | 🚧 | - | Theoretically supported, to be verified |
 | QLoRA | ✅ | ❌ | - | Quantization operators not supported |
 | DPO | ✅ | 🚧 | - | Theoretically supported, to be verified |
 | Megatron TP/PP | ✅ | 🚧 | - | Under adaptation and verification |
-| VLLMSampler | ✅ | ✅ | cookbook/grpo/lora_npu.py | Verified and working |
 | Flash Attention | ✅ | ⚠️ | - | Some operators unsupported |
 
 **Legend**:

From 76d9f10c71a6b5253b6f047204d70aa77fca8fd5 Mon Sep 17 00:00:00 2001
From: addsubmuldiv <zyh13227@163.com>
Date: Thu, 5 Feb 2026 07:39:41 +0000
Subject: [PATCH 6/6] update

---
 docs/source/GetStarted/NPU-setup.md    | 2 +-
 docs/source_en/GetStarted/NPU-setup.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/GetStarted/NPU-setup.md b/docs/source/GetStarted/NPU-setup.md
index e3cac951..000d38f8 100644
--- a/docs/source/GetStarted/NPU-setup.md
+++ b/docs/source/GetStarted/NPU-setup.md
@@ -60,7 +60,7 @@ pip install -e ".[transformers,ray]"
 
 如果需要使用 VLLMSampler 进行高效推理，可以安装 vLLM 和 vLLM-Ascend。
 
-**安装步骤**（参考 [Swift 文档](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)）：
+**安装步骤**：
 
 ```bash
 # 第一步：安装 vLLM
diff --git a/docs/source_en/GetStarted/NPU-setup.md b/docs/source_en/GetStarted/NPU-setup.md
index 61d3503f..78a70c83 100644
--- a/docs/source_en/GetStarted/NPU-setup.md
+++ b/docs/source_en/GetStarted/NPU-setup.md
@@ -60,7 +60,7 @@ pip install -e ".[transformers,ray]"
 
 If you need to use VLLMSampler for efficient inference, you can install vLLM and vLLM-Ascend.
 
-**Installation Steps** (Reference: [Swift Documentation](https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html)):
+**Installation Steps**:
 
 ```bash
 # Step 1: Install vLLM