-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
先说明一下步骤:
- 节点上安装了 k8s + ascend-device-plugin
未安装 HAMi之前,测试:
apiVersion: v1
kind: Pod
metadata:
name: npu-smi
spec:
volumes:
- name: host-usr-local
hostPath:
path: /usr/local
containers:
- name: npu-smi
image: quay.io/ascend/vllm-ascend:v0.11.0rc0
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "npu-smi info; sleep infinity"]
resources:
limits:
huawei.com/Ascend910: 2 # 预期看到2张卡
requests:
huawei.com/Ascend910: 2
volumeMounts:
- name: host-usr-local
mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去这个 yaml npu-smi符合预期,显示2张 910B2的卡(主机上共有8卡)
此时 k8s 使用npu正常。(vllm 能够拉起模型)
- 安装 HAMi
- 安装 hami scheudler (按要求修改了 values, enable 和 customresources)
- node的 label 打了2种,ascend=on 和 accelerator=huawei-Ascend910
VER_K8S=v1.26.0
helm install hami ./charts/hami --set scheduler.kubeScheduler.imageTag=$VER_K8S -n kube-system再安装 ascend-device-plugin,直接 kubectl apply -f ascend-device-plugin.yaml
- 检测状态
这里有个问题:为什么是 64,因为 scheduler的 configmap设置是 1:10, 预期是 80
- 检查 pod中是否能使用 vGPU
修改上面的 yaml中的resource
# ctr image pull quay.io/ascend/vllm-ascend:v0.11.0rc0
apiVersion: v1
kind: Pod
metadata:
name: npu-smi
spec:
volumes:
- name: host-usr-local
hostPath:
path: /usr/local
containers:
- name: npu-smi
image: quay.io/ascend/vllm-ascend:v0.11.0rc0
imagePullPolicy: IfNotPresent
command: ["sh", "-c", "npu-smi info; sleep infinity"]
resources:
limits:
huawei.com/Ascend910B2: 1 # 预期看到张卡
volumeMounts:
- name: host-usr-local
mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去结果如下:
显示分配了一张卡,但是无法在pod中使用。
kubectl get pod/npu-smi -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
hami.io/Ascend910B2-devices-allocated: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
hami.io/Ascend910B2-devices-to-allocate: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
hami.io/bind-phase: allocating
hami.io/bind-time: "1761711164"
hami.io/vgpu-node: 192.168.0.85
hami.io/vgpu-time: "1761711164"
huawei.com/Ascend910B2: '[{"UUID":"6874EE64-8060B21D-1818D892-89528485-104301E3"}]'
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"npu-smi","namespace":"default"},"spec":{"containers":[{"command":["sh","-c","npu-smi info; sleep infinity"],"image":"quay.io/ascend/vllm-ascend:v0.11.0rc0","imagePullPolicy":"IfNotPresent","name":"npu-smi","resources":{"limits":{"huawei.com/Ascend910B2":1}},"volumeMounts":[{"mountPath":"/usr/local","name":"host-usr-local"}]}],"volumes":[{"hostPath":{"path":"/usr/local"},"name":"host-usr-local"}]}}
predicate-time: "1761711164"
creationTimestamp: "2025-10-29T04:12:44Z"
labels:
hami.io/vgpu-node: 192.168.0.85
name: npu-smi
namespace: default
resourceVersion: "1208292"
uid: 7b7cd5ca-1d16-4fe5-835f-3f54bf841152
spec:
containers:
- command:
- sh
- -c
- npu-smi info; sleep infinity
image: quay.io/ascend/vllm-ascend:v0.11.0rc0
imagePullPolicy: IfNotPresent
name: npu-smi
resources:
limits:
huawei.com/Ascend910B2: "1"
huawei.com/Ascend910B2-memory: "65536"
requests:
huawei.com/Ascend910B2: "1"
huawei.com/Ascend910B2-memory: "65536"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/local
name: host-usr-local
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-g52dh
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: 192.168.0.85
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: hami-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- hostPath:
path: /usr/local
type: ""
name: host-usr-local
- name: kube-api-access-g52dh
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-10-29T04:12:44Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-10-29T04:12:45Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-10-29T04:12:45Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-10-29T04:12:44Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://e5145389ed3c7b1b32595a5761093d889d42c4cb3bb9028832c833772a88c244
image: quay.io/ascend/vllm-ascend:v0.11.0rc0
imageID: sha256:f3e58518611886581cc178f1b77683e4483f665e2f32ed9a324474632dc80132
lastState: {}
name: npu-smi
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2025-10-29T04:12:44Z"
hostIP: 192.168.0.85
phase: Running
podIP: 172.20.255.15
podIPs:
- ip: 172.20.255.15
qosClass: BestEffort
startTime: "2025-10-29T04:12:44Z"我到容器里面看了一下日志,贴到这里了
https://www.hiascend.com/forum/thread-0278197012733089109-1-1.html
Metadata
Metadata
Assignees
Labels
No labels