Skip to content

虚拟化不正常 #37

@zqWu

Description

@zqWu

先说明一下步骤:

  1. 节点上安装了 k8s + ascend-device-plugin

未安装 HAMi之前,测试:

apiVersion: v1
kind: Pod
metadata:
  name: npu-smi
spec:
  volumes:
  - name: host-usr-local
    hostPath:
      path: /usr/local

  containers:
  - name: npu-smi
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "npu-smi info; sleep infinity"]

    resources:
      limits:
        huawei.com/Ascend910: 2 # 预期看到2张卡
      requests:
        huawei.com/Ascend910: 2

    volumeMounts:
    - name: host-usr-local
      mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去

这个 yaml npu-smi符合预期,显示2张 910B2的卡(主机上共有8卡)
此时 k8s 使用npu正常。(vllm 能够拉起模型)

  1. 安装 HAMi
  • 安装 hami scheudler (按要求修改了 values, enable 和 customresources)
  • node的 label 打了2种,ascend=on 和 accelerator=huawei-Ascend910
VER_K8S=v1.26.0
helm install hami ./charts/hami --set scheduler.kubeScheduler.imageTag=$VER_K8S -n kube-system

再安装 ascend-device-plugin,直接 kubectl apply -f ascend-device-plugin.yaml

  1. 检测状态
Image Image

这里有个问题:为什么是 64,因为 scheduler的 configmap设置是 1:10, 预期是 80

Image
  1. 检查 pod中是否能使用 vGPU
    修改上面的 yaml中的resource
# ctr image pull quay.io/ascend/vllm-ascend:v0.11.0rc0

apiVersion: v1
kind: Pod
metadata:
  name: npu-smi
spec:
  volumes:
  - name: host-usr-local
    hostPath:
      path: /usr/local

  containers:
  - name: npu-smi
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "npu-smi info; sleep infinity"]

    resources:
      limits:
        huawei.com/Ascend910B2: 1 # 预期看到张卡

    volumeMounts:
    - name: host-usr-local
      mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去

结果如下:

Image

显示分配了一张卡,但是无法在pod中使用。

kubectl get pod/npu-smi -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    hami.io/Ascend910B2-devices-allocated: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
    hami.io/Ascend910B2-devices-to-allocate: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
    hami.io/bind-phase: allocating
    hami.io/bind-time: "1761711164"
    hami.io/vgpu-node: 192.168.0.85
    hami.io/vgpu-time: "1761711164"
    huawei.com/Ascend910B2: '[{"UUID":"6874EE64-8060B21D-1818D892-89528485-104301E3"}]'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"npu-smi","namespace":"default"},"spec":{"containers":[{"command":["sh","-c","npu-smi info; sleep infinity"],"image":"quay.io/ascend/vllm-ascend:v0.11.0rc0","imagePullPolicy":"IfNotPresent","name":"npu-smi","resources":{"limits":{"huawei.com/Ascend910B2":1}},"volumeMounts":[{"mountPath":"/usr/local","name":"host-usr-local"}]}],"volumes":[{"hostPath":{"path":"/usr/local"},"name":"host-usr-local"}]}}
    predicate-time: "1761711164"
  creationTimestamp: "2025-10-29T04:12:44Z"
  labels:
    hami.io/vgpu-node: 192.168.0.85
  name: npu-smi
  namespace: default
  resourceVersion: "1208292"
  uid: 7b7cd5ca-1d16-4fe5-835f-3f54bf841152
spec:
  containers:
  - command:
    - sh
    - -c
    - npu-smi info; sleep infinity
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    name: npu-smi
    resources:
      limits:
        huawei.com/Ascend910B2: "1"
        huawei.com/Ascend910B2-memory: "65536"
      requests:
        huawei.com/Ascend910B2: "1"
        huawei.com/Ascend910B2-memory: "65536"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /usr/local
      name: host-usr-local
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-g52dh
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 192.168.0.85
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: hami-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /usr/local
      type: ""
    name: host-usr-local
  - name: kube-api-access-g52dh
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:44Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:45Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:45Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:44Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://e5145389ed3c7b1b32595a5761093d889d42c4cb3bb9028832c833772a88c244
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imageID: sha256:f3e58518611886581cc178f1b77683e4483f665e2f32ed9a324474632dc80132
    lastState: {}
    name: npu-smi
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-10-29T04:12:44Z"
  hostIP: 192.168.0.85
  phase: Running
  podIP: 172.20.255.15
  podIPs:
  - ip: 172.20.255.15
  qosClass: BestEffort
  startTime: "2025-10-29T04:12:44Z"

我到容器里面看了一下日志,贴到这里了
https://www.hiascend.com/forum/thread-0278197012733089109-1-1.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions