虚拟化不正常

先说明一下步骤：

1. 节点上安装了 k8s + ascend-device-plugin

未安装 HAMi之前，测试:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: npu-smi
spec:
  volumes:
  - name: host-usr-local
    hostPath:
      path: /usr/local

  containers:
  - name: npu-smi
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "npu-smi info; sleep infinity"]

    resources:
      limits:
        huawei.com/Ascend910: 2 # 预期看到2张卡
      requests:
        huawei.com/Ascend910: 2

    volumeMounts:
    - name: host-usr-local
      mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去
```
这个 yaml npu-smi符合预期，显示2张 910B2的卡（主机上共有8卡）
此时 k8s 使用npu正常。（vllm 能够拉起模型）

2. 安装 HAMi
- 安装 hami scheudler (按要求修改了 values， enable 和 customresources)
- node的 label 打了2种，ascend=on 和 accelerator=huawei-Ascend910

```bash
VER_K8S=v1.26.0
helm install hami ./charts/hami --set scheduler.kubeScheduler.imageTag=$VER_K8S -n kube-system
```
再安装 ascend-device-plugin，直接 kubectl apply -f ascend-device-plugin.yaml

3. 检测状态

<img width="477" height="144" alt="Image" src="https://github.com/user-attachments/assets/4a0ef754-6a16-42e4-b654-359d2bddbe7c" />

<img width="910" height="236" alt="Image" src="https://github.com/user-attachments/assets/4ff61473-f4c8-42a5-a6f4-fa468833ae50" />

这里有个问题：为什么是 64，因为 scheduler的 configmap设置是 1:10, 预期是 80

<img width="368" height="332" alt="Image" src="https://github.com/user-attachments/assets/50ecd1c1-08f7-4c30-bc59-cc6f8ec9d1c4" />

4. 检查 pod中是否能使用 vGPU
修改上面的 yaml中的resource
```yaml
# ctr image pull quay.io/ascend/vllm-ascend:v0.11.0rc0

apiVersion: v1
kind: Pod
metadata:
  name: npu-smi
spec:
  volumes:
  - name: host-usr-local
    hostPath:
      path: /usr/local

  containers:
  - name: npu-smi
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    command: ["sh", "-c", "npu-smi info; sleep infinity"]

    resources:
      limits:
        huawei.com/Ascend910B2: 1 # 预期看到张卡

    volumeMounts:
    - name: host-usr-local
      mountPath: /usr/local # 容器本身没有 npu-smi命令, 用 volume 挂载进去
```
结果如下：

<img width="436" height="216" alt="Image" src="https://github.com/user-attachments/assets/3eec1bb0-9465-4e20-8380-443e86c544a9" />

显示分配了一张卡，但是无法在pod中使用。

```bash
kubectl get pod/npu-smi -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    hami.io/Ascend910B2-devices-allocated: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
    hami.io/Ascend910B2-devices-to-allocate: 6874EE64-8060B21D-1818D892-89528485-104301E3,Ascend910B2,65536,0:;
    hami.io/bind-phase: allocating
    hami.io/bind-time: "1761711164"
    hami.io/vgpu-node: 192.168.0.85
    hami.io/vgpu-time: "1761711164"
    huawei.com/Ascend910B2: '[{"UUID":"6874EE64-8060B21D-1818D892-89528485-104301E3"}]'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"npu-smi","namespace":"default"},"spec":{"containers":[{"command":["sh","-c","npu-smi info; sleep infinity"],"image":"quay.io/ascend/vllm-ascend:v0.11.0rc0","imagePullPolicy":"IfNotPresent","name":"npu-smi","resources":{"limits":{"huawei.com/Ascend910B2":1}},"volumeMounts":[{"mountPath":"/usr/local","name":"host-usr-local"}]}],"volumes":[{"hostPath":{"path":"/usr/local"},"name":"host-usr-local"}]}}
    predicate-time: "1761711164"
  creationTimestamp: "2025-10-29T04:12:44Z"
  labels:
    hami.io/vgpu-node: 192.168.0.85
  name: npu-smi
  namespace: default
  resourceVersion: "1208292"
  uid: 7b7cd5ca-1d16-4fe5-835f-3f54bf841152
spec:
  containers:
  - command:
    - sh
    - -c
    - npu-smi info; sleep infinity
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imagePullPolicy: IfNotPresent
    name: npu-smi
    resources:
      limits:
        huawei.com/Ascend910B2: "1"
        huawei.com/Ascend910B2-memory: "65536"
      requests:
        huawei.com/Ascend910B2: "1"
        huawei.com/Ascend910B2-memory: "65536"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /usr/local
      name: host-usr-local
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-g52dh
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 192.168.0.85
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: hami-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /usr/local
      type: ""
    name: host-usr-local
  - name: kube-api-access-g52dh
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:44Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:45Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:45Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-10-29T04:12:44Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://e5145389ed3c7b1b32595a5761093d889d42c4cb3bb9028832c833772a88c244
    image: quay.io/ascend/vllm-ascend:v0.11.0rc0
    imageID: sha256:f3e58518611886581cc178f1b77683e4483f665e2f32ed9a324474632dc80132
    lastState: {}
    name: npu-smi
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2025-10-29T04:12:44Z"
  hostIP: 192.168.0.85
  phase: Running
  podIP: 172.20.255.15
  podIPs:
  - ip: 172.20.255.15
  qosClass: BestEffort
  startTime: "2025-10-29T04:12:44Z"
```

我到容器里面看了一下日志，贴到这里了
https://www.hiascend.com/forum/thread-0278197012733089109-1-1.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

虚拟化不正常 #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

虚拟化不正常 #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions