Skip to content

[Issue]: VLLM_ROCM_USE_AITER_MLA accuracy loss with Kimi DP2TP4 #1455

@bradleyhd

Description

@bradleyhd

Problem Description

When testing Kimi-K2-Thinking with VLLM on MI300X with DP2TP4, enabling VLLM_ROCM_USE_AITER_MLA appears to result in complete accuracy loss as measured by gsm8k eval.

Operating System

CentOS Stream 9

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

6.4.43484-123eb5128

ROCm Component

No response

Steps to Reproduce

Baseline (VLLM_ROCM_USE_AITER_MLA=0):

DEBUG_CLR_GRAPH_PACKET_CAPTURE=0 \
SAFETENSORS_FAST_GPU=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.95 \
VLLM_MLA_DISABLE=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_MLA=0 \
VLLM_ROCM_USE_AITER_MOE=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER=1 \
VLLM_TRUST_REMOTE_CODE=1 \
VLLM_USE_V1=1 \
VLLM_MM_ENCODER_ATTN_BACKEND=TORCH_SDPA \
VLLM_ROCM_USE_AITER_FP8BMM=0 \
VLLM_ROCM_FP8_PADDING=0 \
python -m vllm.entrypoints.openai.api_server \
  --port 22234 \
  --served-model-name kimi-k2 \
  --model /data/local/models/Kimi-K2-Thinking \
  --trust-remote-code \
  --max-model-len 100K \
  --max-num-seqs 256 \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  2>&1 | tee /tmp/bradleyhd/vllm.log

Eval:

lm_eval --model local-completions --model_args "model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256" --tasks gsm8k --num_fewshot 8 2>&1 | tee /tmp/bradleyhd/eval.log

local-completions (model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256), gen_kwargs: (None), limit: None, num_fewshot: 8, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9424|±  |0.0064|
|     |       |strict-match    |     8|exact_match|↑  |0.9424|±  |0.0064|

Reproduce (VLLM_ROCM_USE_AITER_MLA=1):

DEBUG_CLR_GRAPH_PACKET_CAPTURE=0 \
SAFETENSORS_FAST_GPU=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.95 \
VLLM_MLA_DISABLE=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_MLA=1 \
VLLM_ROCM_USE_AITER_MOE=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER=1 \
VLLM_TRUST_REMOTE_CODE=1 \
VLLM_USE_V1=1 \
VLLM_MM_ENCODER_ATTN_BACKEND=TORCH_SDPA \
VLLM_ROCM_USE_AITER_FP8BMM=0 \
VLLM_ROCM_FP8_PADDING=0 \
python -m vllm.entrypoints.openai.api_server \
  --port 22234 \
  --served-model-name kimi-k2 \
  --model /data/local/models/Kimi-K2-Thinking \
  --trust-remote-code \
  --max-model-len 100K \
  --max-num-seqs 256 \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  2>&1 | tee /tmp/bradleyhd/vllm.log

Eval:

lm_eval --model local-completions --model_args "model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256" --tasks gsm8k --num_fewshot 8 2>&1 | tee /tmp/bradleyhd/eval.log

local-completions (model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256), gen_kwargs: (None), limit: None, num_fewshot: 8, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.0091|±  |0.0026|
|     |       |strict-match    |     8|exact_match|↑  |0.0000|±  |0.0000|

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

using amd-aiter 0.1.7.post2.dev18 PIP package

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions