-
Notifications
You must be signed in to change notification settings - Fork 168
Open
Description
Problem Description
When testing Kimi-K2-Thinking with VLLM on MI300X with DP2TP4, enabling VLLM_ROCM_USE_AITER_MLA appears to result in complete accuracy loss as measured by gsm8k eval.
Operating System
CentOS Stream 9
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
6.4.43484-123eb5128
ROCm Component
No response
Steps to Reproduce
Baseline (VLLM_ROCM_USE_AITER_MLA=0):
DEBUG_CLR_GRAPH_PACKET_CAPTURE=0 \
SAFETENSORS_FAST_GPU=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.95 \
VLLM_MLA_DISABLE=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_MLA=0 \
VLLM_ROCM_USE_AITER_MOE=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER=1 \
VLLM_TRUST_REMOTE_CODE=1 \
VLLM_USE_V1=1 \
VLLM_MM_ENCODER_ATTN_BACKEND=TORCH_SDPA \
VLLM_ROCM_USE_AITER_FP8BMM=0 \
VLLM_ROCM_FP8_PADDING=0 \
python -m vllm.entrypoints.openai.api_server \
--port 22234 \
--served-model-name kimi-k2 \
--model /data/local/models/Kimi-K2-Thinking \
--trust-remote-code \
--max-model-len 100K \
--max-num-seqs 256 \
--tensor-parallel-size 4 \
--data-parallel-size 2 \
2>&1 | tee /tmp/bradleyhd/vllm.log
Eval:
lm_eval --model local-completions --model_args "model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256" --tasks gsm8k --num_fewshot 8 2>&1 | tee /tmp/bradleyhd/eval.log
local-completions (model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256), gen_kwargs: (None), limit: None, num_fewshot: 8, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.9424|± |0.0064|
| | |strict-match | 8|exact_match|↑ |0.9424|± |0.0064|
Reproduce (VLLM_ROCM_USE_AITER_MLA=1):
DEBUG_CLR_GRAPH_PACKET_CAPTURE=0 \
SAFETENSORS_FAST_GPU=1 \
VLLM_GPU_MEMORY_UTILIZATION=0.95 \
VLLM_MLA_DISABLE=0 \
VLLM_ROCM_USE_AITER_LINEAR=0 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_MLA=1 \
VLLM_ROCM_USE_AITER_MOE=0 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
VLLM_ROCM_USE_AITER=1 \
VLLM_TRUST_REMOTE_CODE=1 \
VLLM_USE_V1=1 \
VLLM_MM_ENCODER_ATTN_BACKEND=TORCH_SDPA \
VLLM_ROCM_USE_AITER_FP8BMM=0 \
VLLM_ROCM_FP8_PADDING=0 \
python -m vllm.entrypoints.openai.api_server \
--port 22234 \
--served-model-name kimi-k2 \
--model /data/local/models/Kimi-K2-Thinking \
--trust-remote-code \
--max-model-len 100K \
--max-num-seqs 256 \
--tensor-parallel-size 4 \
--data-parallel-size 2 \
2>&1 | tee /tmp/bradleyhd/vllm.log
Eval:
lm_eval --model local-completions --model_args "model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256" --tasks gsm8k --num_fewshot 8 2>&1 | tee /tmp/bradleyhd/eval.log
local-completions (model=kimi-k2,base_url=http://0.0.0.0:22234/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256), gen_kwargs: (None), limit: None, num_fewshot: 8, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 8|exact_match|↑ |0.0091|± |0.0026|
| | |strict-match | 8|exact_match|↑ |0.0000|± |0.0000|
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
using amd-aiter 0.1.7.post2.dev18 PIP package
Metadata
Metadata
Assignees
Labels
No labels