Skip to content

[vllm-atom] Fix GLM-5 accuracy in vLLM plugin#669

Open
kliuae-amd wants to merge 1 commit intomainfrom
kliuae/plugin_fix_dsa
Open

[vllm-atom] Fix GLM-5 accuracy in vLLM plugin#669
kliuae-amd wants to merge 1 commit intomainfrom
kliuae/plugin_fix_dsa

Conversation

@kliuae-amd
Copy link
Copy Markdown
Contributor

Motivation

This PR fixes the accuracy drop of GLM-5 in vLLM-ATOM mode.

Technical Details

Build ragged layout in every layer.
Use top_k_per_row_prefill from vLLM that handles indexing more consistently across short/long context lengths.
In combination with bugfix for cp_gather_indexer_k_quant_cache in aiter ROCm/aiter#2954, accuracy is restored to baseline.

Test Plan

Accuracy test with lm_eval

Model: zai-org/GLM-5-FP8

Server command

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
vllm serve zai-org/GLM-5-FP8 \
  -tp 8 \
  --gpu-memory-utilization 0.7 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype fp8 \
  --block-size 1

lm_eval command:

lm_eval --model local-completions   --model_args model=zai-org/GLM-5-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=64,tokenized_requests=False  --tasks gsm8k --num_fewshot 20

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 20 exact_match _ 0.9431 _ 0.0064
strict-match 20 exact_match _ 0.9431 _ 0.0064

Submission Checklist

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants