Skip to content

[fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe#670

Open
gbyu-amd wants to merge 4 commits intomainfrom
guanbao/fix_kimi_acc
Open

[fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe#670
gbyu-amd wants to merge 4 commits intomainfrom
guanbao/fix_kimi_acc

Conversation

@gbyu-amd
Copy link
Copy Markdown
Contributor

Motivation

The quark models, amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8 have fp8 weight linear layers in attn and adopt ptpc quant recipe. But current code in ATOM forces block scale quant in _fuse_rmsnorm_quant. This pr fixed this issue.

Technical Details

_fuse_rmsnorm_quant should select correct quant type based on the quant config/recipe. For per-token quant, a new kernel: fused_qk_rmsnorm_per_token_quant is added in aiter, refer to PR: ROCm/aiter#2958.

Test Plan

The gsm8k dataset accuracy is validated with/w.o this pr on amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8, with ATOM and vLLM-ATOM.

Test Result

Main branch:

amd/Kimi-K2-Thinking-MXFP4-AttnFP8

ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.8529|±  |0.0098|
|     |       |strict-match    |     3|exact_match||0.8514|±  |0.0098|

vLLM-ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.8491|±  |0.0099|
|     |       |strict-match    |     3|exact_match||0.8431|±  |0.0100|

Both ATOM and vLLM-ATOM drop to ~0.85, which is lower than expected.

amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

vLLM-ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.9393|±  |0.0066|
|     |       |strict-match    |     3|exact_match||0.9363|±  |0.0067|

This PR:

amd/Kimi-K2-Thinking-MXFP4-AttnFP8

ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.9333|±  |0.0069|
|     |       |strict-match    |     3|exact_match||0.9340|±  |0.0068|

vLLM-ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.9318|±  |0.0069|
|     |       |strict-match    |     3|exact_match||0.9287|±  |0.0071|

Both ATOM and vLLM-ATOM get the score recovered back to ~0.93.

amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

vLLM-ATOM:

local-completions ({'model': '/workspace/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.9409|±  |0.0065|
|     |       |strict-match    |     3|exact_match||0.9401|±  |0.0065|

For model amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4, although there is no obvious accuracy drop even without this pr, the code changes here still make sense and will not hurt the accuracy of this model.

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant