[fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe#670
Open
[fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe#670
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The quark models, amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8 have fp8 weight linear layers in attn and adopt ptpc quant recipe. But current code in ATOM forces block scale quant in _fuse_rmsnorm_quant. This pr fixed this issue.
Technical Details
_fuse_rmsnorm_quant should select correct quant type based on the quant config/recipe. For per-token quant, a new kernel: fused_qk_rmsnorm_per_token_quant is added in aiter, refer to PR: ROCm/aiter#2958.
Test Plan
The gsm8k dataset accuracy is validated with/w.o this pr on amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8, with ATOM and vLLM-ATOM.
Test Result
Main branch:
amd/Kimi-K2-Thinking-MXFP4-AttnFP8
ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.8529|± |0.0098| | | |strict-match | 3|exact_match|↑ |0.8514|± |0.0098|vLLM-ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.8491|± |0.0099| | | |strict-match | 3|exact_match|↑ |0.8431|± |0.0100|Both ATOM and vLLM-ATOM drop to ~0.85, which is lower than expected.
amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4
vLLM-ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.9393|± |0.0066| | | |strict-match | 3|exact_match|↑ |0.9363|± |0.0067|This PR:
amd/Kimi-K2-Thinking-MXFP4-AttnFP8
ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.9333|± |0.0069| | | |strict-match | 3|exact_match|↑ |0.9340|± |0.0068|vLLM-ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/Kimi-K2-Thinking-MXFP4-AttnFP8', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.9318|± |0.0069| | | |strict-match | 3|exact_match|↑ |0.9287|± |0.0071|Both ATOM and vLLM-ATOM get the score recovered back to ~0.93.
amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4
vLLM-ATOM:
local-completions ({'model': '/workspace/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 65, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 3|exact_match|↑ |0.9409|± |0.0065| | | |strict-match | 3|exact_match|↑ |0.9401|± |0.0065|For model amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4, although there is no obvious accuracy drop even without this pr, the code changes here still make sense and will not hurt the accuracy of this model.
Submission Checklist