-
Notifications
You must be signed in to change notification settings - Fork 692
[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This cherry-pick PR fixes a bug in speculative decoding where split KV operations cause issues with attention. The fix relocates the max_partition_size configuration from the engine initialization to the AppendAttentionBackend class, where it's set based on whether speculative decoding is enabled.
Key Changes
- Removes multimodal-specific
max_partition_sizesetting fromengine.py - Adds speculative decoding check in
AppendAttentionBackend.__init__()to disable split KV by settingmax_partition_size = self.max_seq_len - Centralizes the logic for handling split KV behavior in the attention backend
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/attention/append_attn_backend.py | Adds conditional logic to disable split KV (set max_partition_size to max_seq_len) when speculative decoding is enabled |
| fastdeploy/engine/engine.py | Removes the enable_mm-based max_partition_size override that was previously set at engine initialization level |
| # split kv still has bug in speculative decoding | ||
| if self.speculative_method is not None: | ||
| self.max_partition_size = self.max_seq_len |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description lacks essential information about the bug being fixed. According to the custom guidelines, the description should explain:
- What is the attention bug in speculative decoding?
- Why does split kv cause issues with speculative decoding?
- What problem does setting
max_partition_size = self.max_seq_lensolve?
Please add a more detailed explanation of the bug and the fix to help reviewers and future maintainers understand the changes.
| self.head_dim: int = fd_config.model_config.head_dim | ||
| self.num_layers: int = fd_config.model_config.num_hidden_layers | ||
| self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024)) | ||
| # split kv still has bug in speculative decoding |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "split kv still has bug in speculative decoding" is too vague. Consider providing more specific information about:
- What exactly is the bug (e.g., incorrect results, crashes, memory issues)?
- Reference to related issue or PR for tracking purposes
- Any temporary workaround implications
Example:
# WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding
# when verifying draft tokens. Disable split KV by setting max_partition_size to max_seq_len.
# TODO: Fix the underlying bug in split KV logic for speculative decoding (Issue #XXXX)| # split kv still has bug in speculative decoding | |
| # WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding, | |
| # particularly when verifying draft tokens. To avoid this, disable split KV by setting | |
| # max_partition_size to max_seq_len when speculative decoding is enabled. | |
| # TODO: Fix the underlying bug in split KV logic for speculative decoding (see Issue #XXXX). |
| if self.cfg.scheduler_config.splitwise_role == "prefill": | ||
| variables["FLAGS_fmt_write_cache_completed_signal"] = 1 | ||
|
|
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file fastdeploy/engine/async_llm.py still contains similar code that was removed from engine.py:
if self.cfg.model_config.enable_mm:
variables["FLAGS_max_partition_size"] = 1024(lines 754-755)
For consistency, this code should also be removed from async_llm.py since the same logic is now handled in AppendAttentionBackend.__init__() where max_partition_size is set based on the speculative method. Otherwise, multimodal models may still override the speculative decoding fix.
| # split kv still has bug in speculative decoding | ||
| if self.speculative_method is not None: |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if self.speculative_method is not None: might be too broad. According to the config, speculative_method can be "ngram_match", "mtp", or "hybrid_mtp_ngram". Consider:
- Verifying if this bug affects all speculative methods or only specific ones (e.g., just "mtp")
- Being more explicit about which methods are affected, for example:
if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]:
self.max_partition_size = self.max_seq_lenThis makes the code more maintainable and self-documenting.
| # split kv still has bug in speculative decoding | |
| if self.speculative_method is not None: | |
| # split kv still has bug in speculative decoding for some speculative methods | |
| if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]: |
| self.head_dim: int = fd_config.model_config.head_dim | ||
| self.num_layers: int = fd_config.model_config.num_hidden_layers | ||
| self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024)) | ||
| # split kv still has bug in speculative decoding |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title format is incorrect. It should be:
- Current:
[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) - Expected:
[Cherry-Pick][BugFix] Fix attention bug in spec decoding (#5460)
Issues:
- The tag should be
[BugFix]not[CI]since this is fixing a bug, not a CI change - There should be a space before the opening parenthesis in
(#5460) - Consider a more descriptive title like:
[Cherry-Pick][BugFix] Fix split KV bug in speculative decoding (#5460)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/online/20251131 #5480 +/- ##
==========================================================
Coverage ? 59.04%
==========================================================
Files ? 319
Lines ? 38871
Branches ? 5843
==========================================================
Hits ? 22951
Misses ? 14092
Partials ? 1828
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6715196
into
PaddlePaddle:release/online/20251131
This reverts commit 6715196.
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.