-
Notifications
You must be signed in to change notification settings - Fork 693
[Cherry-Pick][CI] Fix attention bug in spec decoding(#5460) #5481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cherry-Pick][CI] Fix attention bug in spec decoding(#5460) #5481
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This cherry-pick PR fixes an attention bug in speculative decoding by disabling KV cache partitioning when speculative decoding is enabled. The fix involves setting max_partition_size to max_seq_len to work around a bug in split KV operations during speculative decoding, and removes the hardcoded FLAGS_max_partition_size setting for multimodal models from engine.py.
Key Changes:
- Adds a workaround in
AppendAttentionBackendthat disables KV partitioning when speculative decoding is active - Removes hardcoded
FLAGS_max_partition_sizefor multimodal models from engine.py
Note on PR Description: The PR description lacks detail about what the bug was and how these changes fix it. According to the coding guidelines, the description should at minimum explain why these modifications are being made and what problem is being solved. For a bug fix PR, it would be helpful to include:
- A description of the bug symptoms
- Root cause analysis
- Why disabling KV partitioning resolves the issue
- Any known performance implications
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/attention/append_attn_backend.py | Adds conditional logic to disable KV cache partitioning (by setting max_partition_size to max_seq_len) when speculative decoding is enabled, working around a bug in split KV operations |
| fastdeploy/engine/engine.py | Removes hardcoded FLAGS_max_partition_size=1024 setting for multimodal models, allowing it to be controlled via environment variable or backend-specific logic |
| self.head_dim: int = fd_config.model_config.head_dim | ||
| self.num_layers: int = fd_config.model_config.num_hidden_layers | ||
| self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024)) | ||
| # split kv still has bug in speculative decoding |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "split kv still has bug in speculative decoding", but doesn't provide details about what the bug is or reference any issue/ticket. Consider adding more context:
- What is the specific bug?
- Is there a tracking issue?
- When can this workaround be removed?
Example:
# TODO: Remove this workaround once split KV bug is fixed in speculative decoding (issue #XXXX)
# Currently, split KV causes [describe the issue] when speculative_method is enabled| # split kv still has bug in speculative decoding | |
| # TODO: Remove this workaround once split KV bug is fixed in speculative decoding (see issue #XXXX) | |
| # Currently, enabling split KV with speculative decoding causes incorrect attention results due to misaligned key/value cache updates. |
| if self.cfg.scheduler_config.splitwise_role == "prefill": | ||
| variables["FLAGS_fmt_write_cache_completed_signal"] = 1 | ||
|
|
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removed logic that sets FLAGS_max_partition_size = 1024 for multimodal models (enable_mm) in engine.py should also be removed from fastdeploy/engine/async_llm.py at lines 754-755 for consistency. Currently, the same conditional logic still exists there:
if self.cfg.model_config.enable_mm:
variables["FLAGS_max_partition_size"] = 1024This inconsistency means multimodal models will have different behavior depending on whether they use engine.py or async_llm.py.
| if self.speculative_method is not None: | ||
| self.max_partition_size = self.max_seq_len |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new conditional logic for speculative decoding (lines 152-153) lacks test coverage. While test_append_attention.py has comprehensive tests for the attention backend, none of them verify the behavior when speculative_method is not None.
Consider adding a test case that:
- Creates an AppendAttentionBackend with speculative decoding enabled (e.g., method="mtp")
- Verifies that
max_partition_sizeis set tomax_seq_leninstead of the default or environment variable value - Confirms the attention operations work correctly with this configuration
| self.head_dim: int = fd_config.model_config.head_dim | ||
| self.num_layers: int = fd_config.model_config.num_hidden_layers | ||
| self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024)) | ||
| # split kv still has bug in speculative decoding |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting max_partition_size = self.max_seq_len when speculative decoding is enabled effectively disables KV cache partitioning, which may have performance implications for long sequences.
Consider documenting:
- The expected performance impact (if any)
- Whether this workaround is acceptable for production use
- A timeline or plan for fixing the underlying bug so partitioning can be re-enabled
This will help users understand the trade-offs when using speculative decoding with long sequences.
| # split kv still has bug in speculative decoding | |
| # NOTE: Disabling KV cache partitioning for speculative decoding. | |
| # | |
| # Setting `self.max_partition_size = self.max_seq_len` when speculative decoding is enabled | |
| # effectively disables KV cache partitioning. This is a workaround for a known bug with split KV | |
| # in speculative decoding. | |
| # | |
| # Performance impact: | |
| # - For long sequences, this may significantly increase memory usage and reduce performance, | |
| # as the entire KV cache is handled as a single partition. | |
| # - Users may experience slower inference and higher resource consumption for long inputs. | |
| # | |
| # Production suitability: | |
| # - This workaround is NOT recommended for production use with long sequences. | |
| # - For short sequences or development/testing, the impact may be acceptable. | |
| # | |
| # Plan for fix: | |
| # - The underlying bug with split KV in speculative decoding is tracked internally. | |
| # - Re-enabling partitioning will be prioritized in a future release once the bug is resolved. | |
| # - Please monitor the project release notes or issue tracker for updates. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #5481 +/- ##
==============================================
Coverage ? 58.21%
==============================================
Files ? 327
Lines ? 40616
Branches ? 6165
==============================================
Hits ? 23643
Misses ? 15133
Partials ? 1840
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.