[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480

freeliuzc · 2025-12-10T03:03:49Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-10T03:03:59Z

Thanks for your contribution!

Copilot

Pull request overview

This cherry-pick PR fixes a bug in speculative decoding where split KV operations cause issues with attention. The fix relocates the max_partition_size configuration from the engine initialization to the AppendAttentionBackend class, where it's set based on whether speculative decoding is enabled.

Key Changes

Removes multimodal-specific max_partition_size setting from engine.py
Adds speculative decoding check in AppendAttentionBackend.__init__() to disable split KV by setting max_partition_size = self.max_seq_len
Centralizes the logic for handling split KV behavior in the attention backend

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
fastdeploy/model_executor/layers/attention/append_attn_backend.py	Adds conditional logic to disable split KV (set max_partition_size to max_seq_len) when speculative decoding is enabled
fastdeploy/engine/engine.py	Removes the enable_mm-based max_partition_size override that was previously set at engine initialization level

Copilot · 2025-12-10T03:07:03Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

+        # split kv still has bug in speculative decoding
+        if self.speculative_method is not None:
+            self.max_partition_size = self.max_seq_len


The PR description lacks essential information about the bug being fixed. According to the custom guidelines, the description should explain:

What is the attention bug in speculative decoding?

Why does split kv cause issues with speculative decoding?

What problem does setting max_partition_size = self.max_seq_len solve?

Please add a more detailed explanation of the bug and the fix to help reviewers and future maintainers understand the changes.

Copilot · 2025-12-10T03:07:03Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

        self.head_dim: int = fd_config.model_config.head_dim
        self.num_layers: int = fd_config.model_config.num_hidden_layers
        self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
+        # split kv still has bug in speculative decoding


The comment "split kv still has bug in speculative decoding" is too vague. Consider providing more specific information about:

What exactly is the bug (e.g., incorrect results, crashes, memory issues)?

Reference to related issue or PR for tracking purposes

Any temporary workaround implications

Example:

# WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding # when verifying draft tokens. Disable split KV by setting max_partition_size to max_seq_len. # TODO: Fix the underlying bug in split KV logic for speculative decoding (Issue #XXXX)

Suggested change

# split kv still has bug in speculative decoding

# WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding,

# particularly when verifying draft tokens. To avoid this, disable split KV by setting

# max_partition_size to max_seq_len when speculative decoding is enabled.

# TODO: Fix the underlying bug in split KV logic for speculative decoding (see Issue #XXXX).

Copilot · 2025-12-10T03:07:04Z

fastdeploy/engine/engine.py

            if self.cfg.scheduler_config.splitwise_role == "prefill":
                variables["FLAGS_fmt_write_cache_completed_signal"] = 1



The file fastdeploy/engine/async_llm.py still contains similar code that was removed from engine.py:

if self.cfg.model_config.enable_mm: variables["FLAGS_max_partition_size"] = 1024

(lines 754-755)

For consistency, this code should also be removed from async_llm.py since the same logic is now handled in AppendAttentionBackend.__init__() where max_partition_size is set based on the speculative method. Otherwise, multimodal models may still override the speculative decoding fix.

Copilot · 2025-12-10T03:07:04Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

+        # split kv still has bug in speculative decoding
+        if self.speculative_method is not None:


The condition if self.speculative_method is not None: might be too broad. According to the config, speculative_method can be "ngram_match", "mtp", or "hybrid_mtp_ngram". Consider:

Verifying if this bug affects all speculative methods or only specific ones (e.g., just "mtp")

Being more explicit about which methods are affected, for example:

if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]: self.max_partition_size = self.max_seq_len

This makes the code more maintainable and self-documenting.

Suggested change

# split kv still has bug in speculative decoding

if self.speculative_method is not None:

# split kv still has bug in speculative decoding for some speculative methods

if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]:

Copilot · 2025-12-10T03:07:04Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

        self.head_dim: int = fd_config.model_config.head_dim
        self.num_layers: int = fd_config.model_config.num_hidden_layers
        self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
+        # split kv still has bug in speculative decoding


The PR title format is incorrect. It should be:

Current: [Cherry-Pick][CI]Fix attention bug in spec decoding(#5460)

Expected: [Cherry-Pick][BugFix] Fix attention bug in spec decoding (#5460)

Issues:

The tag should be [BugFix] not [CI] since this is fixing a bug, not a CI change

There should be a space before the opening parenthesis in (#5460)

Consider a more descriptive title like: [Cherry-Pick][BugFix] Fix split KV bug in speculative decoding (#5460)

codecov-commenter · 2025-12-10T04:27:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20251131@f08fb25). Learn more about missing BASE report.

Additional details and impacted files

@@                    Coverage Diff                     @@
##             release/online/20251131    #5480   +/-   ##
==========================================================
  Coverage                           ?   59.04%           
==========================================================
  Files                              ?      319           
  Lines                              ?    38871           
  Branches                           ?     5843           
==========================================================
  Hits                               ?    22951           
  Misses                             ?    14092           
  Partials                           ?     1828

Flag	Coverage Δ
GPU	`59.04% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This reverts commit 6715196.

fix attention bug in spec decoding

901bf5c

Copilot AI review requested due to automatic review settings December 10, 2025 03:03

Copilot started reviewing on behalf of freeliuzc December 10, 2025 03:04 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

lizhenyun01 approved these changes Dec 10, 2025

View reviewed changes

heavengate approved these changes Dec 10, 2025

View reviewed changes

heavengate merged commit 6715196 into PaddlePaddle:release/online/20251131 Dec 10, 2025
19 of 20 checks passed

rainyfly added a commit to rainyfly/FastDeploy that referenced this pull request Dec 16, 2025

Revert "fix attention bug in spec decoding (PaddlePaddle#5480)"

72a9469

This reverts commit 6715196.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480

[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480

Uh oh!

freeliuzc commented Dec 10, 2025

Uh oh!

paddle-bot bot commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

Copilot AI Dec 10, 2025

Uh oh!

codecov-commenter commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        # split kv still has bug in speculative decoding
+        # WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding,
+        # particularly when verifying draft tokens. To avoid this, disable split KV by setting
+        # max_partition_size to max_seq_len when speculative decoding is enabled.
+        # TODO: Fix the underlying bug in split KV logic for speculative decoding (see Issue #XXXX).

		if self.cfg.scheduler_config.splitwise_role == "prefill":
		variables["FLAGS_fmt_write_cache_completed_signal"] = 1

		# split kv still has bug in speculative decoding
		if self.speculative_method is not None:

[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480

[Cherry-Pick][CI]Fix attention bug in spec decoding(#5460) #5480

Uh oh!

Conversation

freeliuzc commented Dec 10, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 10, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants