Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 10, 2025 03:03
@paddle-bot
Copy link

paddle-bot bot commented Dec 10, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This cherry-pick PR fixes a bug in speculative decoding where split KV operations cause issues with attention. The fix relocates the max_partition_size configuration from the engine initialization to the AppendAttentionBackend class, where it's set based on whether speculative decoding is enabled.

Key Changes

  • Removes multimodal-specific max_partition_size setting from engine.py
  • Adds speculative decoding check in AppendAttentionBackend.__init__() to disable split KV by setting max_partition_size = self.max_seq_len
  • Centralizes the logic for handling split KV behavior in the attention backend

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
fastdeploy/model_executor/layers/attention/append_attn_backend.py Adds conditional logic to disable split KV (set max_partition_size to max_seq_len) when speculative decoding is enabled
fastdeploy/engine/engine.py Removes the enable_mm-based max_partition_size override that was previously set at engine initialization level

Comment on lines +151 to +153
# split kv still has bug in speculative decoding
if self.speculative_method is not None:
self.max_partition_size = self.max_seq_len
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description lacks essential information about the bug being fixed. According to the custom guidelines, the description should explain:

  1. What is the attention bug in speculative decoding?
  2. Why does split kv cause issues with speculative decoding?
  3. What problem does setting max_partition_size = self.max_seq_len solve?

Please add a more detailed explanation of the bug and the fix to help reviewers and future maintainers understand the changes.

Copilot generated this review using guidance from repository custom instructions.
self.head_dim: int = fd_config.model_config.head_dim
self.num_layers: int = fd_config.model_config.num_hidden_layers
self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
# split kv still has bug in speculative decoding
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "split kv still has bug in speculative decoding" is too vague. Consider providing more specific information about:

  1. What exactly is the bug (e.g., incorrect results, crashes, memory issues)?
  2. Reference to related issue or PR for tracking purposes
  3. Any temporary workaround implications

Example:

# WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding
# when verifying draft tokens. Disable split KV by setting max_partition_size to max_seq_len.
# TODO: Fix the underlying bug in split KV logic for speculative decoding (Issue #XXXX)
Suggested change
# split kv still has bug in speculative decoding
# WORKAROUND: Split KV causes incorrect attention outputs in speculative decoding,
# particularly when verifying draft tokens. To avoid this, disable split KV by setting
# max_partition_size to max_seq_len when speculative decoding is enabled.
# TODO: Fix the underlying bug in split KV logic for speculative decoding (see Issue #XXXX).

Copilot uses AI. Check for mistakes.
Comment on lines 483 to 485
if self.cfg.scheduler_config.splitwise_role == "prefill":
variables["FLAGS_fmt_write_cache_completed_signal"] = 1

Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file fastdeploy/engine/async_llm.py still contains similar code that was removed from engine.py:

if self.cfg.model_config.enable_mm:
    variables["FLAGS_max_partition_size"] = 1024

(lines 754-755)

For consistency, this code should also be removed from async_llm.py since the same logic is now handled in AppendAttentionBackend.__init__() where max_partition_size is set based on the speculative method. Otherwise, multimodal models may still override the speculative decoding fix.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +152
# split kv still has bug in speculative decoding
if self.speculative_method is not None:
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if self.speculative_method is not None: might be too broad. According to the config, speculative_method can be "ngram_match", "mtp", or "hybrid_mtp_ngram". Consider:

  1. Verifying if this bug affects all speculative methods or only specific ones (e.g., just "mtp")
  2. Being more explicit about which methods are affected, for example:
if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]:
    self.max_partition_size = self.max_seq_len

This makes the code more maintainable and self-documenting.

Suggested change
# split kv still has bug in speculative decoding
if self.speculative_method is not None:
# split kv still has bug in speculative decoding for some speculative methods
if self.speculative_method in ["mtp", "hybrid_mtp_ngram"]:

Copilot uses AI. Check for mistakes.
self.head_dim: int = fd_config.model_config.head_dim
self.num_layers: int = fd_config.model_config.num_hidden_layers
self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
# split kv still has bug in speculative decoding
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title format is incorrect. It should be:

  • Current: [Cherry-Pick][CI]Fix attention bug in spec decoding(#5460)
  • Expected: [Cherry-Pick][BugFix] Fix attention bug in spec decoding (#5460)

Issues:

  1. The tag should be [BugFix] not [CI] since this is fixing a bug, not a CI change
  2. There should be a space before the opening parenthesis in (#5460)
  3. Consider a more descriptive title like: [Cherry-Pick][BugFix] Fix split KV bug in speculative decoding (#5460)

Copilot generated this review using guidance from repository custom instructions.
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/online/20251131@f08fb25). Learn more about missing BASE report.

Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20251131    #5480   +/-   ##
==========================================================
  Coverage                           ?   59.04%           
==========================================================
  Files                              ?      319           
  Lines                              ?    38871           
  Branches                           ?     5843           
==========================================================
  Hits                               ?    22951           
  Misses                             ?    14092           
  Partials                           ?     1828           
Flag Coverage Δ
GPU 59.04% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@heavengate heavengate merged commit 6715196 into PaddlePaddle:release/online/20251131 Dec 10, 2025
19 of 20 checks passed
rainyfly added a commit to rainyfly/FastDeploy that referenced this pull request Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants