Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 10, 2025 03:04
@paddle-bot
Copy link

paddle-bot bot commented Dec 10, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This cherry-pick PR fixes an attention bug in speculative decoding by disabling KV cache partitioning when speculative decoding is enabled. The fix involves setting max_partition_size to max_seq_len to work around a bug in split KV operations during speculative decoding, and removes the hardcoded FLAGS_max_partition_size setting for multimodal models from engine.py.

Key Changes:

  • Adds a workaround in AppendAttentionBackend that disables KV partitioning when speculative decoding is active
  • Removes hardcoded FLAGS_max_partition_size for multimodal models from engine.py

Note on PR Description: The PR description lacks detail about what the bug was and how these changes fix it. According to the coding guidelines, the description should at minimum explain why these modifications are being made and what problem is being solved. For a bug fix PR, it would be helpful to include:

  • A description of the bug symptoms
  • Root cause analysis
  • Why disabling KV partitioning resolves the issue
  • Any known performance implications

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/model_executor/layers/attention/append_attn_backend.py Adds conditional logic to disable KV cache partitioning (by setting max_partition_size to max_seq_len) when speculative decoding is enabled, working around a bug in split KV operations
fastdeploy/engine/engine.py Removes hardcoded FLAGS_max_partition_size=1024 setting for multimodal models, allowing it to be controlled via environment variable or backend-specific logic

self.head_dim: int = fd_config.model_config.head_dim
self.num_layers: int = fd_config.model_config.num_hidden_layers
self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
# split kv still has bug in speculative decoding
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "split kv still has bug in speculative decoding", but doesn't provide details about what the bug is or reference any issue/ticket. Consider adding more context:

  • What is the specific bug?
  • Is there a tracking issue?
  • When can this workaround be removed?

Example:

# TODO: Remove this workaround once split KV bug is fixed in speculative decoding (issue #XXXX)
# Currently, split KV causes [describe the issue] when speculative_method is enabled
Suggested change
# split kv still has bug in speculative decoding
# TODO: Remove this workaround once split KV bug is fixed in speculative decoding (see issue #XXXX)
# Currently, enabling split KV with speculative decoding causes incorrect attention results due to misaligned key/value cache updates.

Copilot uses AI. Check for mistakes.
Comment on lines 483 to 485
if self.cfg.scheduler_config.splitwise_role == "prefill":
variables["FLAGS_fmt_write_cache_completed_signal"] = 1

Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed logic that sets FLAGS_max_partition_size = 1024 for multimodal models (enable_mm) in engine.py should also be removed from fastdeploy/engine/async_llm.py at lines 754-755 for consistency. Currently, the same conditional logic still exists there:

if self.cfg.model_config.enable_mm:
    variables["FLAGS_max_partition_size"] = 1024

This inconsistency means multimodal models will have different behavior depending on whether they use engine.py or async_llm.py.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
if self.speculative_method is not None:
self.max_partition_size = self.max_seq_len
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new conditional logic for speculative decoding (lines 152-153) lacks test coverage. While test_append_attention.py has comprehensive tests for the attention backend, none of them verify the behavior when speculative_method is not None.

Consider adding a test case that:

  1. Creates an AppendAttentionBackend with speculative decoding enabled (e.g., method="mtp")
  2. Verifies that max_partition_size is set to max_seq_len instead of the default or environment variable value
  3. Confirms the attention operations work correctly with this configuration

Copilot uses AI. Check for mistakes.
self.head_dim: int = fd_config.model_config.head_dim
self.num_layers: int = fd_config.model_config.num_hidden_layers
self.max_partition_size: int = int(os.getenv("FLAGS_max_partition_size", 1024))
# split kv still has bug in speculative decoding
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting max_partition_size = self.max_seq_len when speculative decoding is enabled effectively disables KV cache partitioning, which may have performance implications for long sequences.

Consider documenting:

  1. The expected performance impact (if any)
  2. Whether this workaround is acceptable for production use
  3. A timeline or plan for fixing the underlying bug so partitioning can be re-enabled

This will help users understand the trade-offs when using speculative decoding with long sequences.

Suggested change
# split kv still has bug in speculative decoding
# NOTE: Disabling KV cache partitioning for speculative decoding.
#
# Setting `self.max_partition_size = self.max_seq_len` when speculative decoding is enabled
# effectively disables KV cache partitioning. This is a workaround for a known bug with split KV
# in speculative decoding.
#
# Performance impact:
# - For long sequences, this may significantly increase memory usage and reduce performance,
# as the entire KV cache is handled as a single partition.
# - Users may experience slower inference and higher resource consumption for long inputs.
#
# Production suitability:
# - This workaround is NOT recommended for production use with long sequences.
# - For short sequences or development/testing, the impact may be acceptable.
#
# Plan for fix:
# - The underlying bug with split KV in speculative decoding is tracked internally.
# - Re-enabling partitioning will be prioritized in a future release once the bug is resolved.
# - Please monitor the project release notes or issue tracker for updates.

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.4@e9174f2). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #5481   +/-   ##
==============================================
  Coverage               ?   58.21%           
==============================================
  Files                  ?      327           
  Lines                  ?    40616           
  Branches               ?     6165           
==============================================
  Hits                   ?    23643           
  Misses                 ?    15133           
  Partials               ?     1840           
Flag Coverage Δ
GPU 58.21% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@heavengate heavengate merged commit c5c43e3 into PaddlePaddle:release/2.4 Dec 10, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants