Skip to content

Conversation

@freeliuzc
Copy link
Collaborator

…ed and PD-split modes (#5738)

  • fix attn_mask_offset in mtp with multi-step and pd-split-mode

  • fix xpu operater register

  • update pmtp multi-step mtp strategy in d-split -mode

  • add note

  • fix xpu register

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…ed and PD-split modes (PaddlePaddle#5738)

* fix attn_mask_offset in mtp with multi-step and pd-split-mode

* fix xpu operater register

* update pmtp multi-step mtp strategy in d-split -mode

* add note

* fix xpu register
Copilot AI review requested due to automatic review settings December 26, 2025 07:37
@paddle-bot
Copy link

paddle-bot bot commented Dec 26, 2025

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This is a cherry-pick of PR #5738 that fixes the attn_mask_offset calculation for multi-step MTP (Medusa Tree-based Parallel decoding) in mixed and PD-split (Prefill-Decode split) modes. The fix introduces two new parameters (mask_rollback and recompute_token_num) to properly track and manage attention mask offsets when tokens are recomputed in speculative decoding.

Key Changes:

  • Adds mask_rollback and recompute_token_num tracking for proper attention mask offset calculation in multi-step scenarios
  • Updates CUDA kernels and Python implementation to handle dynamic recomputation strategies, particularly for PD-split mode
  • Improves test structure by adding unittest.TestCase inheritance and simplifying assertions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
custom_ops/gpu_ops/speculate_decoding/draft_model/draft_model_preprocess.cu Adds mask_rollback and recompute_token_num parameters to CUDA kernel, implements logic to track recomputed tokens and update mask offsets; includes code formatting improvements
custom_ops/gpu_ops/cpp_extensions.cc Updates function signature to include new mask_rollback and recompute_token_num parameters
fastdeploy/spec_decode/mtp.py Initializes new model input tensors for mask rollback and recompute tracking; sets special values for PD-split mode where prefill-decode coordination requires one-step rollback
tests/operators/test_draft_model_preprocess.py Adds new parameters to test functions, improves test class with unittest.TestCase inheritance, and simplifies assertion logic to loop through all results

np.testing.assert_allclose(results1[12], results2[12]) # accept_num
np.testing.assert_allclose(results1[7], results2[7]) # not_need_stop
for i in range(12):
np.testing.assert_equal(results1[i].numpy(), results2[i].numpy())
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.testing.assert_equal for all 12 results may be too strict. The original code used assert_allclose for most comparisons, which allows for small floating-point differences. If any of the results contain floating-point values (like logits or probabilities), this change could cause tests to fail due to numerical precision issues. Consider using assert_allclose for floating-point tensors and assert_equal for integer/boolean tensors.

Suggested change
np.testing.assert_equal(results1[i].numpy(), results2[i].numpy())
arr1 = results1[i].numpy()
arr2 = results2[i].numpy()
if np.issubdtype(arr1.dtype, np.floating):
np.testing.assert_allclose(arr1, arr2)
else:
np.testing.assert_equal(arr1, arr2)

Copilot uses AI. Check for mistakes.


class TestDraftModelPreprocess:
class TestDraftModelPreprocess(unittest.TestCase):
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding unittest.TestCase as the base class is a good improvement. This ensures proper test discovery and integration with unittest framework. However, this change doesn't appear to be related to the PR's main purpose (fixing attn_mask_offset). Consider mentioning this improvement in the PR description or making it a separate cleanup commit.

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +360
mask_rollback = paddle.zeros([bsz], dtype="int32")
recompute_token_num = paddle.zeros([bsz], dtype="int32")
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test initializes recompute_token_num to zeros, but in the actual implementation (mtp.py), it's initialized to self.num_model_steps - 1. This discrepancy means the test isn't fully exercising the behavior when recompute_token_num has a non-zero initial value. Consider testing both scenarios: initial value of 0 (for PD-split mode) and initial value of num_model_step - 1 (for normal multi-step MTP).

Copilot uses AI. Check for mistakes.
Comment on lines +303 to +332
process_splitwise_prefill<BlockSize,
TRUNCATE_FIRST_TOKEN,
KVCACHE_SCHEDULER_V1>
<<<1, BlockSize, 0, stream>>>(draft_tokens,
input_ids,
stop_flags,
seq_lens_this_time,
seq_lens_encoder,
seq_lens_decoder,
step_idx,
not_need_stop,
is_block_step,
batch_drop,
pre_ids,
accept_tokens,
accept_num,
base_model_seq_lens_this_time,
base_model_seq_lens_encoder,
base_model_seq_lens_decoder,
base_model_step_idx,
base_model_stop_flags,
base_model_is_block_step,
base_model_draft_tokens,
bsz,
num_model_step,
accept_tokens_len,
draft_tokens_len,
input_ids_len,
base_model_draft_tokens_len,
pre_ids_len);
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process_splitwise_prefill kernel call is missing the mask_rollback and recompute_token_num parameters that were added to the DispatchRunner function signature. These parameters are passed to draft_model_preprocess_kernel (lines 348-349) but not to process_splitwise_prefill.

If these parameters are not needed for the splitwise_prefill path, they should still be passed to maintain API consistency and avoid potential issues when the logic changes. If they are needed, this is a critical bug that could cause incorrect behavior in splitwise_prefill mode.

Copilot uses AI. Check for mistakes.
Comment on lines +527 to 529
# NOTE(liuzichang):
# extra 1 : P-D split need rollback one step
self.model_inputs["mask_rollback"][idx : idx + 1] = 1
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "extra 1: P-D split need rollback one step" but this is misleading. Looking at line 529, mask_rollback is set to 1, which represents the extra rollback step. The comment should be placed above line 529 where mask_rollback is set, not above line 528 where it appears now. This would make the code more clear about which line the comment is describing.

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20251131@b018c49). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/spec_decode/mtp.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20251131    #5792   +/-   ##
==========================================================
  Coverage                           ?   59.00%           
==========================================================
  Files                              ?      319           
  Lines                              ?    39108           
  Branches                           ?     5893           
==========================================================
  Hits                               ?    23074           
  Misses                             ?    14179           
  Partials                           ?     1855           
Flag Coverage Δ
GPU 59.00% <50.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@EmmonsCurse EmmonsCurse merged commit 19a625a into PaddlePaddle:release/online/20251131 Dec 26, 2025
15 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants