-
Notifications
You must be signed in to change notification settings - Fork 690
[XPU] xpu support mm prefix cache #5356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] xpu support mm prefix cache #5356
Conversation
|
ddchenhao66 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
这个PR为XPU添加了多模态前缀缓存(multimodal prefix cache)的支持。主要变更包括移除配置文件中对XPU平台默认禁用prefix cache的限制,以及在xpu_model_runner中实现encoder cache功能以缓存视觉特征。
关键变更:
- 移除了XPU平台多模态模型默认禁用prefix cache的配置限制
- 添加了encoder_cache机制来缓存和重用提取的视觉特征
- 重构了多模态输入处理逻辑,将其提取到
_apply_mm_inputs方法中 - 优化了CUDAGraph设备类型检查逻辑
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/engine/args_utils.py | 删除了XPU平台多模态模型自动禁用prefix cache的配置代码 |
| fastdeploy/config.py | 重构了CUDAGraph设备类型检查,使用current_platform.is_cuda()替代直接访问device_config |
| fastdeploy/worker/xpu_model_runner.py | 添加了encoder_cache初始化和多模态prefix cache支持,包括新增的get_chunked_inputs、batch_uncached_inputs、scatter_and_cache_features和_apply_mm_inputs方法 |
| elif "qwen" in self.model_config.model_type: | ||
| actual_image_token_num = paddle.sum( | ||
| vision_inputs["input_ids"] == vision_inputs["image_patch_id"] | ||
| ) + paddle.sum(vision_inputs["input_ids"] == vision_inputs["video_patch_id"]) | ||
| else: | ||
| raise ValueError(f"multiple modalities model {self.model_config.model_type} is not supported") |
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating actual_image_token_num for "qwen" model type attempts to access vision_inputs["image_patch_id"] and vision_inputs["video_patch_id"], but these fields are not set in the vision_inputs dictionary in either the batch_uncached_inputs method or the get_chunked_inputs method. This will cause a KeyError at runtime for qwen models.
Looking at the GPU model runner implementation, these patch IDs should be provided in the vision_inputs. Please ensure that these fields are properly populated in the vision_inputs dict before this calculation, or modify the logic to retrieve them from an appropriate source (e.g., from request.multimodal_inputs or self.model_config).
| self.encoder_cache: dict[str, paddle.Tensor] = {} | ||
| else: | ||
| self.encoder_cache = None | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace detected. Please remove the trailing whitespace for code cleanliness.
| inputs = request.multimodal_inputs | ||
| if request.with_image: | ||
| if envs.FD_ENABLE_MAX_PREFILL: | ||
| multi_vision_inputs["images_lst"].append( | ||
| inputs["images"][request.image_start : request.image_end].cuda() | ||
| ) | ||
| multi_vision_inputs["grid_thw_lst"].extend( | ||
| inputs["grid_thw"][request.num_image_start : request.num_image_end] | ||
| ) | ||
| multi_vision_inputs["cu_seqlens"].extend( | ||
| inputs["vit_seqlen"][request.num_image_start : request.num_image_end] | ||
| ) | ||
| multi_vision_inputs["vit_position_ids_lst"].extend( | ||
| inputs["vit_position_ids"][request.num_image_start : request.num_image_end] | ||
| ) | ||
| else: | ||
| vision_inputs = inputs | ||
| if self.encoder_cache: | ||
| ( | ||
| vision_inputs["input_ids"], | ||
| vision_inputs["token_type_ids"], | ||
| vision_inputs["image_type_ids"], | ||
| vision_inputs["images"], | ||
| vision_inputs["grid_thw"], | ||
| vision_inputs["mm_hashes"], | ||
| ) = self.batch_uncached_inputs(request) | ||
| if len(vision_inputs["mm_hashes"]) > 0: | ||
| # uncached multimodal inputs exist | ||
| image_features = self.extract_vision_features(vision_inputs) | ||
| self.scatter_and_cache_features(image_features, vision_inputs) | ||
|
|
||
| full_image_features_lst = [] | ||
| for mm_hash in inputs["mm_hashes"][request.num_image_start : request.num_image_end]: | ||
| feature = self.encoder_cache[mm_hash].cuda() | ||
| full_image_features_lst.append(feature) | ||
| image_features = paddle.concat(full_image_features_lst, axis=0) | ||
| else: | ||
| ( | ||
| input_ids, | ||
| token_type_ids, | ||
| image_type_ids, | ||
| images, | ||
| grid_thw, | ||
| mm_hashes, | ||
| ) = self.get_chunked_inputs(request) | ||
| vision_inputs["input_ids"] = paddle.to_tensor(input_ids, dtype=paddle.int64) | ||
| vision_inputs["token_type_ids"] = paddle.to_tensor(token_type_ids, dtype=paddle.int64) | ||
| vision_inputs["image_type_ids"] = paddle.to_tensor(image_type_ids, dtype=paddle.int64) | ||
| vision_inputs["images"] = paddle.to_tensor( | ||
| images, dtype="uint8" if "ernie" in self.model_config.model_type else "bfloat16" | ||
| ) | ||
| vision_inputs["grid_thw"] = paddle.to_tensor(grid_thw, dtype=paddle.int64) | ||
| vision_inputs["mm_hashes"] = mm_hashes | ||
|
|
||
| image_features = self.extract_vision_features(vision_inputs) | ||
|
|
||
| # part of the first image may be already cached | ||
| if "ernie" in self.model_config.model_type: | ||
| actual_image_token_num = paddle.sum(vision_inputs["input_ids"] == self.model_config.im_patch_id) | ||
| elif "qwen" in self.model_config.model_type: | ||
| actual_image_token_num = paddle.sum( | ||
| vision_inputs["input_ids"] == vision_inputs["image_patch_id"] | ||
| ) + paddle.sum(vision_inputs["input_ids"] == vision_inputs["video_patch_id"]) | ||
| else: | ||
| raise ValueError(f"multiple modalities model {self.model_config.model_type} is not supported") | ||
| self.share_inputs["image_features"] = image_features[-actual_image_token_num:] | ||
|
|
Copilot
AI
Dec 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing handling for the case when request.with_image is False. The old code (before refactoring) explicitly set self.share_inputs["image_features"] = None when there were no images. Without this, when multimodal is enabled but a request batch contains no images, self.share_inputs["image_features"] may retain stale data from previous batches.
The recommended fix, following the GPU model runner pattern (gpu_model_runner.py line 566), is to initialize self.share_inputs["image_features"] = None at the beginning of the insert_tasks_v1 method (after line 358), rather than in this method.
|
_apply_mm_inputs 方法我最近重写了一下,支持了多模 prefill 多batch 和修复了 encoder cache的bug,这个pr可以先合入,后边可以看下需不需要更新 |
|
|
||
| inputs = request.multimodal_inputs | ||
| if request.with_image: | ||
| if envs.FD_ENABLE_MAX_PREFILL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
envs.FD_ENABLE_MAX_PREFILL 是 paddle_ocr vl 的逻辑,是支持prefill多batch的,这里在xpu上有没有问题可能要看下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
envs.FD_ENABLE_MAX_PREFILL 如果和 paddle_ocr vl模型绑定,那是不是if envs.FD_ENABLE_MAX_PREFILL:直接再加上模型判断,或者封装成一个函数?xpu也支持了extract_vision_features_paddleocr函数,里面也涉及到了FD_ENABLE_MAX_PREFILL,这次应该也要加上
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5356 +/- ##
==========================================
Coverage ? 59.34%
==========================================
Files ? 324
Lines ? 40058
Branches ? 6051
==========================================
Hits ? 23772
Misses ? 14402
Partials ? 1884
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2da3b0a to
b81da95
Compare
b81da95 to
c27ead2
Compare
|
Thanks for your contribution! |
1 similar comment
|
Thanks for your contribution! |
Motivation
XPU支持多模prefix cache
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.