-
Notifications
You must be signed in to change notification settings - Fork 694
[Feature] mm prefix cache #4554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] mm prefix cache #4554
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements multimodal prefix caching functionality, enabling efficient cache reuse for requests containing images or other multimodal inputs. The key changes involve refactoring the prefix cache manager to handle multimodal data with proper hashing and block management.
- Adds multimodal-aware prefix caching with image hash tracking
- Introduces
disable_chunked_mm_inputflag to prevent splitting multimodal inputs across cache blocks - Updates cache hit tracking to use token counts instead of block counts for more accurate metrics
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/cache_manager/prefix_cache_manager.py | Implements core multimodal prefix caching logic including mm_match_block, mm_build_path, and hash computation with image keys |
| fastdeploy/engine/sched/resource_manager_v1.py | Updates cache hit metrics to use token-level granularity instead of block-level |
| fastdeploy/engine/common_engine.py | Simplifies available blocks calculation by removing multimodal-specific logic |
| fastdeploy/engine/args_utils.py | Adds disable_chunked_mm_input configuration option and removes restriction preventing prefix caching with multimodal models |
| fastdeploy/config.py | Adds disable_chunked_mm_input field to CacheConfig |
| tests/v1/test_prefix_cache.py | Adds comprehensive tests for multimodal prefix caching functionality |
| tests/v1/test_revert_blocks.py | Adds tests for block reversion logic when multimodal inputs are chunked |
966297e
into
PaddlePaddle:feature/experimental_feature_20250908
* mm prefix cache * add _revert_match_blocks * update code * update code * update code * fix bugs * add test case * fix bug * update code * update reserved_dec_block_ids
Motivation
mm prefix cache
启动时需要增加参数:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.