-
Notifications
You must be signed in to change notification settings - Fork 692
[CudaGraph] [SOT] Support spliting static graph into piecewise graph with cuda_graph #3478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split 接口后期需要暴露出来,方便 FD 注册 Attention Layer
| f"[CUDA GRAPH] CUDAGraph capture list {self.cudagraph_capture_sizes}, " "Created all real shape entry." | ||
| ) | ||
|
|
||
| def run_static_model(self, entry: ConcreteSizeEntry, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subgraph 直接在 Paddle 内管理了吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,暴露到python的实现成本会比较高,但功能上没有差别
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
麻烦补充一个静态图的单测,覆盖率没有过
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for splitting static computation graphs into piecewise subgraphs for CUDA Graph capture and execution in SOT (Static-to-Optimized-Transform) mode. It enables CUDA Graph optimization at the subgraph level when graph optimization is enabled.
Key changes:
- Introduces a new
Dy2StCudaGraphManagerclass to manage CUDA Graph state transitions for static graph execution - Adds a new execution path
run_static_modelfor handling static model execution with CUDA Graph capture and replay - Integrates the CUDA Graph manager into the existing
CudaGraphPiecewiseBackendclass
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| self.captrued_batch_size = set() | ||
| self.batch_size = -1 | ||
|
|
||
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | ||
| run_state = self.state | ||
| prog_attrs, cuda_graph_attrs = attrs | ||
| if run_state == CUDAGraphState.REPLAY: | ||
| if self.batch_size not in self.captrued_batch_size: | ||
| run_state = CUDAGraphState.DISABLE | ||
| elif run_state == CUDAGraphState.CAPTURE: | ||
| self.captrued_batch_size.add(self.batch_size) |
Copilot
AI
Aug 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the variable name. 'captrued_batch_size' should be 'captured_batch_size'.
| self.captrued_batch_size = set() | |
| self.batch_size = -1 | |
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | |
| run_state = self.state | |
| prog_attrs, cuda_graph_attrs = attrs | |
| if run_state == CUDAGraphState.REPLAY: | |
| if self.batch_size not in self.captrued_batch_size: | |
| run_state = CUDAGraphState.DISABLE | |
| elif run_state == CUDAGraphState.CAPTURE: | |
| self.captrued_batch_size.add(self.batch_size) | |
| self.captured_batch_size = set() | |
| self.batch_size = -1 | |
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | |
| run_state = self.state | |
| prog_attrs, cuda_graph_attrs = attrs | |
| if run_state == CUDAGraphState.REPLAY: | |
| if self.batch_size not in self.captured_batch_size: | |
| run_state = CUDAGraphState.DISABLE | |
| elif run_state == CUDAGraphState.CAPTURE: | |
| self.captured_batch_size.add(self.batch_size) |
fastdeploy/model_executor/graph_optimization/cudagraph_piecewise_backend.py
Outdated
Show resolved
Hide resolved
| self.captrued_batch_size = set() | ||
| self.batch_size = -1 | ||
|
|
||
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | ||
| run_state = self.state | ||
| prog_attrs, cuda_graph_attrs = attrs | ||
| if run_state == CUDAGraphState.REPLAY: | ||
| if self.batch_size not in self.captrued_batch_size: | ||
| run_state = CUDAGraphState.DISABLE | ||
| elif run_state == CUDAGraphState.CAPTURE: | ||
| self.captrued_batch_size.add(self.batch_size) |
Copilot
AI
Aug 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the variable name. 'captrued_batch_size' should be 'captured_batch_size'.
| self.captrued_batch_size = set() | |
| self.batch_size = -1 | |
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | |
| run_state = self.state | |
| prog_attrs, cuda_graph_attrs = attrs | |
| if run_state == CUDAGraphState.REPLAY: | |
| if self.batch_size not in self.captrued_batch_size: | |
| run_state = CUDAGraphState.DISABLE | |
| elif run_state == CUDAGraphState.CAPTURE: | |
| self.captrued_batch_size.add(self.batch_size) | |
| self.captured_batch_size = set() | |
| self.batch_size = -1 | |
| def run_impl(self, original_run_impl, inputs, parameters, attrs): | |
| run_state = self.state | |
| prog_attrs, cuda_graph_attrs = attrs | |
| if run_state == CUDAGraphState.REPLAY: | |
| if self.batch_size not in self.captured_batch_size: | |
| run_state = CUDAGraphState.DISABLE | |
| elif run_state == CUDAGraphState.CAPTURE: | |
| self.captured_batch_size.add(self.batch_size) |
| def run_static_model(self, entry: ConcreteSizeEntry, **kwargs): | ||
| if not entry.captured: | ||
| # Warmup the model | ||
| for n in range(entry.num_finished_warmup, self.warm_up_size): | ||
| entry.num_finished_warmup += 1 | ||
| entry.runnable(**kwargs) |
Copilot
AI
Aug 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry.captured flag is never set to True after capturing is complete. This will cause the warmup and capture logic to run repeatedly on every call instead of transitioning to replay mode.
…se_backend.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #3478 +/- ##
==========================================
Coverage ? 93.02%
==========================================
Files ? 1
Lines ? 43
Branches ? 7
==========================================
Hits ? 40
Misses ? 1
Partials ? 2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
支持SOT模式下计算图切分为子图进行CudaGraph捕获运行。
如需根据attention进行子图切分,可以配置
FLAGS_cuda_graph_blacklist="custom_op.static_op_append_attention_"