Make Megatron-FSDP torch.compile compatible#2425
Conversation
|
/ok to test f45c937 |
|
/ok to test 2237d64 |
cspades
left a comment
There was a problem hiding this comment.
Torch compilation definitely works now, but using compile with Megatron-FSDP seems to hurt performance for a variety of standard model architectures, likely due to skipping compilation for our collectives and overall making the compiled program slower since the compilation needs to work around our collectives.
Linear
AVG STEP TIME (COMPILE=True): 0.03973395120352507
AVG STEP TIME (COMPILE=False): 0.029659099504351617
CNN
AVG STEP TIME (COMPILE=True): 0.030622328780591488
AVG STEP TIME (COMPILE=False): 0.02351114109158516
Transformer
AVG STEP TIME (COMPILE=True): 0.36532023292034865
AVG STEP TIME (COMPILE=False): 0.21396012518554927
I believe FSDP2 has a performance improvement from compilation, possibly due to:
Functional collectives. If you don't like DTensor, we also support "functional collectives", which are non-mutating versions of collective operations that can be used to manually implement SPMD operations in a compiler-friendly way without needing DTensor. (In fact, if you use traditional collective APIs and compile them, we will silently translate them into functional collectives for compiler passes.) When compiled, functional collectives don't necessarily force allocation of the output buffer as they can be re-inplaced. Importantly, functional collectives currently do NOT support autograd, see https://discuss.pytorch.org/t/supporting-autograd-for-collectives/219430
which is taken from @ezyang's blog: https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/ among many other optimizations that makes FSDP2 lightly compatible and improved by torch.compile.
We'll likely need more changes to deeply support compilation, unless the model arch benefits greatly from compilation to the point that it offsets the lack of collective compilation and compile time overhead. Not experienced with compilation, so no immediate ideas at the moment, just wanted to write down these thoughts somewhere.
|
Clicked the "update branch" button since main branch CI was broken. CI should pass now. |
|
What is going on with main branch 👀 No way this is related to our changes... |
|
Updated branch, should be fixed by: #2970 |
|
/ok to test 14080c1 |
|
/ok to test cad4126 |
Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
What does this PR do ?
This PR makes Megatron-FSDP compatible with
torch.compileby disabling compilation for its internal FSDP hooks using@torch.compiler.disable. These hooks rely on eager-mode behavior and can confuse the compiler’s tracing and graph construction, leading to graph breaks or errors when compiling models wrapped with Megatron-FSDP. By explicitly opting these hook entry points out of compilation, the main model computation remains compilable while Megatron-FSDP continues to manage sharding and communication in eager mode.Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.