[GraphTrainer] Changed tagging flex_attn via graph_pass in replace of fx.annotation for better robustness by SherlockNoMad · Pull Request #2924 · pytorch/torchtitan

SherlockNoMad · 2026-04-09T23:24:57Z

Summary

Moves flex attention annotation from a pre-tracing function/context-manager
(annotate_flex_for_regional_inductor / annotate_flex_attention_for_regional_inductor
in common_utils.py) to a post-tracing graph pass
(annotate_flex_attention_for_regional_inductor_pass in passes.py).

Why: The previous approach annotated Python-level functions before tracing,
which required a context manager to temporarily patch and restore
FlexAttention._compiled_flex_attn and _compiled_create_block_mask.
A graph pass is simpler — it directly tags the relevant FX nodes after tracing,
with no monkey-patching or cleanup needed.

What the pass does: Annotates three sets of nodes with
compile_with_inductor (including inductor_configs from FlexAttention)
so that regional_inductor correctly scoops and compiles flex attention regions:

The HOP nodes (flex_attention / flex_attention_backward)
The get_attr nodes referencing score_mod / mask_mod submodules
All nodes inside those submodule graphs

Changes:

Add annotate_flex_attention_for_regional_inductor_pass graph pass in passes.py
Remove annotate_flex_for_regional_inductor() and its context manager from common_utils.py
Remove pre-tracing annotation calls from llama3/parallelize.py and deepseek_v3/parallelize.py
Wire up the pass in graph_utils.py (applied as a joint pass before regional_inductor)
Update tests to use the graph pass instead of the context manager

Test plan

test_passes.py — passed
test_precompile.py — passed
test_trace_module.py — 27/28 passed (1 pre-existing failure in test_peak_memory_identical_fsdp)
test_numerics.py — passed
test_bitwise_deterministic.py — passed
pre-commit run --all-files — passed

…l_inductor Refactor flex attention annotations to tag `_compiled_flex_attn` and `_compiled_create_block_mask` with `compile_with_inductor` metadata including inductor configs, instead of annotating `FlexAttention.forward`. This ensures bitwise-identical kernels between eager and regional_inductor paths by propagating the same inductor configs used by `FlexAttention._compiled_flex_attn`. - Add `annotate_flex_for_regional_inductor()` for permanent annotations - Update context manager to use the new function and restore originals - Unify llama3 and deepseek_v3 parallelize to use the shared helper

SherlockNoMad · 2026-04-10T00:03:03Z

torchtitan/experiments/graph_trainer/deepseek_v3/parallelize.py

    )
    MoE.forward = annotate_fn({"EP": "compute"})(MoE.forward)

-    FlexAttention.forward = annotate_fn({"compile_with_inductor": "flex_attention"})(


Due to change in #2761

wrapping over entire FlexAttention.forward is annotating more than it's supposed to be compiled.

Thus this fix.

This reverts commit 8646601.

SherlockNoMad · 2026-04-10T06:33:13Z

torchtitan/experiments/graph_trainer/llama3/parallelize.py

-      {"compile_with_inductor": "flex_attention"} so the compiler can apply
-      regional inductor pass based on the annotation. Regional inductor is now only
-      supported in AOT mode.
+    - Flex attention annotation: Tags FlexAttention.forward and compiled flex


…flex_attention_for_regional_inductor_pass

… fx.annotation for better robustness (pytorch#2924) ## Summary Moves flex attention annotation from a pre-tracing function/context-manager (`annotate_flex_for_regional_inductor` / `annotate_flex_attention_for_regional_inductor` in `common_utils.py`) to a post-tracing graph pass (`annotate_flex_attention_for_regional_inductor_pass` in `passes.py`). **Why:** The previous approach annotated Python-level functions before tracing, which required a context manager to temporarily patch and restore `FlexAttention._compiled_flex_attn` and `_compiled_create_block_mask`. A graph pass is simpler — it directly tags the relevant FX nodes after tracing, with no monkey-patching or cleanup needed. **What the pass does:** Annotates three sets of nodes with `compile_with_inductor` (including `inductor_configs` from `FlexAttention`) so that `regional_inductor` correctly scoops and compiles flex attention regions: 1. The HOP nodes (`flex_attention` / `flex_attention_backward`) 2. The `get_attr` nodes referencing score_mod / mask_mod submodules 3. All nodes inside those submodule graphs **Changes:** - Add `annotate_flex_attention_for_regional_inductor_pass` graph pass in `passes.py` - Remove `annotate_flex_for_regional_inductor()` and its context manager from `common_utils.py` - Remove pre-tracing annotation calls from `llama3/parallelize.py` and `deepseek_v3/parallelize.py` - Wire up the pass in `graph_utils.py` (applied as a joint pass before regional_inductor) - Update tests to use the graph pass instead of the context manager ## Test plan - [x] `test_passes.py` — passed - [x] `test_precompile.py` — passed - [x] `test_trace_module.py` — 27/28 passed (1 pre-existing failure in `test_peak_memory_identical_fsdp`) - [x] `test_numerics.py` — passed - [x] `test_bitwise_deterministic.py` — passed - [x] `pre-commit run --all-files` — passed

pytorch-bot bot added the ciflow/8gpu label Apr 9, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026

SherlockNoMad commented Apr 10, 2026

View reviewed changes

SherlockNoMad added 2 commits April 9, 2026 17:26

[DEBUG] Dump flex attention node metadata in validate pass

8646601

Revert "[DEBUG] Dump flex attention node metadata in validate pass"

e7058b9

This reverts commit 8646601.

SherlockNoMad commented Apr 10, 2026

View reviewed changes

[GraphTrainer] Rename validate_flex_attn_annotation_pass to annotate_…

c47515f

…flex_attention_for_regional_inductor_pass

SherlockNoMad force-pushed the graph_trainer/annotate_flex_for_regional_inductor branch from d9baee0 to c47515f Compare April 10, 2026 06:35

SherlockNoMad changed the title ~~[GraphTrainer] Annotate compiled flex attention functions for regional_inductor~~ [GraphTrainer] annotate_flex_attention_for_regional_inductor_pass Apr 10, 2026

SherlockNoMad marked this pull request as ready for review April 10, 2026 06:57

SherlockNoMad requested review from aditvenk, tianyu-l, xmfan and yiming0416 as code owners April 10, 2026 06:57

SherlockNoMad changed the title ~~[GraphTrainer] annotate_flex_attention_for_regional_inductor_pass~~ [GraphTrainer] Changed tagging flex_attn via graph_pass in replace of fx.annotation for better robustness Apr 10, 2026

aditvenk approved these changes Apr 10, 2026

View reviewed changes

yiming0416 approved these changes Apr 10, 2026

View reviewed changes

SherlockNoMad merged commit e24e465 into main Apr 10, 2026
17 of 25 checks passed

This was referenced Apr 10, 2026

Deepseek 16B model doesn't work with make_fx + SAC + regional_inductor on TP=2 FSDP=4 #2818

Open

Test for https://github.com/drisspg/pt_job_queue/pull/9 #2904

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphTrainer] Changed tagging flex_attn via graph_pass in replace of fx.annotation for better robustness #2924

[GraphTrainer] Changed tagging flex_attn via graph_pass in replace of fx.annotation for better robustness #2924
SherlockNoMad merged 4 commits intomainfrom
graph_trainer/annotate_flex_for_regional_inductor

SherlockNoMad commented Apr 9, 2026 •

edited

Loading

Uh oh!

SherlockNoMad Apr 10, 2026

Uh oh!

SherlockNoMad Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SherlockNoMad commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

SherlockNoMad Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SherlockNoMad commented Apr 9, 2026 •

edited

Loading