fix: AC silently skipped on all registered VLMs — flatten ModuleList #1941
Merged
akoumpa merged 1 commit intoNVIDIA-NeMo:mainfrom Apr 21, 2026
Merged
fix: AC silently skipped on all registered VLMs — flatten ModuleList #1941akoumpa merged 1 commit intoNVIDIA-NeMo:mainfrom
akoumpa merged 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
…ividual layers _reduce_attrs returns ModuleList objects as single items; extending layers with them meant AC code never found self_attn/mlp on a ModuleList and silently skipped all checkpointing. Flatten any ModuleList results so layers contains individual decoder layers, matching the heuristic path. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
Author
|
I want to clarify how this bug slipped through in #1904: when testing with TP+PP, the memory pressure per GPU was already low enough (TP4×PP2 spreads the model across 8 GPUs) that training ran for hundreds of steps without OOM — so I assumed AC was working correctly. In reality, AC had already silently failed at that point. I apologize for not catching this earlier. The bug has been fixed in this PR, and I will be more careful to validate AC behavior explicitly in future work, rather than relying on the absence of OOM as a proxy. |
Contributor
|
/ok to test cdd56e5 |
Contributor
|
/claude review |
akoumpa
approved these changes
Apr 21, 2026
HuiyingLi
added a commit
that referenced
this pull request
Apr 21, 2026
codecov/patch flagged #1941 at 14.28% (1/7 diff lines hit): every existing test mocks _extract_model_layers, so the new _extend_layers helper and the two modified call sites were unexecuted. Add six tests over the real function covering: class-keyed single FQN (GPT2), string-keyed arm (NemotronH name match), multi-FQN (Qwen2.5-VL), non-ModuleList element kept as a single entry (ModuleDict post-PP-split shape), and both ModuleList/ModuleDict fallback branches as regression guards. Uses Cls.__new__ + nn.Module.__init__ to produce instances whose type(model) matches the exact class in MODEL_CLS_TO_LAYERS (identity lookup — subclasses miss the dict) without HF's config-dependent __init__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
akoumpa
pushed a commit
that referenced
this pull request
Apr 21, 2026
…List (1941)` into `r0.4.0` (#1958) fix: AC silently skipped on all registered VLMs — flatten ModuleList (#1941) fix: flatten ModuleList in _extract_model_layers so AC applies to individual layers _reduce_attrs returns ModuleList objects as single items; extending layers with them meant AC code never found self_attn/mlp on a ModuleList and silently skipped all checkpointing. Flatten any ModuleList results so layers contains individual decoder layers, matching the heuristic path. Signed-off-by: khazic <khazzz1c@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: khazzz1c <khazzz1c@gmail.com>
HuiyingLi
added a commit
that referenced
this pull request
Apr 21, 2026
#1941 flattens each ModuleList returned from _reduce_attrs, so _extract_model_layers now yields individual decoder modules instead of the containing ModuleLists. Update the two fallback/None-safety assertions added in #1859 to isinstance-check the inner nn.Linear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
akoumpa
pushed a commit
that referenced
this pull request
Apr 21, 2026
* fix: flatten ModuleList in _extract_model_layers so AC applies to individual layers _reduce_attrs returns ModuleList objects as single items; extending layers with them meant AC code never found self_attn/mlp on a ModuleList and silently skipped all checkpointing. Flatten any ModuleList results so layers contains individual decoder layers, matching the heuristic path. Signed-off-by: khazic <khazzz1c@gmail.com> * test: cover _extract_model_layers flatten branches codecov/patch flagged #1941 at 14.28% (1/7 diff lines hit): every existing test mocks _extract_model_layers, so the new _extend_layers helper and the two modified call sites were unexecuted. Add six tests over the real function covering: class-keyed single FQN (GPT2), string-keyed arm (NemotronH name match), multi-FQN (Qwen2.5-VL), non-ModuleList element kept as a single entry (ModuleDict post-PP-split shape), and both ModuleList/ModuleDict fallback branches as regression guards. Uses Cls.__new__ + nn.Module.__init__ to produce instances whose type(model) matches the exact class in MODEL_CLS_TO_LAYERS (identity lookup — subclasses miss the dict) without HF's config-dependent __init__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: ruff format + rename ambiguous `l` loop var Fix E741 and apply ruff format on the test file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test(qwen3_5): expect flattened per-layer modules after #1941 #1941 flattens each ModuleList returned from _reduce_attrs, so _extract_model_layers now yields individual decoder modules instead of the containing ModuleLists. Update the two fallback/None-safety assertions added in #1859 to isinstance-check the inner nn.Linear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: khazic <khazzz1c@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: khazic <khazzz1c@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
…1941) fix: flatten ModuleList in _extract_model_layers so AC applies to individual layers _reduce_attrs returns ModuleList objects as single items; extending layers with them meant AC code never found self_attn/mlp on a ModuleList and silently skipped all checkpointing. Flatten any ModuleList results so layers contains individual decoder layers, matching the heuristic path. Signed-off-by: khazic <khazzz1c@gmail.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
* fix: flatten ModuleList in _extract_model_layers so AC applies to individual layers _reduce_attrs returns ModuleList objects as single items; extending layers with them meant AC code never found self_attn/mlp on a ModuleList and silently skipped all checkpointing. Flatten any ModuleList results so layers contains individual decoder layers, matching the heuristic path. Signed-off-by: khazic <khazzz1c@gmail.com> * test: cover _extract_model_layers flatten branches codecov/patch flagged #1941 at 14.28% (1/7 diff lines hit): every existing test mocks _extract_model_layers, so the new _extend_layers helper and the two modified call sites were unexecuted. Add six tests over the real function covering: class-keyed single FQN (GPT2), string-keyed arm (NemotronH name match), multi-FQN (Qwen2.5-VL), non-ModuleList element kept as a single entry (ModuleDict post-PP-split shape), and both ModuleList/ModuleDict fallback branches as regression guards. Uses Cls.__new__ + nn.Module.__init__ to produce instances whose type(model) matches the exact class in MODEL_CLS_TO_LAYERS (identity lookup — subclasses miss the dict) without HF's config-dependent __init__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: ruff format + rename ambiguous `l` loop var Fix E741 and apply ruff format on the test file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test(qwen3_5): expect flattened per-layer modules after #1941 #1941 flattens each ModuleList returned from _reduce_attrs, so _extract_model_layers now yields individual decoder modules instead of the containing ModuleLists. Update the two fallback/None-safety assertions added in #1859 to isinstance-check the inner nn.Linear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: khazic <khazzz1c@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: khazic <khazzz1c@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fixes a silent activation-checkpointing (AC) regression introduced in #1904 that caused Gemma4 31B to OOM when trained with plain FSDP2 + AC (
gemma4_31b.yaml), tracked in #1927.Root Cause
Why #1904 (TP+PP) broke AC without touching AC code
_extract_model_layersserves dual purpose: both TP sharding and AC consume the same returnedlayerslist:To generate a TP sharding plan for Gemma4, #1904 registered it in
MODEL_CLS_TO_LAYERS:Before #1904, Gemma4 was not registered, so it fell through to the heuristic path:
After #1904, Gemma4 hits the registered path instead:
Why the registered path silently breaks AC
_reduce_attrstraverses the model by FQN and returns the module at the end of each path as a single item:layers.extend([ModuleList])adds the entireModuleListas a single entry. The AC code then iterateslayersand looks forself_attn/mlpon each element —nn.ModuleListhas neither, so all 62 layers are silently skipped. No checkpointing happens, all activations are retained in memory, and 31B OOMs at step 1.This bug was pre-existing for all registered VLMs
The same
layers.extend(_reduce_attrs(...))pattern was already in place before #1904 for every model inMODEL_CLS_TO_LAYERS(Qwen2VL, LlavaNext, Mistral3, Llama4, etc.). Their AC had also been silently failing, but with TP/PP memory is distributed across many GPUs so the failure never caused OOM and went unnoticed. Gemma4 31B on plain FSDP2+AC (8×80GB, no TP/PP) was the first configuration where the memory pressure was high enough to surface it.Fix
Introduce a small helper
_extend_layersthat flattens anynn.ModuleListresults from_reduce_attrsinto individual layers:This replaces both
layers.extend(_reduce_attrs(...))call sites, fixing AC for all registered model types in one change.Validation
Reproduced the OOM on
gemma4_31b.yaml(8× H100, FSDP2 + AC, tp=1 cp=1) on main before this fix — crashes at step 1 withtorch.OutOfMemoryErrorinv_norm(peak mem jumps to 44.96 GiB then OOM).After this fix, training runs stably with memory settling at ~40 GiB:
Changelog
_extract_model_layersto flattennn.ModuleListobjects returned by_reduce_attrsinto individual layers, restoring correct AC behavior for all registered VLM model types (Gemma4, Qwen2VL, LlavaNext, Mistral3, Llama4, etc.).Before your PR is "Ready for review"
Pre checks:
Additional Information