fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update by HuiyingLi · Pull Request #1906 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-04-18T20:53:56Z

Summary

Two nightly VLM finetune CI jobs broke after the transformers v5.5 bump (#1734). This PR fixes both.

Changes

nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py — port CPAwareGatedDeltaNet._forward_no_cp to the transformers v5.5 per-layer cache API:

cache_params.has_previous_state → cache_params.has_previous_state(self.layer_idx) (now a method)
Read states only when use_precomputed_states is true
Read via cache_params.layers[layer_idx].{conv,recurrent}_states instead of the removed top-level dicts
Write via update_conv_state / update_recurrent_state methods instead of conv_states[idx] = ...

Without this, every forward pass with a fresh DynamicCache raised AttributeError: 'DynamicCache' object has no attribute 'conv_states'.

nemo_automodel/_transformers/kernel_patches.py (+ wire-up from utils.py) — bridge the legacy _supports_flash_attn_2 class flag to v5.5's _supports_flash_attn. transformers v5.5 renamed the attribute and switched the _flash_attn_can_dispatch check to the new name only (defaulting to False on PreTrainedModel). Remote-code models pinned against ≤v5.3 (e.g. microsoft/Phi-4-multimodal-instruct sets _supports_flash_attn_2 = True) are unaware of the rename, so their FA2 support becomes invisible to v5.5 and attn_implementation="flash_attention_2" raises ValueError: Phi4MMForCausalLM does not support Flash Attention 2.

Fix: install a property on PreTrainedModel._supports_flash_attn that falls back to the legacy flag when a subclass hasn't set the new one. Subclasses that set _supports_flash_attn directly still shadow the property via normal MRO lookup, so native v5.5 models are unaffected. Called from apply_cache_compatibility_patches() so it runs at the same setup point as the other v5 compat shims.

After the bridge, Phi-4-MM dispatches to FA2 on v5.5 (confirmed by is_flash_attn_greater_or_equal_2_10 being called during forward; memory also drops from ~11.37 GiB SDPA → ~10.97 GiB FA2).

Tests

TestPatchLegacyFlashAttnFlag (in tests/unit_tests/_transformers/test_auto_model.py): property installed, idempotent, legacy-True bridges, explicit new-flag True/False both shadow, base default preserved, legacy-False does not bridge, nearest-in-MRO wins.
TestForwardNoCpV55CacheAPI (in tests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py): training-style DynamicCache runs without error, no-cache path still works, update_conv_state / update_recurrent_state invoked with the layer's layer_idx, has_previous_state(layer_idx) called as a method.

Linked CI jobs

Test plan

Reproduced both failures locally against the same container image, then confirmed the fixes pass
torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/qwen3_5/qwen3_5_4b.yaml (max_steps=2) — steps 0/1 + validation + checkpoint, exit 0
torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/phi4/phi4_mm_cv17.yaml (max_steps=1) — step 0 (loss 2.8924, FA2 confirmed) + validation + checkpoint, exit 0
New unit tests: pytest tests/unit_tests/_transformers/test_auto_model.py::TestPatchLegacyFlashAttnFlag tests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py::TestForwardNoCpV55CacheAPI — 12 passed
Nightly VLM CI green

Not in this PR

nemotron_parse_v1_1 nightly (job 300041608) fails offline because the CI's HF cache is missing nvidia/C-RADIOv2-H. That's a cache-seeding fix on the CI side (add huggingface-cli download nvidia/C-RADIOv2-H to the pre-cache step), not a library bug — tracked separately.

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-18T20:53:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…date - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-18T21:52:48Z

/claude review

claude · 2026-04-18T21:54:38Z

PR description / diff mismatch

The PR description documents changes to nemo_automodel/_transformers/model_init.py (forcing _attn_implementation on remote-code configs, recursing into sub_configs, and kwargs.setdefault("config", hf_config) in the HF fallback path), but none of those changes appear in the diff.

Either those changes were accidentally dropped from this branch, or the description is stale. Worth confirming — the description frames them as load-bearing for the Phi-4-MM fix.

claude · 2026-04-18T21:54:42Z

+        with (
+            patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv,
+            patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec,


Nit: this assertion is correct but hard to follow. A simpler equivalent:

Suggested change

with (

patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv,

patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec,

assert mock_update_conv.call_args.args[1] == module.layer_idx

The call site always passes layer_idx as the second positional arg, so indexing args[1] directly is sufficient and easier to read.

HuiyingLi · 2026-04-18T21:57:56Z

Updated the PR description to match the current diff — earlier version referred to a prior model_init.py approach that was dropped in favor of the _patch_legacy_flash_attn_flag bridge in kernel_patches.py.

/claude review

claude

LGTM — the code changes are correct and tests are thorough.

One note: the PR description's nemo_automodel/_transformers/model_init.py section describes changes (forcing hf_config._attn_implementation, recursing into sub_configs, kwargs.setdefault("config", hf_config)) that are not present in the diff. The actual Phi-4-MM fix is the FA2 flag bridge in kernel_patches.py, which is a different approach. Consider updating the description to match the actual implementation so reviewers aren't looking for changes that don't exist.

HuiyingLi · 2026-04-18T22:03:35Z

PR description now updated correctly (previous edit was swallowed by a GraphQL warning on the Projects classic API). Could you re-review?

/claude review

claude · 2026-04-18T22:05:52Z

+        # Written at the layer_idx owned by the module.
+        _, call_kwargs = mock_update_conv.call_args
+        args = mock_update_conv.call_args.args
+        assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx


Nit: this three-line assertion is correct but hard to parse at a glance. Consider simplifying — the call is always positional (update_conv_state(conv_state, self.layer_idx)), so:

Suggested change

# Written at the layer_idx owned by the module.

_, call_kwargs = mock_update_conv.call_args

args = mock_update_conv.call_args.args

assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx

args, kwargs = mock_update_conv.call_args

assert args[1] == module.layer_idx

The current code handles both positional and keyword styles, but the production code only ever calls it positionally, and the simpler form makes the intent immediately obvious.

claude

LGTM — both fixes are correct and well-tested. The FA2 flag bridge handles MRO edge cases cleanly, and the cache API port matches v5.5 semantics. Left one minor readability nit on a test assertion, nothing blocking.

Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-18T22:13:04Z

/ok to test 215b24f

…date (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

… `r0.4.0` (#1908) fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…date (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and pthombre as code owners April 18, 2026 20:53

HuiyingLi force-pushed the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch 2 times, most recently from f4022ee to de92f81 Compare April 18, 2026 21:25

HuiyingLi force-pushed the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch from de92f81 to 2b30195 Compare April 18, 2026 21:33

claude Bot reviewed Apr 18, 2026

View reviewed changes

claude Bot previously approved these changes Apr 18, 2026

View reviewed changes

test: simplify update_conv_state arg assertion

215b24f

Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi dismissed claude[bot]’s stale review via 215b24f April 18, 2026 22:13

copy-pr-bot Bot temporarily deployed to test April 18, 2026 22:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 22:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 22:16 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 22:23 Inactive

HuiyingLi added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 18, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 18, 2026 22:58 Inactive

akoumpa approved these changes Apr 19, 2026

View reviewed changes

akoumpa merged commit c72b931 into main Apr 19, 2026
56 checks passed

akoumpa deleted the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch April 19, 2026 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906

fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906
akoumpa merged 3 commits intomainfrom
huiyingl/qwen3_5-phi4mm-transformers-v5.5

HuiyingLi commented Apr 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 18, 2026

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot commented Apr 18, 2026

Uh oh!

claude Bot Apr 18, 2026

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot Apr 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HuiyingLi commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Linked CI jobs

Test plan

Not in this PR

Uh oh!

copy-pr-bot Bot commented Apr 18, 2026

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot commented Apr 18, 2026

PR description / diff mismatch

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HuiyingLi commented Apr 18, 2026 •

edited

Loading