Skip to content

fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906

Merged
akoumpa merged 3 commits intomainfrom
huiyingl/qwen3_5-phi4mm-transformers-v5.5
Apr 19, 2026
Merged

fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906
akoumpa merged 3 commits intomainfrom
huiyingl/qwen3_5-phi4mm-transformers-v5.5

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi commented Apr 18, 2026

Summary

Two nightly VLM finetune CI jobs broke after the transformers v5.5 bump (#1734). This PR fixes both.

Changes

nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py — port CPAwareGatedDeltaNet._forward_no_cp to the transformers v5.5 per-layer cache API:

  • cache_params.has_previous_statecache_params.has_previous_state(self.layer_idx) (now a method)
  • Read states only when use_precomputed_states is true
  • Read via cache_params.layers[layer_idx].{conv,recurrent}_states instead of the removed top-level dicts
  • Write via update_conv_state / update_recurrent_state methods instead of conv_states[idx] = ...

Without this, every forward pass with a fresh DynamicCache raised AttributeError: 'DynamicCache' object has no attribute 'conv_states'.

nemo_automodel/_transformers/kernel_patches.py (+ wire-up from utils.py) — bridge the legacy _supports_flash_attn_2 class flag to v5.5's _supports_flash_attn. transformers v5.5 renamed the attribute and switched the _flash_attn_can_dispatch check to the new name only (defaulting to False on PreTrainedModel). Remote-code models pinned against ≤v5.3 (e.g. microsoft/Phi-4-multimodal-instruct sets _supports_flash_attn_2 = True) are unaware of the rename, so their FA2 support becomes invisible to v5.5 and attn_implementation="flash_attention_2" raises ValueError: Phi4MMForCausalLM does not support Flash Attention 2.

Fix: install a property on PreTrainedModel._supports_flash_attn that falls back to the legacy flag when a subclass hasn't set the new one. Subclasses that set _supports_flash_attn directly still shadow the property via normal MRO lookup, so native v5.5 models are unaffected. Called from apply_cache_compatibility_patches() so it runs at the same setup point as the other v5 compat shims.

After the bridge, Phi-4-MM dispatches to FA2 on v5.5 (confirmed by is_flash_attn_greater_or_equal_2_10 being called during forward; memory also drops from ~11.37 GiB SDPA → ~10.97 GiB FA2).

Tests

  • TestPatchLegacyFlashAttnFlag (in tests/unit_tests/_transformers/test_auto_model.py): property installed, idempotent, legacy-True bridges, explicit new-flag True/False both shadow, base default preserved, legacy-False does not bridge, nearest-in-MRO wins.
  • TestForwardNoCpV55CacheAPI (in tests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py): training-style DynamicCache runs without error, no-cache path still works, update_conv_state / update_recurrent_state invoked with the layer's layer_idx, has_previous_state(layer_idx) called as a method.

Linked CI jobs

Test plan

  • Reproduced both failures locally against the same container image, then confirmed the fixes pass
  • torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/qwen3_5/qwen3_5_4b.yaml (max_steps=2) — steps 0/1 + validation + checkpoint, exit 0
  • torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/phi4/phi4_mm_cv17.yaml (max_steps=1) — step 0 (loss 2.8924, FA2 confirmed) + validation + checkpoint, exit 0
  • New unit tests: pytest tests/unit_tests/_transformers/test_auto_model.py::TestPatchLegacyFlashAttnFlag tests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py::TestForwardNoCpV55CacheAPI — 12 passed
  • Nightly VLM CI green

Not in this PR

nemotron_parse_v1_1 nightly (job 300041608) fails offline because the CI's HF cache is missing nvidia/C-RADIOv2-H. That's a cache-seeding fix on the CI side (add huggingface-cli download nvidia/C-RADIOv2-H to the pre-cache step), not a library bug — tracked separately.

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi HuiyingLi force-pushed the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch 2 times, most recently from f4022ee to de92f81 Compare April 18, 2026 21:25
…date

- Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer
  cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states,
  update_conv_state/update_recurrent_state) — fixes
  AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every
  forward pass.
- Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's
  `_supports_flash_attn` (renamed + default-False on the base). Remote-code
  models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only
  set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2
  dispatch then raises ValueError even though the model supports it. Install
  a property on PreTrainedModel that honors the legacy flag as a fallback
  when a subclass has not set the new one; subclasses that set the new flag
  directly still shadow the property via MRO, so native models are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi HuiyingLi force-pushed the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch from de92f81 to 2b30195 Compare April 18, 2026 21:33
- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges
  to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO;
  `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent.
- TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache
  (training path), runs without a cache, calls `update_conv_state` /
  `update_recurrent_state` with the layer's `layer_idx`, and calls
  `has_previous_state(layer_idx)` as a method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 18, 2026

PR description / diff mismatch

The PR description documents changes to nemo_automodel/_transformers/model_init.py (forcing _attn_implementation on remote-code configs, recursing into sub_configs, and kwargs.setdefault("config", hf_config) in the HF fallback path), but none of those changes appear in the diff.

Either those changes were accidentally dropped from this branch, or the description is stale. Worth confirming — the description frames them as load-bearing for the Phi-4-MM fix.

Comment on lines +338 to +340
with (
patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv,
patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this assertion is correct but hard to follow. A simpler equivalent:

Suggested change
with (
patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv,
patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec,
assert mock_update_conv.call_args.args[1] == module.layer_idx

The call site always passes layer_idx as the second positional arg, so indexing args[1] directly is sufficient and easier to read.

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

Updated the PR description to match the current diff — earlier version referred to a prior model_init.py approach that was dropped in favor of the _patch_legacy_flash_attn_flag bridge in kernel_patches.py.

/claude review

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the code changes are correct and tests are thorough.

One note: the PR description's nemo_automodel/_transformers/model_init.py section describes changes (forcing hf_config._attn_implementation, recursing into sub_configs, kwargs.setdefault("config", hf_config)) that are not present in the diff. The actual Phi-4-MM fix is the FA2 flag bridge in kernel_patches.py, which is a different approach. Consider updating the description to match the actual implementation so reviewers aren't looking for changes that don't exist.

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

PR description now updated correctly (previous edit was swallowed by a GraphQL warning on the Projects classic API). Could you re-review?

/claude review

Comment on lines +344 to +347
# Written at the layer_idx owned by the module.
_, call_kwargs = mock_update_conv.call_args
args = mock_update_conv.call_args.args
assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this three-line assertion is correct but hard to parse at a glance. Consider simplifying — the call is always positional (update_conv_state(conv_state, self.layer_idx)), so:

Suggested change
# Written at the layer_idx owned by the module.
_, call_kwargs = mock_update_conv.call_args
args = mock_update_conv.call_args.args
assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx
args, kwargs = mock_update_conv.call_args
assert args[1] == module.layer_idx

The current code handles both positional and keyword styles, but the production code only ever calls it positionally, and the simpler form makes the intent immediately obvious.

claude[bot]
claude Bot previously approved these changes Apr 18, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — both fixes are correct and well-tested. The FA2 flag bridge handles MRO edge cases cleanly, and the cache API port matches v5.5 semantics. Left one minor readability nit on a test assertion, nothing blocking.

Addresses review nit — the production call is always positional, so the
keyword-fallback branch was dead code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 215b24f

@akoumpa akoumpa merged commit c72b931 into main Apr 19, 2026
56 checks passed
@akoumpa akoumpa deleted the huiyingl/qwen3_5-phi4mm-transformers-v5.5 branch April 19, 2026 01:54
svcnvidia-nemo-ci pushed a commit that referenced this pull request Apr 19, 2026
…date (#1906)

* fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update

- Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer
  cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states,
  update_conv_state/update_recurrent_state) — fixes
  AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every
  forward pass.
- Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's
  `_supports_flash_attn` (renamed + default-False on the base). Remote-code
  models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only
  set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2
  dispatch then raises ValueError even though the model supports it. Install
  a property on PreTrainedModel that honors the legacy flag as a fallback
  when a subclass has not set the new one; subclasses that set the new flag
  directly still shadow the property via MRO, so native models are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API

- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges
  to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO;
  `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent.
- TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache
  (training path), runs without a cache, calls `update_conv_state` /
  `update_recurrent_state` with the layer's `layer_idx`, and calls
  `has_previous_state(layer_idx)` as a method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: simplify update_conv_state arg assertion

Addresses review nit — the production call is always positional, so the
keyword-fallback branch was dead code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
HuiyingLi added a commit that referenced this pull request Apr 19, 2026
… `r0.4.0` (#1908)

fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update (#1906)

* fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update

- Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer
  cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states,
  update_conv_state/update_recurrent_state) — fixes
  AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every
  forward pass.
- Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's
  `_supports_flash_attn` (renamed + default-False on the base). Remote-code
  models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only
  set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2
  dispatch then raises ValueError even though the model supports it. Install
  a property on PreTrainedModel that honors the legacy flag as a fallback
  when a subclass has not set the new one; subclasses that set the new flag
  directly still shadow the property via MRO, so native models are unaffected.




* test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API

- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges
  to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO;
  `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent.
- TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache
  (training path), runs without a cache, calls `update_conv_state` /
  `update_recurrent_state` with the layer's `layer_idx`, and calls
  `has_previous_state(layer_idx)` as a method.




* test: simplify update_conv_state arg assertion

Addresses review nit — the production call is always positional, so the
keyword-fallback branch was dead code.




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…date (#1906)

* fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update

- Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer
  cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states,
  update_conv_state/update_recurrent_state) — fixes
  AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every
  forward pass.
- Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's
  `_supports_flash_attn` (renamed + default-False on the base). Remote-code
  models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only
  set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2
  dispatch then raises ValueError even though the model supports it. Install
  a property on PreTrainedModel that honors the legacy flag as a fallback
  when a subclass has not set the new one; subclasses that set the new flag
  directly still shadow the property via MRO, so native models are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API

- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges
  to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO;
  `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent.
- TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache
  (training path), runs without a cache, calls `update_conv_state` /
  `update_recurrent_state` with the layer's `layer_idx`, and calls
  `has_previous_state(layer_idx)` as a method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: simplify update_conv_state arg assertion

Addresses review nit — the production call is always positional, so the
keyword-fallback branch was dead code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants