fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906
fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update#1906
Conversation
f4022ee to
de92f81
Compare
…date
- Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer
cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states,
update_conv_state/update_recurrent_state) — fixes
AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every
forward pass.
- Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's
`_supports_flash_attn` (renamed + default-False on the base). Remote-code
models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only
set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2
dispatch then raises ValueError even though the model supports it. Install
a property on PreTrainedModel that honors the legacy flag as a fallback
when a subclass has not set the new one; subclasses that set the new flag
directly still shadow the property via MRO, so native models are unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
de92f81 to
2b30195
Compare
- TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
PR description / diff mismatchThe PR description documents changes to Either those changes were accidentally dropped from this branch, or the description is stale. Worth confirming — the description frames them as load-bearing for the Phi-4-MM fix. |
| with ( | ||
| patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv, | ||
| patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec, |
There was a problem hiding this comment.
Nit: this assertion is correct but hard to follow. A simpler equivalent:
| with ( | |
| patch.object(cache, "update_conv_state", wraps=cache.update_conv_state) as mock_update_conv, | |
| patch.object(cache, "update_recurrent_state", wraps=cache.update_recurrent_state) as mock_update_rec, | |
| assert mock_update_conv.call_args.args[1] == module.layer_idx |
The call site always passes layer_idx as the second positional arg, so indexing args[1] directly is sufficient and easier to read.
|
Updated the PR description to match the current diff — earlier version referred to a prior /claude review |
There was a problem hiding this comment.
LGTM — the code changes are correct and tests are thorough.
One note: the PR description's nemo_automodel/_transformers/model_init.py section describes changes (forcing hf_config._attn_implementation, recursing into sub_configs, kwargs.setdefault("config", hf_config)) that are not present in the diff. The actual Phi-4-MM fix is the FA2 flag bridge in kernel_patches.py, which is a different approach. Consider updating the description to match the actual implementation so reviewers aren't looking for changes that don't exist.
|
PR description now updated correctly (previous edit was swallowed by a GraphQL warning on the Projects classic API). Could you re-review? /claude review |
| # Written at the layer_idx owned by the module. | ||
| _, call_kwargs = mock_update_conv.call_args | ||
| args = mock_update_conv.call_args.args | ||
| assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx |
There was a problem hiding this comment.
Nit: this three-line assertion is correct but hard to parse at a glance. Consider simplifying — the call is always positional (update_conv_state(conv_state, self.layer_idx)), so:
| # Written at the layer_idx owned by the module. | |
| _, call_kwargs = mock_update_conv.call_args | |
| args = mock_update_conv.call_args.args | |
| assert (args + (call_kwargs.get("layer_idx"),))[1] == module.layer_idx | |
| args, kwargs = mock_update_conv.call_args | |
| assert args[1] == module.layer_idx |
The current code handles both positional and keyword styles, but the production code only ever calls it positionally, and the simpler form makes the intent immediately obvious.
Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test 215b24f |
…date (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
… `r0.4.0` (#1908) fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…date (#1906) * fix: restore Qwen3.5 + Phi-4-MM nightly CI after transformers v5.5 update - Port Qwen3.5 MoE CPAwareGatedDeltaNet._forward_no_cp to the v5.5 per-layer cache API (has_previous_state method, cache.layers[idx].{conv,recurrent}_states, update_conv_state/update_recurrent_state) — fixes AttributeError: 'DynamicCache' object has no attribute 'conv_states' on every forward pass. - Bridge the legacy `_supports_flash_attn_2` class flag to v5.5's `_supports_flash_attn` (renamed + default-False on the base). Remote-code models pinned against <=v5.3 (e.g. microsoft/Phi-4-multimodal-instruct) only set the legacy flag and their FA2 support becomes invisible to v5.5 — FA2 dispatch then raises ValueError even though the model supports it. Install a property on PreTrainedModel that honors the legacy flag as a fallback when a subclass has not set the new one; subclasses that set the new flag directly still shadow the property via MRO, so native models are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: cover FA2 flag bridge and Qwen3.5 v5.5 cache API - TestPatchLegacyFlashAttnFlag: legacy `_supports_flash_attn_2 = True` bridges to `_supports_flash_attn`; explicit new flag (True/False) shadows via MRO; `False` legacy flag does not bridge; nearest-in-MRO wins; idempotent. - TestForwardNoCpV55CacheAPI: `_forward_no_cp` runs with a fresh DynamicCache (training path), runs without a cache, calls `update_conv_state` / `update_recurrent_state` with the layer's `layer_idx`, and calls `has_previous_state(layer_idx)` as a method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: simplify update_conv_state arg assertion Addresses review nit — the production call is always positional, so the keyword-fallback branch was dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Two nightly VLM finetune CI jobs broke after the transformers v5.5 bump (#1734). This PR fixes both.
Changes
nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py— portCPAwareGatedDeltaNet._forward_no_cpto the transformers v5.5 per-layer cache API:cache_params.has_previous_state→cache_params.has_previous_state(self.layer_idx)(now a method)use_precomputed_statesis truecache_params.layers[layer_idx].{conv,recurrent}_statesinstead of the removed top-level dictsupdate_conv_state/update_recurrent_statemethods instead ofconv_states[idx] = ...Without this, every forward pass with a fresh
DynamicCacheraisedAttributeError: 'DynamicCache' object has no attribute 'conv_states'.nemo_automodel/_transformers/kernel_patches.py(+ wire-up fromutils.py) — bridge the legacy_supports_flash_attn_2class flag to v5.5's_supports_flash_attn. transformers v5.5 renamed the attribute and switched the_flash_attn_can_dispatchcheck to the new name only (defaulting toFalseonPreTrainedModel). Remote-code models pinned against ≤v5.3 (e.g.microsoft/Phi-4-multimodal-instructsets_supports_flash_attn_2 = True) are unaware of the rename, so their FA2 support becomes invisible to v5.5 andattn_implementation="flash_attention_2"raisesValueError: Phi4MMForCausalLM does not support Flash Attention 2.Fix: install a property on
PreTrainedModel._supports_flash_attnthat falls back to the legacy flag when a subclass hasn't set the new one. Subclasses that set_supports_flash_attndirectly still shadow the property via normal MRO lookup, so native v5.5 models are unaffected. Called fromapply_cache_compatibility_patches()so it runs at the same setup point as the other v5 compat shims.After the bridge, Phi-4-MM dispatches to FA2 on v5.5 (confirmed by
is_flash_attn_greater_or_equal_2_10being called during forward; memory also drops from ~11.37 GiB SDPA → ~10.97 GiB FA2).Tests
TestPatchLegacyFlashAttnFlag(intests/unit_tests/_transformers/test_auto_model.py): property installed, idempotent, legacy-True bridges, explicit new-flag True/False both shadow, base default preserved, legacy-False does not bridge, nearest-in-MRO wins.TestForwardNoCpV55CacheAPI(intests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py): training-styleDynamicCacheruns without error, no-cache path still works,update_conv_state/update_recurrent_stateinvoked with the layer'slayer_idx,has_previous_state(layer_idx)called as a method.Linked CI jobs
Test plan
torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/qwen3_5/qwen3_5_4b.yaml(max_steps=2) — steps 0/1 + validation + checkpoint, exit 0torchrun --nproc-per-node=8 examples/vlm_finetune/finetune.py -c examples/vlm_finetune/phi4/phi4_mm_cv17.yaml(max_steps=1) — step 0 (loss 2.8924, FA2 confirmed) + validation + checkpoint, exit 0pytest tests/unit_tests/_transformers/test_auto_model.py::TestPatchLegacyFlashAttnFlag tests/unit_tests/models/qwen3_5_moe/test_cp_linear_attn.py::TestForwardNoCpV55CacheAPI— 12 passedNot in this PR
nemotron_parse_v1_1nightly (job 300041608) fails offline because the CI's HF cache is missingnvidia/C-RADIOv2-H. That's a cache-seeding fix on the CI side (addhuggingface-cli download nvidia/C-RADIOv2-Hto the pre-cache step), not a library bug — tracked separately.🤖 Generated with Claude Code