[dependencies] Upgrade transformers to >=5.0.0,<=5.3.0#1426
[dependencies] Upgrade transformers to >=5.0.0,<=5.3.0#1426erictang000 merged 15 commits intomainfrom
Conversation
| mhc_expansion_rate: mHC expansion rate. Connectors are trainable when this is > 1. | ||
| """ | ||
|
|
||
| # Type hints for config attributes |
There was a problem hiding this comment.
Do we need to remove these? It would be good to keep them for documentation purposes if possible :)
|
For the tx backend, you will also need to adapt for the change that |
| # Broadcast non-persistent buffers (e.g. inv_freq from RotaryEmbedding) that | ||
| # are excluded from state_dict. On non-rank-0 meta-init these are still on | ||
| # meta device with no data; rank 0 has the correctly computed values. | ||
| _sync_non_persistent_buffers(model, sharded_sd) |
There was a problem hiding this comment.
I'm curious do you know why upgrading transformers necessitates this change? Seems a little surprising :)
| if hasattr(provider, "q_lora_rank") and hasattr(hf_config, "q_lora_rank"): | ||
| provider.q_lora_rank = hf_config.q_lora_rank | ||
|
|
||
| # Workaround for transformers v5 moving rope_theta into rope_parameters |
There was a problem hiding this comment.
Curious why this is needed, since megatron-bridge already updated NVIDIA-NeMo/Megatron-Bridge#2068 -- if this is still needed, should we raise an issue against megatron-bridge so we can remove this workaround going forward?
| def __init__( | ||
| self, | ||
| config: PretrainedConfig | dict, | ||
| config: PretrainedConfig | dict | None = None, |
There was a problem hiding this comment.
Do you know why this is needed now? Who is calling this without passing in a config? I'm also concerned that the defaults in
max_lora_adapters: int = 0,
max_lora_rank: int = 0,
shard_attention_heads: bool = True,
could cause trouble and it would be better to not need to have the **kwargs part, since it can mask problems.
There was a problem hiding this comment.
seems like it's this PR in transformers 5.4.0: huggingface/transformers#41250
i'm pinning to <= 5.3.0 so it actually isn't an issue right now (but i guess i was testing with 5.4.0 when originally changing this code). I can revert the changes here for now and we can revisit when upgrading to >=5.4.0.
seems like megatron-bridge caps at <=5.3.0 as well and there's some relevant activity on transformers so these changes could be avoided in the changes: huggingface/transformers#45070
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | ||
| hf_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name, attn_implementation="eager", use_safetensors=True, trust_remote_code=True | ||
| model_name, attn_implementation="eager", use_safetensors=True, torch_dtype=torch.float32 |
There was a problem hiding this comment.
this is needed now since behavior in v5 changed from defaulting to float32 to defaulting to the default model dtype.
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The split invocation with `--with transformers==5.2.0` was added in NovaSky-AI#1228 when pyproject.toml still pinned transformers <5, to let the new Qwen 3.5 test use v5 while the rest of the suite stayed on v4. The project-wide migration to v5 in NovaSky-AI#1426 left this carve-out and its comment behind, so test_qwen3_5.py has been artificially pinned to 5.2.0 while everything else runs on whatever pyproject.toml resolves (now 5.5.4 on this branch). Collapse back to a single pytest invocation — the exact shape the workflow had before NovaSky-AI#1228 — so all tests run on one transformers version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upgrade to transformers v5
Summary
Upgrades
transformersfrom>=4.56.1,<5to>=5.0.0,<=5.3.0and adapts SkyRL's model initialization, FSDP loading, and test code to accommodate v5 breaking changes.CI
Round 2 CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23917102581 -> 10 failing from before
Megatron CI Round 2: https://github.com/NovaSky-AI/SkyRL/actions/runs/23959241150/job/69884903884 -> 1 failing from before
~~Round 1 CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23876002482 ~~ -> 17 still failing
Megatron CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23920479124Key changes
Meta-device model initialization (
fsdp_utils.py,model_wrapper.py,fsdp_worker.py)v5 disallows
from_pretrained()insideaccelerate.init_empty_weights()(TypeError: Parameter.__new__() got an unexpected keyword argument '_is_hf_initialized'). Replaced with:from_pretrained()(loads real weights)from_config()insidetorch.device("meta")(empty shell; weights broadcast by FSDP)rope_scaling,rope_theta, and_attn_implementationare applied to the config before the branch so both paths are consistent.FSDP2 non-persistent buffer sync (
fsdp_utils.py)from_configon meta produces non-persistent buffers (inv_freqin RotaryEmbedding) with no data. These are excluded fromstate_dict()and never broadcast. Fixes:_sync_non_persistent_buffers()broadcasts these from rank 0 after state dict loadingoffload_fsdp2_model_to_cpu()now materializes only meta buffers instead of callingmodel.to_empty()(which wiped all loaded parameters → NaN)CriticModel
post_init()(model_wrapper.py)v5 added
all_tied_weights_keysinPreTrainedModel.post_init(). The dynamicCriticModelclass now callsself.post_init(), and the meta-init path wraps construction inno_init_weights().Strict dataclass configs (
configs.py)PretrainedConfigis now a strict dataclass. MadeModelConfig.__init__args optional with defaults; fixedget_text_config()signature for v5.VLM
mm_token_type_ids(model_wrapper.py, VLM tests)v5 requires
mm_token_type_idsfor M-RoPE in multimodal models. Threaded throughHFModelWrapper.forward()and tests.Megatron
rope_theta(megatron_worker.py)v5 moved
rope_thetaintorope_parametersdict. Added workaround to setprovider.rotary_basefrom the new location.Other fixes
cuda_ipc_strategy.py:.view(-1)→.reshape(-1)for non-contiguous weight tensorsvllm_server.py: guardsock.close()against uvloopTransportSocketAttributeErrortest_remote_inference_client_chat_template.py: userender_chat_completion()for prompt token verification