Changes for transformers 5 weight conversion#3083
Changes for transformers 5 weight conversion#3083BenjaminBossan merged 15 commits intohuggingface:mainfrom
Conversation
- better handling of swapped in and out features - move PEFT config update functions to PEFT
This allows the weight conversion to be correctly applied without going through transformer_model.load_adapter.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@githubnemo The PR should now be ready for review. |
- always apply in/out feature swapping for MoE params - add a test for this with Qwen3 MoE - expose swapping argument to provide escape hatch
| Whether to tie weights or not after peft initialization. This will ensure that the adapters added to the | ||
| tied layers are also tied. This is only applicable for layers passed via `modules_to_save` and | ||
| `target_modules`. | ||
| param_wrapper_swap_in_out_features (`bool`, *optional) |
There was a problem hiding this comment.
Is this parameter used to resolve #3112? If so, maybe automatic detection would be better?
There was a problem hiding this comment.
In my latest commit, I changed the code to use module.is_transposed.
|
I tested your branch, the saved LoRA weights for qwen35-moe still have the same issues,see: #3112 |
Could you please show a small reproducer for that error? |
|
@BenjaminBossan You should be able to reproduce this issue easily by training Qwen3-5 MoE with LoRA. |
|
@jeejeelee Could you at least describe what goes wrong? Loading the trained LoRA weights or something else? What error do you get? What transformers version are you using? |
|
@BenjaminBossan Hmm, I think I've already described it clearly in #3112. |
|
@jeejeelee LMK if I overlooked something, but I didn't find the information that I would need to try to reproduce the error and fix it. You mentioned:
and
and linked to a couple of those weights. But I'm missing the information how these weights were created, what error you got (full stacktrace), what versions were used (especially of Transformers). I can understand if it's not possible for you to provide a full reproducer, but if you want me to take a look at your issue, you have to provide these missing pieces of information or point me to where you've shared them. |
|
@jeejeelee I wrote a test that loads a small Qwen 3.5 MoE model and applies LoRA to normal linear layers and also to MoE layeers. Then it saves the LoRA weights and loads them again, checking that the outputs remain the same. The test passes (I used transformers 5.4.0). So without further information, I cannot replicate any error with this model architecture. Qwen 3.5 MoE testimport torch
from transformers import Qwen3_5MoeForConditionalGeneration, Qwen3_5MoeConfig
from peft import LoraConfig, PeftModel, get_peft_model
def create_small_qwen():
torch.manual_seed(0)
config = Qwen3_5MoeConfig(
image_token_id=248056,
video_token_id=248057,
vision_start_token_id=248053,
vision_end_token_id=248054,
tie_word_embeddings=False,
text_config={
"attention_bias": False,
"attention_dropout": 0.0,
"attn_output_gate": True,
"eos_token_id": 248044,
"full_attention_interval": 4,
"head_dim": 16,
"hidden_act": "silu",
"hidden_size": 64,
"initializer_range": 0.02,
"layer_types": [
"linear_attention",
"linear_attention",
"linear_attention",
"full_attention",
"linear_attention",
"linear_attention",
"linear_attention",
"full_attention",
],
"linear_conv_kernel_dim": 4,
"linear_key_head_dim": 16,
"linear_num_key_heads": 2,
"linear_num_value_heads": 4,
"linear_value_head_dim": 16,
"max_position_embeddings": 1024,
"mlp_only_layers": [],
"model_type": "qwen3_5_moe_text",
"moe_intermediate_size": 32,
"mtp_num_hidden_layers": 1,
"mtp_use_dedicated_embeddings": False,
"num_attention_heads": 4,
"num_experts": 8,
"num_experts_per_tok": 2,
"num_hidden_layers": 8,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"router_aux_loss_coef": 0.001,
"shared_expert_intermediate_size": 32,
"use_cache": True,
"vocab_size": 248320,
"mamba_ssm_dtype": "float32",
"rope_parameters": {
"mrope_interleaved": True,
"mrope_section": [3, 3, 2],
"rope_type": "default",
"rope_theta": 10000000,
"partial_rotary_factor": 0.25,
},
},
vision_config={
"deepstack_visual_indexes": [],
"depth": 2,
"hidden_act": "gelu_pytorch_tanh",
"hidden_size": 64,
"in_channels": 3,
"initializer_range": 0.02,
"intermediate_size": 128,
"model_type": "qwen3_5_moe",
"num_heads": 4,
"num_position_embeddings": 2304,
"out_hidden_size": 64,
"patch_size": 16,
"spatial_merge_size": 2,
"temporal_patch_size": 2,
},
)
model = Qwen3_5MoeForConditionalGeneration(config).to(0)
return model
def main():
inputs = torch.arange(10).view(1, -1).to(0)
model = create_small_qwen()
with torch.inference_mode():
out_base = model(inputs).logits
config = LoraConfig(
target_modules=["q_proj", "v_proj", "in_proj_qkv"],
target_parameters=["experts.gate_up_proj", "experts.down_proj"],
init_lora_weights=False,
)
torch.manual_seed(0)
model = get_peft_model(model, config)
model.print_trainable_parameters()
with torch.inference_mode():
out_lora = model(inputs).logits
path = "/tmp/peft/qwen3_5moe"
model.save_pretrained(path)
del model
model = create_small_qwen()
model = PeftModel.from_pretrained(model, path)
with torch.inference_mode():
out_loaded = model(inputs).logits
assert not torch.allclose(out_base, out_lora)
assert torch.allclose(out_lora, out_loaded)
if __name__ == "__main__":
main() |
GPT-OSSLet me descride the gpt-oss20b first. the gpt-oss config is :
Qwen35When I tried to generate lora weights for qwen35-moe , I got the incorrect shape.
Reproduce code:import os
import torch
from transformers import Qwen3_5MoeForConditionalGeneration
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path="Qwen/Qwen3.5-35B-A3B"
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=True,
r=8,
lora_alpha=32,
target_modules="all-linear",
use_rslora=False,
use_dora=False,
target_parameters=[
"mlp.experts.gate_up_proj",
"mlp.experts.down_proj",
],
)
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).eval()
print(model)
model = get_peft_model(model, peft_config)
print(model)
model.save_pretrained(
save_directory="qwen35-moe-lora-moe",
safe_serialization=True,
save_embedding_layers=False,
)
|
|
@jeejeelee Thank you for providing further information. So IIUC, what you take issue with is that the shapes of the LoRA weights are not what you expect -- there is no actual error from running the code, just the shapes look incorrect. Regarding these shapes, they can be different from what you expect for a few reasons. First of all, even for expert layers with 3-dim parameters, PEFT flattens out the parameters to 2-dim. This is for keeping with PEFT conventions of using an peft/src/peft/tuners/lora/layer.py Lines 2191 to 2192 in 30b7b50 Furthermore, for some models, we have to deal with situations where the original weights were fused and we have to keep the checkpoint compatible. In that case, we employ different fusing strategies, which are defined in
This shouldn't be relevant to Qwen3.5, as it didn't have a change in weight structure, but it's good to be aware that it can factor in. Let's look at one concrete example of the experts in Qwen3.5 MoE,
These shapes thus look correct to me. All of the above can lead to shapes that are unexpected at first. However, with the testing we do, we should hopefully ensure that these operations are correctly applied. If you have a concrete example where the model output is incorrect or the model doesn't train as expected, please let us know, just be aware that pure shapes can be misleading. (Btw.: |
githubnemo
left a comment
There was a problem hiding this comment.
We discussed this offline often enough. LGTM.

See accompanying huggingface/transformers#44478.
Note that the newly added tests will fail until a new transformers release with the linked PR is out. This should be v5.4, so the corresponding tests only run with that transformers version. I locally tested with the current main branch and the tests pass.