convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#20505
convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#20505CISC merged 10 commits intoggml-org:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes Qwen3.5 / Qwen3.5-MoE NVFP4 HF→GGUF conversion failures by improving tensor name mapping and applying Qwen3.5 linear-attention-specific reordering during NVFP4 repacking.
Changes:
- Extend tensor name mapping to handle
model.language_model.*/language_model.*wrapper prefixes. - Add Qwen3.5 NVFP4 linear-attention weight transforms (row/column reordering) during NVFP4 repack.
- Skip writing NVFP4 auxiliary tensors and already-repacked tensors during the prepare/write loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment was marked as resolved.
This comment was marked as resolved.
|
@arthurcavalcant I'll post an update for that shortly. |
3e7ded9 to
43b0892
Compare
|
@arthurcavalcant I've fixed that and you should be able to convert and run that model now. |
|
Rebase please. |
e64f9eb to
634bb47
Compare
@CISC rebased! |
Great, unfortunately it may take a while before I come around to review this, in the mean time address Copilot's review, it looks sensible. |
|
Sure thing, won't take long to do those and re-test.
… Message ID: ***@***.***>
|
634bb47 to
44a351f
Compare
|
rebased |
44a351f to
3dcc1d9
Compare
c7edce3 to
bee717a
Compare
bee717a to
84c04f0
Compare
7b59d2f to
4fd8311
Compare
|
So I was actually wrong about k_scale and v_scale, it will be impossible to make use of those in llama, and it not the same thing in Qwen35 than I first thought. I was confused by the naming. I added in the input_scale (which will currently do nothing, but at least it's there for future use, and trivial in size). |
|
@CISC I've got it all ready, confirming, I also had added a section like this, ...etc, for all the tensors. Is that what you were expecting and are OK with? |
Yes, though perhaps change the naming as it's ill-fitting, perhaps
Why would they cause errors? The tensors are marked |
Fixed it, I was getting expected x; got y errors repeatedly, but it was from a local issue. |
That could work, but it sort of implies there also exists an |
Let me know how that looks |
Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
@ggerganov gentle ping re naming |
|
@michaelw9999 How are |
|
@ggerganov It's not used yet for the first implementation of NVFP4 on the PR, because that is only bringing NVFP4 x Q8 and there is nothing being quantized to NVFP4, so no use for But the reason for including I have a lot of NVFP4 code for llama.cpp that I have been tuning since last year, so for NVFP4 x NVFP4 (e2m1 weights x e2m1 activations) , it really uses input_scale, which is standard recipe output from ModelOpt. (Side note: My NVFP4 llama-quantizer also emits input_scale that is better suited to integrate with llama.cpp using imatrix. I don't know what future plans are, as I've read the intention is only to support conversions, but my GGUFs beat ModelOpt that I've tested so far, I'll post some soon to HF. It calculates input scale as: ModelOpt does: In a very imminent future PR, the proposed NVFP4xNVFP4 design uses In the loader: Then in Without the input_scale available, you could use 1.0f and lose some quality, or could calculate something off something from src1, but it would be not be the same without the original weights, so it would be best to use the input scale as intended. |
|
Is it correct to say that the |
|
|
@CISC OK to merge if it is good. |
I will give the latest changes a spin later today, then merge. |
|
Tried to convert Qwen3.5-122B model, fails with message: |
@drrros the current HF script only supports conversions from ModelOpt. I have script ready to go for compressed-tensors but trying not to push out too many pending PRs all at once :)) Maybe I will speed that one up since you are asking. |
If you want you can give my version of Cascade-2 a try as well michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF and would love feedback. I made it with a early pre-PR llama-quantizer I've been working on. It has improved ppl and kld over the official version, but been trying to complete benchmarks and it failing them all. It gets the answers correct in reasoning but seems to reason too long. I'm trying to work out if that's a templating or configuration setting. |
@drrros you can give PR #21095 a shot now to convert the compressed-tensors models from HF. Hope that helps! |
…l-org#20505) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…l-org#20505) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
This PR fixes several errors that occur when attempting to convert Qwen3.5/Qwen3.5Moe models. To keep this PR scope in check and specific, a separate PR #20506 allows loading of these newly converted models.
Bug:
When attempting to use
convert_hf_to_gguf.pyon various Qwen3.5 and Qwen3.5 MoE models, it would abort with the following error(s):This occurred because these models now have
model.language_modelorlanguage_modelprefixes. The fix strips the wrappers instead of failing, which allows it to continue.But just stripping the names and continuing was not enough to get the models converted properly, so it would cause a new error:
This is because Qwen3.5's linear attention weights get reordered in
modify_tensors():However NVFP4 bypasses
modify_tensors()and has its own repacking, andlinear_attn.in_proj_a.input_scalewas seen by as a [num_v_heads] tensor and tried to reshape it into [16, 3, 1].This is fixed by skipping tensors in the write loop that already were repacked
and by applying the same reordering for :
This will now produce the correct Qwen3.5/Qwen3.5MoE NVFP4 GGUF file. A separate PR must be applied to load these files.
This fixed the issue with both Qwen3.5-122B-A10B-NVFP4 and Qwen3.5-27B-NVFP4 and correctly produced proper output.
Qwen3.5-35B-A3B-NVFP4.gguf was also tested after returning k_scale and v_scale to the skip list.
Note, some Qwen3.5 NVFP4 HF models produce this tokenizer error and others don't for the same model:
Workaround:
Edit the model's tokenizer_config.json and change tokenizer_class from TokenizersBackend to Qwen2Tokenizer