Skip to content

convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#20505

Merged
CISC merged 10 commits intoggml-org:masterfrom
michaelw9999:nvfp4-fix-qwen-conversions
Mar 26, 2026
Merged

convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions#20505
CISC merged 10 commits intoggml-org:masterfrom
michaelw9999:nvfp4-fix-qwen-conversions

Conversation

@michaelw9999
Copy link
Copy Markdown
Contributor

@michaelw9999 michaelw9999 commented Mar 13, 2026

This PR fixes several errors that occur when attempting to convert Qwen3.5/Qwen3.5Moe models. To keep this PR scope in check and specific, a separate PR #20506 allows loading of these newly converted models.

Bug:
When attempting to use convert_hf_to_gguf.py on various Qwen3.5 and Qwen3.5 MoE models, it would abort with the following error(s):

ValueError: Can not map tensor 'model.language_model.layers.0.mlp.shared_expert.down_proj.weight'
ValueError: Can not map tensor 'model.language_model.layers.0.linear_attn.in_proj_a.weight'

This occurred because these models now have model.language_model or language_model prefixes. The fix strips the wrappers instead of failing, which allows it to continue.
But just stripping the names and continuing was not enough to get the models converted properly, so it would cause a new error:

RuntimeError: shape '[16, 3, 1]' is invalid for input of size 1

This is because Qwen3.5's linear attention weights get reordered in modify_tensors():

# original order:  [q, k, v, z] * head_count
# corrected order: [q * head_count, k * head_count, v * head_count, z * head_count]

However NVFP4 bypasses modify_tensors() and has its own repacking, and linear_attn.in_proj_a.input_scale was seen by as a [num_v_heads] tensor and tried to reshape it into [16, 3, 1].
This is fixed by skipping tensors in the write loop that already were repacked

if self._is_nvfp4:
                if name.endswith(".weight") and name.replace(".weight", ".weight_scale") in self.model_tensors:
                    continue
                if name.endswith((".weight_scale", ".weight_scale_2", ".input_scale", "k_scale", ".v_scale"))
                    continue
 Updated: added k_scale and v_scale above

and by applying the same reordering for :

linear_attn.in_proj_qkv
linear_attn.in_proj_z
linear_attn.in_proj_a
linear_attn.in_proj_b
linear_attn.out_proj

This will now produce the correct Qwen3.5/Qwen3.5MoE NVFP4 GGUF file. A separate PR must be applied to load these files.
This fixed the issue with both Qwen3.5-122B-A10B-NVFP4 and Qwen3.5-27B-NVFP4 and correctly produced proper output.
Qwen3.5-35B-A3B-NVFP4.gguf was also tested after returning k_scale and v_scale to the skip list.

Note, some Qwen3.5 NVFP4 HF models produce this tokenizer error and others don't for the same model:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

Workaround:
Edit the model's tokenizer_config.json and change tokenizer_class from TokenizersBackend to Qwen2Tokenizer

@michaelw9999 michaelw9999 requested a review from CISC as a code owner March 13, 2026 12:00
Copilot AI review requested due to automatic review settings March 13, 2026 12:00
@github-actions github-actions Bot added the python python script changes label Mar 13, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Qwen3.5 / Qwen3.5-MoE NVFP4 HF→GGUF conversion failures by improving tensor name mapping and applying Qwen3.5 linear-attention-specific reordering during NVFP4 repacking.

Changes:

  • Extend tensor name mapping to handle model.language_model.* / language_model.* wrapper prefixes.
  • Add Qwen3.5 NVFP4 linear-attention weight transforms (row/column reordering) during NVFP4 repack.
  • Skip writing NVFP4 auxiliary tensors and already-repacked tensors during the prepare/write loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
@michaelw9999 michaelw9999 changed the title ggml: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions convert: Fix Qwen3.5/Qwen3.5 Moe NVFP4 Conversions Mar 13, 2026
@arthurcavalcant

This comment was marked as resolved.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

@arthurcavalcant I'll post an update for that shortly.

@michaelw9999 michaelw9999 force-pushed the nvfp4-fix-qwen-conversions branch from 3e7ded9 to 43b0892 Compare March 13, 2026 18:42
@michaelw9999
Copy link
Copy Markdown
Contributor Author

@arthurcavalcant I've fixed that and you should be able to convert and run that model now.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 16, 2026

Rebase please.

@michaelw9999 michaelw9999 force-pushed the nvfp4-fix-qwen-conversions branch 2 times, most recently from e64f9eb to 634bb47 Compare March 16, 2026 14:36
@michaelw9999
Copy link
Copy Markdown
Contributor Author

Rebase please.

@CISC rebased!

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 16, 2026

Rebase please.

@CISC rebased!

Great, unfortunately it may take a while before I come around to review this, in the mean time address Copilot's review, it looks sensible.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

michaelw9999 commented Mar 16, 2026 via email

@michaelw9999 michaelw9999 force-pushed the nvfp4-fix-qwen-conversions branch from 634bb47 to 44a351f Compare March 17, 2026 22:03
@michaelw9999
Copy link
Copy Markdown
Contributor Author

rebased

@michaelw9999 michaelw9999 force-pushed the nvfp4-fix-qwen-conversions branch from bee717a to 84c04f0 Compare March 20, 2026 08:01
@michaelw9999 michaelw9999 force-pushed the nvfp4-fix-qwen-conversions branch from 7b59d2f to 4fd8311 Compare March 22, 2026 08:55
@michaelw9999
Copy link
Copy Markdown
Contributor Author

So I was actually wrong about k_scale and v_scale, it will be impossible to make use of those in llama, and it not the same thing in Qwen35 than I first thought. I was confused by the naming. I added in the input_scale (which will currently do nothing, but at least it's there for future use, and trivial in size).

Comment thread convert_hf_to_gguf.py Outdated
Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to handle them here, otherwise the model will fail to load:

llama.cpp/src/llama-model.cpp

Lines 7449 to 7451 in 245f5cc

// generic pass: load optional per-tensor/per-expert ".scale" tensors (e.g. NVFP4 scale2)
// this avoids having to add scale loading to every architecture
for (int i = 0; i < n_layer; ++i) {

Comment thread convert_hf_to_gguf.py Outdated
@michaelw9999
Copy link
Copy Markdown
Contributor Author

@CISC I've got it all ready, confirming, I also had added a section like this,

// input scales
            if (!layer.wq_input_s && layer.wq) {
                layer.wq_input_s = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "input_scale", i), {1}, TENSOR_NOT_REQUIRED);
            }
            if (!layer.wk_input_s && layer.wk) {
                layer.wk_input_s = create_tensor(tn(LLM_TENSOR_ATTN_K,   "input_scale", i), {1}, TENSOR_NOT_REQUIRED);
            }
            if (!layer.wv_input_s && layer.wv) {
                layer.wv_input_s = create_tensor(tn(LLM_TENSOR_ATTN_V,   "input_scale", i), {1}, TENSOR_NOT_REQUIRED);             

...etc, for all the tensors. Is that what you were expecting and are OK with?
And after that change, any previously made GGUFs won't load as it errors expecting those tensors. (I've got a dozen+ made already but that's fine with me!) Not sure how many people already made any NVFP4 GGUFs, it's some as you've seen. But still better now than later.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 23, 2026

...etc, for all the tensors. Is that what you were expecting and are OK with?

Yes, though perhaps change the naming as it's ill-fitting, perhaps _is? @ggerganov

And after that change, any previously made GGUFs won't load as it errors expecting those tensors. (I've got a dozen+ made already but that's fine with me!) Not sure how many people already made any NVFP4 GGUFs, it's some as you've seen. But still better now than later.

Why would they cause errors? The tensors are marked TENSOR_NOT_REQUIRED.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

michaelw9999 commented Mar 24, 2026

...etc, for all the tensors. Is that what you were expecting and are OK with?

Yes, though perhaps change the naming as it's ill-fitting, perhaps _is? @ggerganov

And after that change, any previously made GGUFs won't load as it errors expecting those tensors. (I've got a dozen+ made already but that's fine with me!) Not sure how many people already made any NVFP4 GGUFs, it's some as you've seen. But still better now than later.

Why would they cause errors? The tensors are marked TENSOR_NOT_REQUIRED.

Fixed it, I was getting expected x; got y errors repeatedly, but it was from a local issue.
I renamed it to in_s to make it shorter if that works, I think is ..is .. a bit weird to look at.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 24, 2026

...etc, for all the tensors. Is that what you were expecting and are OK with?

Yes, though perhaps change the naming as it's ill-fitting, perhaps _is? @ggerganov

I renamed it to in_s to make it shorter if that works, I think is ..is .. a bit weird to look at.

That could work, but it sort of implies there also exists an _in...

@michaelw9999
Copy link
Copy Markdown
Contributor Author

...etc, for all the tensors. Is that what you were expecting and are OK with?

Yes, though perhaps change the naming as it's ill-fitting, perhaps _is? @ggerganov

I renamed it to in_s to make it shorter if that works, I think is ..is .. a bit weird to look at.

That could work, but it sort of implies there also exists an _in...

Let me know how that looks

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 26, 2026

@ggerganov gentle ping re naming

@ggerganov
Copy link
Copy Markdown
Member

@michaelw9999 How are input_scale used in the reference implementation?

@michaelw9999
Copy link
Copy Markdown
Contributor Author

@ggerganov It's not used yet for the first implementation of NVFP4 on the PR, because that is only bringing NVFP4 x Q8 and there is nothing being quantized to NVFP4, so no use for input_scale using Q8 for activations.

But the reason for including input_scale right now with this PR, is to prevent "early v1 GGUFs" that don't retain it saved into the file. If/when NVFP4xNVFP4 usage is implemented later, then any file without the input_scale would not work as designed and need to be handled differently or suffer some quality loss.

I have a lot of NVFP4 code for llama.cpp that I have been tuning since last year, so for NVFP4 x NVFP4 (e2m1 weights x e2m1 activations) , it really uses input_scale, which is standard recipe output from ModelOpt. (Side note: My NVFP4 llama-quantizer also emits input_scale that is better suited to integrate with llama.cpp using imatrix. I don't know what future plans are, as I've read the intention is only to support conversions, but my GGUFs beat ModelOpt that I've tested so far, I'll post some soon to HF. It calculates input scale as:
file_input_scale = clamp(sqrt(mean(imatrix_entries)), 1/32, 32)

ModelOpt does: activation_scaling_factor = amax / (quantizer.maxbound * 448.0) but can be set to also sscan MSE around global_amax. Imatrix is better as there are less outliers so max kld goes down. (Recent side by side with Nemotron Cascade-2, ModelOpt maxkld 4.17, my llama-quantizer is 1.61)

In a very imminent future PR, the proposed NVFP4xNVFP4 design uses input_scale as intended for vec_dot_nvfp4_nvfp4_mma, with Blackwell (mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3) , as such:

In the loader:

float nvfp4_activation_scale = 1.0f / gguf_nvfp4_input_scale;

Then in quantize_mmq_nvfp4:

float sub_max = use_activation_scale ? amax_raw * activation_scale : amax_raw;
...
const float vals_norm = vals_raw[k] * activation_scale;
const uint8_t qi = best_index_nvfp4(vals_norm, subblock_scale);
const float q = subblock_scale * kvalues_nvfp4_float[qi];
const float e = vals_norm - q;
L += e * e;

Without the input_scale available, you could use 1.0f and lose some quality, or could calculate something off something from src1, but it would be not be the same without the original weights, so it would be best to use the input scale as intended.

@ggerganov
Copy link
Copy Markdown
Member

Is it correct to say that the input_scale are used by the reference implementation to scale the activations prior to quantizing them for each NVFP4 multiplication?

@michaelw9999
Copy link
Copy Markdown
Contributor Author

Is it correct to say that the input_scale are used by the reference implementation to scale the activations prior to quantizing them for each NVFP4 multiplication?
Yes, but to be more precise, it is done during the quantization step, just before each NVFP4xNVFP4 multiplication.

@ggerganov
Copy link
Copy Markdown
Member

@CISC OK to merge if it is good.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 26, 2026

@CISC OK to merge if it is good.

I will give the latest changes a spin later today, then merge.

@CISC CISC merged commit f8d4aba into ggml-org:master Mar 26, 2026
49 of 52 checks passed
@michaelw9999 michaelw9999 deleted the nvfp4-fix-qwen-conversions branch March 26, 2026 17:59
@drrros
Copy link
Copy Markdown
Contributor

drrros commented Mar 27, 2026

Tried to convert Qwen3.5-122B model, fails with message:
NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported
Although this one - Nemotron-Cascade-2 - converted fine (generated gguf also worked fine).
Latest build (version: 8559 (59d8402))

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
Traceback (most recent call last):
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12820, in <module>
    main()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12814, in main
    model_instance.write()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 934, in write
    self.prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 4602, in prepare_tensors
    super().prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 768, in prepare_tensors
    self.dequant_model()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 485, in dequant_model
    raise NotImplementedError(f"Quant format {quant_format!r} for method {quant_method!r} is not yet supported")
NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported

@michaelw9999
Copy link
Copy Markdown
Contributor Author

Tried to convert Qwen3.5-122B model, fails with message: NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported Although this one - Nemotron-Cascade-2 - converted fine (generated gguf also worked fine). Latest build (version: 8559 (59d8402))

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
Traceback (most recent call last):
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12820, in <module>
    main()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12814, in main
    model_instance.write()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 934, in write
    self.prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 4602, in prepare_tensors
    super().prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 768, in prepare_tensors
    self.dequant_model()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 485, in dequant_model
    raise NotImplementedError(f"Quant format {quant_format!r} for method {quant_method!r} is not yet supported")
NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported

@drrros the current HF script only supports conversions from ModelOpt. I have script ready to go for compressed-tensors but trying not to push out too many pending PRs all at once :)) Maybe I will speed that one up since you are asking.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

Although this one - Nemotron-Cascade-2 - converted fine (generated gguf also worked fine).

If you want you can give my version of Cascade-2 a try as well michaelw9999/Nemotron-Cascade-2-30B-A3B-NVFP4-GGUF and would love feedback. I made it with a early pre-PR llama-quantizer I've been working on. It has improved ppl and kld over the official version, but been trying to complete benchmarks and it failing them all. It gets the answers correct in reasoning but seems to reason too long. I'm trying to work out if that's a templating or configuration setting.

@michaelw9999
Copy link
Copy Markdown
Contributor Author

Tried to convert Qwen3.5-122B model, fails with message: NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported Although this one - Nemotron-Cascade-2 - converted fine (generated gguf also worked fine). Latest build (version: 8559 (59d8402))

(llama.cpp) drros@epyc-ws:~/llama.cpp$ ./convert_hf_to_gguf.py --verbose --outfile ../Qwen3.5-122B-A10B-NVFP4.gguf /mnt/nfs-esxi/LLM/Qwen3.5-122B-A10B-NVFP4/
INFO:hf-to-gguf:Loading model: Qwen3.5-122B-A10B-NVFP4
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:hf-to-gguf:heuristics unable to detect tensor dtype, defaulting to --outtype f16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
Traceback (most recent call last):
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12820, in <module>
    main()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 12814, in main
    model_instance.write()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 934, in write
    self.prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 4602, in prepare_tensors
    super().prepare_tensors()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 768, in prepare_tensors
    self.dequant_model()
  File "/home/drros/llama.cpp/./convert_hf_to_gguf.py", line 485, in dequant_model
    raise NotImplementedError(f"Quant format {quant_format!r} for method {quant_method!r} is not yet supported")
NotImplementedError: Quant format 'nvfp4-pack-quantized' for method 'compressed-tensors' is not yet supported

@drrros the current HF script only supports conversions from ModelOpt. I have script ready to go for compressed-tensors but trying not to push out too many pending PRs all at once :)) Maybe I will speed that one up since you are asking.

@drrros you can give PR #21095 a shot now to convert the compressed-tensors models from HF. Hope that helps!

slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
…l-org#20505)

* convert : fix Qwen3.5 NVFP4 conversion

* Updated copilot concerns and rebased

* move into _LinearAttentionVReorderBase and simplify

* --flake

* new_name not needed

* Added input_scale to gguf

* Fixed input_scale addition as tensor

* Added input scale to loader and named _in_s

* Update convert_hf_to_gguf.py

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…l-org#20505)

* convert : fix Qwen3.5 NVFP4 conversion

* Updated copilot concerns and rebased

* move into _LinearAttentionVReorderBase and simplify

* --flake

* new_name not needed

* Added input_scale to gguf

* Fixed input_scale addition as tensor

* Added input scale to loader and named _in_s

* Update convert_hf_to_gguf.py

Re-removed input_scale from aux cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants