convert : support non-mxfp4 HF model#15153
Conversation
|
Got this error: Tried with this script: |
|
I deleted this section: The script ran without it, I'm uploading here and testing it |
|
looking good |
|
Trying to quantize down to MXFP4 prints out a ton of stuff and then fails |
|
Tried your solution @gabriellarson but it only seems to produce a 2GB file using q8_0, so I think there's an issue somewhere. |
|
@gabriellarson thanks for testing, please retry to see if |
| // TODO: temporary sanity check that the F16 -> MXFP4 is lossless | ||
| #if 1 | ||
| #if 0 |
There was a problem hiding this comment.
For vis @ggerganov , I disable this check because most users will now using this code branch to convert fine-tuned model to MXFP4, which will no longer be lossless.
Although, I'm a bit doubt if fine-tuned models like the abliterated version should be quantize to something other than MXFP4 or not
@gabriellarson Could you also try converting it to Q4_K_M to see if it impacts the quality?
There was a problem hiding this comment.
Ah nevermind, it's not possible to quantize to Q4_K since the tensor shape is not divisible by 256
|
Quantizing works now Q4_K_M and MXFP4 both create decent output, Q4_K_M has lower perplexity llama.cpp/build/bin/llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99 MXFP4: Q4_K_M |
Yes that's expected, because big FFN tensors cannot be quantized to anything other than Q8_0 or MXFP4. For the Q4_K_M, these tensors are fallback to Q8_0 |
I guess we need a new quantization scheme of "Q4_K_FX" or something that uses MXFP4 as the fallback. |
* convert : support non-mxfp4 HF model * rm redundant check * disable debug check
* convert : support non-mxfp4 HF model * rm redundant check * disable debug check

The goal is to fix conversion for https://huggingface.co/huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated
This partially reverts e2c1beb
Closes #15146