Add Nemotron/Minitron GGUF Conversion & Inference Support#8922
Add Nemotron/Minitron GGUF Conversion & Inference Support#8922slaren merged 10 commits intoggml-org:masterfrom
Conversation
Vaibhavs10
left a comment
There was a problem hiding this comment.
Note: As of Transformers 4.44.0 Nemotron is supported, so no need to install transformers from source.
|
Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description. |
|
Thank you @compilade for the comments and suggestions! Committed changes accordingly. |
Vaibhavs10
left a comment
There was a problem hiding this comment.
Hi @suhara - can you rebase to main, specifically make sure this commit is in - this should fix the failing requirements.txt test.
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: compilade <git@compilade.net>
4bb8d50 to
bd76198
Compare
|
Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch. |
There was a problem hiding this comment.
Not sure what am I missing here but I wasn't able to make the GGUF run tried to test the PR via this:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && gh pr checkout 8922
huggingface-cli download nvidia/Minitron-4B-Base --local-dir minitron --local-dir-use-symlinks False
python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf
llama-cli -m model.gguf -p "Meaning to life is"
I get error loading model architecture: unknown model architecture: 'nemotron'
EDIT: I'm stupid, I was using an older binary!
There was a problem hiding this comment.
Tested it (using the steps mentioned above), it works quite well!
Let's wait for @compilade to review + approve then we can merge! 🤗
| // optional MLP bias | ||
| layer.ffn_down_b = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "bias", i), {n_embd}, llama_model_loader::TENSOR_NOT_REQUIRED); | ||
| layer.ffn_up_b = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP, "bias", i), {n_ff}, llama_model_loader::TENSOR_NOT_REQUIRED); |
There was a problem hiding this comment.
It's not correct to use ctx_split for bias tensors, it should use ctx_layer instead.
There was a problem hiding this comment.
Thank you for your comment @slaren !
Sorry for the naive question. What's the difference between ctx_split and ctx_layer?
Something not clear is that some part of llama.cpp uses ctx_split for bias tensors as well.
For example,
- https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6108-L6110
- https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6343
Should they be corrected (which is out of the scope of this PR but wanted to ask to have a better understanding of them)?
There was a problem hiding this comment.
ctx_split only makes a difference when using tensor parallelism with -sm row, which is only supported on the CUDA backend when using multiple GPUs. When using -sm row, ctx_split splits the rows of the matrix between the available GPUs. This is only supported for matrix multiplication, so it should only be used with the matrix portion of linear/dense layers. The other cases are also wrong and should be corrected as well, but it doesn't need to be done here.
There was a problem hiding this comment.
Thanks for the explanation! Updated the two lines accordingly. Agree with you that the other parts should be fixed outside this PR.
|
Hi @compilade |
|
Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren ! Could anybody help merge this PR? Thank you! |
|
Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be? |
|
Hi @schmorp I think the repo has been updated and You can actually extract There are two tokenizer files but they are the same and either can be renamed as
|
|
@suhara thanks a lot! |
|
Minitron-8B converts, but then can't be used: llm_load_tensors: ggml ctx size = 0.15 MiB |
|
Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet. |
|
I'll look into this but I think I know the root cause. 8B uses Many HF models including Llama asserts
FYI, for 4B, |
|
That's good news, thanks for looking into this. I'll have a try at the 340B. |
|
For the 340B, conversion instantly fails flat because there isn't a config.json file. |
|
I tried The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor. |
|
@nicoboss if the conversion steps and script would be useful, I can document this tomorrow! |
This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format. |
|
And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with: cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed |
) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>
) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>
) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>
) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py * Update src/llama.cpp * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json Co-Authored-By: compilade <git@compilade.net>
) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>
This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.
The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model
HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running
pip install transformers>=4.44.0.Please see this PR for details.
The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:
You can find details about the model architecture in the following papers:
This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.