Add Nemotron/Minitron GGUF Conversion & Inference Support by suhara · Pull Request #8922 · ggml-org/llama.cpp

suhara · 2024-08-08T07:17:31Z

This PR adds HF->GGUF conversion & inference support for Nemotron models including Nemotron-3, Nemotron-4 and "Minitron" models.

The PR should support any Nemotron/Minitron models but has been primarily tested with the following Minitron model

Minitron 4B

HF support for Nemotron has been recently added and as of Transformers 4.44.0 Nemotron is supported (Thank you @Vaibhavs10 for the information!). You may need to install a newer version of the transformers library by running pip install transformers>=4.44.0.

Please see this PR for details.

The Nemotron architecture is similar to the Llama-2 architecture with a few key differences:

Vocabulary size: Nemotron uses 256k SentencePiece tokenizer
FFN layer: Nemotron uses Squared ReLU (up and down projections)
RoPE scheduling: Nemotron uses partial (50%) RoPE
Layer Normalization: Nemotron adds 1 to LayerNorm's weight for better numerical stability

You can find details about the model architecture in the following papers:

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR was created in collaboration with @SpaceCowboy850, who is another contributor to this PR.

Vaibhavs10

Note: As of Transformers 4.44.0 Nemotron is supported, so no need to install transformers from source.

suhara · 2024-08-08T07:27:55Z

Awesome! Thank you for sharing @Vaibhavs10! I've updated the original PR description.

suhara · 2024-08-09T09:12:44Z

Thank you @compilade for the comments and suggestions! Committed changes accordingly.

Vaibhavs10

Hi @suhara - can you rebase to main, specifically make sure this commit is in - this should fix the failing requirements.txt test.

Co-authored-by: compilade <git@compilade.net>

suhara · 2024-08-13T00:34:33Z

Hi @Vaibhavs10 , thanks for reviewing! I rebased it onto the latest main branch.

Vaibhavs10

Not sure what am I missing here but I wasn't able to make the GGUF run tried to test the PR via this:

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp && gh pr checkout 8922

huggingface-cli download nvidia/Minitron-4B-Base --local-dir minitron --local-dir-use-symlinks False

python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf

llama-cli -m model.gguf -p "Meaning to life is"

I get error loading model architecture: unknown model architecture: 'nemotron'

EDIT: I'm stupid, I was using an older binary!

Vaibhavs10

Tested it (using the steps mentioned above), it works quite well!

Let's wait for @compilade to review + approve then we can merge! 🤗

slaren · 2024-08-13T19:11:42Z

+                        // optional MLP bias
+                        layer.ffn_down_b = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "bias", i), {n_embd}, llama_model_loader::TENSOR_NOT_REQUIRED);
+                        layer.ffn_up_b   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "bias", i), {n_ff}, llama_model_loader::TENSOR_NOT_REQUIRED);


It's not correct to use ctx_split for bias tensors, it should use ctx_layer instead.

Thank you for your comment @slaren !

Sorry for the naive question. What's the difference between ctx_split and ctx_layer?

Something not clear is that some part of llama.cpp uses ctx_split for bias tensors as well.

For example,

https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6108-L6110

https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6343

Should they be corrected (which is out of the scope of this PR but wanted to ask to have a better understanding of them)?

ctx_split only makes a difference when using tensor parallelism with -sm row, which is only supported on the CUDA backend when using multiple GPUs. When using -sm row, ctx_split splits the rows of the matrix between the available GPUs. This is only supported for matrix multiplication, so it should only be used with the matrix portion of linear/dense layers. The other cases are also wrong and should be corrected as well, but it doesn't need to be done here.

Thanks for the explanation! Updated the two lines accordingly. Agree with you that the other parts should be fixed outside this PR.

suhara · 2024-08-14T20:24:15Z

Hi @compilade
Can you take a look and see if it looks good to you? Thank you!

compilade

Looks good to me.

suhara · 2024-08-15T18:07:57Z

Thank you all for your reviews and support @compilade @Vaibhavs10 @ggerganov @slaren !

Could anybody help merge this PR? Thank you!

schmorp · 2024-08-16T08:52:07Z

Sorry for disturbing, but when I try to convert the linked minitron-4b model with transformers 4.44.0 and current llama.cpp, it simply complains about missing tokenizer.model. Any idea why that could be?

suhara · 2024-08-17T09:12:31Z

Hi @schmorp

I think the repo has been updated and tokenizer.model (in the sentencepiece format) is not hosted there anymore.

You can actually extract tokenizer.model from nemo/minitron-4b-base.nemo

$ cd minitron/nemo
$ tar -xf minitron-4b-base.nemo
$ ls
914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model
minitron-4b-base.nemo
model_config.yaml
model_weights
$ cp 914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model ../tokenizer.model

$ cd ../../
$ python convert_hf_to_gguf.py minitron --outtype f16 --outfile model.gguf

There are two tokenizer files but they are the same and either can be renamed astoknizer.model

914829c706e34a92ab89d5213695f4e5_nemotron_2_256k.model
b1bc02bf987043f3884c39152f183238_nemotron_2_256k.model

schmorp · 2024-08-17T13:43:36Z

@suhara thanks a lot!

schmorp · 2024-08-17T14:09:39Z

Minitron-8B converts, but then can't be used:

llm_load_tensors: ggml ctx size = 0.15 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/tmp/Minitron-8B-Base.gguf'
main : failed to init

schmorp · 2024-08-17T14:13:07Z

Minitron-4B seems to work. So it seems Minitron-8B is not quite supported yet.

suhara · 2024-08-17T15:31:25Z

I'll look into this but I think I know the root cause.

8B uses head_dim: 128 and that may be the cause.
https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25

Many HF models including Llama asserts head_dim == hidden_size // num_attention_heads.

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected 4096, 4096, got 4096, 6144, 1, 1

6144 = 128 * 48 so the conversion seems to be correct. The expectation (4096) is wrong.

FYI, for 4B, head_dim (128) == hidden_size (3072) // num_attention_heads (24) so this doesn't cause the issue.

schmorp · 2024-08-18T02:23:55Z

That's good news, thanks for looking into this. I'll have a try at the 340B.

schmorp · 2024-08-18T04:29:54Z

For the 340B, conversion instantly fails flat because there isn't a config.json file.

nicoboss · 2024-08-18T19:44:32Z

I tried nvidia/Nemotron-4-340B-Instruct as well. Turns out even if you add a config.json the conversion results in a metadata only GGUF as all Nemotron-3 and Nemotron-4 models lack pytorch_model.bin or any safetensor files.

The only option seems to be using the SafeTensor conversion provided by @mgoin under https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5. He unfortunately never shared how he converted nemo into safetensor.

mgoin · 2024-08-18T19:46:49Z

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

nicoboss · 2024-08-18T20:06:18Z

@nicoboss if the conversion steps and script would be useful, I can document this tomorrow!

This would be absolutely awesome. Thanks a lot! I’m very interested in how the conversion works. Maybe it would even be possible to implement it inside convert_hf_to_gguf.py. I'm currently working together with @schmorp to GGUF quantize all Nemotron-3, Nemotron-4 and "Minitron" models. While your collection is great it unfortunately misses many Nemotron-3 models which we could convert by our own if you share your tools and knowledge. Nemotron-4-340B-Instruct is one of my favorite models and I can't thank you enough to convert it into a usable format.

schmorp · 2024-08-22T09:43:51Z

And just to document this here, Llama-3.1-Minitron-4B-Width-Base fails with:

cvs/llama.cpp/ggml/src/ggml.c:6399: GGML_ASSERT(c->ne[0] >= n_dims / 2) failed

@compilade

) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>

@compilade

) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>

@compilade

) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>

@compilade

) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py * Update src/llama.cpp * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json Co-Authored-By: compilade <git@compilade.net>

@compilade

) * Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>

github-actions Bot added the python python script changes label Aug 8, 2024

Vaibhavs10 reviewed Aug 8, 2024

View reviewed changes

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 8, 2024

ggerganov approved these changes Aug 8, 2024

View reviewed changes

compilade reviewed Aug 8, 2024

View reviewed changes

Comment thread src/llama.cpp Outdated

Comment thread convert_hf_to_gguf.py Outdated

Comment thread src/llama.cpp Outdated

Comment thread convert_hf_to_gguf.py Outdated

Vaibhavs10 reviewed Aug 12, 2024

View reviewed changes

suhara and others added 8 commits August 12, 2024 17:28

Add nemotron GGUF conversion & inference support

aa2f4a7

Fix formatting issues

45e9d16

Remove unnecessary write_tensors()

147cdf6

Update convert_hf_to_gguf.py

b841554

Co-authored-by: compilade <git@compilade.net>

Update src/llama.cpp

092382f

Co-authored-by: compilade <git@compilade.net>

Address comments by @compilade

6f369f3

Replace ggml_mul_mat()->llm_build_lora_mm()

ae86b5e

Remove mutable variable

bd76198

suhara force-pushed the nemotron-support branch from 4bb8d50 to bd76198 Compare August 13, 2024 00:29

Vaibhavs10 requested a review from compilade August 13, 2024 09:41

Vaibhavs10 reviewed Aug 13, 2024

View reviewed changes

Vaibhavs10 approved these changes Aug 13, 2024

View reviewed changes

slaren reviewed Aug 13, 2024

View reviewed changes

suhara added 2 commits August 13, 2024 14:45

Use for bias tensors

e4bb91b

Cover corner case for role_scaling not in config.json

0645adc

compilade approved these changes Aug 14, 2024

View reviewed changes

Comment thread src/llama.cpp

slaren merged commit 2a24c8c into ggml-org:master Aug 16, 2024

suhara mentioned this pull request Aug 17, 2024

Fix incorrect use of ctx_split for bias tensors #9063

Merged

4 tasks

Conversation

suhara commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

suhara commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suhara commented Aug 9, 2024

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

suhara commented Aug 13, 2024

Uh oh!

Vaibhavs10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vaibhavs10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

suhara Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

slaren Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

suhara Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

suhara commented Aug 14, 2024

Uh oh!

compilade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suhara commented Aug 15, 2024

Uh oh!

schmorp commented Aug 16, 2024

Uh oh!

suhara commented Aug 17, 2024

Uh oh!

schmorp commented Aug 17, 2024

Uh oh!

schmorp commented Aug 17, 2024

Uh oh!

schmorp commented Aug 17, 2024

Uh oh!

suhara commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schmorp commented Aug 18, 2024

Uh oh!

schmorp commented Aug 18, 2024

Uh oh!

nicoboss commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Aug 18, 2024

Uh oh!

nicoboss commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schmorp commented Aug 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

suhara commented Aug 8, 2024 •

edited

Loading

suhara commented Aug 8, 2024 •

edited

Loading

Vaibhavs10 left a comment •

edited

Loading

Vaibhavs10 left a comment •

edited

Loading

suhara commented Aug 17, 2024 •

edited

Loading

nicoboss commented Aug 18, 2024 •

edited

Loading

nicoboss commented Aug 18, 2024 •

edited

Loading