Add PLM GGUF Conversion & Inference Support by Si1w · Pull Request #12457 · ggml-org/llama.cpp

Si1w · 2025-03-18T20:35:23Z

This PR adds HF->GGUF conversion & inference support for PLM Model PLM-1.8B-Instruct

The Model has already been converted into gguf form with quantized and tested PLM-1.8B-Instruct-gguf, PLM-1.8B-Instruct-id-gguf

The Model Arch is similar with Deepseek V2 and Minicpm3. The key points of the model are:

Sparse FFN: PLM uses Squared ReLU (up and down projections)
MLA: PLM uses Multi-head Latent Attention

The details of the model can be seen in the following Paper

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

I have read the contributing guidelines

Self-reported review complexity:

Low
Medium
High

arch-btw · 2025-03-19T16:46:50Z

Tested both the premade gguf and converting the gguf, both work 👍

Looks like it's using the qwen2 tokenizer with the associated chatml prompt template:

llama_model_loader: - kv   0:                       general.architecture str              = plm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = PLM 1.8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = PLM
llama_model_loader: - kv   5:                         general.size_label str              = 1.8B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                            plm.block_count u32              = 32
llama_model_loader: - kv   8:                         plm.context_length u32              = 4096
llama_model_loader: - kv   9:                       plm.embedding_length u32              = 2048
llama_model_loader: - kv  10:                    plm.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                   plm.attention.head_count u32              = 16
llama_model_loader: - kv  12:                plm.attention.head_count_kv u32              = 16
llama_model_loader: - kv  13:                         plm.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  14:       plm.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                             plm.vocab_size u32              = 151936
llama_model_loader: - kv  16:                 plm.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  17:                   plm.attention.key_length u32              = 192
llama_model_loader: - kv  18:                 plm.attention.value_length u32              = 128
llama_model_loader: - kv  19:                   plm.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  176 tensors
llama_model_loader: - type q6_K:   17 tensors

The only small error was a brief switch in language, but that's probably not related to this PR:

> hello!
通往成功

> How are you?
I'm doing well, thank you! How about you? How can I help you today?

convert_hf_to_gguf.py output:

python convert_hf_to_gguf.py /home/test/PLM-1.8B-Instruct --outtype f32
.....
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 2048
INFO:hf-to-gguf:gguf: feed forward length = 8192
INFO:hf-to-gguf:gguf: head count = 16
INFO:hf-to-gguf:gguf: key-value head count = 16
INFO:hf-to-gguf:gguf: rope theta = 100000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 0
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151643
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf: n_tensors = 290, total_size = 7.3G
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.30G/7.30G [00:11<00:00, 643Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/test/PLM-1.8B-Instruct/PLM-1.8B-Instruct-F32.gguf

Si1w · 2025-03-27T09:51:50Z

@ggerganov @slaren @ngxson I have already fixed the problem and tested models, could you help review again and merge? Thanks in advance

ggerganov · 2025-03-27T09:58:55Z

@Si1w What is the difference between the "instruct" and "instruct-id" models?

Si1w · 2025-03-27T10:06:58Z

@Si1w What is the difference between the "instruct" and "instruct-id" models?

Basically, there is no significant difference but "instruct-id" is that model with identification i.e. the model knows that its name is plm.

ggerganov · 2025-03-27T10:12:20Z

Let's merge if CI is green.

* add edgellm model arch[conversation feature doesn't work] * remove output.weight layer for edgellm arch * [Model] update the name of the model * update the name of model arch in convert gguf * [Model] Refarctor the model arch into llama-model * [Bug] Fix the bug in create attn kv * [Code] Fix editorconfig erros * [Code] Remove Trailing whitespace * [Code] Remove Trailing whitespace * [Code] Change the order of model arch in list * [Code] Fix flake8 Lint errors * Remove trailing white space * [Code] Remove call in model arch

Si1w and others added 23 commits February 1, 2025 18:53

add edgellm model arch[conversation feature doesn't work]

563ec88

Merge branch 'ggerganov:master' into master

f006d42

remove output.weight layer for edgellm arch

c14cad9

Merge branch 'master' of github.com:Si1w/llama.cpp

1a47cee

Merge branch 'ggerganov:master' into master

21ed73d

Merge branch 'ggerganov:master' into master

9a54239

Merge branch 'ggerganov:master' into master

08b5a57

Merge branch 'ggml-org:master' into master

7813da4

Merge branch 'ggml-org:master' into master

731ed0a

Merge branch 'ggml-org:master' into master

b808f00

Merge branch 'ggml-org:master' into master

f687e8e

[Model] update the name of the model

5646eb9

update the name of model arch in convert gguf

2518841

Merge branch 'ggml-org:master' into master

ff3d94f

Merge remote-tracking branch 'upstream/master'

444dfe5

[Model] Refarctor the model arch into llama-model

22d35ac

Merge branch 'ggml-org:master' into master

93cf1e4

Merge branch 'ggml-org:master' into master

850d301

Merge branch 'ggml-org:master' into master

4235644

Merge branch 'ggml-org:master' into master

0fcce31

Merge branch 'ggml-org:master' into master

55b8674

[Bug] Fix the bug in create attn kv

69d61ee

Merge branch 'ggml-org:master' into master

a7f4a68

github-actions Bot added the python python script changes label Mar 18, 2025

Si1w and others added 6 commits March 19, 2025 06:21

Merge branch 'ggml-org:master' into master

066901e

Merge branch 'ggml-org:master' into master

9d47a39

[Code] Fix editorconfig erros

95de3c6

[Code] Remove Trailing whitespace

d7a2fc0

Merge branch 'ggml-org:master' into master

91f06a7

[Code] Remove Trailing whitespace

4bd85c6

Merge branch 'master' of github.com:Si1w/llama.cpp

5f75445

ngxson reviewed Mar 19, 2025

View reviewed changes

Comment thread src/llama-arch.h

[Code] Change the order of model arch in list

cd460ab

ngxson reviewed Mar 19, 2025

View reviewed changes

Comment thread src/llama-model.cpp

Si1w and others added 5 commits March 20, 2025 15:28

[Code] Fix flake8 Lint errors

6d3ac9a

Merge branch 'ggml-org:master' into master

0b8de3f

Remove trailing white space

646521e

Merge branch 'master' of github.com:Si1w/llama.cpp

f5b5271

Merge branch 'ggml-org:master' into master

2339115

ngxson requested a review from ggerganov March 21, 2025 20:13

ggerganov reviewed Mar 21, 2025

View reviewed changes

Comment thread src/llama-model.cpp Outdated

Si1w and others added 5 commits March 21, 2025 21:50

[Code] Remove call in model arch

7772d4f

Merge branch 'ggml-org:master' into master

e9c7ff4

Merge branch 'master' of github.com:Si1w/llama.cpp

1ec1c1e

Merge branch 'ggml-org:master' into master

82889bb

Merge branch 'ggml-org:master' into master

3a07979

Si1w requested review from ggerganov, ngxson and slaren March 27, 2025 09:52

ggerganov approved these changes Mar 27, 2025

View reviewed changes

ngxson approved these changes Mar 27, 2025

View reviewed changes

ggerganov merged commit f125b8d into ggml-org:master Mar 27, 2025
50 checks passed

ggerganov mentioned this pull request Apr 12, 2025

DeepSeek V2/V3 MLA implementation #12801

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PLM GGUF Conversion & Inference Support#12457

Add PLM GGUF Conversion & Inference Support#12457
ggerganov merged 41 commits intoggml-org:masterfrom
Si1w:master

Si1w commented Mar 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

arch-btw commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Si1w commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 27, 2025

Uh oh!

Si1w commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Si1w commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

arch-btw commented Mar 19, 2025

Uh oh!

Uh oh!

Uh oh!

Si1w commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 27, 2025

Uh oh!

Si1w commented Mar 27, 2025

Uh oh!

ggerganov commented Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Si1w commented Mar 18, 2025 •

edited

Loading