Skip to content

models : move the token embedding norms to the first layer#20943

Merged
ggerganov merged 2 commits intomasterfrom
gg/models-move-te-norm-to-device
Mar 24, 2026
Merged

models : move the token embedding norms to the first layer#20943
ggerganov merged 2 commits intomasterfrom
gg/models-move-te-norm-to-device

Conversation

@ggerganov
Copy link
Copy Markdown
Member

Overview

We were keeping the token embedding norms on the input layer buffers. This results in the operations being performed on the CPU:

make -j && GGML_SCHED_DEBUG=2 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "Hello world" -lv 5
## SPLIT #0: CPU # 0 inputs0.01.244.604 D 
0.01.244.604 D node #  0 (  GET_ROWS):                 embd (  16K) [  CPU         ] use=1,c=1:0.01.244.605 D     token_embd.weight ( 259M) [  CPU         ]0.01.244.605 D            inp_tokens (   0K) [  CPU         ]0.01.244.605 D 
0.01.244.606 D node #  1 (  GET_ROWS):               node_1 (  16K) [  CPU         ] use=1,c=1:0.01.244.606 D  position_embd.weight (  32M) [  CPU         ]0.01.244.606 D                leaf_5 (   0K) [  CPU         ]0.01.244.606 D 
0.01.244.607 D node #  3 (       ADD):               node_3 (  16K) [  CPU         ] use=1,c=1:0.01.244.607 D                  embd (  16K) [  CPU         ]0.01.244.607 D  token_types.weight ( (   4K) [  CPU         ]0.01.244.607 D 
0.01.244.607 D node #  4 (       ADD):             inp_embd (  16K) [  CPU         ] use=1,c=1:0.01.244.608 D                node_1 (  16K) [  CPU         ]0.01.244.608 D                node_3 (  16K) [  CPU         ]0.01.244.608 D 
0.01.244.608 D node #  5 (      NORM):                 norm (  16K) [  CPU         ] use=1,c=1:0.01.244.609 D              inp_embd (  16K) [  CPU         ]0.01.244.609 D 
0.01.244.609 D node #  6 (       MUL):               norm_w (  16K) [  CPU         ] use=1,c=1:0.01.244.609 D                  norm (  16K) [  CPU         ]0.01.244.609 D  token_embd_norm.weig (   4K) [  CPU         ]0.01.244.609 D 
0.01.244.610 D node #  7 (       ADD):             inp_norm (  16K) [  CPU         ] use=4,c=1:0.01.244.610 D                norm_w (  16K) [  CPU         ]0.01.244.610 D  token_embd_norm.bias (   4K) [  CPU         ]0.01.244.610 D 
0.01.244.610 D 
## SPLIT #1: MTL0 # 4 inputs0.01.244.611 D : 0.01.244.611 D [inp_norm (  16K)] 0.01.244.611 D [leaf_14 (   0K)] 0.01.244.611 D [leaf_387 (   0K)] 0.01.244.611 D [leaf_394 (   0K)] 0.01.244.611 D 
0.01.244.612 D node #  8 (   MUL_MAT):               node_8 (  16K) [ MTL0         ] use=1,c=1:0.01.244.612 D   blk.0.attn_q.weight (   1M) [ MTL0         ]0.01.244.613 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.613 D 
0.01.244.613 D node #  9 (       ADD):               node_9 (  16K) [ MTL0         ] use=1,c=1:0.01.244.613 D                node_8 (  16K) [ MTL0         ]0.01.244.613 D     blk.0.attn_q.bias (   4K) [ MTL0         ]0.01.244.614 D 
0.01.244.614 D node # 11 (   MUL_MAT):              node_11 (  16K) [ MTL0         ] use=1,c=1:0.01.244.614 D   blk.0.attn_k.weight (   1M) [ MTL0         ]0.01.244.614 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.614 D 
0.01.244.615 D node # 12 (       ADD):              node_12 (  16K) [ MTL0         ] use=1,c=1:0.01.244.615 D               node_11 (  16K) [ MTL0         ]0.01.244.615 D     blk.0.attn_k.bias (   4K) [ MTL0         ]0.01.244.615 D 
0.01.244.616 D node # 14 (   MUL_MAT):              node_14 (  16K) [ MTL0         ] use=1,c=1:0.01.244.616 D   blk.0.attn_v.weight (   1M) [ MTL0         ]0.01.244.616 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.616 D 

With this change, we assign the norms to the first layer buffer, so this way they can be offloaded to the GPU when possible:

## SPLIT #0: CPU # 0 inputs0.01.192.580 D 
0.01.192.581 D node #  0 (  GET_ROWS):                 embd (  16K) [  CPU         ] use=1,c=1:0.01.192.581 D     token_embd.weight ( 259M) [  CPU         ]0.01.192.581 D            inp_tokens (   0K) [  CPU         ]0.01.192.581 D 
0.01.192.582 D node #  1 (  GET_ROWS):               node_1 (  16K) [  CPU         ] use=1,c=1:0.01.192.582 D  position_embd.weight (  32M) [  CPU         ]0.01.192.582 D                leaf_5 (   0K) [  CPU         ]0.01.192.582 D 
0.01.192.583 D 
## SPLIT #1: MTL0 # 6 inputs0.01.192.583 D : 0.01.192.583 D [embd (  16K)] 0.01.192.583 D [token_types.weight (view) (   4K)] 0.01.192.583 D [node_1 (  16K)] 0.01.192.584 D [leaf_14 (   0K)] 0.01.192.584 D [leaf_387 (   0K)] 0.01.192.584 D [leaf_394 (   0K)] 0.01.192.584 D 
0.01.192.585 D node #  3 (       ADD):               node_3 (  16K) [ MTL0         ] use=1,c=1:0.01.192.585 D           MTL0#embd#0 (  16K) [ NULL         ]0.01.192.585 D  MTL0#token_types.wei (   4K) [ NULL         ]0.01.192.585 D 
0.01.192.586 D node #  4 (       ADD):             inp_embd (  16K) [ MTL0         ] use=1,c=1:0.01.192.586 D         MTL0#node_1#0 (  16K) [ NULL         ]0.01.192.586 D                node_3 (  16K) [ MTL0         ]0.01.192.586 D 
0.01.192.587 D node #  5 (      NORM):                 norm (  16K) [ MTL0         ] use=1,c=1:0.01.192.587 D              inp_embd (  16K) [ MTL0         ]0.01.192.587 D 
0.01.192.587 D node #  6 (       MUL):               norm_w (  16K) [ MTL0         ] use=1,c=1:0.01.192.588 D                  norm (  16K) [ MTL0         ]0.01.192.588 D  token_embd_norm.weig (   4K) [ MTL0         ]0.01.192.588 D 
0.01.192.588 D node #  7 (       ADD):             inp_norm (  16K) [ MTL0         ] use=4,c=1:0.01.192.589 D                norm_w (  16K) [ MTL0         ]0.01.192.590 D  token_embd_norm.bias (   4K) [ MTL0         ]0.01.192.590 D 
0.01.192.591 D node #  8 (   MUL_MAT):               node_8 (  16K) [ MTL0         ] use=1,c=1:0.01.192.591 D   blk.0.attn_q.weight (   1M) [ MTL0         ]0.01.192.591 D              inp_norm (  16K) [ MTL0         ]0.01.192.591 D 

Additional information

This is causing some of the roundtrip discrepancies found in #20503 (comment)

Requirements

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to move the embedding norm to the first layer and even preferable as it will probably be marginally faster. However, I am concerned that this is indicative of a deeper issue since I don't understand why the tensor -> backend assignment would be different in the first place.

@ggerganov
Copy link
Copy Markdown
Member Author

However, I am concerned that this is indicative of a deeper issue since I don't understand why the tensor -> backend assignment would be different in the first place.

Are you referring to the GET_ROWS issue that I mentioned in #20503 (comment)? I am still not sure, but it seems like we are taking different paths in the logic for assigning the tensors depending if we read them from a file or not. Investigating.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

No, I mean that if we save and then load the model again we would expect the tensor assignments to remain the same (since my understanding is that this is changing somehow?). To me it would suggest that llama_model_saver isn't storing some relevant piece of information as GGUF.

@ggerganov
Copy link
Copy Markdown
Member Author

What I am seeing is that the model saving and then loading it works exactly like a normal model of the given architecture. The tensors are being assigned in the same way - according to the LLM_TENSOR_INFOS.

The difference comes from the in-memory model - in this case, all tensors are always put on the device. It's as if the LLM_TENSOR_LAYER_INPUT logic is not being considered, and even the token embeddings end up assigned to the Metal buffer instead of staying on the CPU. I am still looking into the root cause.

Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I wrong in thinking this (whisper et al.) also should be moved?

{LLM_TENSOR_CONV1D, {LLM_TENSOR_LAYER_INPUT, GGML_OP_IM2COL}},

@github-actions github-actions Bot added the model Model specific label Mar 24, 2026
@ggerganov ggerganov merged commit 9f102a1 into master Mar 24, 2026
47 of 48 checks passed
@ggerganov ggerganov deleted the gg/models-move-te-norm-to-device branch March 24, 2026 15:00
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…20943)

* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants