models : move the token embedding norms to the first layer by ggerganov · Pull Request #20943 · ggml-org/llama.cpp

ggerganov · 2026-03-24T09:21:23Z

Overview

We were keeping the token embedding norms on the input layer buffers. This results in the operations being performed on the CPU:

make -j && GGML_SCHED_DEBUG=2 ./bin/llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "Hello world" -lv 5

## SPLIT #0: CPU # 0 inputs0.01.244.604 D 
0.01.244.604 D node #  0 (  GET_ROWS):                 embd (  16K) [  CPU         ] use=1,c=1:0.01.244.605 D     token_embd.weight ( 259M) [  CPU         ]0.01.244.605 D            inp_tokens (   0K) [  CPU         ]0.01.244.605 D 
0.01.244.606 D node #  1 (  GET_ROWS):               node_1 (  16K) [  CPU         ] use=1,c=1:0.01.244.606 D  position_embd.weight (  32M) [  CPU         ]0.01.244.606 D                leaf_5 (   0K) [  CPU         ]0.01.244.606 D 
0.01.244.607 D node #  3 (       ADD):               node_3 (  16K) [  CPU         ] use=1,c=1:0.01.244.607 D                  embd (  16K) [  CPU         ]0.01.244.607 D  token_types.weight ( (   4K) [  CPU         ]0.01.244.607 D 
0.01.244.607 D node #  4 (       ADD):             inp_embd (  16K) [  CPU         ] use=1,c=1:0.01.244.608 D                node_1 (  16K) [  CPU         ]0.01.244.608 D                node_3 (  16K) [  CPU         ]0.01.244.608 D 
0.01.244.608 D node #  5 (      NORM):                 norm (  16K) [  CPU         ] use=1,c=1:0.01.244.609 D              inp_embd (  16K) [  CPU         ]0.01.244.609 D 
0.01.244.609 D node #  6 (       MUL):               norm_w (  16K) [  CPU         ] use=1,c=1:0.01.244.609 D                  norm (  16K) [  CPU         ]0.01.244.609 D  token_embd_norm.weig (   4K) [  CPU         ]0.01.244.609 D 
0.01.244.610 D node #  7 (       ADD):             inp_norm (  16K) [  CPU         ] use=4,c=1:0.01.244.610 D                norm_w (  16K) [  CPU         ]0.01.244.610 D  token_embd_norm.bias (   4K) [  CPU         ]0.01.244.610 D 
0.01.244.610 D 
## SPLIT #1: MTL0 # 4 inputs0.01.244.611 D : 0.01.244.611 D [inp_norm (  16K)] 0.01.244.611 D [leaf_14 (   0K)] 0.01.244.611 D [leaf_387 (   0K)] 0.01.244.611 D [leaf_394 (   0K)] 0.01.244.611 D 
0.01.244.612 D node #  8 (   MUL_MAT):               node_8 (  16K) [ MTL0         ] use=1,c=1:0.01.244.612 D   blk.0.attn_q.weight (   1M) [ MTL0         ]0.01.244.613 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.613 D 
0.01.244.613 D node #  9 (       ADD):               node_9 (  16K) [ MTL0         ] use=1,c=1:0.01.244.613 D                node_8 (  16K) [ MTL0         ]0.01.244.613 D     blk.0.attn_q.bias (   4K) [ MTL0         ]0.01.244.614 D 
0.01.244.614 D node # 11 (   MUL_MAT):              node_11 (  16K) [ MTL0         ] use=1,c=1:0.01.244.614 D   blk.0.attn_k.weight (   1M) [ MTL0         ]0.01.244.614 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.614 D 
0.01.244.615 D node # 12 (       ADD):              node_12 (  16K) [ MTL0         ] use=1,c=1:0.01.244.615 D               node_11 (  16K) [ MTL0         ]0.01.244.615 D     blk.0.attn_k.bias (   4K) [ MTL0         ]0.01.244.615 D 
0.01.244.616 D node # 14 (   MUL_MAT):              node_14 (  16K) [ MTL0         ] use=1,c=1:0.01.244.616 D   blk.0.attn_v.weight (   1M) [ MTL0         ]0.01.244.616 D       MTL0#inp_norm#0 (  16K) [ NULL         ]0.01.244.616 D

With this change, we assign the norms to the first layer buffer, so this way they can be offloaded to the GPU when possible:

## SPLIT #0: CPU # 0 inputs0.01.192.580 D 
0.01.192.581 D node #  0 (  GET_ROWS):                 embd (  16K) [  CPU         ] use=1,c=1:0.01.192.581 D     token_embd.weight ( 259M) [  CPU         ]0.01.192.581 D            inp_tokens (   0K) [  CPU         ]0.01.192.581 D 
0.01.192.582 D node #  1 (  GET_ROWS):               node_1 (  16K) [  CPU         ] use=1,c=1:0.01.192.582 D  position_embd.weight (  32M) [  CPU         ]0.01.192.582 D                leaf_5 (   0K) [  CPU         ]0.01.192.582 D 
0.01.192.583 D 
## SPLIT #1: MTL0 # 6 inputs0.01.192.583 D : 0.01.192.583 D [embd (  16K)] 0.01.192.583 D [token_types.weight (view) (   4K)] 0.01.192.583 D [node_1 (  16K)] 0.01.192.584 D [leaf_14 (   0K)] 0.01.192.584 D [leaf_387 (   0K)] 0.01.192.584 D [leaf_394 (   0K)] 0.01.192.584 D 
0.01.192.585 D node #  3 (       ADD):               node_3 (  16K) [ MTL0         ] use=1,c=1:0.01.192.585 D           MTL0#embd#0 (  16K) [ NULL         ]0.01.192.585 D  MTL0#token_types.wei (   4K) [ NULL         ]0.01.192.585 D 
0.01.192.586 D node #  4 (       ADD):             inp_embd (  16K) [ MTL0         ] use=1,c=1:0.01.192.586 D         MTL0#node_1#0 (  16K) [ NULL         ]0.01.192.586 D                node_3 (  16K) [ MTL0         ]0.01.192.586 D 
0.01.192.587 D node #  5 (      NORM):                 norm (  16K) [ MTL0         ] use=1,c=1:0.01.192.587 D              inp_embd (  16K) [ MTL0         ]0.01.192.587 D 
0.01.192.587 D node #  6 (       MUL):               norm_w (  16K) [ MTL0         ] use=1,c=1:0.01.192.588 D                  norm (  16K) [ MTL0         ]0.01.192.588 D  token_embd_norm.weig (   4K) [ MTL0         ]0.01.192.588 D 
0.01.192.588 D node #  7 (       ADD):             inp_norm (  16K) [ MTL0         ] use=4,c=1:0.01.192.589 D                norm_w (  16K) [ MTL0         ]0.01.192.590 D  token_embd_norm.bias (   4K) [ MTL0         ]0.01.192.590 D 
0.01.192.591 D node #  8 (   MUL_MAT):               node_8 (  16K) [ MTL0         ] use=1,c=1:0.01.192.591 D   blk.0.attn_q.weight (   1M) [ MTL0         ]0.01.192.591 D              inp_norm (  16K) [ MTL0         ]0.01.192.591 D

Additional information

This is causing some of the roundtrip discrepancies found in #20503 (comment)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

JohannesGaessler

I think it's fine to move the embedding norm to the first layer and even preferable as it will probably be marginally faster. However, I am concerned that this is indicative of a deeper issue since I don't understand why the tensor -> backend assignment would be different in the first place.

ggerganov · 2026-03-24T10:34:06Z

However, I am concerned that this is indicative of a deeper issue since I don't understand why the tensor -> backend assignment would be different in the first place.

Are you referring to the GET_ROWS issue that I mentioned in #20503 (comment)? I am still not sure, but it seems like we are taking different paths in the logic for assigning the tensors depending if we read them from a file or not. Investigating.

JohannesGaessler · 2026-03-24T12:20:34Z

No, I mean that if we save and then load the model again we would expect the tensor assignments to remain the same (since my understanding is that this is changing somehow?). To me it would suggest that llama_model_saver isn't storing some relevant piece of information as GGUF.

ggerganov · 2026-03-24T12:59:36Z

What I am seeing is that the model saving and then loading it works exactly like a normal model of the given architecture. The tensors are being assigned in the same way - according to the LLM_TENSOR_INFOS.

The difference comes from the in-memory model - in this case, all tensors are always put on the device. It's as if the LLM_TENSOR_LAYER_INPUT logic is not being considered, and even the token embeddings end up assigned to the Metal buffer instead of staying on the CPU. I am still looking into the root cause.

CISC

Am I wrong in thinking this (whisper et al.) also should be moved?

llama.cpp/src/llama-arch.cpp

Line 2728 in c845f54

    
           {LLM_TENSOR_CONV1D,                     {LLM_TENSOR_LAYER_INPUT,     GGML_OP_IM2COL}},

…20943) * models : move the token embedding norms to the first layer * cont : fix LLM_TENSOR_CONV1D + fix il indexing

models : move the token embedding norms to the first layer

c845f54

ggerganov requested a review from CISC as a code owner March 24, 2026 09:21

ggerganov mentioned this pull request Mar 24, 2026

llama: fix llama-model-saver #20503

Merged

JohannesGaessler approved these changes Mar 24, 2026

View reviewed changes

CISC approved these changes Mar 24, 2026

View reviewed changes

cont : fix LLM_TENSOR_CONV1D + fix il indexing

0872cbc

github-actions Bot added the model Model specific label Mar 24, 2026

ggerganov merged commit 9f102a1 into master Mar 24, 2026
47 of 48 checks passed

ggerganov deleted the gg/models-move-te-norm-to-device branch March 24, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models : move the token embedding norms to the first layer#20943

models : move the token embedding norms to the first layer#20943
ggerganov merged 2 commits intomasterfrom
gg/models-move-te-norm-to-device

ggerganov commented Mar 24, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Mar 24, 2026

Overview

Additional information

Requirements

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants