model: move `load_hparams` and `load_tensors` to per-model definition by ngxson · Pull Request #22004 · ggml-org/llama.cpp

ngxson · 2026-04-16T16:23:58Z

Overview

Fix #21966

The migration will be done via a script 0migrate.py included in this PR (and will be removed right before this is merged)

Important note: the goal of this PR is to make sure the migration is as deterministic as possible, we do that via a completely heuristic script as mentioned above. Any improvements (deduplication, clean up, etc) will be done via follow-up PRs

Depends on:

Checklist before merging:

Remove migration script
Remove #if 0 pre-migration places

Additional information

Migration rules:

Create one file per arch, the file name is the arch enum name in lower case, with _ replaced by -
Model name is arch name in lower case, prefixed by llama_model_

Given 2 example arch A and B:

If arch A and B BOTH use the same load_tensors AND load_hparams, B inherit A
If either load_tensors OR load_hparams is different, no inherit, there will be one of the 2 functions being duplicated
If B reuse the same graph with A, the graph definition in B will be using graph = llama_model_a::graph

Side effects:

Some part of the code will be duplicated (i.e. some load_tensors or load_hparams)
No more -iswa suffix in file name or in model class name

Self-note for updating this PR:

git reset --hard HEAD~1
git merge master
python 0migrate.py
git add -A
git commit -m "auto"
git push -f

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: partially, mostly for the migration script

ngxson · 2026-04-16T22:11:04Z

+struct llama_model_llama_embed : public llama_model_llama {
+    llama_model_llama_embed(const struct llama_model_params & params) : llama_model_llama(params) {}
+    // reuse load_hparams and load_tensors from llama_model_llama
+
+    template <bool embed>
+    using graph = llama_model_llama::graph<embed>;
+
+    std::unique_ptr<llm_graph_context> build_graph_context(const llm_graph_params & params) const override;
 };


The auto-migration script can't be intelligent enough to point out that we can use the specialization <true> of the graph, but things like this can be improved via a follow-up PR (just noting here for visibility)

pwilkin · 2026-04-16T23:03:02Z


+        // helper function to facilitate migration
+        // TODO: remove this in the future
        auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * {


Would rename the lambda, having both the method and the lambda named create_tensor is a bit confusing.

I think the better way is to completely move the current function the llm_arch_model_i and simply reuse the create_tensor there. Will push a fix for this

I try not to renaming things to reduce the number of line changes in this PR

This lambda is removed in the latest version

ngxson · 2026-04-17T15:18:52Z

Not very urgent but pinging @ggerganov if you can have a quick look on the direction

ggerganov · 2026-04-20T11:11:33Z

+    friend struct llama_model;
+
+    llama_model * model;
+    llama_model_loader * ml = nullptr;


I don't think introducing the llm_arch_model_i here is necessary. You should keep using llama_model for now and inherit the implementations directly from it. The llm_arch_model_i idea is separate - see below. Here you are interested just in localizing the model definitions (loading hparams, tensors, memory and graph creation) into individual files.

There is a separate refactoring task that can be done before or after this PR. The final state should be like this:

// // llama.h // typedef struct llama_model_i * llama_model_t; ... LLAMA_API int32_t llama_model_n_ctx_train(const llama_model_t model); LLAMA_API int32_t llama_model_n_embd (const llama_model_t model); LLAMA_API int32_t llama_model_n_embd_inp (const llama_model_t model); ... // // llama-model.h // // pure interface struct llama_model_i { virtual ~llama_model_i() = default; // public API mirror of llama.h virtual int32_t n_ctx_train() const = 0; virtual int32_t n_embd() const = 0; virtual int32_t n_embd_inp() const = 0; ... // internal API virtual bool load_hparams(llama_model_loader * ml, ...) = 0; virtual bool load_tensors(llama_model_loader * ml, ...) = 0; virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0; virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0; ... }; // base model (common functionality and data for all models) class llama_model_base : public llama_model_i { public: int32_t n_ctx_train() override; int32_t n_embd() override; int32_t n_embd_inp() override; ... protected: llm_type type = LLM_TYPE_UNKNOWN; llm_arch arch = LLM_ARCH_UNKNOWN; std::string name = "n/a"; llama_hparams hparams = {}; llama_vocab vocab; ggml_tensor * tok_embd = nullptr; ggml_tensor * type_embd = nullptr; ggml_tensor * pos_embd = nullptr; ggml_tensor * tok_norm = nullptr; ggml_tensor * tok_norm_b = nullptr; ... ggml_tensor * create_tensor(llama_model_loader * ml, ...); // helpers void create_tensor_gate_up_exps(llama_model_loader * ml, ...); void create_tensor_qkv (llama_model_loader * ml, ...); ... }; // // models/models.h // class llama_model_qwen3 : public llama_model_base { public: bool load_hparams(llama_model_loader * ml, ...) override; bool load_tensors(llama_model_loader * ml, ...) override; llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override; ggml_cgraph * build_graph(const llm_graph_params & params) const override; ... };

After this change, all code in llama_context should use only llama_model_i, similar to how it uses llama_memory_i.

That makes sense. For this PR, I think I will target a state where all the core definition (load_hparams/load_tensors / build_graph) being moved into src/models. I won't separate the llama_model / llama_model_i right now to simplicity, that can be done in a follow-up.

Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ? For now, I think I will add an alias using llama_model_base = llama_model_i so that llama_model_base can be reserved for future use.

Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ?

Had a typo: class llama_model -> renamed to class llama_model_base. The separate model instances will inherit the llama_model_base because there is a lot of common stuff that we want to avoid repeating (e.g. tensors, hparams, devices, buffers, ...). For now inheriting a base implementation is easier way to deduplicate.

The alternative is using composition, which is usually considered better architecturally. But there will be a lot of duplicated code:

// // llama-model.h // // pure interface struct llama_model_i { virtual ~llama_model_i() = default; // public API mirror of llama.h virtual int32_t n_ctx_train() const = 0; virtual int32_t n_embd() const = 0; virtual int32_t n_embd_inp() const = 0; ... // internal API virtual bool load_hparams(llama_model_loader * ml, ...) = 0; virtual bool load_tensors(llama_model_loader * ml, ...) = 0; virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0; virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0; ... }; // // models/models.h // // in this case, directly implement the interface class llama_model_qwen3 : public llama_model_i { public: // note each model implements these over and over again int32_t n_ctx_train() override; int32_t n_embd() override; int32_t n_embd_inp() override; ... bool load_hparams(llama_model_loader * ml, ...) override; bool load_tensors(llama_model_loader * ml, ...) override; llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override; ggml_cgraph * build_graph(const llm_graph_params & params) const override; ... private: // composition instead of inheritance for reusing common model functionality llama_model_base model; };

I'd say "composition being better structurally" is a commonly repeated modern programming trend, but in this case, inheritance seems like the better approach since there isn't really a "part" of a model that can be conceptualized as one we're delegating to.

Hmm yeah the composition will make the code quite verbose, I would prefer staying with the inheritance pattern for now. So just to make sure I understand it correctly, the current opaque pointer llama_model will be mapped to the to-be-added llama_model_i, right?

I think it make sense to do inheritance as in your first comment, so that:

llama_model_i holds the definition of the model (i.e. mostly hparams)

llama_model_base holds the tensors

llama_model_* defines how to load tensors and hparams

More like:

llama_model_i abstract interface, does not hold or implement anything. Replaces the old llama_model

llama_model_base holds hparams, tensors, devices, meta data, loras, etc. Implements common part of the interface (mostly getters for hparams, devices, loras, etc.)

llama_model_* implements the rest of the interface: loading hparams, loading tensors, creating memory, building graph

One concern about llama_model_i holding nothing (and only being an interface) is that we don't yet have any use case where another implementation other than llama_model_base that will reuse the same interface.

I think it may still worth a discussion about the separation of llama_model_i / _base, but that's not very urgent, it will be done in a follow-up anyway

ngxson · 2026-04-21T22:51:29Z

+struct llama_model_base : public llama_model {
+    friend struct llama_model;
+
+    llama_model * model;
+    llama_model_loader * ml = nullptr;
+    const LLM_TN tn;
+
+    // note: these variables are suppose to be read-only; however, since we can't read them until we load the hparams, they will be set after load_hparams()
+    int     n_layer;
+    // note: cast to int64_t since we will use these for the tensor dimensions
+    int64_t n_head;
+    int64_t n_head_kv;
+    int64_t n_embd;
+    int64_t n_embd_k_gqa;
+    int64_t n_embd_v_gqa;
+    int64_t n_embd_head_k;
+    int64_t n_embd_head_v;
+    int64_t n_ff;
+    int64_t n_embd_gqa;
+    int64_t n_vocab;
+    int64_t n_token_types;
+    int64_t n_rot;
+    int64_t n_expert;
+    int64_t n_expert_used;
+    int64_t n_ctx_train;


So I refactor a bit the llm_model_arch_i into llama_model_base, which is not 100% what we had in the last comment, but I think it should be enough as a middle step before another migration (which will hopefully produce less changes than the current PR)

One main concern is that currently, these variables are being defined inside the struct, so that load_tensors() doesn't need to load them manually. Although, do you think it's more preferable to define those inside the load_tensors()? I'm not even sure if such change will be possible with my script, but I can try if that's preferable.

CC @ggerganov @CISC @pwilkin if you have any thoughts.

This on the other hand might be a good place for composition :) is there any reason we can't just use the internal hparams here to keep the parameters?

Because it's made this way from the beginning:

llama.cpp/src/llama-model.cpp

Lines 3065 to 3080 in 750579f

// note: cast to int64_t since we will use these for the tensor dimensions

const int64_t n_head = hparams.n_head();

const int64_t n_head_kv = hparams.n_head_kv();

const int64_t n_embd = hparams.n_embd;

const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();

const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();

const int64_t n_embd_head_k = hparams.n_embd_head_k();

const int64_t n_embd_head_v = hparams.n_embd_head_v();

const int64_t n_ff = hparams.n_ff();

const int64_t n_embd_gqa = n_embd_v_gqa;

const int64_t n_vocab = vocab.n_tokens();

const int64_t n_token_types = vocab.n_token_types();

const int64_t n_rot = hparams.n_rot();

const int64_t n_expert = hparams.n_expert;

const int64_t n_expert_used = hparams.n_expert_used;

const int64_t n_ctx_train = hparams.n_ctx_train;

And because I don't want to change too much code in the same PR, I'm keeping this list so that load_tensors() stays the same.

You can make a LLAMA_LOAD_LOCALS macro similar to GGML_TENSOR_LOCALS and use the macro at the start of each load_tensors() implementation.

ngxson · 2026-04-27T12:11:46Z

+// convenience macro for loading local variables for load_tensors() in llama_model_base
+// note: cast to int64_t since we will use these for the tensor dimensions
+#define LLAMA_LOAD_LOCALS \
+    const int     n_layer        = hparams.n_layer;          GGML_UNUSED(n_layer); \
+    const int64_t n_head         = hparams.n_head();         GGML_UNUSED(n_head); \


@ggerganov alright, I added the LLAMA_LOAD_LOCALS as suggested

GGML_UNUSED is a hack here to avoid unused variable warning. tbh I don't know if there is a better way, so I'm open for suggestions here

ngxson added 13 commits April 15, 2026 21:21

git-friendly migration

05905b1

add build_graph

59f8237

nits

eefe366

Merge branch 'master' into xsn/model_def_self_contained

e078d03

exclude old code from build

7e71b46

wip

4d87359

add llm_arch_model_i

ede26f9

prepare downstream functions

96a959c

nits

bc5f239

Merge branch 'master' into xsn/model_def_self_contained

2c91880

nits

589de0e

wip

7127077

wip

e56f5bc

ngxson mentioned this pull request Apr 16, 2026

cmake: use glob to collect src/models sources #22005

Merged

ngxson added 8 commits April 16, 2026 18:33

Merge branch 'master' into xsn/model_def_self_contained

f1549cd

add back create_tensor_qkv

e4e521a

fix files missing include

e73ac93

Merge branch 'master' into xsn/model_def_self_contained

80e75d4

enforce one llm_build per arch

9445ce2

cmake: use glob

8613071

missing model params

10aa6a7

nits

b8e9131

github-actions Bot added the python python script changes label Apr 16, 2026

ngxson added 4 commits April 16, 2026 21:52

wip

55569ad

wip (2)

e95c4d6

wip (3)

4f58c4d

test-llama-archs is happy

9d3bdbd

ngxson marked this pull request as ready for review April 16, 2026 21:20

ngxson requested review from CISC and ggerganov as code owners April 16, 2026 21:20

improve switch case

5096a32

ngxson force-pushed the xsn/model_def_self_contained branch from 69104f1 to 31011e6 Compare April 16, 2026 22:06

ngxson commented Apr 16, 2026

View reviewed changes

github-actions Bot added the model Model specific label Apr 16, 2026

pwilkin reviewed Apr 16, 2026

View reviewed changes

ngxson added 5 commits April 17, 2026 15:25

Merge branch 'master' into xsn/model_def_self_contained

b3dc2b2

move more stuff into llm_arch_model_i

7f22ff2

fix downstream code

6d39d8c

nits

47d7a9b

nits (2)

64ce044

ngxson force-pushed the xsn/model_def_self_contained branch from 31011e6 to b5809e2 Compare April 17, 2026 14:49

ggerganov self-assigned this Apr 17, 2026

ngxson force-pushed the xsn/model_def_self_contained branch from b5809e2 to c3e93a7 Compare April 18, 2026 21:42

Merge branch 'master' into xsn/model_def_self_contained

01b4be7

ngxson force-pushed the xsn/model_def_self_contained branch from c3e93a7 to a8282e3 Compare April 18, 2026 21:46

fix order

ae406b1

ngxson force-pushed the xsn/model_def_self_contained branch from a8282e3 to f4ee25e Compare April 18, 2026 22:26

ggerganov reviewed Apr 20, 2026

View reviewed changes

ngxson added 2 commits April 21, 2026 20:22

Merge branch 'master' into xsn/model_def_self_contained

186f261

llama_model_base

24ba9be

ngxson force-pushed the xsn/model_def_self_contained branch from f4ee25e to 081e8fd Compare April 21, 2026 22:47

ngxson commented Apr 21, 2026

View reviewed changes

ngxson added 4 commits April 27, 2026 11:09

Merge branch 'master' into xsn/model_def_self_contained

8af1973

LLAMA_LOAD_LOCALS

d97465a

small fix

484e2aa

fix build errors

e872c47

ngxson force-pushed the xsn/model_def_self_contained branch from 081e8fd to e872c47 Compare April 27, 2026 11:59

auto

0b6342e

ngxson commented Apr 27, 2026

View reviewed changes

	// note: cast to int64_t since we will use these for the tensor dimensions
	const int64_t n_head = hparams.n_head();
	const int64_t n_head_kv = hparams.n_head_kv();
	const int64_t n_embd = hparams.n_embd;
	const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
	const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();
	const int64_t n_embd_head_k = hparams.n_embd_head_k();
	const int64_t n_embd_head_v = hparams.n_embd_head_v();
	const int64_t n_ff = hparams.n_ff();
	const int64_t n_embd_gqa = n_embd_v_gqa;
	const int64_t n_vocab = vocab.n_tokens();
	const int64_t n_token_types = vocab.n_token_types();
	const int64_t n_rot = hparams.n_rot();
	const int64_t n_expert = hparams.n_expert;
	const int64_t n_expert_used = hparams.n_expert_used;
	const int64_t n_ctx_train = hparams.n_ctx_train;

Conversation

ngxson commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Migration rules:

Requirements

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 17, 2026

Uh oh!

ggerganov Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Apr 16, 2026 •

edited

Loading

ggerganov Apr 20, 2026 •

edited

Loading

ggerganov Apr 20, 2026 •

edited

Loading

ngxson Apr 20, 2026 •

edited

Loading