Skip to content

model: move load_hparams and load_tensors to per-model definition#22004

Open
ngxson wants to merge 41 commits intoggml-org:masterfrom
ngxson:xsn/model_def_self_contained
Open

model: move load_hparams and load_tensors to per-model definition#22004
ngxson wants to merge 41 commits intoggml-org:masterfrom
ngxson:xsn/model_def_self_contained

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Apr 16, 2026

Overview

Fix #21966

The migration will be done via a script 0migrate.py included in this PR (and will be removed right before this is merged)

Important note: the goal of this PR is to make sure the migration is as deterministic as possible, we do that via a completely heuristic script as mentioned above. Any improvements (deduplication, clean up, etc) will be done via follow-up PRs

Depends on:

Checklist before merging:

  • Remove migration script
  • Remove #if 0 pre-migration places

Additional information

Migration rules:

  • Create one file per arch, the file name is the arch enum name in lower case, with _ replaced by -
  • Model name is arch name in lower case, prefixed by llama_model_

Given 2 example arch A and B:

  • If arch A and B BOTH use the same load_tensors AND load_hparams, B inherit A
  • If either load_tensors OR load_hparams is different, no inherit, there will be one of the 2 functions being duplicated
  • If B reuse the same graph with A, the graph definition in B will be using graph = llama_model_a::graph

Side effects:

  • Some part of the code will be duplicated (i.e. some load_tensors or load_hparams)
  • No more -iswa suffix in file name or in model class name

Self-note for updating this PR:

git reset --hard HEAD~1
git merge master
python 0migrate.py
git add -A
git commit -m "auto"
git push -f

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: partially, mostly for the migration script

@github-actions github-actions Bot added the python python script changes label Apr 16, 2026
@ngxson ngxson marked this pull request as ready for review April 16, 2026 21:20
@ngxson ngxson requested review from CISC and ggerganov as code owners April 16, 2026 21:20
@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from 69104f1 to 31011e6 Compare April 16, 2026 22:06
Comment thread src/models/models.h
Comment on lines +142 to 150
struct llama_model_llama_embed : public llama_model_llama {
llama_model_llama_embed(const struct llama_model_params & params) : llama_model_llama(params) {}
// reuse load_hparams and load_tensors from llama_model_llama

template <bool embed>
using graph = llama_model_llama::graph<embed>;

std::unique_ptr<llm_graph_context> build_graph_context(const llm_graph_params & params) const override;
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-migration script can't be intelligent enough to point out that we can use the specialization <true> of the graph, but things like this can be improved via a follow-up PR (just noting here for visibility)

@github-actions github-actions Bot added the model Model specific label Apr 16, 2026
Comment thread src/llama-model.cpp Outdated

// helper function to facilitate migration
// TODO: remove this in the future
auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename the lambda, having both the method and the lambda named create_tensor is a bit confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better way is to completely move the current function the llm_arch_model_i and simply reuse the create_tensor there. Will push a fix for this

I try not to renaming things to reduce the number of line changes in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is removed in the latest version

@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from 31011e6 to b5809e2 Compare April 17, 2026 14:49
@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 17, 2026

Not very urgent but pinging @ggerganov if you can have a quick look on the direction

@ggerganov ggerganov self-assigned this Apr 17, 2026
@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from b5809e2 to c3e93a7 Compare April 18, 2026 21:42
@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from c3e93a7 to a8282e3 Compare April 18, 2026 21:46
@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from a8282e3 to f4ee25e Compare April 18, 2026 22:26
Comment thread src/llama-model.h
friend struct llama_model;

llama_model * model;
llama_model_loader * ml = nullptr;
Copy link
Copy Markdown
Member

@ggerganov ggerganov Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think introducing the llm_arch_model_i here is necessary. You should keep using llama_model for now and inherit the implementations directly from it. The llm_arch_model_i idea is separate - see below. Here you are interested just in localizing the model definitions (loading hparams, tensors, memory and graph creation) into individual files.


There is a separate refactoring task that can be done before or after this PR. The final state should be like this:

//
// llama.h
//

typedef struct llama_model_i * llama_model_t;

...

LLAMA_API int32_t llama_model_n_ctx_train(const llama_model_t model);
LLAMA_API int32_t llama_model_n_embd     (const llama_model_t model);
LLAMA_API int32_t llama_model_n_embd_inp (const llama_model_t model);

...

//
// llama-model.h
//

// pure interface
struct llama_model_i {
    virtual ~llama_model_i() = default;

    // public API mirror of llama.h
    virtual int32_t n_ctx_train() const = 0;
    virtual int32_t n_embd() const = 0;
    virtual int32_t n_embd_inp() const = 0;

    ...

    // internal API
    virtual bool load_hparams(llama_model_loader * ml, ...) = 0;
    virtual bool load_tensors(llama_model_loader * ml, ...) = 0;

    virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0;

    virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0;

    ...
};

// base model (common functionality and data for all models)
class llama_model_base : public llama_model_i {
public:
    int32_t n_ctx_train() override;
    int32_t n_embd() override;
    int32_t n_embd_inp() override;

    ...

protected:
    llm_type type = LLM_TYPE_UNKNOWN;
    llm_arch arch = LLM_ARCH_UNKNOWN;

    std::string name = "n/a";

    llama_hparams hparams = {};
    llama_vocab   vocab;

    ggml_tensor * tok_embd   = nullptr;
    ggml_tensor * type_embd  = nullptr;
    ggml_tensor * pos_embd   = nullptr;
    ggml_tensor * tok_norm   = nullptr;
    ggml_tensor * tok_norm_b = nullptr;

    ...

    ggml_tensor * create_tensor(llama_model_loader * ml, ...);

    // helpers
    void create_tensor_gate_up_exps(llama_model_loader * ml, ...);
    void create_tensor_qkv         (llama_model_loader * ml, ...);

    ...
};

//
// models/models.h
//

class llama_model_qwen3 : public llama_model_base {
public:
    bool load_hparams(llama_model_loader * ml, ...) override;
    bool load_tensors(llama_model_loader * ml, ...) override;

    llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override;

    ggml_cgraph * build_graph(const llm_graph_params & params) const override;

    ...
};

After this change, all code in llama_context should use only llama_model_i, similar to how it uses llama_memory_i.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. For this PR, I think I will target a state where all the core definition (load_hparams/load_tensors / build_graph) being moved into src/models. I won't separate the llama_model / llama_model_i right now to simplicity, that can be done in a follow-up.

Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ? For now, I think I will add an alias using llama_model_base = llama_model_i so that llama_model_base can be reserved for future use.

Copy link
Copy Markdown
Member

@ggerganov ggerganov Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ?

Had a typo: class llama_model -> renamed to class llama_model_base. The separate model instances will inherit the llama_model_base because there is a lot of common stuff that we want to avoid repeating (e.g. tensors, hparams, devices, buffers, ...). For now inheriting a base implementation is easier way to deduplicate.

The alternative is using composition, which is usually considered better architecturally. But there will be a lot of duplicated code:

//
// llama-model.h
//

// pure interface
struct llama_model_i {
    virtual ~llama_model_i() = default;

    // public API mirror of llama.h
    virtual int32_t n_ctx_train() const = 0;
    virtual int32_t n_embd() const = 0;
    virtual int32_t n_embd_inp() const = 0;

    ...

    // internal API
    virtual bool load_hparams(llama_model_loader * ml, ...) = 0;
    virtual bool load_tensors(llama_model_loader * ml, ...) = 0;

    virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0;

    virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0;

    ...
};

//
// models/models.h
//

// in this case, directly implement the interface
class llama_model_qwen3 : public llama_model_i {
public:
    // note each model implements these over and over again
    int32_t n_ctx_train() override;
    int32_t n_embd() override;
    int32_t n_embd_inp() override;

    ...

    bool load_hparams(llama_model_loader * ml, ...) override;
    bool load_tensors(llama_model_loader * ml, ...) override;

    llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override;

    ggml_cgraph * build_graph(const llm_graph_params & params) const override;

    ...

private:
    // composition instead of inheritance for reusing common model functionality
    llama_model_base model;
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "composition being better structurally" is a commonly repeated modern programming trend, but in this case, inheritance seems like the better approach since there isn't really a "part" of a model that can be conceptualized as one we're delegating to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah the composition will make the code quite verbose, I would prefer staying with the inheritance pattern for now. So just to make sure I understand it correctly, the current opaque pointer llama_model will be mapped to the to-be-added llama_model_i, right?

I think it make sense to do inheritance as in your first comment, so that:

  • llama_model_i holds the definition of the model (i.e. mostly hparams)
  • llama_model_base holds the tensors
  • llama_model_* defines how to load tensors and hparams

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More like:

  • llama_model_i abstract interface, does not hold or implement anything. Replaces the old llama_model
  • llama_model_base holds hparams, tensors, devices, meta data, loras, etc. Implements common part of the interface (mostly getters for hparams, devices, loras, etc.)
  • llama_model_* implements the rest of the interface: loading hparams, loading tensors, creating memory, building graph

Copy link
Copy Markdown
Contributor Author

@ngxson ngxson Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern about llama_model_i holding nothing (and only being an interface) is that we don't yet have any use case where another implementation other than llama_model_base that will reuse the same interface.

I think it may still worth a discussion about the separation of llama_model_i / _base, but that's not very urgent, it will be done in a follow-up anyway

@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from f4ee25e to 081e8fd Compare April 21, 2026 22:47
Comment thread src/llama-model.h Outdated
Comment on lines +642 to +666
struct llama_model_base : public llama_model {
friend struct llama_model;

llama_model * model;
llama_model_loader * ml = nullptr;
const LLM_TN tn;

// note: these variables are suppose to be read-only; however, since we can't read them until we load the hparams, they will be set after load_hparams()
int n_layer;
// note: cast to int64_t since we will use these for the tensor dimensions
int64_t n_head;
int64_t n_head_kv;
int64_t n_embd;
int64_t n_embd_k_gqa;
int64_t n_embd_v_gqa;
int64_t n_embd_head_k;
int64_t n_embd_head_v;
int64_t n_ff;
int64_t n_embd_gqa;
int64_t n_vocab;
int64_t n_token_types;
int64_t n_rot;
int64_t n_expert;
int64_t n_expert_used;
int64_t n_ctx_train;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I refactor a bit the llm_model_arch_i into llama_model_base, which is not 100% what we had in the last comment, but I think it should be enough as a middle step before another migration (which will hopefully produce less changes than the current PR)

One main concern is that currently, these variables are being defined inside the struct, so that load_tensors() doesn't need to load them manually. Although, do you think it's more preferable to define those inside the load_tensors()? I'm not even sure if such change will be possible with my script, but I can try if that's preferable.

CC @ggerganov @CISC @pwilkin if you have any thoughts.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This on the other hand might be a good place for composition :) is there any reason we can't just use the internal hparams here to keep the parameters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's made this way from the beginning:

llama.cpp/src/llama-model.cpp

Lines 3065 to 3080 in 750579f

// note: cast to int64_t since we will use these for the tensor dimensions
const int64_t n_head = hparams.n_head();
const int64_t n_head_kv = hparams.n_head_kv();
const int64_t n_embd = hparams.n_embd;
const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();
const int64_t n_embd_head_k = hparams.n_embd_head_k();
const int64_t n_embd_head_v = hparams.n_embd_head_v();
const int64_t n_ff = hparams.n_ff();
const int64_t n_embd_gqa = n_embd_v_gqa;
const int64_t n_vocab = vocab.n_tokens();
const int64_t n_token_types = vocab.n_token_types();
const int64_t n_rot = hparams.n_rot();
const int64_t n_expert = hparams.n_expert;
const int64_t n_expert_used = hparams.n_expert_used;
const int64_t n_ctx_train = hparams.n_ctx_train;

And because I don't want to change too much code in the same PR, I'm keeping this list so that load_tensors() stays the same.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a LLAMA_LOAD_LOCALS macro similar to GGML_TENSOR_LOCALS and use the macro at the start of each load_tensors() implementation.

@ngxson ngxson force-pushed the xsn/model_def_self_contained branch from 081e8fd to e872c47 Compare April 27, 2026 11:59
Comment thread src/llama-model.h
Comment on lines +685 to +689
// convenience macro for loading local variables for load_tensors() in llama_model_base
// note: cast to int64_t since we will use these for the tensor dimensions
#define LLAMA_LOAD_LOCALS \
const int n_layer = hparams.n_layer; GGML_UNUSED(n_layer); \
const int64_t n_head = hparams.n_head(); GGML_UNUSED(n_head); \
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov alright, I added the LLAMA_LOAD_LOCALS as suggested

GGML_UNUSED is a hack here to avoid unused variable warning. tbh I don't know if there is a better way, so I'm open for suggestions here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: move load_hparams and load_tensors to per-model definition

4 participants