model: move load_hparams and load_tensors to per-model definition#22004
model: move load_hparams and load_tensors to per-model definition#22004ngxson wants to merge 41 commits intoggml-org:masterfrom
load_hparams and load_tensors to per-model definition#22004Conversation
69104f1 to
31011e6
Compare
| struct llama_model_llama_embed : public llama_model_llama { | ||
| llama_model_llama_embed(const struct llama_model_params & params) : llama_model_llama(params) {} | ||
| // reuse load_hparams and load_tensors from llama_model_llama | ||
|
|
||
| template <bool embed> | ||
| using graph = llama_model_llama::graph<embed>; | ||
|
|
||
| std::unique_ptr<llm_graph_context> build_graph_context(const llm_graph_params & params) const override; | ||
| }; |
There was a problem hiding this comment.
The auto-migration script can't be intelligent enough to point out that we can use the specialization <true> of the graph, but things like this can be improved via a follow-up PR (just noting here for visibility)
|
|
||
| // helper function to facilitate migration | ||
| // TODO: remove this in the future | ||
| auto create_tensor = [&](const LLM_TN_IMPL & tn, const std::initializer_list<int64_t> & ne, int flags) -> ggml_tensor * { |
There was a problem hiding this comment.
Would rename the lambda, having both the method and the lambda named create_tensor is a bit confusing.
There was a problem hiding this comment.
I think the better way is to completely move the current function the llm_arch_model_i and simply reuse the create_tensor there. Will push a fix for this
I try not to renaming things to reduce the number of line changes in this PR
There was a problem hiding this comment.
This lambda is removed in the latest version
31011e6 to
b5809e2
Compare
|
Not very urgent but pinging @ggerganov if you can have a quick look on the direction |
b5809e2 to
c3e93a7
Compare
c3e93a7 to
a8282e3
Compare
a8282e3 to
f4ee25e
Compare
| friend struct llama_model; | ||
|
|
||
| llama_model * model; | ||
| llama_model_loader * ml = nullptr; |
There was a problem hiding this comment.
I don't think introducing the llm_arch_model_i here is necessary. You should keep using llama_model for now and inherit the implementations directly from it. The llm_arch_model_i idea is separate - see below. Here you are interested just in localizing the model definitions (loading hparams, tensors, memory and graph creation) into individual files.
There is a separate refactoring task that can be done before or after this PR. The final state should be like this:
//
// llama.h
//
typedef struct llama_model_i * llama_model_t;
...
LLAMA_API int32_t llama_model_n_ctx_train(const llama_model_t model);
LLAMA_API int32_t llama_model_n_embd (const llama_model_t model);
LLAMA_API int32_t llama_model_n_embd_inp (const llama_model_t model);
...
//
// llama-model.h
//
// pure interface
struct llama_model_i {
virtual ~llama_model_i() = default;
// public API mirror of llama.h
virtual int32_t n_ctx_train() const = 0;
virtual int32_t n_embd() const = 0;
virtual int32_t n_embd_inp() const = 0;
...
// internal API
virtual bool load_hparams(llama_model_loader * ml, ...) = 0;
virtual bool load_tensors(llama_model_loader * ml, ...) = 0;
virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0;
virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0;
...
};
// base model (common functionality and data for all models)
class llama_model_base : public llama_model_i {
public:
int32_t n_ctx_train() override;
int32_t n_embd() override;
int32_t n_embd_inp() override;
...
protected:
llm_type type = LLM_TYPE_UNKNOWN;
llm_arch arch = LLM_ARCH_UNKNOWN;
std::string name = "n/a";
llama_hparams hparams = {};
llama_vocab vocab;
ggml_tensor * tok_embd = nullptr;
ggml_tensor * type_embd = nullptr;
ggml_tensor * pos_embd = nullptr;
ggml_tensor * tok_norm = nullptr;
ggml_tensor * tok_norm_b = nullptr;
...
ggml_tensor * create_tensor(llama_model_loader * ml, ...);
// helpers
void create_tensor_gate_up_exps(llama_model_loader * ml, ...);
void create_tensor_qkv (llama_model_loader * ml, ...);
...
};
//
// models/models.h
//
class llama_model_qwen3 : public llama_model_base {
public:
bool load_hparams(llama_model_loader * ml, ...) override;
bool load_tensors(llama_model_loader * ml, ...) override;
llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override;
ggml_cgraph * build_graph(const llm_graph_params & params) const override;
...
};After this change, all code in llama_context should use only llama_model_i, similar to how it uses llama_memory_i.
There was a problem hiding this comment.
That makes sense. For this PR, I think I will target a state where all the core definition (load_hparams/load_tensors / build_graph) being moved into src/models. I won't separate the llama_model / llama_model_i right now to simplicity, that can be done in a follow-up.
Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ? For now, I think I will add an alias using llama_model_base = llama_model_i so that llama_model_base can be reserved for future use.
There was a problem hiding this comment.
Just one thing not very clear from your example though, is llama_model_base is same as llama_model_i ?
Had a typo: class llama_model -> renamed to class llama_model_base. The separate model instances will inherit the llama_model_base because there is a lot of common stuff that we want to avoid repeating (e.g. tensors, hparams, devices, buffers, ...). For now inheriting a base implementation is easier way to deduplicate.
The alternative is using composition, which is usually considered better architecturally. But there will be a lot of duplicated code:
//
// llama-model.h
//
// pure interface
struct llama_model_i {
virtual ~llama_model_i() = default;
// public API mirror of llama.h
virtual int32_t n_ctx_train() const = 0;
virtual int32_t n_embd() const = 0;
virtual int32_t n_embd_inp() const = 0;
...
// internal API
virtual bool load_hparams(llama_model_loader * ml, ...) = 0;
virtual bool load_tensors(llama_model_loader * ml, ...) = 0;
virtual llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const = 0;
virtual ggml_cgraph * build_graph(const llm_graph_params & params) const = 0;
...
};
//
// models/models.h
//
// in this case, directly implement the interface
class llama_model_qwen3 : public llama_model_i {
public:
// note each model implements these over and over again
int32_t n_ctx_train() override;
int32_t n_embd() override;
int32_t n_embd_inp() override;
...
bool load_hparams(llama_model_loader * ml, ...) override;
bool load_tensors(llama_model_loader * ml, ...) override;
llama_memory_i * create_memory(const llama_memory_params & params, const llama_cparams & cparams) const override;
ggml_cgraph * build_graph(const llm_graph_params & params) const override;
...
private:
// composition instead of inheritance for reusing common model functionality
llama_model_base model;
};There was a problem hiding this comment.
I'd say "composition being better structurally" is a commonly repeated modern programming trend, but in this case, inheritance seems like the better approach since there isn't really a "part" of a model that can be conceptualized as one we're delegating to.
There was a problem hiding this comment.
Hmm yeah the composition will make the code quite verbose, I would prefer staying with the inheritance pattern for now. So just to make sure I understand it correctly, the current opaque pointer llama_model will be mapped to the to-be-added llama_model_i, right?
I think it make sense to do inheritance as in your first comment, so that:
llama_model_iholds the definition of the model (i.e. mostly hparams)llama_model_baseholds the tensorsllama_model_*defines how to load tensors and hparams
There was a problem hiding this comment.
More like:
llama_model_iabstract interface, does not hold or implement anything. Replaces the oldllama_modelllama_model_baseholds hparams, tensors, devices, meta data, loras, etc. Implements common part of the interface (mostly getters for hparams, devices, loras, etc.)llama_model_*implements the rest of the interface: loading hparams, loading tensors, creating memory, building graph
There was a problem hiding this comment.
One concern about llama_model_i holding nothing (and only being an interface) is that we don't yet have any use case where another implementation other than llama_model_base that will reuse the same interface.
I think it may still worth a discussion about the separation of llama_model_i / _base, but that's not very urgent, it will be done in a follow-up anyway
f4ee25e to
081e8fd
Compare
| struct llama_model_base : public llama_model { | ||
| friend struct llama_model; | ||
|
|
||
| llama_model * model; | ||
| llama_model_loader * ml = nullptr; | ||
| const LLM_TN tn; | ||
|
|
||
| // note: these variables are suppose to be read-only; however, since we can't read them until we load the hparams, they will be set after load_hparams() | ||
| int n_layer; | ||
| // note: cast to int64_t since we will use these for the tensor dimensions | ||
| int64_t n_head; | ||
| int64_t n_head_kv; | ||
| int64_t n_embd; | ||
| int64_t n_embd_k_gqa; | ||
| int64_t n_embd_v_gqa; | ||
| int64_t n_embd_head_k; | ||
| int64_t n_embd_head_v; | ||
| int64_t n_ff; | ||
| int64_t n_embd_gqa; | ||
| int64_t n_vocab; | ||
| int64_t n_token_types; | ||
| int64_t n_rot; | ||
| int64_t n_expert; | ||
| int64_t n_expert_used; | ||
| int64_t n_ctx_train; |
There was a problem hiding this comment.
So I refactor a bit the llm_model_arch_i into llama_model_base, which is not 100% what we had in the last comment, but I think it should be enough as a middle step before another migration (which will hopefully produce less changes than the current PR)
One main concern is that currently, these variables are being defined inside the struct, so that load_tensors() doesn't need to load them manually. Although, do you think it's more preferable to define those inside the load_tensors()? I'm not even sure if such change will be possible with my script, but I can try if that's preferable.
CC @ggerganov @CISC @pwilkin if you have any thoughts.
There was a problem hiding this comment.
This on the other hand might be a good place for composition :) is there any reason we can't just use the internal hparams here to keep the parameters?
There was a problem hiding this comment.
Because it's made this way from the beginning:
Lines 3065 to 3080 in 750579f
And because I don't want to change too much code in the same PR, I'm keeping this list so that load_tensors() stays the same.
There was a problem hiding this comment.
You can make a LLAMA_LOAD_LOCALS macro similar to GGML_TENSOR_LOCALS and use the macro at the start of each load_tensors() implementation.
081e8fd to
e872c47
Compare
| // convenience macro for loading local variables for load_tensors() in llama_model_base | ||
| // note: cast to int64_t since we will use these for the tensor dimensions | ||
| #define LLAMA_LOAD_LOCALS \ | ||
| const int n_layer = hparams.n_layer; GGML_UNUSED(n_layer); \ | ||
| const int64_t n_head = hparams.n_head(); GGML_UNUSED(n_head); \ |
There was a problem hiding this comment.
@ggerganov alright, I added the LLAMA_LOAD_LOCALS as suggested
GGML_UNUSED is a hack here to avoid unused variable warning. tbh I don't know if there is a better way, so I'm open for suggestions here
Overview
Fix #21966
The migration will be done via a script
0migrate.pyincluded in this PR (and will be removed right before this is merged)Important note: the goal of this PR is to make sure the migration is as deterministic as possible, we do that via a completely heuristic script as mentioned above. Any improvements (deduplication, clean up, etc) will be done via follow-up PRs
Depends on:
Checklist before merging:
#if 0pre-migration placesAdditional information
Migration rules:
_replaced by-llama_model_Given 2 example arch A and B:
load_tensorsANDload_hparams, B inherit Aload_tensorsORload_hparamsis different, no inherit, there will be one of the 2 functions being duplicatedusing graph = llama_model_a::graphSide effects:
load_tensorsorload_hparams)-iswasuffix in file name or in model class nameSelf-note for updating this PR:
git reset --hard HEAD~1 git merge master python 0migrate.py git add -A git commit -m "auto" git push -fRequirements