Skip to content

Implement '--keep-split' to quantize model into several shards#6688

Merged
ggerganov merged 6 commits intoggml-org:masterfrom
zj040045:jiez/quantize-keep-split
Apr 25, 2024
Merged

Implement '--keep-split' to quantize model into several shards#6688
ggerganov merged 6 commits intoggml-org:masterfrom
zj040045:jiez/quantize-keep-split

Conversation

@zj040045
Copy link
Copy Markdown
Contributor

Fix #6548
--keep-split allows quantize to output shards instead of a full model. The number of shards depends on the input model files

@phymbert
Copy link
Copy Markdown
Collaborator

Thanks. Do you mind to add a tests.sh as we did in #6655

@phymbert phymbert added the split GGUF split model sharding label Apr 17, 2024
@zj040045
Copy link
Copy Markdown
Contributor Author

@phymbert Done

@phymbert phymbert requested a review from ggerganov April 18, 2024 14:25
@github-actions

This comment was marked as off-topic.

Comment thread llama.cpp Outdated
LLAMA_LOG_INFO("%s: meta size = %zu bytes\n", __func__, meta_size);
auto weight = ml.get_weight(i);
struct ggml_tensor * tensor = weight->tensor;
if (weight->idx != (ctx_outs.size() - 1) && params->keep_split) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it not safe for future evolution as it assumes writing the tensors the same order they have been read. Could we simply check if weight->idx is not present in ctx_outs and retrieve ctx_out by tensor ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Then model splits writing should follow this logic to support case like "0 0 0 2 2 1 1". Besides, do you think incontinuous order should be considered like "0 0 0 2 1 2 1"?

Comment thread llama.h Outdated
Comment thread examples/quantize/quantize.cpp Outdated
@ggerganov ggerganov merged commit 1966eb2 into ggml-org:master Apr 25, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…#6688)

* Implement '--keep-split' to quantize model into several shards

* Add test script

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Split model correctly even if tensor id is out-of-order

* Update llama_model_quantize_params

* Fix preci failures

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
…#6688)

* Implement '--keep-split' to quantize model into several shards

* Add test script

* Update examples/quantize/quantize.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Split model correctly even if tensor id is out-of-order

* Update llama_model_quantize_params

* Fix preci failures

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

split GGUF split model sharding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-quantization of a split gguf file produces "invalid split file"

3 participants