Model : Add support for Kimi-K2#14654
Conversation
|
Thanks for the patch, I'm excited to try this model! Running convert_hf_to_gguf based on your branch, will report back with the results. |
|
Uploaded Q2_K and BF16 GGUF to HuggingFace (thanks @danielhanchen for the initial BF16 conversion!) https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF |
|
Nice work @gabriellarson ! I was also trying to get a PR going - master...unslothai:llama.cpp:master - but I was primarily stuck on the K2's special regex handling - will try yours out to see if the regex works! |
|
has anyone been able to get quantization to work, I got Q3 to successfully quantize but have failed with Q2, its really a pain with this model because the GGUF ended up being 2 TB when I converted it and I only have 384GB of RAM (only) and Q3 comes it at about 440GB so loading it is a nightmare |
i believe @gabriellarson made a Q2_K if you scroll down to the bottom of their huggingface repo files and that is a link to a discussion where some folks are sharing information. going to try this PR shortly and with luck release an imatrix file (or two eventually...) 🤞 |
|
@CISC I think this is ready for review now |
Thank you! The Q2 I made outputs complete gibberish - not sure if my own tweaks to the SRC caused this issue or if its inherent to the commit as a whole right now. I saw the recent commits so I will start over from a clean slate. I need to test some more to see if FA or KV quant was the issue. Gibberish aside, I was able to get about 13-15 t/s on 4x RTX 5090s which is about par on what I get with Deepseek. |
|
I've just started converting -> testing (based on my own dequantized BF16). |
|
I haven't tested any edge-cases involving complex character handling in the patch, but it works for English:
|
|
Thanks @anikifoss for the confirmation. I too have been able to fp8 to bf16 cast, then run convert_hf_to_gguf to get bf16 GGUF and use that to quantize a "pure" q8_0 which successfully inferenced with llama-server in a few short chats. I have my methodology details and screenshots on the hf repo discussion and updating as I go along. Running into issue now with imatrix dropping a lot of experts due to only 99.74% partial data, might need to look at #9400 (comment) to get a better imatrix here on mainline. |
@thad0ctor How many layers did you have to offload to vram and at what context size did you get that speed? |
|
I tested briefly with the provided Q2_K quant and observed a lot of repetition in the output. Whole paragraphs repeating verbatim 3-4 times under different but similar headings. Bottom Line, In Summary, etc. (temp=0.3 and temp=0.6). I'm building Q8 and will try again. |
|
@gabriellarson I can confirm the new regex seems to work well based on tokenization ID matches! @ubergarm Yes you're correct on experts being zeros - I think I also found this to be the case. I also made some 245GB, 281GB (IQ1_S) dynamic quants + Q2_K_XL, Q4_K_XL quants at https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF - it should work fine with this PR or using my fork https://github.com/unslothai/llama.cpp - guide to run them here: https://docs.unsloth.ai/basics/kimi-k2-how-to-run-locally#run-kimi-k2-tutorials |
I am pretty sure it was default context (2k) and 5 layers per card which left about 4-5 GB headroom on each card (Q_4 kv cache). I suspect that the llama.cpp splitting may be inefficiently mapping layers so I suspect that moe enhancements on ik_llama or -ot args may add some perfomance once this model gets flushed out a bit more It's unfortunately the Kimi team didn't work with the community pre-release to get the ball rolling on compatability with common interface engines for those of us who aren't swimming in VRAM lol |
Okay thanks for confirming! I checked your hf repo but didn't see your I'm trying @compilade 's new
Oh hey we already discussed this, but looks like your scripts mistakenly named another quant TQ1_0 given it doesn't actually contain that quantization type and is conflating the rough BPW range with an actual ternary model only quantization type. Its great there are a lot of options in the smaller size ranges these days, but just trying to keep the naming conventions accurate! Thanks and great job getting this beast of a model going! |
Original patch by @gabriellarson: ggml-org/llama.cpp#14654
|
@RodriMora Yes I saw that as well!
|
I'm getting stuck on adding the template to llm_chat_detect_template() in llama-chat.cpp |
|
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
I just got it working with this, my first attempt forgot to |
|
@ubergarm I'll add the i also include the tool role in mine, is the tool role not necessary? |
I wasn't sure myself honestly, but yes yours does look correct to me given my understanding of the official template. Should be fine, but I don't have a quant to test it and honestly don't know how to use proper tool calling 😅 👈 chat template decoder scriptoutput$ python chat_template_tester.py moonshotai/Kimi-K2-Instruct
>> chat template <<
<|im_system|>system<|im_middle|>example system prompt<|im_end|><|im_user|>user<|im_middle|>example user turn 1<|im_end|><|im_assistant|>assistant<|im_middle|>example assistant turn 1<|im_end|><|im_user|>user<|im_middle|>example user turn 2<|im_end|><|im_assistant|>assistant<|im_middle|>example assistant turn 2<|im_end|><|im_system|>tool<|im_middle|>## Return of \nsome kind of tool call maybe<|im_end|><|im_assistant|>assistant<|im_middle|>python script$ cat chat_template_tester.py
# uv pip install transformers jinja2
# (and sometimes also sentencepiece torch statsmodels, looking at you ERNIE4.5)
from transformers import AutoTokenizer
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("model", help="Name of Hugging Face LLM repo (org/model format)")
args = parser.parse_args()
tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code = True)
chat = [
{"role": "system", "content": "example system prompt"},
{"role": "user", "content": "example user turn 1"},
{"role": "assistant", "content": "example assistant turn 1"},
{"role": "user", "content": "example user turn 2"},
{"role": "assistant", "content": "example assistant turn 2"},
{"role": "tool", "content": "some kind of tool call maybe"},
]
print(">> chat template <<")
print(tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False))
print(">> end of chat template <<") |
|
I've managed to create a draft model, but unsure yet if I can actually train it: https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-GGUF Even the untrained draft gives some improvements for highly "draftable" refactoring prompts: I'll try to see if I can fine-tune it, but it all depends on if I can get transformers to load it and/or figure out a similar |
|
I made a small fix to the chat template, would appreciate if anyone can test it: #14852 |
|
There's a good chance we can improve The tech report seems to suggest that there was no actual long-context training beyond 32k: unlike It doesn't look like it works that great above 32k anyway: so it's probably a good idea to avoid using the 128k YaRN parameters unless you really have to, and only then use the minimum you actually need >32k, eg: and so on. |
|
I raised a question here: MoonshotAI/Kimi-K2#55 |
They replied:
This is the same as what |
|
@jukofyork So during "mid-training" they essentially did long context extension it seems? @ngxson Nice work :) |
* Kimi-K2 conversion
* add Kimi_K2 pre type
* Kimi-K2
* Kimi-K2 unicode
* Kimi-K2
* LLAMA_MAX_EXPERTS 384
* fix vocab iteration
* regex space fix
* add kimi-k2 to pre_computed_hashes
* Updated with kimi-k2 get_vocab_base_pre hash
* fix whitespaces
* fix flake errors
* remove more unicode.cpp whitespaces
* change set_vocab() flow
* add moonshotai-Kimi-K2.jinja to /models/templates/
* update moonshotai-Kimi-K2.jinja
* add kimi-k2 chat template
* add kimi-k2
* update NotImplementedError
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* except Exception
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Kimi-K2 conversion
* add Kimi_K2 pre type
* Kimi-K2
* Kimi-K2 unicode
* Kimi-K2
* LLAMA_MAX_EXPERTS 384
* fix vocab iteration
* regex space fix
* add kimi-k2 to pre_computed_hashes
* Updated with kimi-k2 get_vocab_base_pre hash
* fix whitespaces
* fix flake errors
* remove more unicode.cpp whitespaces
* change set_vocab() flow
* add moonshotai-Kimi-K2.jinja to /models/templates/
* update moonshotai-Kimi-K2.jinja
* add kimi-k2 chat template
* add kimi-k2
* update NotImplementedError
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* except Exception
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* LLM_CHAT_TEMPLATE_KIMI_K2 if(add_ass){}
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Original patch by @gabriellarson: ggml-org#14654 Co-authored-by: anikifoss <anikifoss>



I used the same set_vocab approach as the HunYuanMoE, and attempted to accurately represent the kimi_tokenization.py regex in unicode.cpp .