Skip to content

common/gemma4 : handle parsing edge cases#21760

Merged
aldehir merged 5 commits intoggml-org:masterfrom
aldehir:gemma4-more-fixes
Apr 13, 2026
Merged

common/gemma4 : handle parsing edge cases#21760
aldehir merged 5 commits intoggml-org:masterfrom
aldehir:gemma4-more-fixes

Conversation

@aldehir
Copy link
Copy Markdown
Contributor

@aldehir aldehir commented Apr 11, 2026

Overview

Fix a few edge cases for Gemma 4 26B A4B. I don't see these artifacts from the 31B variant.

Additional information

Issue 1

If the model generates content + tool call, the template will incorrectly format the prompt without the generation prompt (<|turn>model\n):

...<|tool_call>call:$call<tool_call|><|tool_response>response:$response<tool_response|>$message<turn|>\n

Causing 26B to produce a broken thinking sequence:

thought\n<channel|>

Instead of

<|channel>thought\n<channel|>

This is fixed by adding the generation prompt if not present and the prompt ends with <turn|>\n.

Issue 2

Occasionally 26B will emit a trailing <channel|>, particularly when it does not reason but produces a content message before a tool call:

<|channel>thought\n<channel|>I will ...<channel|><|tool_call>`

Fixed by scanning until <channel|>, then consume until <|tool_call> or end.

Issue 3

At the start of the generation, 26B may emit multiple <|channel> tokens.

<|channel><|channel>thought\nI will...

Unsure if this is related to the bad prompt above, but it's easy enough to handle by consuming all <|channel> tokens that do not precede thought.

Requirements

@aldehir aldehir requested a review from a team as a code owner April 11, 2026 07:22
@aldehir aldehir requested a review from pwilkin as a code owner April 11, 2026 07:31
@github-actions github-actions Bot added the testing Everything test related label Apr 11, 2026
@En3Tho
Copy link
Copy Markdown

En3Tho commented Apr 11, 2026

#21767

Made a discussion but you already working on a fix. Thanks!

This is what I'm hitting quite often:
<|channel><|channel>thought <channel|><|tool_call>call:tool_name{..tool_args}<tool_call|>

Is it the same issue?

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 11, 2026

@En3Tho this one might be related to the prompt issue, but I'll add it just in case.

@aldehir aldehir requested a review from ggerganov as a code owner April 11, 2026 19:39
@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented Apr 11, 2026

Encountered these in Hermes Agent. I'm not sure if this is a lcpp or hermes issue, but it doesn't hurt to post it here.

Screenshot 2026-04-11 225147 Screenshot 2026-04-11 215539

Maybe this PR fixes that as well?

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 11, 2026

@Dampfinchen it should for the thought\n<channel|> issues, but that first generation has me worried. Which quants are you using? I want to see if I can reproduce something similar.

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented Apr 11, 2026

@Dampfinchen it should for the thought\n<channel|> issues, but that first generation has me worried. Which quants are you using? I want to see if I can reproduce something similar.

I was using https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/commit/42d40426322efc9bdd03f9f32b9fd87bfc63409f

It was the second update I believe. Recently the third update released with the latest chat template.

Since its an older quant, I have downloaded the new chat template from Google and injected it by using --chat-template-file. I am also running an up to date llama.cpp build that features the fixes from PR #21704

The exact lcpp command I have used:

./llama-server -m "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" -c 102144 -fa 1 --host 0.0.0.0 --port 5001 --jinja -ngl 99 --n-cpu-moe 28 -ctv q8_0 -ctk q8_0 -ub 1024 --mmproj "gemma-4-26B-A4B-it.mmproj-q8_0.gguf" --no-mmproj-offload --ctx-checkpoints 6 --no-mmap --chat-template-file "Gemma4\chat_template-google-26b.jinja" --reasoning 0 -np 1 --temp 0.3 --top-p 0.9 --min-p 0.1 --top-k 20

Please note I did not download and test this PR yet, I simply thought it was a good idea to share it in the case that this would be one of these edge cases that can be improved.

In Hermes Agent, it seems to work fine usually but these issues appeared after I have worked with the agent for a little while. At one point (it was around 64K of context) I noticed it was stuck, endlessly generating so I have stopped the Agent which lead to the print of the strange generation that has you worried.

@aldehir aldehir marked this pull request as draft April 11, 2026 22:14
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 12, 2026

In Hermes Agent, it seems to work fine usually but these issues appeared after I have worked with the agent for a little while. At one point (it was around 64K of context) I noticed it was stuck, endlessly generating so I have stopped the Agent which lead to the print of the strange generation that has you worried.

Yeah, I notice the same thing on occasion at high context. Have you tried another agent? Some of them don't return back the reasoning traces properly, and it's difficult to test everything out there.

I've been running this PR for a bit on multiple runs at high context. It feels like it handles all the parsing issues I've noticed.

There is one particularly nasty failure condition of 26B A4B. It will sometimes try to back out of a tool call, reason again, and then generate a new tool call. I had Claude make this visualization: https://cdn.alde.dev/llama.cpp/examples/gemma4/tool-call-derailment.html

This is an intractable problem if we stream tool calls. By the time it reconsiders, we have already sent the tool call deltas to the client. The only real way to address this is to block until completion of the generation and then attempt to parse tool call(s) from the end.

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented Apr 12, 2026

In Hermes Agent, it seems to work fine usually but these issues appeared after I have worked with the agent for a little while. At one point (it was around 64K of context) I noticed it was stuck, endlessly generating so I have stopped the Agent which lead to the print of the strange generation that has you worried.

Yeah, I notice the same thing on occasion at high context. Have you tried another agent? Some of them don't return back the reasoning traces properly, and it's difficult to test everything out there.

I've been running this PR for a bit on multiple runs at high context. It feels like it handles all the parsing issues I've noticed.

There is one particularly nasty failure condition of 26B A4B. It will sometimes try to back out of a tool call, reason again, and then generate a new tool call. I had Claude make this visualization: https://cdn.alde.dev/llama.cpp/examples/gemma4/tool-call-derailment.html

This is an intractable problem if we stream tool calls. By the time it reconsiders, we have already sent the tool call deltas to the client. The only real way to address this is to block until completion of the generation and then attempt to parse tool call(s) from the end.

Not yet, Hermes Agent is my only Agent.

I, too, have been running this PR for a little while now using Hermes Agent and it appears stable. At one point Hermes printed "empty response after tool calls — using earlier content as final answer", but I didn't find any repeating or leaking like before, so this is a good improvement.

@emansom
Copy link
Copy Markdown

emansom commented Apr 12, 2026

@aldehir Working on this too in https://github.com/emansom/llama.cpp/tree/gemma4-fixes

emansom@d12ebae
emansom@76398ed
emansom@a1ed637
emansom@a93c0ff
emansom@c73fe16

Feel free to copy whatever makes sense to your code/branch, and/or tell me what to open as pull request.

@emansom
Copy link
Copy Markdown

emansom commented Apr 12, 2026

I think we really need a proper state machine implementation for the code to make sense.

or clean up the current PEG parsing handling so it is more logically structured.

State machine that follows the possible states (and sub states) defined in the official prompt format docs:
https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4

@jacekpoplawski
Copy link
Copy Markdown
Contributor

@Dampfinchen I had this with 26B in OpenCode

@emansom
Copy link
Copy Markdown

emansom commented Apr 12, 2026

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 12, 2026

or clean up the current PEG parsing handling so it is more logically structured.

I'm open to recommendations, but a state machine is nothing more than a DFA which is representable by PEG. There's an argument to be made about parsing at a token level, which is not impossible, but requires some fundamental changes.

for llama.cpp to process the Gemma 4 format in the same way the official implementation does:

llama.cpp is more than just Gemma 4. Keep that in mind and keep the conversation productive. I appreciate the links to your changes. I think I handled most of them here, but we could use more testing around tool call parsing.

@emansom
Copy link
Copy Markdown

emansom commented Apr 13, 2026

or clean up the current PEG parsing handling so it is more logically structured.

I'm open to recommendations, but a state machine is nothing more than a DFA which is representable by PEG.

Yes, agreed that the PEG parser can act as a state machine. Currently for Gemma 4 however it seems to be more used as a duct-tape patcher. Especially when compared to the full parser implemented in vLLM.

There's an argument to be made about parsing at a token level, which is not impossible, but requires some fundamental changes.

Does that mean that what this test describes is currently not possible with the PEG parser?:
https://github.com/google-ai-edge/LiteRT-LM/blob/main/runtime/conversation/internal_callback_util_test.cc#L672C1-L673C66

Could that be the reason why it is failing? It is simply not parsing the <channel|> or <|channel> token spread over seperate responses from the model?

for llama.cpp to process the Gemma 4 format in the same way the official implementation does:

llama.cpp is more than just Gemma 4. Keep that in mind.

Logic could be implemented to detect Gemma 4 syntax from the PEG and then switch to a full Gemma 4 parser. Or a CLI flag as indicator which parser to use? Similar to how it's implemented in vLLM?

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#launch-server-with-tool-calling

https://github.com/vllm-project/vllm/blob/main/vllm/tool_parsers/gemma4_tool_parser.py
https://github.com/vllm-project/vllm/blob/main/vllm/tool_parsers/gemma4_utils.py
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gemma4_mm.py
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/gemma4.py
https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/gemma4_reasoning_parser.py
https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/gemma4_utils.py
https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_gemma4.jinja
https://github.com/vllm-project/vllm/blob/main/tests/renderers/test_gemma4_chat_template.py
https://github.com/vllm-project/vllm/blob/main/tests/reasoning/test_gemma4_reasoning_parser.py
https://github.com/vllm-project/vllm/blob/main/tests/models/multimodal/processing/test_gemma4.py

Stumbled on this, might be that llama.cpp needs some layer fixing for Gemma 4?:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/gemma4_rope.py

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 13, 2026

Currently for Gemma 4 however it seems to be more used as a duct-tape patcher

A state machine doesn't help if the model output doesn't align with its own format. Unless you're suggesting masking logits in adherence to the state machine, which is a good idea but out of scope for this PR.

Especially when compared to the full parser implemented in vLLM.

Their parsing file is 700+ lines of Python vs. 104 lines of C++ here. A majority of the tool call parsers in vLLM are recursive descent parsers stitched together with regex and edge cases to handle streaming. Not a criticism of their approach, it's a straightforward approach. The point is that Gemma 4 tool call parsing was captured here in relatively few lines of code, with streaming support included.

Could that be the reason why it is failing? It is simply not parsing the <channel|> or <|channel> token spread over separate responses from the model?

Could you clarify what you're seeing fail? These come in as a single token, so the entire piece is added to the prompt. There's no way to receive two halves in separate generations. That said, the PEG parser handles this well by design: if a literal is only partially matched, it waits for more input before appending to the AST.

If you think a different implementation approach would be better for the project overall, I'd suggest opening a discussion to get feedback from the other maintainers. That'd be a better venue for that conversation than this PR.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 13, 2026

Stumbled on this, might be that llama.cpp needs some layer fixing for Gemma 4?:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding/gemma4_rope.py

I'm not familiar enough with that part of the codebase to weigh in. Someone more familiar with the rotary embedding layers might be able to say whether that applies here.

@emansom
Copy link
Copy Markdown

emansom commented Apr 13, 2026

A state machine doesn't help if the model output doesn't align with its own format. Unless you're suggesting masking logits in adherence to the state machine, which is a good idea but out of scope for this PR.

I was not suggesting masking logits in adherence to a Gemma 4 format state machine, however, now that you brought it up; that may be a interesting experiment indeed?

Their parsing file is 700+ lines of Python vs. 104 lines of C++ here. A majority of the tool call parsers in vLLM are recursive descent parsers stitched together with regex and edge cases to handle streaming. Not a criticism of their approach, since that deviates from what most models generate. The point is that Gemma 4 tool call parsing was captured here in relatively few lines of code, with streaming support included.

Lines of code is not a determining factor of quality. I wasn't aware their approach relied on regexes, that doesn't seem very robust neither. The format parser implementation from Google in LiteRT-LM is a real parser afaict. I am bringing it up to suggest comparing notes with that implementation, to ensure the PEG parser handles the Gemma 4 model the same way.

That said, the PEG parser handles this well by design: if a literal is only partially matched, it waits for more input before appending to the AST.

Is that being tested, in a similar way as in LiteRT-LM conversation tests?

https://github.com/google-ai-edge/LiteRT-LM/blob/main/runtime/conversation/internal_callback_util_test.cc#L672C1-L673C66

If you think a different implementation approach would be better for the project overall, I'd suggest opening a discussion to get feedback from the other maintainers. That'd be a better venue for that conversation than this PR.

I'm not entirely sure. Google intended the Gemma 4 model to be used with stateful parsers (see their LiteRT-LM implementation). If the PEG parser can handle all the niche cases in the same way, and produce the same state then that's a good solution too.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 13, 2026

The tests/test-chat.cpp file runs the parser against incrementally increasing partial input to ensure that partial parses don't introduce incorrect deltas between parses. The Gemma 4 tests are included, since they use the same test framework.

There's also a series of partial matching tests for the PEG parser in particular:

// Sequences - Partial Match 1
t.test("sequence_partial_match_1", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("<think>") + p.literal("</think>"); });
auto ctx = common_peg_parse_context("<thi", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("sequence_partial_match_1", true, result.need_more_input());
});
// Sequences - Partial Match 2
t.test("sequence_partial_match_2", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("begin") + p.literal("end"); });
auto ctx = common_peg_parse_context("begin", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("sequence_partial_match_2", true, result.need_more_input());
});
// Sequences - Partial Match 3
t.test("sequence_partial_match_3", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("<think>") + p.literal("</think>"); });
auto ctx = common_peg_parse_context("<think></", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("sequence_partial_match_3", true, result.need_more_input());
});
// Sequences - Full Match
t.test("sequence_full_match", [&](testing & t) {
auto common_chat_combinator_parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("hello") + p.literal("world"); });
auto ctx = common_peg_parse_context("helloworld");
auto result = common_chat_combinator_parser.parse(ctx);
t.assert_equal("sequence_full_match", true, result.success());
});
// Sequences - No Match
t.test("sequence_no_match", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("<think>") + p.literal("</think>"); });
auto ctx = common_peg_parse_context("<think>I am common_chat_combinator_parser", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("sequence_no_match", true, result.fail());
});
// Choices - Partial Match 1
t.test("choices_partial_match_1", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("option1") | p.literal("option2"); });
auto ctx = common_peg_parse_context("opt", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("choices_partial_match_1", true, result.need_more_input());
});
// Choices - Partial Match 2
t.test("choices_partial_match_2", [&](testing & t) {
auto parser =
build_peg_parser([](common_peg_parser_builder & p) { return p.literal("choice_a") | p.literal("choice_b"); });
auto ctx = common_peg_parse_context("choice", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("choices_partial_match_2", true, result.need_more_input());
});
// Choices - Full Match 1
t.test("choices_full_match_1", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("first") | p.literal("second"); });
auto ctx = common_peg_parse_context("first");
auto result = parser.parse(ctx);
t.assert_equal("choices_full_match_1", true, result.success());
});
// Choices - Full Match 2
t.test("choices_full_match_2", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("alpha") | p.literal("beta"); });
auto ctx = common_peg_parse_context("beta");
auto result = parser.parse(ctx);
t.assert_equal("choices_full_match_2", true, result.success());
});
// Choices - No Match
t.test("choices_no_match", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.literal("good") | p.literal("better"); });
auto ctx = common_peg_parse_context("best");
auto result = parser.parse(ctx);
t.assert_equal("choices_no_match", true, result.fail());
});
// Zero or More - Partial Match 1
t.test("zero_or_more_partial_match_1", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.zero_or_more(p.literal("ab")); });
auto ctx = common_peg_parse_context("a", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("zero_or_more_partial_match_1", true, result.need_more_input());
});
// Zero or More - Partial Match 2
t.test("zero_or_more_partial_match_2", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.zero_or_more(p.literal("xy")); });
auto ctx = common_peg_parse_context("xyx", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("zero_or_more_partial_match_2", true, result.need_more_input());
});
// Zero or More - Full Match
t.test("zero_or_more_full_match", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.zero_or_more(p.literal("test")); });
auto ctx = common_peg_parse_context("test");
auto result = parser.parse(ctx);
t.assert_equal("zero_or_more_full_match", true, result.success());
});
// One or More - Partial Match 1
t.test("one_or_more_partial_match_1", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.one_or_more(p.literal("repeat")); });
auto ctx = common_peg_parse_context("rep", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("one_or_more_partial_match_1", true, result.need_more_input());
});
// One or More - Partial Match 2
t.test("one_or_more_partial_match_2", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.one_or_more(p.literal("ab")); });
auto ctx = common_peg_parse_context("aba", COMMON_PEG_PARSE_FLAG_LENIENT);
auto result = parser.parse(ctx);
t.assert_equal("one_or_more_partial_match_2", true, result.need_more_input());
});
// One or More - Full Match
t.test("one_or_more_full_match", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.one_or_more(p.literal("single")); });
auto ctx = common_peg_parse_context("single");
auto result = parser.parse(ctx);
t.assert_equal("one_or_more_full_match", true, result.success());
});
// One or More - No Match
t.test("one_or_more_no_match", [&](testing & t) {
auto parser = build_peg_parser([](common_peg_parser_builder & p) { return p.one_or_more(p.literal("()")); });
auto ctx = common_peg_parse_context("success");
auto result = parser.parse(ctx);
t.assert_equal("one_or_more_no_match", true, result.fail());
});

Overall, I think we're in a good place with Gemma 4. There are some interesting edge cases to handle, but I believe PEG can address them quite succinctly especially with the and/not (named peek/negate here) parser combinators, which map to positive/negative lookahead.

That said, PEG has some downsides. Error reporting, in particular, can be a challenge. There is a trade-off.

For context on why we moved to this approach: previously, the parsers looked very similar to vLLM, and grammars for grammar-constrained decoding were created separately, independent of the parsing. Many implementations were incomplete because each model had to support response format and tool calling. tool_choice = required was especially difficult to write grammars for with reasoning models, since you had to constrain from the start of generation.

Now we define something that closely represents a grammar, and it builds both the parser and GBNF grammar. Model support comes much quicker (usually within a day, not a week), and fixes come in as we learn more about how the model behaves in practice versus what's written in documentation.

There's definitely room for improvement, and I have several ideas in the back of my mind. Like I said, it's not perfect but quite good enough (in my opinion).

@emansom
Copy link
Copy Markdown

emansom commented Apr 13, 2026

If you think a different implementation approach would be better for the project overall, I'd suggest opening a discussion to get feedback from the other maintainers. That'd be a better venue for that conversation than this PR.

Stumbled on something in the LiteRT-LM source that may be of interest: #21836

@aldehir aldehir marked this pull request as ready for review April 13, 2026 07:16
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 13, 2026

@pwilkin don't think I have anything more to add at the moment. Note the new gbnf() parser, intended to overwrite the grammar generated for a rule. This is to keep the grammar clean when grammar_lazy = false, esp. if using lookahead to handle edge cases.

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Apr 13, 2026

I've been running this since yesterday after seeing issues mentioned at the top, like thought\n<channel|>. PR works great on my end, and local evaluation scores got a nice boost from far less failures.

I'm running it locally, but I'd suggest a merge ASAP. 👍

Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as it is and if we find more testable edge cases, we can do a follow-up PR.

@arkavo-com arkavo-com mentioned this pull request Apr 13, 2026
7 tasks
@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 13, 2026

@ggml-org/llama-common

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed this, but just giving the approval

@aldehir aldehir merged commit e21cdc1 into ggml-org:master Apr 13, 2026
48 of 50 checks passed
InfernalDread added a commit to InfernalDread/llama.cpp that referenced this pull request Apr 14, 2026
common/gemma4 : handle parsing edge cases (ggml-org#21760)
Vertex-DS pushed a commit to Vertex-DS/llama.cpp that referenced this pull request Apr 14, 2026
@TomTheWise
Copy link
Copy Markdown

Is there a possible fix to endless thinking? It happens even without tool calls. While this fixes the parsing, the endless reasoning loops still happen quite often with Gemma 4 26B A4B

@emansom
Copy link
Copy Markdown

emansom commented Apr 15, 2026

Is there a possible fix to endless thinking? It happens even without tool calls. While this fixes the parsing, the endless reasoning loops still happen quite often with Gemma 4 26B A4B

Test from this branch, let me know if that fixes it for you.

https://github.com/emansom/llama.cpp/tree/gemma4-fixes

@Quairon-Nailo
Copy link
Copy Markdown

Quairon-Nailo commented Apr 17, 2026

This breaks parsing for chat completion after thinking. When streaming, it outputs the reasoning and the streaming stops, but the server keeps generating tokens that are not streamed. When not streaming, the reasoning text is outputted as the actual response, but the logs say it generated more tokens. In both cases, I get an error 500 when I ask it to continue (by prefilling the reasoning):

slot update_slots: id  0 | task 1198 | prompt processing progress, n_tokens = 17682, batch.n_tokens = 18, progress = 0.985729
slot update_slots: id  0 | task 1198 | n_tokens = 17682, memory_seq_rm [17682, end)
slot update_slots: id  0 | task 1198 | prompt processing progress, n_tokens = 17934, batch.n_tokens = 252, progress = 0.999777
slot update_slots: id  0 | task 1198 | created context checkpoint 1 of 2 (pos_min = 16402, pos_max = 17681, n_tokens = 17682, size = 1000.016 MiB)
slot update_slots: id  0 | task 1198 | n_tokens = 17934, memory_seq_rm [17934, end)
slot init_sampler: id  0 | task 1198 | init sampler, took 1.38 ms, tokens: text = 17938, total = 17938
slot update_slots: id  0 | task 1198 | prompt processing done, n_tokens = 17938, batch.n_tokens = 4
slot update_slots: id  0 | task 1198 | created context checkpoint 2 of 2 (pos_min = 16654, pos_max = 17933, n_tokens = 17934, size = 1000.016 MiB)
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1198 | 
prompt eval time =   11225.83 ms / 17938 tokens (    0.63 ms per token,  1597.92 tokens per second)
       eval time =   86411.14 ms /  1900 tokens (   45.48 ms per token,    21.99 tokens per second)
      total time =   97636.97 ms / 19838 tokens
slot      release: id  0 | task 1198 | stop processing: n_tokens = 19837, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: peg-gemma4
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.956 (> 0.100 thold), f_keep = 0.904
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 3170 | processing task, is_child = 0
slot update_slots: id  0 | task 3170 | new prompt, n_ctx_slot = 39936, n_keep = 0, task.n_tokens = 18771
slot update_slots: id  0 | task 3170 | n_past = 17939, slot.prompt.tokens.size() = 19837, seq_id = 0, pos_min = 18557, n_swa = 1024
slot update_slots: id  0 | task 3170 | Checking checkpoint with [16654, 17933] against 16915...
slot update_slots: id  0 | task 3170 | restored context checkpoint (pos_min = 16654, pos_max = 17933, n_tokens = 17934, n_past = 17933, size = 1000.016 MiB)
slot update_slots: id  0 | task 3170 | n_tokens = 17933, memory_seq_rm [17933, end)
slot update_slots: id  0 | task 3170 | prompt processing progress, n_tokens = 18189, batch.n_tokens = 256, progress = 0.968995
slot update_slots: id  0 | task 3170 | n_tokens = 18189, memory_seq_rm [18189, end)
slot update_slots: id  0 | task 3170 | prompt processing progress, n_tokens = 18445, batch.n_tokens = 256, progress = 0.982633
slot update_slots: id  0 | task 3170 | n_tokens = 18445, memory_seq_rm [18445, end)
slot update_slots: id  0 | task 3170 | prompt processing progress, n_tokens = 18515, batch.n_tokens = 70, progress = 0.986362
slot update_slots: id  0 | task 3170 | n_tokens = 18515, memory_seq_rm [18515, end)
slot update_slots: id  0 | task 3170 | prompt processing progress, n_tokens = 18767, batch.n_tokens = 252, progress = 0.999787
slot update_slots: id  0 | task 3170 | erasing old context checkpoint (pos_min = 16402, pos_max = 17681, n_tokens = 17682, size = 1000.016 MiB)
slot update_slots: id  0 | task 3170 | created context checkpoint 2 of 2 (pos_min = 17235, pos_max = 18514, n_tokens = 18515, size = 1000.016 MiB)
slot update_slots: id  0 | task 3170 | n_tokens = 18767, memory_seq_rm [18767, end)
reasoning-budget: activated, budget=2147483647 tokens
slot init_sampler: id  0 | task 3170 | init sampler, took 1.48 ms, tokens: text = 18771, total = 18771
slot update_slots: id  0 | task 3170 | prompt processing done, n_tokens = 18771, batch.n_tokens = 4
slot update_slots: id  0 | task 3170 | erasing old context checkpoint (pos_min = 16654, pos_max = 17933, n_tokens = 17934, size = 1000.016 MiB)
slot update_slots: id  0 | task 3170 | created context checkpoint 2 of 2 (pos_min = 17487, pos_max = 18766, n_tokens = 18767, size = 1000.016 MiB)
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 3170 | 
prompt eval time =    1334.88 ms /   838 tokens (    1.59 ms per token,   627.77 tokens per second)
       eval time =   48585.09 ms /  1065 tokens (   45.62 ms per token,    21.92 tokens per second)
      total time =   49919.97 ms /  1903 tokens
slot      release: id  0 | task 3170 | stop processing: n_tokens = 19835, truncated = 0
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 3170
srv  update_slots: all slots are idle
srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 13: <channel|>[full text of the response after the reasoning]","type":"server_error"}}
srv  log_server_r: done request: POST /v1/chat/completions [IP]

I did substitute out the IP and the actual response text, but the logs showed an actual response was generated, just never sent to the frontend.

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 18, 2026

How did you prefill? It is not supported for reasoning models, and thus not a legitimate form of reproduction unless done correctly. Judging by the pos 13:, I am skeptical it was properly prefilled so the start of the assistant message aligns with what the parser expects.

It does seem like llama.cpp was able to parse the reasoning but then failed to parse the content. However, the failure doesn't propagate as an error and therefore you see generation occurring but no new content. You would need to set --verbose to see the raw generation. Without that, I don't know where the gap in the parsing lies.

Which model did you use, 26B A4B?

@Quairon-Nailo
Copy link
Copy Markdown

How did you prefill? It is not supported for reasoning models, and thus not a legitimate form of reproduction unless done correctly. Judging by the pos 13:, I am skeptical it was properly prefilled so the start of the assistant message aligns with what the parser expects.

It does seem like llama.cpp was able to parse the reasoning but then failed to parse the content. However, the failure doesn't propagate as an error and therefore you see generation occurring but no new content. You would need to set --verbose to see the raw generation. Without that, I don't know where the gap in the parsing lies.

Which model did you use, 26B A4B?

I use 31B. I forgot reasoning and prefill weren't supposed to be used together, since I fixed that the first day. I'm using ST, and I need chat completion in order to use vision (it doesn't send images in text completion), but in order to use the "continue generation" function (which I use often), prefill is needed. So I added

chat_template_kwargs:
  enable_thinking: false

to the additional body parameters to "disable" reasoning, then I prefill it with <|channel> and the AI completes the "thought" part, and that worked perfectly before this update. I assume that's what's messing up the parser now?

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 18, 2026

to the additional body parameters to "disable" reasoning, then I prefill it with <|channel> and the AI completes the "thought" part, and that worked perfectly before this update. I assume that's what's messing up the parser now?

I'll take a look, thanks. It should work, since we recently started parsing from the start of the generation prompt vs. the start of the generation itself.

I have an idea that could help, it's just hard to reproduce these errors to verify it works.

@Quairon-Nailo
Copy link
Copy Markdown

Quairon-Nailo commented Apr 20, 2026

to the additional body parameters to "disable" reasoning, then I prefill it with <|channel> and the AI completes the "thought" part, and that worked perfectly before this update. I assume that's what's messing up the parser now?

I'll take a look, thanks. It should work, since we recently started parsing from the start of the generation prompt vs. the start of the generation itself.

I have an idea that could help, it's just hard to reproduce these errors to verify it works.

Thank you for looking into it, and sorry for taking so long to reply, I haven't been home the past few days. I've just run this with --verbose hoping it helps, I've run the same prompt with and without streaming.

Log
~/Programs/llama.cpp master* ≡
❯ G4-31B.sh --verbose
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48229 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24103 MiB
build_info: b8851-e365e658f
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/mnt/Speed/AI/Models/bartowski/google_gemma-4-31B-it-GGUF/google_gemma-4-31B-it-Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23859 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 22602 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 833 tensors from /mnt/Speed/AI/Models/bartowski/google_gemma-4-31B-it-GGUF/google_gemma-4-31B-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 64
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Gemma 4 31B It
llama_model_loader: - kv   6:                           general.finetune str              = it
llama_model_loader: - kv   7:                           general.basename str              = gemma-4
llama_model_loader: - kv   8:                         general.size_label str              = 31B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://ai.google.dev/gemma/docs/gemm...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         gemma4.block_count u32              = 60
llama_model_loader: - kv  13:                      gemma4.context_length u32              = 262144
llama_model_loader: - kv  14:                    gemma4.embedding_length u32              = 5376
llama_model_loader: - kv  15:                 gemma4.feed_forward_length u32              = 21504
llama_model_loader: - kv  16:                gemma4.attention.head_count u32              = 32
llama_model_loader: - kv  17:             gemma4.attention.head_count_kv arr[i32,60]      = [16, 16, 16, 16, 16, 4, 16, 16, 16, 1...
llama_model_loader: - kv  18:                      gemma4.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:                  gemma4.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  20:    gemma4.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                gemma4.attention.key_length u32              = 512
llama_model_loader: - kv  22:              gemma4.attention.value_length u32              = 512
llama_model_loader: - kv  23:             gemma4.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  24:            gemma4.attention.sliding_window u32              = 1024
llama_model_loader: - kv  25:          gemma4.attention.shared_kv_layers u32              = 0
llama_model_loader: - kv  26:    gemma4.embedding_length_per_layer_input u32              = 0
llama_model_loader: - kv  27:    gemma4.attention.sliding_window_pattern arr[bool,60]     = [true, true, true, true, true, false,...
llama_model_loader: - kv  28:            gemma4.attention.key_length_swa u32              = 256
llama_model_loader: - kv  29:          gemma4.attention.value_length_swa u32              = 256
llama_model_loader: - kv  30:                gemma4.rope.dimension_count u32              = 512
llama_model_loader: - kv  31:            gemma4.rope.dimension_count_swa u32              = 256
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gemma4
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  34:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,514906]  = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  41:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- macro format_parameters(propertie...
llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  44:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 7
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = /models_out/gemma-4-31B-it-GGUF/googl...
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 410
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 886
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type q8_0:  411 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.38 GiB (8.50 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
load: 0 unused tokens
load: control token: 258884 '<|video|>' is not marked as EOG
load: control token: 255999 '<|image>' is not marked as EOG
load: control token: 258882 '<image|>' is not marked as EOG
load: control token: 258883 '<audio|>' is not marked as EOG
load: control token:     98 '<|think|>' is not marked as EOG
load: control token:    105 '<|turn>' is not marked as EOG
load: control token: 258880 '<|image|>' is not marked as EOG
load: control token:      2 '<bos>' is not marked as EOG
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token:      0 '<pad>' is not marked as EOG
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token:     46 '<|tool>' is not marked as EOG
load: control token:     47 '<tool|>' is not marked as EOG
load: control token: 256000 '<|audio>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token: 258881 '<|audio|>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
load: special tokens cache size = 24
load: token to piece cache size = 1.9445 MB
print_info: arch                  = gemma4
print_info: vocab_only            = 0
print_info: no_alloc              = 1
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5376
print_info: n_embd_inp            = 5376
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
print_info: n_rot                 = 512
print_info: n_swa                 = 1024
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 512
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 1.0e+00
print_info: n_ff                  = 21504
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 256
print_info: n_embd_head_v_swa     = 256
print_info: n_rot_swa             = 256
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 31B
print_info: model params          = 30.70 B
print_info: general.name          = Gemma 4 31B It
print_info: vocab type            = BPE
print_info: n_vocab               = 262144
print_info: n_merges              = 514906
print_info: BOS token             = 2 '<bos>'
print_info: EOS token             = 1 '<eos>'
print_info: UNK token             = 3 '<unk>'
print_info: PAD token             = 0 '<pad>'
print_info: MASK token            = 4 '<mask>'
print_info: LF token              = 107 '
'
print_info: EOG token             = 1 '<eos>'
print_info: EOG token             = 50 '<|tool_response>'
print_info: EOG token             = 106 '<turn|>'
print_info: max token length      = 93
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 1
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA1, is_swa = 1
load_tensors: layer  38 assigned to device CUDA1, is_swa = 1
load_tensors: layer  39 assigned to device CUDA1, is_swa = 1
load_tensors: layer  40 assigned to device CUDA1, is_swa = 1
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 1
load_tensors: layer  43 assigned to device CUDA1, is_swa = 1
load_tensors: layer  44 assigned to device CUDA1, is_swa = 1
load_tensors: layer  45 assigned to device CUDA1, is_swa = 1
load_tensors: layer  46 assigned to device CUDA1, is_swa = 1
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CUDA1, is_swa = 1
load_tensors: layer  49 assigned to device CUDA1, is_swa = 1
load_tensors: layer  50 assigned to device CUDA1, is_swa = 1
load_tensors: layer  51 assigned to device CUDA1, is_swa = 1
load_tensors: layer  52 assigned to device CUDA1, is_swa = 1
load_tensors: layer  53 assigned to device CUDA1, is_swa = 0
load_tensors: layer  54 assigned to device CUDA1, is_swa = 1
load_tensors: layer  55 assigned to device CUDA1, is_swa = 1
load_tensors: layer  56 assigned to device CUDA1, is_swa = 1
load_tensors: layer  57 assigned to device CUDA1, is_swa = 1
load_tensors: layer  58 assigned to device CUDA1, is_swa = 1
load_tensors: layer  59 assigned to device CUDA1, is_swa = 0
load_tensors: layer  60 assigned to device CUDA1, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q_norm.weight
create_tensor: loading tensor blk.0.attn_k_norm.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.layer_output_scale.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.post_ffw_norm.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q_norm.weight
create_tensor: loading tensor blk.1.attn_k_norm.weight
create_tensor: loading tensor blk.1.post_attention_norm.weight
create_tensor: loading tensor blk.1.layer_output_scale.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.post_ffw_norm.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q_norm.weight
create_tensor: loading tensor blk.2.attn_k_norm.weight
create_tensor: loading tensor blk.2.post_attention_norm.weight
create_tensor: loading tensor blk.2.layer_output_scale.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.post_ffw_norm.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.post_attention_norm.weight
create_tensor: loading tensor blk.3.layer_output_scale.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.post_ffw_norm.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q_norm.weight
create_tensor: loading tensor blk.4.attn_k_norm.weight
create_tensor: loading tensor blk.4.post_attention_norm.weight
create_tensor: loading tensor blk.4.layer_output_scale.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.post_ffw_norm.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q_norm.weight
create_tensor: loading tensor blk.5.attn_k_norm.weight
create_tensor: loading tensor blk.5.post_attention_norm.weight
create_tensor: loading tensor blk.5.layer_output_scale.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.post_ffw_norm.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q_norm.weight
create_tensor: loading tensor blk.6.attn_k_norm.weight
create_tensor: loading tensor blk.6.post_attention_norm.weight
create_tensor: loading tensor blk.6.layer_output_scale.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.post_ffw_norm.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.post_attention_norm.weight
create_tensor: loading tensor blk.7.layer_output_scale.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.post_ffw_norm.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q_norm.weight
create_tensor: loading tensor blk.8.attn_k_norm.weight
create_tensor: loading tensor blk.8.post_attention_norm.weight
create_tensor: loading tensor blk.8.layer_output_scale.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.post_ffw_norm.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q_norm.weight
create_tensor: loading tensor blk.9.attn_k_norm.weight
create_tensor: loading tensor blk.9.post_attention_norm.weight
create_tensor: loading tensor blk.9.layer_output_scale.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.post_ffw_norm.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q_norm.weight
create_tensor: loading tensor blk.10.attn_k_norm.weight
create_tensor: loading tensor blk.10.post_attention_norm.weight
create_tensor: loading tensor blk.10.layer_output_scale.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.post_ffw_norm.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.post_attention_norm.weight
create_tensor: loading tensor blk.11.layer_output_scale.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.post_ffw_norm.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q_norm.weight
create_tensor: loading tensor blk.12.attn_k_norm.weight
create_tensor: loading tensor blk.12.post_attention_norm.weight
create_tensor: loading tensor blk.12.layer_output_scale.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.post_ffw_norm.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q_norm.weight
create_tensor: loading tensor blk.13.attn_k_norm.weight
create_tensor: loading tensor blk.13.post_attention_norm.weight
create_tensor: loading tensor blk.13.layer_output_scale.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.post_ffw_norm.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q_norm.weight
create_tensor: loading tensor blk.14.attn_k_norm.weight
create_tensor: loading tensor blk.14.post_attention_norm.weight
create_tensor: loading tensor blk.14.layer_output_scale.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.post_ffw_norm.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.post_attention_norm.weight
create_tensor: loading tensor blk.15.layer_output_scale.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.post_ffw_norm.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q_norm.weight
create_tensor: loading tensor blk.16.attn_k_norm.weight
create_tensor: loading tensor blk.16.post_attention_norm.weight
create_tensor: loading tensor blk.16.layer_output_scale.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.post_ffw_norm.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q_norm.weight
create_tensor: loading tensor blk.17.attn_k_norm.weight
create_tensor: loading tensor blk.17.post_attention_norm.weight
create_tensor: loading tensor blk.17.layer_output_scale.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.post_ffw_norm.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q_norm.weight
create_tensor: loading tensor blk.18.attn_k_norm.weight
create_tensor: loading tensor blk.18.post_attention_norm.weight
create_tensor: loading tensor blk.18.layer_output_scale.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.post_ffw_norm.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.post_attention_norm.weight
create_tensor: loading tensor blk.19.layer_output_scale.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.post_ffw_norm.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_q_norm.weight
create_tensor: loading tensor blk.20.attn_k_norm.weight
create_tensor: loading tensor blk.20.post_attention_norm.weight
create_tensor: loading tensor blk.20.layer_output_scale.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.post_ffw_norm.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_q_norm.weight
create_tensor: loading tensor blk.21.attn_k_norm.weight
create_tensor: loading tensor blk.21.post_attention_norm.weight
create_tensor: loading tensor blk.21.layer_output_scale.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.post_ffw_norm.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_q_norm.weight
create_tensor: loading tensor blk.22.attn_k_norm.weight
create_tensor: loading tensor blk.22.post_attention_norm.weight
create_tensor: loading tensor blk.22.layer_output_scale.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.post_ffw_norm.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q_norm.weight
create_tensor: loading tensor blk.23.attn_k_norm.weight
create_tensor: loading tensor blk.23.post_attention_norm.weight
create_tensor: loading tensor blk.23.layer_output_scale.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.post_ffw_norm.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_q_norm.weight
create_tensor: loading tensor blk.24.attn_k_norm.weight
create_tensor: loading tensor blk.24.post_attention_norm.weight
create_tensor: loading tensor blk.24.layer_output_scale.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.post_ffw_norm.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_q_norm.weight
create_tensor: loading tensor blk.25.attn_k_norm.weight
create_tensor: loading tensor blk.25.post_attention_norm.weight
create_tensor: loading tensor blk.25.layer_output_scale.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.post_ffw_norm.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_q_norm.weight
create_tensor: loading tensor blk.26.attn_k_norm.weight
create_tensor: loading tensor blk.26.post_attention_norm.weight
create_tensor: loading tensor blk.26.layer_output_scale.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.post_ffw_norm.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q_norm.weight
create_tensor: loading tensor blk.27.attn_k_norm.weight
create_tensor: loading tensor blk.27.post_attention_norm.weight
create_tensor: loading tensor blk.27.layer_output_scale.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_up.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.post_ffw_norm.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_q_norm.weight
create_tensor: loading tensor blk.28.attn_k_norm.weight
create_tensor: loading tensor blk.28.post_attention_norm.weight
create_tensor: loading tensor blk.28.layer_output_scale.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate.weight
create_tensor: loading tensor blk.28.ffn_up.weight
create_tensor: loading tensor blk.28.ffn_down.weight
create_tensor: loading tensor blk.28.post_ffw_norm.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_q_norm.weight
create_tensor: loading tensor blk.29.attn_k_norm.weight
create_tensor: loading tensor blk.29.post_attention_norm.weight
create_tensor: loading tensor blk.29.layer_output_scale.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate.weight
create_tensor: loading tensor blk.29.ffn_up.weight
create_tensor: loading tensor blk.29.ffn_down.weight
create_tensor: loading tensor blk.29.post_ffw_norm.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_q.weight
create_tensor: loading tensor blk.30.attn_k.weight
create_tensor: loading tensor blk.30.attn_v.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_q_norm.weight
create_tensor: loading tensor blk.30.attn_k_norm.weight
create_tensor: loading tensor blk.30.post_attention_norm.weight
create_tensor: loading tensor blk.30.layer_output_scale.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate.weight
create_tensor: loading tensor blk.30.ffn_up.weight
create_tensor: loading tensor blk.30.ffn_down.weight
create_tensor: loading tensor blk.30.post_ffw_norm.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_q_norm.weight
create_tensor: loading tensor blk.31.attn_k_norm.weight
create_tensor: loading tensor blk.31.post_attention_norm.weight
create_tensor: loading tensor blk.31.layer_output_scale.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate.weight
create_tensor: loading tensor blk.31.ffn_up.weight
create_tensor: loading tensor blk.31.ffn_down.weight
create_tensor: loading tensor blk.31.post_ffw_norm.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_q.weight
create_tensor: loading tensor blk.32.attn_k.weight
create_tensor: loading tensor blk.32.attn_v.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_q_norm.weight
create_tensor: loading tensor blk.32.attn_k_norm.weight
create_tensor: loading tensor blk.32.post_attention_norm.weight
create_tensor: loading tensor blk.32.layer_output_scale.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate.weight
create_tensor: loading tensor blk.32.ffn_up.weight
create_tensor: loading tensor blk.32.ffn_down.weight
create_tensor: loading tensor blk.32.post_ffw_norm.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_q.weight
create_tensor: loading tensor blk.33.attn_k.weight
create_tensor: loading tensor blk.33.attn_v.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_q_norm.weight
create_tensor: loading tensor blk.33.attn_k_norm.weight
create_tensor: loading tensor blk.33.post_attention_norm.weight
create_tensor: loading tensor blk.33.layer_output_scale.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate.weight
create_tensor: loading tensor blk.33.ffn_up.weight
create_tensor: loading tensor blk.33.ffn_down.weight
create_tensor: loading tensor blk.33.post_ffw_norm.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_q.weight
create_tensor: loading tensor blk.34.attn_k.weight
create_tensor: loading tensor blk.34.attn_v.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_q_norm.weight
create_tensor: loading tensor blk.34.attn_k_norm.weight
create_tensor: loading tensor blk.34.post_attention_norm.weight
create_tensor: loading tensor blk.34.layer_output_scale.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate.weight
create_tensor: loading tensor blk.34.ffn_up.weight
create_tensor: loading tensor blk.34.ffn_down.weight
create_tensor: loading tensor blk.34.post_ffw_norm.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.attn_q.weight
create_tensor: loading tensor blk.35.attn_k.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_q_norm.weight
create_tensor: loading tensor blk.35.attn_k_norm.weight
create_tensor: loading tensor blk.35.post_attention_norm.weight
create_tensor: loading tensor blk.35.layer_output_scale.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate.weight
create_tensor: loading tensor blk.35.ffn_up.weight
create_tensor: loading tensor blk.35.ffn_down.weight
create_tensor: loading tensor blk.35.post_ffw_norm.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_q.weight
create_tensor: loading tensor blk.36.attn_k.weight
create_tensor: loading tensor blk.36.attn_v.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_q_norm.weight
create_tensor: loading tensor blk.36.attn_k_norm.weight
create_tensor: loading tensor blk.36.post_attention_norm.weight
create_tensor: loading tensor blk.36.layer_output_scale.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate.weight
create_tensor: loading tensor blk.36.ffn_up.weight
create_tensor: loading tensor blk.36.ffn_down.weight
create_tensor: loading tensor blk.36.post_ffw_norm.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_q.weight
create_tensor: loading tensor blk.37.attn_k.weight
create_tensor: loading tensor blk.37.attn_v.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_q_norm.weight
create_tensor: loading tensor blk.37.attn_k_norm.weight
create_tensor: loading tensor blk.37.post_attention_norm.weight
create_tensor: loading tensor blk.37.layer_output_scale.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate.weight
create_tensor: loading tensor blk.37.ffn_up.weight
create_tensor: loading tensor blk.37.ffn_down.weight
create_tensor: loading tensor blk.37.post_ffw_norm.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_q.weight
create_tensor: loading tensor blk.38.attn_k.weight
create_tensor: loading tensor blk.38.attn_v.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_q_norm.weight
create_tensor: loading tensor blk.38.attn_k_norm.weight
create_tensor: loading tensor blk.38.post_attention_norm.weight
create_tensor: loading tensor blk.38.layer_output_scale.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate.weight
create_tensor: loading tensor blk.38.ffn_up.weight
create_tensor: loading tensor blk.38.ffn_down.weight
create_tensor: loading tensor blk.38.post_ffw_norm.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_q.weight
create_tensor: loading tensor blk.39.attn_k.weight
create_tensor: loading tensor blk.39.attn_v.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_q_norm.weight
create_tensor: loading tensor blk.39.attn_k_norm.weight
create_tensor: loading tensor blk.39.post_attention_norm.weight
create_tensor: loading tensor blk.39.layer_output_scale.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate.weight
create_tensor: loading tensor blk.39.ffn_up.weight
create_tensor: loading tensor blk.39.ffn_down.weight
create_tensor: loading tensor blk.39.post_ffw_norm.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_q.weight
create_tensor: loading tensor blk.40.attn_k.weight
create_tensor: loading tensor blk.40.attn_v.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_q_norm.weight
create_tensor: loading tensor blk.40.attn_k_norm.weight
create_tensor: loading tensor blk.40.post_attention_norm.weight
create_tensor: loading tensor blk.40.layer_output_scale.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate.weight
create_tensor: loading tensor blk.40.ffn_up.weight
create_tensor: loading tensor blk.40.ffn_down.weight
create_tensor: loading tensor blk.40.post_ffw_norm.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.attn_q.weight
create_tensor: loading tensor blk.41.attn_k.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_q_norm.weight
create_tensor: loading tensor blk.41.attn_k_norm.weight
create_tensor: loading tensor blk.41.post_attention_norm.weight
create_tensor: loading tensor blk.41.layer_output_scale.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate.weight
create_tensor: loading tensor blk.41.ffn_up.weight
create_tensor: loading tensor blk.41.ffn_down.weight
create_tensor: loading tensor blk.41.post_ffw_norm.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_q.weight
create_tensor: loading tensor blk.42.attn_k.weight
create_tensor: loading tensor blk.42.attn_v.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_q_norm.weight
create_tensor: loading tensor blk.42.attn_k_norm.weight
create_tensor: loading tensor blk.42.post_attention_norm.weight
create_tensor: loading tensor blk.42.layer_output_scale.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate.weight
create_tensor: loading tensor blk.42.ffn_up.weight
create_tensor: loading tensor blk.42.ffn_down.weight
create_tensor: loading tensor blk.42.post_ffw_norm.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_q.weight
create_tensor: loading tensor blk.43.attn_k.weight
create_tensor: loading tensor blk.43.attn_v.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_q_norm.weight
create_tensor: loading tensor blk.43.attn_k_norm.weight
create_tensor: loading tensor blk.43.post_attention_norm.weight
create_tensor: loading tensor blk.43.layer_output_scale.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate.weight
create_tensor: loading tensor blk.43.ffn_up.weight
create_tensor: loading tensor blk.43.ffn_down.weight
create_tensor: loading tensor blk.43.post_ffw_norm.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_q.weight
create_tensor: loading tensor blk.44.attn_k.weight
create_tensor: loading tensor blk.44.attn_v.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_q_norm.weight
create_tensor: loading tensor blk.44.attn_k_norm.weight
create_tensor: loading tensor blk.44.post_attention_norm.weight
create_tensor: loading tensor blk.44.layer_output_scale.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate.weight
create_tensor: loading tensor blk.44.ffn_up.weight
create_tensor: loading tensor blk.44.ffn_down.weight
create_tensor: loading tensor blk.44.post_ffw_norm.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_q_norm.weight
create_tensor: loading tensor blk.45.attn_k_norm.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.layer_output_scale.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate.weight
create_tensor: loading tensor blk.45.ffn_up.weight
create_tensor: loading tensor blk.45.ffn_down.weight
create_tensor: loading tensor blk.45.post_ffw_norm.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_q.weight
create_tensor: loading tensor blk.46.attn_k.weight
create_tensor: loading tensor blk.46.attn_v.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_q_norm.weight
create_tensor: loading tensor blk.46.attn_k_norm.weight
create_tensor: loading tensor blk.46.post_attention_norm.weight
create_tensor: loading tensor blk.46.layer_output_scale.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate.weight
create_tensor: loading tensor blk.46.ffn_up.weight
create_tensor: loading tensor blk.46.ffn_down.weight
create_tensor: loading tensor blk.46.post_ffw_norm.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.attn_q.weight
create_tensor: loading tensor blk.47.attn_k.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_q_norm.weight
create_tensor: loading tensor blk.47.attn_k_norm.weight
create_tensor: loading tensor blk.47.post_attention_norm.weight
create_tensor: loading tensor blk.47.layer_output_scale.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate.weight
create_tensor: loading tensor blk.47.ffn_up.weight
create_tensor: loading tensor blk.47.ffn_down.weight
create_tensor: loading tensor blk.47.post_ffw_norm.weight
create_tensor: loading tensor blk.48.attn_norm.weight
create_tensor: loading tensor blk.48.attn_q.weight
create_tensor: loading tensor blk.48.attn_k.weight
create_tensor: loading tensor blk.48.attn_v.weight
create_tensor: loading tensor blk.48.attn_output.weight
create_tensor: loading tensor blk.48.attn_q_norm.weight
create_tensor: loading tensor blk.48.attn_k_norm.weight
create_tensor: loading tensor blk.48.post_attention_norm.weight
create_tensor: loading tensor blk.48.layer_output_scale.weight
create_tensor: loading tensor blk.48.ffn_norm.weight
create_tensor: loading tensor blk.48.ffn_gate.weight
create_tensor: loading tensor blk.48.ffn_up.weight
create_tensor: loading tensor blk.48.ffn_down.weight
create_tensor: loading tensor blk.48.post_ffw_norm.weight
create_tensor: loading tensor blk.49.attn_norm.weight
create_tensor: loading tensor blk.49.attn_q.weight
create_tensor: loading tensor blk.49.attn_k.weight
create_tensor: loading tensor blk.49.attn_v.weight
create_tensor: loading tensor blk.49.attn_output.weight
create_tensor: loading tensor blk.49.attn_q_norm.weight
create_tensor: loading tensor blk.49.attn_k_norm.weight
create_tensor: loading tensor blk.49.post_attention_norm.weight
create_tensor: loading tensor blk.49.layer_output_scale.weight
create_tensor: loading tensor blk.49.ffn_norm.weight
create_tensor: loading tensor blk.49.ffn_gate.weight
create_tensor: loading tensor blk.49.ffn_up.weight
create_tensor: loading tensor blk.49.ffn_down.weight
create_tensor: loading tensor blk.49.post_ffw_norm.weight
create_tensor: loading tensor blk.50.attn_norm.weight
create_tensor: loading tensor blk.50.attn_q.weight
create_tensor: loading tensor blk.50.attn_k.weight
create_tensor: loading tensor blk.50.attn_v.weight
create_tensor: loading tensor blk.50.attn_output.weight
create_tensor: loading tensor blk.50.attn_q_norm.weight
create_tensor: loading tensor blk.50.attn_k_norm.weight
create_tensor: loading tensor blk.50.post_attention_norm.weight
create_tensor: loading tensor blk.50.layer_output_scale.weight
create_tensor: loading tensor blk.50.ffn_norm.weight
create_tensor: loading tensor blk.50.ffn_gate.weight
create_tensor: loading tensor blk.50.ffn_up.weight
create_tensor: loading tensor blk.50.ffn_down.weight
create_tensor: loading tensor blk.50.post_ffw_norm.weight
create_tensor: loading tensor blk.51.attn_norm.weight
create_tensor: loading tensor blk.51.attn_q.weight
create_tensor: loading tensor blk.51.attn_k.weight
create_tensor: loading tensor blk.51.attn_v.weight
create_tensor: loading tensor blk.51.attn_output.weight
create_tensor: loading tensor blk.51.attn_q_norm.weight
create_tensor: loading tensor blk.51.attn_k_norm.weight
create_tensor: loading tensor blk.51.post_attention_norm.weight
create_tensor: loading tensor blk.51.layer_output_scale.weight
create_tensor: loading tensor blk.51.ffn_norm.weight
create_tensor: loading tensor blk.51.ffn_gate.weight
create_tensor: loading tensor blk.51.ffn_up.weight
create_tensor: loading tensor blk.51.ffn_down.weight
create_tensor: loading tensor blk.51.post_ffw_norm.weight
create_tensor: loading tensor blk.52.attn_norm.weight
create_tensor: loading tensor blk.52.attn_q.weight
create_tensor: loading tensor blk.52.attn_k.weight
create_tensor: loading tensor blk.52.attn_v.weight
create_tensor: loading tensor blk.52.attn_output.weight
create_tensor: loading tensor blk.52.attn_q_norm.weight
create_tensor: loading tensor blk.52.attn_k_norm.weight
create_tensor: loading tensor blk.52.post_attention_norm.weight
create_tensor: loading tensor blk.52.layer_output_scale.weight
create_tensor: loading tensor blk.52.ffn_norm.weight
create_tensor: loading tensor blk.52.ffn_gate.weight
create_tensor: loading tensor blk.52.ffn_up.weight
create_tensor: loading tensor blk.52.ffn_down.weight
create_tensor: loading tensor blk.52.post_ffw_norm.weight
create_tensor: loading tensor blk.53.attn_norm.weight
create_tensor: loading tensor blk.53.attn_q.weight
create_tensor: loading tensor blk.53.attn_k.weight
create_tensor: loading tensor blk.53.attn_output.weight
create_tensor: loading tensor blk.53.attn_q_norm.weight
create_tensor: loading tensor blk.53.attn_k_norm.weight
create_tensor: loading tensor blk.53.post_attention_norm.weight
create_tensor: loading tensor blk.53.layer_output_scale.weight
create_tensor: loading tensor blk.53.ffn_norm.weight
create_tensor: loading tensor blk.53.ffn_gate.weight
create_tensor: loading tensor blk.53.ffn_up.weight
create_tensor: loading tensor blk.53.ffn_down.weight
create_tensor: loading tensor blk.53.post_ffw_norm.weight
create_tensor: loading tensor blk.54.attn_norm.weight
create_tensor: loading tensor blk.54.attn_q.weight
create_tensor: loading tensor blk.54.attn_k.weight
create_tensor: loading tensor blk.54.attn_v.weight
create_tensor: loading tensor blk.54.attn_output.weight
create_tensor: loading tensor blk.54.attn_q_norm.weight
create_tensor: loading tensor blk.54.attn_k_norm.weight
create_tensor: loading tensor blk.54.post_attention_norm.weight
create_tensor: loading tensor blk.54.layer_output_scale.weight
create_tensor: loading tensor blk.54.ffn_norm.weight
create_tensor: loading tensor blk.54.ffn_gate.weight
create_tensor: loading tensor blk.54.ffn_up.weight
create_tensor: loading tensor blk.54.ffn_down.weight
create_tensor: loading tensor blk.54.post_ffw_norm.weight
create_tensor: loading tensor blk.55.attn_norm.weight
create_tensor: loading tensor blk.55.attn_q.weight
create_tensor: loading tensor blk.55.attn_k.weight
create_tensor: loading tensor blk.55.attn_v.weight
create_tensor: loading tensor blk.55.attn_output.weight
create_tensor: loading tensor blk.55.attn_q_norm.weight
create_tensor: loading tensor blk.55.attn_k_norm.weight
create_tensor: loading tensor blk.55.post_attention_norm.weight
create_tensor: loading tensor blk.55.layer_output_scale.weight
create_tensor: loading tensor blk.55.ffn_norm.weight
create_tensor: loading tensor blk.55.ffn_gate.weight
create_tensor: loading tensor blk.55.ffn_up.weight
create_tensor: loading tensor blk.55.ffn_down.weight
create_tensor: loading tensor blk.55.post_ffw_norm.weight
create_tensor: loading tensor blk.56.attn_norm.weight
create_tensor: loading tensor blk.56.attn_q.weight
create_tensor: loading tensor blk.56.attn_k.weight
create_tensor: loading tensor blk.56.attn_v.weight
create_tensor: loading tensor blk.56.attn_output.weight
create_tensor: loading tensor blk.56.attn_q_norm.weight
create_tensor: loading tensor blk.56.attn_k_norm.weight
create_tensor: loading tensor blk.56.post_attention_norm.weight
create_tensor: loading tensor blk.56.layer_output_scale.weight
create_tensor: loading tensor blk.56.ffn_norm.weight
create_tensor: loading tensor blk.56.ffn_gate.weight
create_tensor: loading tensor blk.56.ffn_up.weight
create_tensor: loading tensor blk.56.ffn_down.weight
create_tensor: loading tensor blk.56.post_ffw_norm.weight
create_tensor: loading tensor blk.57.attn_norm.weight
create_tensor: loading tensor blk.57.attn_q.weight
create_tensor: loading tensor blk.57.attn_k.weight
create_tensor: loading tensor blk.57.attn_v.weight
create_tensor: loading tensor blk.57.attn_output.weight
create_tensor: loading tensor blk.57.attn_q_norm.weight
create_tensor: loading tensor blk.57.attn_k_norm.weight
create_tensor: loading tensor blk.57.post_attention_norm.weight
create_tensor: loading tensor blk.57.layer_output_scale.weight
create_tensor: loading tensor blk.57.ffn_norm.weight
create_tensor: loading tensor blk.57.ffn_gate.weight
create_tensor: loading tensor blk.57.ffn_up.weight
create_tensor: loading tensor blk.57.ffn_down.weight
create_tensor: loading tensor blk.57.post_ffw_norm.weight
create_tensor: loading tensor blk.58.attn_norm.weight
create_tensor: loading tensor blk.58.attn_q.weight
create_tensor: loading tensor blk.58.attn_k.weight
create_tensor: loading tensor blk.58.attn_v.weight
create_tensor: loading tensor blk.58.attn_output.weight
create_tensor: loading tensor blk.58.attn_q_norm.weight
create_tensor: loading tensor blk.58.attn_k_norm.weight
create_tensor: loading tensor blk.58.post_attention_norm.weight
create_tensor: loading tensor blk.58.layer_output_scale.weight
create_tensor: loading tensor blk.58.ffn_norm.weight
create_tensor: loading tensor blk.58.ffn_gate.weight
create_tensor: loading tensor blk.58.ffn_up.weight
create_tensor: loading tensor blk.58.ffn_down.weight
create_tensor: loading tensor blk.58.post_ffw_norm.weight
create_tensor: loading tensor blk.59.attn_norm.weight
create_tensor: loading tensor blk.59.attn_q.weight
create_tensor: loading tensor blk.59.attn_k.weight
create_tensor: loading tensor blk.59.attn_output.weight
create_tensor: loading tensor blk.59.attn_q_norm.weight
create_tensor: loading tensor blk.59.attn_k_norm.weight
create_tensor: loading tensor blk.59.post_attention_norm.weight
create_tensor: loading tensor blk.59.layer_output_scale.weight
create_tensor: loading tensor blk.59.ffn_norm.weight
create_tensor: loading tensor blk.59.ffn_gate.weight
create_tensor: loading tensor blk.59.ffn_up.weight
create_tensor: loading tensor blk.59.ffn_down.weight
create_tensor: loading tensor blk.59.post_ffw_norm.weight
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors:        CUDA0 model buffer size =     0.00 MiB
load_tensors:        CUDA1 model buffer size =     0.00 MiB
load_tensors:    CUDA_Host model buffer size =     0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 39936
llama_context: n_ctx_seq     = 39936
llama_context: n_batch       = 256
llama_context: n_ubatch      = 256
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (39936) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 39936 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: filtered
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: dev = CUDA0
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: filtered
llama_kv_cache: layer  32: filtered
llama_kv_cache: layer  33: filtered
llama_kv_cache: layer  34: filtered
llama_kv_cache: layer  35: dev = CUDA0
llama_kv_cache: layer  36: filtered
llama_kv_cache: layer  37: filtered
llama_kv_cache: layer  38: filtered
llama_kv_cache: layer  39: filtered
llama_kv_cache: layer  40: filtered
llama_kv_cache: layer  41: dev = CUDA1
llama_kv_cache: layer  42: filtered
llama_kv_cache: layer  43: filtered
llama_kv_cache: layer  44: filtered
llama_kv_cache: layer  45: filtered
llama_kv_cache: layer  46: filtered
llama_kv_cache: layer  47: dev = CUDA1
llama_kv_cache: layer  48: filtered
llama_kv_cache: layer  49: filtered
llama_kv_cache: layer  50: filtered
llama_kv_cache: layer  51: filtered
llama_kv_cache: layer  52: filtered
llama_kv_cache: layer  53: dev = CUDA1
llama_kv_cache: layer  54: filtered
llama_kv_cache: layer  55: filtered
llama_kv_cache: layer  56: filtered
llama_kv_cache: layer  57: filtered
llama_kv_cache: layer  58: filtered
llama_kv_cache: layer  59: dev = CUDA1
llama_kv_cache: reusing layers:
llama_kv_cache: - layer   0: no reuse
llama_kv_cache: - layer   1: no reuse
llama_kv_cache: - layer   2: no reuse
llama_kv_cache: - layer   3: no reuse
llama_kv_cache: - layer   4: no reuse
llama_kv_cache: - layer   5: no reuse
llama_kv_cache: - layer   6: no reuse
llama_kv_cache: - layer   7: no reuse
llama_kv_cache: - layer   8: no reuse
llama_kv_cache: - layer   9: no reuse
llama_kv_cache: - layer  10: no reuse
llama_kv_cache: - layer  11: no reuse
llama_kv_cache: - layer  12: no reuse
llama_kv_cache: - layer  13: no reuse
llama_kv_cache: - layer  14: no reuse
llama_kv_cache: - layer  15: no reuse
llama_kv_cache: - layer  16: no reuse
llama_kv_cache: - layer  17: no reuse
llama_kv_cache: - layer  18: no reuse
llama_kv_cache: - layer  19: no reuse
llama_kv_cache: - layer  20: no reuse
llama_kv_cache: - layer  21: no reuse
llama_kv_cache: - layer  22: no reuse
llama_kv_cache: - layer  23: no reuse
llama_kv_cache: - layer  24: no reuse
llama_kv_cache: - layer  25: no reuse
llama_kv_cache: - layer  26: no reuse
llama_kv_cache: - layer  27: no reuse
llama_kv_cache: - layer  28: no reuse
llama_kv_cache: - layer  29: no reuse
llama_kv_cache: - layer  30: no reuse
llama_kv_cache: - layer  31: no reuse
llama_kv_cache: - layer  32: no reuse
llama_kv_cache: - layer  33: no reuse
llama_kv_cache: - layer  34: no reuse
llama_kv_cache: - layer  35: no reuse
llama_kv_cache: - layer  36: no reuse
llama_kv_cache: - layer  37: no reuse
llama_kv_cache: - layer  38: no reuse
llama_kv_cache: - layer  39: no reuse
llama_kv_cache: - layer  40: no reuse
llama_kv_cache: - layer  41: no reuse
llama_kv_cache: - layer  42: no reuse
llama_kv_cache: - layer  43: no reuse
llama_kv_cache: - layer  44: no reuse
llama_kv_cache: - layer  45: no reuse
llama_kv_cache: - layer  46: no reuse
llama_kv_cache: - layer  47: no reuse
llama_kv_cache: - layer  48: no reuse
llama_kv_cache: - layer  49: no reuse
llama_kv_cache: - layer  50: no reuse
llama_kv_cache: - layer  51: no reuse
llama_kv_cache: - layer  52: no reuse
llama_kv_cache: - layer  53: no reuse
llama_kv_cache: - layer  54: no reuse
llama_kv_cache: - layer  55: no reuse
llama_kv_cache: - layer  56: no reuse
llama_kv_cache: - layer  57: no reuse
llama_kv_cache: - layer  58: no reuse
llama_kv_cache: - layer  59: no reuse
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     0.00 MiB
llama_kv_cache: size = 3120.00 MiB ( 39936 cells,  10 layers,  1/1 seqs), K (f16): 1560.00 MiB, V (f16): 1560.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_kv_cache_iswa: creating     SWA KV cache, size = 1280 cells
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: filtered
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache: layer  28: dev = CUDA0
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: dev = CUDA0
llama_kv_cache: layer  31: dev = CUDA0
llama_kv_cache: layer  32: dev = CUDA0
llama_kv_cache: layer  33: dev = CUDA0
llama_kv_cache: layer  34: dev = CUDA0
llama_kv_cache: layer  35: filtered
llama_kv_cache: layer  36: dev = CUDA0
llama_kv_cache: layer  37: dev = CUDA1
llama_kv_cache: layer  38: dev = CUDA1
llama_kv_cache: layer  39: dev = CUDA1
llama_kv_cache: layer  40: dev = CUDA1
llama_kv_cache: layer  41: filtered
llama_kv_cache: layer  42: dev = CUDA1
llama_kv_cache: layer  43: dev = CUDA1
llama_kv_cache: layer  44: dev = CUDA1
llama_kv_cache: layer  45: dev = CUDA1
llama_kv_cache: layer  46: dev = CUDA1
llama_kv_cache: layer  47: filtered
llama_kv_cache: layer  48: dev = CUDA1
llama_kv_cache: layer  49: dev = CUDA1
llama_kv_cache: layer  50: dev = CUDA1
llama_kv_cache: layer  51: dev = CUDA1
llama_kv_cache: layer  52: dev = CUDA1
llama_kv_cache: layer  53: filtered
llama_kv_cache: layer  54: dev = CUDA1
llama_kv_cache: layer  55: dev = CUDA1
llama_kv_cache: layer  56: dev = CUDA1
llama_kv_cache: layer  57: dev = CUDA1
llama_kv_cache: layer  58: dev = CUDA1
llama_kv_cache: layer  59: filtered
llama_kv_cache: reusing layers:
llama_kv_cache: - layer   0: no reuse
llama_kv_cache: - layer   1: no reuse
llama_kv_cache: - layer   2: no reuse
llama_kv_cache: - layer   3: no reuse
llama_kv_cache: - layer   4: no reuse
llama_kv_cache: - layer   5: no reuse
llama_kv_cache: - layer   6: no reuse
llama_kv_cache: - layer   7: no reuse
llama_kv_cache: - layer   8: no reuse
llama_kv_cache: - layer   9: no reuse
llama_kv_cache: - layer  10: no reuse
llama_kv_cache: - layer  11: no reuse
llama_kv_cache: - layer  12: no reuse
llama_kv_cache: - layer  13: no reuse
llama_kv_cache: - layer  14: no reuse
llama_kv_cache: - layer  15: no reuse
llama_kv_cache: - layer  16: no reuse
llama_kv_cache: - layer  17: no reuse
llama_kv_cache: - layer  18: no reuse
llama_kv_cache: - layer  19: no reuse
llama_kv_cache: - layer  20: no reuse
llama_kv_cache: - layer  21: no reuse
llama_kv_cache: - layer  22: no reuse
llama_kv_cache: - layer  23: no reuse
llama_kv_cache: - layer  24: no reuse
llama_kv_cache: - layer  25: no reuse
llama_kv_cache: - layer  26: no reuse
llama_kv_cache: - layer  27: no reuse
llama_kv_cache: - layer  28: no reuse
llama_kv_cache: - layer  29: no reuse
llama_kv_cache: - layer  30: no reuse
llama_kv_cache: - layer  31: no reuse
llama_kv_cache: - layer  32: no reuse
llama_kv_cache: - layer  33: no reuse
llama_kv_cache: - layer  34: no reuse
llama_kv_cache: - layer  35: no reuse
llama_kv_cache: - layer  36: no reuse
llama_kv_cache: - layer  37: no reuse
llama_kv_cache: - layer  38: no reuse
llama_kv_cache: - layer  39: no reuse
llama_kv_cache: - layer  40: no reuse
llama_kv_cache: - layer  41: no reuse
llama_kv_cache: - layer  42: no reuse
llama_kv_cache: - layer  43: no reuse
llama_kv_cache: - layer  44: no reuse
llama_kv_cache: - layer  45: no reuse
llama_kv_cache: - layer  46: no reuse
llama_kv_cache: - layer  47: no reuse
llama_kv_cache: - layer  48: no reuse
llama_kv_cache: - layer  49: no reuse
llama_kv_cache: - layer  50: no reuse
llama_kv_cache: - layer  51: no reuse
llama_kv_cache: - layer  52: no reuse
llama_kv_cache: - layer  53: no reuse
llama_kv_cache: - layer  54: no reuse
llama_kv_cache: - layer  55: no reuse
llama_kv_cache: - layer  56: no reuse
llama_kv_cache: - layer  57: no reuse
llama_kv_cache: - layer  58: no reuse
llama_kv_cache: - layer  59: no reuse
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     0.00 MiB
llama_kv_cache: size = 1000.00 MiB (  1280 cells,  50 layers,  1/1 seqs), K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: max_nodes = 6680
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1
sched_reserve: resolving fused Gated Delta Net support:
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (autoregressive) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
sched_reserve: fused Gated Delta Net (chunked) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  256, n_seqs =  1, n_outputs =  256
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  256, n_seqs =  1, n_outputs =  256
sched_reserve:      CUDA0 compute buffer size =   291.66 MiB
sched_reserve:      CUDA1 compute buffer size =   383.79 MiB
sched_reserve:  CUDA_Host compute buffer size =   171.54 MiB
sched_reserve: graph nodes  = 2462
sched_reserve: graph splits = 3
sched_reserve: reserve took 2.77 ms, sched copies = 4
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute       unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24126 = 23859 + (21077 = 18293 +    2492 +     291) + 17592186023605 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24103 = 22602 + (14826 = 12814 +    1628 +     383) + 17592186031090 |
llama_memory_breakdown_print: |   - Host               |                   1599 =  1428 +       0 +     171                   |
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0 (NVIDIA GeForce RTX 3090):  24126 total,  21077 used,   2781 free vs. target of   1024
llama_params_fit_impl:   - CUDA1 (NVIDIA GeForce RTX 3090):  24103 total,  14826 used,   7775 free vs. target of   1024
llama_params_fit_impl: projected to use 35904 MiB of device memory vs. 46461 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.73 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23859 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 22602 MiB free
llama_model_loader: loaded meta data with 51 key-value pairs and 833 tensors from /mnt/Speed/AI/Models/bartowski/google_gemma-4-31B-it-GGUF/google_gemma-4-31B-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 64
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Gemma 4 31B It
llama_model_loader: - kv   6:                           general.finetune str              = it
llama_model_loader: - kv   7:                           general.basename str              = gemma-4
llama_model_loader: - kv   8:                         general.size_label str              = 31B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://ai.google.dev/gemma/docs/gemm...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         gemma4.block_count u32              = 60
llama_model_loader: - kv  13:                      gemma4.context_length u32              = 262144
llama_model_loader: - kv  14:                    gemma4.embedding_length u32              = 5376
llama_model_loader: - kv  15:                 gemma4.feed_forward_length u32              = 21504
llama_model_loader: - kv  16:                gemma4.attention.head_count u32              = 32
llama_model_loader: - kv  17:             gemma4.attention.head_count_kv arr[i32,60]      = [16, 16, 16, 16, 16, 4, 16, 16, 16, 1...
llama_model_loader: - kv  18:                      gemma4.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  19:                  gemma4.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  20:    gemma4.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                gemma4.attention.key_length u32              = 512
llama_model_loader: - kv  22:              gemma4.attention.value_length u32              = 512
llama_model_loader: - kv  23:             gemma4.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  24:            gemma4.attention.sliding_window u32              = 1024
llama_model_loader: - kv  25:          gemma4.attention.shared_kv_layers u32              = 0
llama_model_loader: - kv  26:    gemma4.embedding_length_per_layer_input u32              = 0
llama_model_loader: - kv  27:    gemma4.attention.sliding_window_pattern arr[bool,60]     = [true, true, true, true, true, false,...
llama_model_loader: - kv  28:            gemma4.attention.key_length_swa u32              = 256
llama_model_loader: - kv  29:          gemma4.attention.value_length_swa u32              = 256
llama_model_loader: - kv  30:                gemma4.rope.dimension_count u32              = 512
llama_model_loader: - kv  31:            gemma4.rope.dimension_count_swa u32              = 256
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gemma4
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  34:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,514906]  = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  39:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  41:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {%- macro format_parameters(propertie...
llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  44:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                          general.file_type u32              = 7
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = /models_out/gemma-4-31B-it-GGUF/googl...
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 410
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 886
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type q8_0:  411 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.38 GiB (8.50 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: override 'tokenizer.ggml.add_bos_token' to 'true' for Gemma4
load: 0 unused tokens
load: control token: 258884 '<|video|>' is not marked as EOG
load: control token: 255999 '<|image>' is not marked as EOG
load: control token: 258882 '<image|>' is not marked as EOG
load: control token: 258883 '<audio|>' is not marked as EOG
load: control token:     98 '<|think|>' is not marked as EOG
load: control token:    105 '<|turn>' is not marked as EOG
load: control token: 258880 '<|image|>' is not marked as EOG
load: control token:      2 '<bos>' is not marked as EOG
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token:      0 '<pad>' is not marked as EOG
load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token:     46 '<|tool>' is not marked as EOG
load: control token:     47 '<tool|>' is not marked as EOG
load: control token: 256000 '<|audio>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token: 258881 '<|audio|>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 50 ('<|tool_response>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
load: special tokens cache size = 24
load: token to piece cache size = 1.9445 MB
print_info: arch                  = gemma4
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5376
print_info: n_embd_inp            = 5376
print_info: n_layer               = 60
print_info: n_head                = 32
print_info: n_head_kv             = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
print_info: n_rot                 = 512
print_info: n_swa                 = 1024
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 512
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
print_info: n_embd_k_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: n_embd_v_gqa          = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 1.0e+00
print_info: n_ff                  = 21504
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 256
print_info: n_embd_head_v_swa     = 256
print_info: n_rot_swa             = 256
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 31B
print_info: model params          = 30.70 B
print_info: general.name          = Gemma 4 31B It
print_info: vocab type            = BPE
print_info: n_vocab               = 262144
print_info: n_merges              = 514906
print_info: BOS token             = 2 '<bos>'
print_info: EOS token             = 1 '<eos>'
print_info: UNK token             = 3 '<unk>'
print_info: PAD token             = 0 '<pad>'
print_info: MASK token            = 4 '<mask>'
print_info: LF token              = 107 '
'
print_info: EOG token             = 1 '<eos>'
print_info: EOG token             = 50 '<|tool_response>'
print_info: EOG token             = 106 '<turn|>'
print_info: max token length      = 93
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 1
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA1, is_swa = 1
load_tensors: layer  38 assigned to device CUDA1, is_swa = 1
load_tensors: layer  39 assigned to device CUDA1, is_swa = 1
load_tensors: layer  40 assigned to device CUDA1, is_swa = 1
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 1
load_tensors: layer  43 assigned to device CUDA1, is_swa = 1
load_tensors: layer  44 assigned to device CUDA1, is_swa = 1
load_tensors: layer  45 assigned to device CUDA1, is_swa = 1
load_tensors: layer  46 assigned to device CUDA1, is_swa = 1
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CUDA1, is_swa = 1
load_tensors: layer  49 assigned to device CUDA1, is_swa = 1
load_tensors: layer  50 assigned to device CUDA1, is_swa = 1
load_tensors: layer  51 assigned to device CUDA1, is_swa = 1
load_tensors: layer  52 assigned to device CUDA1, is_swa = 1
load_tensors: layer  53 assigned to device CUDA1, is_swa = 0
load_tensors: layer  54 assigned to device CUDA1, is_swa = 1
load_tensors: layer  55 assigned to device CUDA1, is_swa = 1
load_tensors: layer  56 assigned to device CUDA1, is_swa = 1
load_tensors: layer  57 assigned to device CUDA1, is_swa = 1
load_tensors: layer  58 assigned to device CUDA1, is_swa = 1
load_tensors: layer  59 assigned to device CUDA1, is_swa = 0
load_tensors: layer  60 assigned to device CUDA1, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q_norm.weight
create_tensor: loading tensor blk.0.attn_k_norm.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.layer_output_scale.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.post_ffw_norm.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q_norm.weight
create_tensor: loading tensor blk.1.attn_k_norm.weight
create_tensor: loading tensor blk.1.post_attention_norm.weight
create_tensor: loading tensor blk.1.layer_output_scale.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.post_ffw_norm.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q_norm.weight
create_tensor: loading tensor blk.2.attn_k_norm.weight
create_tensor: loading tensor blk.2.post_attention_norm.weight
create_tensor: loading tensor blk.2.layer_output_scale.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.post_ffw_norm.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.post_attention_norm.weight
create_tensor: loading tensor blk.3.layer_output_scale.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.post_ffw_norm.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q_norm.weight
create_tensor: loading tensor blk.4.attn_k_norm.weight
create_tensor: loading tensor blk.4.post_attention_norm.weight
create_tensor: loading tensor blk.4.layer_output_scale.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.post_ffw_norm.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q_norm.weight
create_tensor: loading tensor blk.5.attn_k_norm.weight
create_tensor: loading tensor blk.5.post_attention_norm.weight
create_tensor: loading tensor blk.5.layer_output_scale.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.post_ffw_norm.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q_norm.weight
create_tensor: loading tensor blk.6.attn_k_norm.weight
create_tensor: loading tensor blk.6.post_attention_norm.weight
create_tensor: loading tensor blk.6.layer_output_scale.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.post_ffw_norm.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.post_attention_norm.weight
create_tensor: loading tensor blk.7.layer_output_scale.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.post_ffw_norm.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q_norm.weight
create_tensor: loading tensor blk.8.attn_k_norm.weight
create_tensor: loading tensor blk.8.post_attention_norm.weight
create_tensor: loading tensor blk.8.layer_output_scale.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.post_ffw_norm.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q_norm.weight
create_tensor: loading tensor blk.9.attn_k_norm.weight
create_tensor: loading tensor blk.9.post_attention_norm.weight
create_tensor: loading tensor blk.9.layer_output_scale.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.post_ffw_norm.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q_norm.weight
create_tensor: loading tensor blk.10.attn_k_norm.weight
create_tensor: loading tensor blk.10.post_attention_norm.weight
create_tensor: loading tensor blk.10.layer_output_scale.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.post_ffw_norm.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.post_attention_norm.weight
create_tensor: loading tensor blk.11.layer_output_scale.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.post_ffw_norm.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q_norm.weight
create_tensor: loading tensor blk.12.attn_k_norm.weight
create_tensor: loading tensor blk.12.post_attention_norm.weight
create_tensor: loading tensor blk.12.layer_output_scale.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.post_ffw_norm.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q_norm.weight
create_tensor: loading tensor blk.13.attn_k_norm.weight
create_tensor: loading tensor blk.13.post_attention_norm.weight
create_tensor: loading tensor blk.13.layer_output_scale.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.post_ffw_norm.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q_norm.weight
create_tensor: loading tensor blk.14.attn_k_norm.weight
create_tensor: loading tensor blk.14.post_attention_norm.weight
create_tensor: loading tensor blk.14.layer_output_scale.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.post_ffw_norm.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.post_attention_norm.weight
create_tensor: loading tensor blk.15.layer_output_scale.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.post_ffw_norm.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q_norm.weight
create_tensor: loading tensor blk.16.attn_k_norm.weight
create_tensor: loading tensor blk.16.post_attention_norm.weight
create_tensor: loading tensor blk.16.layer_output_scale.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.post_ffw_norm.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q_norm.weight
create_tensor: loading tensor blk.17.attn_k_norm.weight
create_tensor: loading tensor blk.17.post_attention_norm.weight
create_tensor: loading tensor blk.17.layer_output_scale.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.post_ffw_norm.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q_norm.weight
create_tensor: loading tensor blk.18.attn_k_norm.weight
create_tensor: loading tensor blk.18.post_attention_norm.weight
create_tensor: loading tensor blk.18.layer_output_scale.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.post_ffw_norm.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.post_attention_norm.weight
create_tensor: loading tensor blk.19.layer_output_scale.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.post_ffw_norm.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_q_norm.weight
create_tensor: loading tensor blk.20.attn_k_norm.weight
create_tensor: loading tensor blk.20.post_attention_norm.weight
create_tensor: loading tensor blk.20.layer_output_scale.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.post_ffw_norm.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_q_norm.weight
create_tensor: loading tensor blk.21.attn_k_norm.weight
create_tensor: loading tensor blk.21.post_attention_norm.weight
create_tensor: loading tensor blk.21.layer_output_scale.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.post_ffw_norm.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_q_norm.weight
create_tensor: loading tensor blk.22.attn_k_norm.weight
create_tensor: loading tensor blk.22.post_attention_norm.weight
create_tensor: loading tensor blk.22.layer_output_scale.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.post_ffw_norm.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q_norm.weight
create_tensor: loading tensor blk.23.attn_k_norm.weight
create_tensor: loading tensor blk.23.post_attention_norm.weight
create_tensor: loading tensor blk.23.layer_output_scale.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.post_ffw_norm.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_q_norm.weight
create_tensor: loading tensor blk.24.attn_k_norm.weight
create_tensor: loading tensor blk.24.post_attention_norm.weight
create_tensor: loading tensor blk.24.layer_output_scale.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.post_ffw_norm.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_q_norm.weight
create_tensor: loading tensor blk.25.attn_k_norm.weight
create_tensor: loading tensor blk.25.post_attention_norm.weight
create_tensor: loading tensor blk.25.layer_output_scale.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.post_ffw_norm.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_q_norm.weight
create_tensor: loading tensor blk.26.attn_k_norm.weight
create_tensor: loading tensor blk.26.post_attention_norm.weight
create_tensor: loading tensor blk.26.layer_output_scale.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.post_ffw_norm.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q_norm.weight
create_tensor: loading tensor blk.27.attn_k_norm.weight
create_tensor: loading tensor blk.27.post_attention_norm.weight
create_tensor: loading tensor blk.27.layer_output_scale.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_up.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.post_ffw_norm.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_q_norm.weight
create_tensor: loading tensor blk.28.attn_k_norm.weight
create_tensor: loading tensor blk.28.post_attention_norm.weight
create_tensor: loading tensor blk.28.layer_output_scale.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate.weight
create_tensor: loading tensor blk.28.ffn_up.weight
create_tensor: loading tensor blk.28.ffn_down.weight
create_tensor: loading tensor blk.28.post_ffw_norm.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_q_norm.weight
create_tensor: loading tensor blk.29.attn_k_norm.weight
create_tensor: loading tensor blk.29.post_attention_norm.weight
create_tensor: loading tensor blk.29.layer_output_scale.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate.weight
create_tensor: loading tensor blk.29.ffn_up.weight
create_tensor: loading tensor blk.29.ffn_down.weight
create_tensor: loading tensor blk.29.post_ffw_norm.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_q.weight
create_tensor: loading tensor blk.30.attn_k.weight
create_tensor: loading tensor blk.30.attn_v.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_q_norm.weight
create_tensor: loading tensor blk.30.attn_k_norm.weight
create_tensor: loading tensor blk.30.post_attention_norm.weight
create_tensor: loading tensor blk.30.layer_output_scale.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate.weight
create_tensor: loading tensor blk.30.ffn_up.weight
create_tensor: loading tensor blk.30.ffn_down.weight
create_tensor: loading tensor blk.30.post_ffw_norm.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_q_norm.weight
create_tensor: loading tensor blk.31.attn_k_norm.weight
create_tensor: loading tensor blk.31.post_attention_norm.weight
create_tensor: loading tensor blk.31.layer_output_scale.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate.weight
create_tensor: loading tensor blk.31.ffn_up.weight
create_tensor: loading tensor blk.31.ffn_down.weight
create_tensor: loading tensor blk.31.post_ffw_norm.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_q.weight
create_tensor: loading tensor blk.32.attn_k.weight
create_tensor: loading tensor blk.32.attn_v.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_q_norm.weight
create_tensor: loading tensor blk.32.attn_k_norm.weight
create_tensor: loading tensor blk.32.post_attention_norm.weight
create_tensor: loading tensor blk.32.layer_output_scale.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate.weight
create_tensor: loading tensor blk.32.ffn_up.weight
create_tensor: loading tensor blk.32.ffn_down.weight
create_tensor: loading tensor blk.32.post_ffw_norm.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_q.weight
create_tensor: loading tensor blk.33.attn_k.weight
create_tensor: loading tensor blk.33.attn_v.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_q_norm.weight
create_tensor: loading tensor blk.33.attn_k_norm.weight
create_tensor: loading tensor blk.33.post_attention_norm.weight
create_tensor: loading tensor blk.33.layer_output_scale.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate.weight
create_tensor: loading tensor blk.33.ffn_up.weight
create_tensor: loading tensor blk.33.ffn_down.weight
create_tensor: loading tensor blk.33.post_ffw_norm.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_q.weight
create_tensor: loading tensor blk.34.attn_k.weight
create_tensor: loading tensor blk.34.attn_v.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_q_norm.weight
create_tensor: loading tensor blk.34.attn_k_norm.weight
create_tensor: loading tensor blk.34.post_attention_norm.weight
create_tensor: loading tensor blk.34.layer_output_scale.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate.weight
create_tensor: loading tensor blk.34.ffn_up.weight
create_tensor: loading tensor blk.34.ffn_down.weight
create_tensor: loading tensor blk.34.post_ffw_norm.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.attn_q.weight
create_tensor: loading tensor blk.35.attn_k.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_q_norm.weight
create_tensor: loading tensor blk.35.attn_k_norm.weight
create_tensor: loading tensor blk.35.post_attention_norm.weight
create_tensor: loading tensor blk.35.layer_output_scale.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate.weight
create_tensor: loading tensor blk.35.ffn_up.weight
create_tensor: loading tensor blk.35.ffn_down.weight
create_tensor: loading tensor blk.35.post_ffw_norm.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_q.weight
create_tensor: loading tensor blk.36.attn_k.weight
create_tensor: loading tensor blk.36.attn_v.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_q_norm.weight
create_tensor: loading tensor blk.36.attn_k_norm.weight
create_tensor: loading tensor blk.36.post_attention_norm.weight
create_tensor: loading tensor blk.36.layer_output_scale.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate.weight
create_tensor: loading tensor blk.36.ffn_up.weight
create_tensor: loading tensor blk.36.ffn_down.weight
create_tensor: loading tensor blk.36.post_ffw_norm.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_q.weight
create_tensor: loading tensor blk.37.attn_k.weight
create_tensor: loading tensor blk.37.attn_v.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_q_norm.weight
create_tensor: loading tensor blk.37.attn_k_norm.weight
create_tensor: loading tensor blk.37.post_attention_norm.weight
create_tensor: loading tensor blk.37.layer_output_scale.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate.weight
create_tensor: loading tensor blk.37.ffn_up.weight
create_tensor: loading tensor blk.37.ffn_down.weight
create_tensor: loading tensor blk.37.post_ffw_norm.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_q.weight
create_tensor: loading tensor blk.38.attn_k.weight
create_tensor: loading tensor blk.38.attn_v.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_q_norm.weight
create_tensor: loading tensor blk.38.attn_k_norm.weight
create_tensor: loading tensor blk.38.post_attention_norm.weight
create_tensor: loading tensor blk.38.layer_output_scale.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate.weight
create_tensor: loading tensor blk.38.ffn_up.weight
create_tensor: loading tensor blk.38.ffn_down.weight
create_tensor: loading tensor blk.38.post_ffw_norm.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_q.weight
create_tensor: loading tensor blk.39.attn_k.weight
create_tensor: loading tensor blk.39.attn_v.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_q_norm.weight
create_tensor: loading tensor blk.39.attn_k_norm.weight
create_tensor: loading tensor blk.39.post_attention_norm.weight
create_tensor: loading tensor blk.39.layer_output_scale.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate.weight
create_tensor: loading tensor blk.39.ffn_up.weight
create_tensor: loading tensor blk.39.ffn_down.weight
create_tensor: loading tensor blk.39.post_ffw_norm.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_q.weight
create_tensor: loading tensor blk.40.attn_k.weight
create_tensor: loading tensor blk.40.attn_v.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_q_norm.weight
create_tensor: loading tensor blk.40.attn_k_norm.weight
create_tensor: loading tensor blk.40.post_attention_norm.weight
create_tensor: loading tensor blk.40.layer_output_scale.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate.weight
create_tensor: loading tensor blk.40.ffn_up.weight
create_tensor: loading tensor blk.40.ffn_down.weight
create_tensor: loading tensor blk.40.post_ffw_norm.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.attn_q.weight
create_tensor: loading tensor blk.41.attn_k.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_q_norm.weight
create_tensor: loading tensor blk.41.attn_k_norm.weight
create_tensor: loading tensor blk.41.post_attention_norm.weight
create_tensor: loading tensor blk.41.layer_output_scale.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate.weight
create_tensor: loading tensor blk.41.ffn_up.weight
create_tensor: loading tensor blk.41.ffn_down.weight
create_tensor: loading tensor blk.41.post_ffw_norm.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_q.weight
create_tensor: loading tensor blk.42.attn_k.weight
create_tensor: loading tensor blk.42.attn_v.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_q_norm.weight
create_tensor: loading tensor blk.42.attn_k_norm.weight
create_tensor: loading tensor blk.42.post_attention_norm.weight
create_tensor: loading tensor blk.42.layer_output_scale.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate.weight
create_tensor: loading tensor blk.42.ffn_up.weight
create_tensor: loading tensor blk.42.ffn_down.weight
create_tensor: loading tensor blk.42.post_ffw_norm.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_q.weight
create_tensor: loading tensor blk.43.attn_k.weight
create_tensor: loading tensor blk.43.attn_v.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_q_norm.weight
create_tensor: loading tensor blk.43.attn_k_norm.weight
create_tensor: loading tensor blk.43.post_attention_norm.weight
create_tensor: loading tensor blk.43.layer_output_scale.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate.weight
create_tensor: loading tensor blk.43.ffn_up.weight
create_tensor: loading tensor blk.43.ffn_down.weight
create_tensor: loading tensor blk.43.post_ffw_norm.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_q.weight
create_tensor: loading tensor blk.44.attn_k.weight
create_tensor: loading tensor blk.44.attn_v.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_q_norm.weight
create_tensor: loading tensor blk.44.attn_k_norm.weight
create_tensor: loading tensor blk.44.post_attention_norm.weight
create_tensor: loading tensor blk.44.layer_output_scale.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate.weight
create_tensor: loading tensor blk.44.ffn_up.weight
create_tensor: loading tensor blk.44.ffn_down.weight
create_tensor: loading tensor blk.44.post_ffw_norm.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_q_norm.weight
create_tensor: loading tensor blk.45.attn_k_norm.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.layer_output_scale.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate.weight
create_tensor: loading tensor blk.45.ffn_up.weight
create_tensor: loading tensor blk.45.ffn_down.weight
create_tensor: loading tensor blk.45.post_ffw_norm.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_q.weight
create_tensor: loading tensor blk.46.attn_k.weight
create_tensor: loading tensor blk.46.attn_v.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_q_norm.weight
create_tensor: loading tensor blk.46.attn_k_norm.weight
create_tensor: loading tensor blk.46.post_attention_norm.weight
create_tensor: loading tensor blk.46.layer_output_scale.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate.weight
create_tensor: loading tensor blk.46.ffn_up.weight
create_tensor: loading tensor blk.46.ffn_down.weight
create_tensor: loading tensor blk.46.post_ffw_norm.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.attn_q.weight
create_tensor: loading tensor blk.47.attn_k.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_q_norm.weight
create_tensor: loading tensor blk.47.attn_k_norm.weight
create_tensor: loading tensor blk.47.post_attention_norm.weight
create_tensor: loading tensor blk.47.layer_output_scale.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate.weight
create_tensor: loading tensor blk.47.ffn_up.weight
create_tensor: loading tensor blk.47.ffn_down.weight
create_tensor: loading tensor blk.47.post_ffw_norm.weight
create_tensor: loading tensor blk.48.attn_norm.weight
create_tensor: loading tensor blk.48.attn_q.weight
create_tensor: loading tensor blk.48.attn_k.weight
create_tensor: loading tensor blk.48.attn_v.weight
create_tensor: loading tensor blk.48.attn_output.weight
create_tensor: loading tensor blk.48.attn_q_norm.weight
create_tensor: loading tensor blk.48.attn_k_norm.weight
create_tensor: loading tensor blk.48.post_attention_norm.weight
create_tensor: loading tensor blk.48.layer_output_scale.weight
create_tensor: loading tensor blk.48.ffn_norm.weight
create_tensor: loading tensor blk.48.ffn_gate.weight
create_tensor: loading tensor blk.48.ffn_up.weight
create_tensor: loading tensor blk.48.ffn_down.weight
create_tensor: loading tensor blk.48.post_ffw_norm.weight
create_tensor: loading tensor blk.49.attn_norm.weight
create_tensor: loading tensor blk.49.attn_q.weight
create_tensor: loading tensor blk.49.attn_k.weight
create_tensor: loading tensor blk.49.attn_v.weight
create_tensor: loading tensor blk.49.attn_output.weight
create_tensor: loading tensor blk.49.attn_q_norm.weight
create_tensor: loading tensor blk.49.attn_k_norm.weight
create_tensor: loading tensor blk.49.post_attention_norm.weight
create_tensor: loading tensor blk.49.layer_output_scale.weight
create_tensor: loading tensor blk.49.ffn_norm.weight
create_tensor: loading tensor blk.49.ffn_gate.weight
create_tensor: loading tensor blk.49.ffn_up.weight
create_tensor: loading tensor blk.49.ffn_down.weight
create_tensor: loading tensor blk.49.post_ffw_norm.weight
create_tensor: loading tensor blk.50.attn_norm.weight
create_tensor: loading tensor blk.50.attn_q.weight
create_tensor: loading tensor blk.50.attn_k.weight
create_tensor: loading tensor blk.50.attn_v.weight
create_tensor: loading tensor blk.50.attn_output.weight
create_tensor: loading tensor blk.50.attn_q_norm.weight
create_tensor: loading tensor blk.50.attn_k_norm.weight
create_tensor: loading tensor blk.50.post_attention_norm.weight
create_tensor: loading tensor blk.50.layer_output_scale.weight
create_tensor: loading tensor blk.50.ffn_norm.weight
create_tensor: loading tensor blk.50.ffn_gate.weight
create_tensor: loading tensor blk.50.ffn_up.weight
create_tensor: loading tensor blk.50.ffn_down.weight
create_tensor: loading tensor blk.50.post_ffw_norm.weight
create_tensor: loading tensor blk.51.attn_norm.weight
create_tensor: loading tensor blk.51.attn_q.weight
create_tensor: loading tensor blk.51.attn_k.weight
create_tensor: loading tensor blk.51.attn_v.weight
create_tensor: loading tensor blk.51.attn_output.weight
create_tensor: loading tensor blk.51.attn_q_norm.weight
create_tensor: loading tensor blk.51.attn_k_norm.weight
create_tensor: loading tensor blk.51.post_attention_norm.weight
create_tensor: loading tensor blk.51.layer_output_scale.weight
create_tensor: loading tensor blk.51.ffn_norm.weight
create_tensor: loading tensor blk.51.ffn_gate.weight
create_tensor: loading tensor blk.51.ffn_up.weight
create_tensor: loading tensor blk.51.ffn_down.weight
create_tensor: loading tensor blk.51.post_ffw_norm.weight
create_tensor: loading tensor blk.52.attn_norm.weight
create_tensor: loading tensor blk.52.attn_q.weight
create_tensor: loading tensor blk.52.attn_k.weight
create_tensor: loading tensor blk.52.attn_v.weight
create_tensor: loading tensor blk.52.attn_output.weight
create_tensor: loading tensor blk.52.attn_q_norm.weight
create_tensor: loading tensor blk.52.attn_k_norm.weight
create_tensor: loading tensor blk.52.post_attention_norm.weight
create_tensor: loading tensor blk.52.layer_output_scale.weight
create_tensor: loading tensor blk.52.ffn_norm.weight
create_tensor: loading tensor blk.52.ffn_gate.weight
create_tensor: loading tensor blk.52.ffn_up.weight
create_tensor: loading tensor blk.52.ffn_down.weight
create_tensor: loading tensor blk.52.post_ffw_norm.weight
create_tensor: loading tensor blk.53.attn_norm.weight
create_tensor: loading tensor blk.53.attn_q.weight
create_tensor: loading tensor blk.53.attn_k.weight
create_tensor: loading tensor blk.53.attn_output.weight
create_tensor: loading tensor blk.53.attn_q_norm.weight
create_tensor: loading tensor blk.53.attn_k_norm.weight
create_tensor: loading tensor blk.53.post_attention_norm.weight
create_tensor: loading tensor blk.53.layer_output_scale.weight
create_tensor: loading tensor blk.53.ffn_norm.weight
create_tensor: loading tensor blk.53.ffn_gate.weight
create_tensor: loading tensor blk.53.ffn_up.weight
create_tensor: loading tensor blk.53.ffn_down.weight
create_tensor: loading tensor blk.53.post_ffw_norm.weight
create_tensor: loading tensor blk.54.attn_norm.weight
create_tensor: loading tensor blk.54.attn_q.weight
create_tensor: loading tensor blk.54.attn_k.weight
create_tensor: loading tensor blk.54.attn_v.weight
create_tensor: loading tensor blk.54.attn_output.weight
create_tensor: loading tensor blk.54.attn_q_norm.weight
create_tensor: loading tensor blk.54.attn_k_norm.weight
create_tensor: loading tensor blk.54.post_attention_norm.weight
create_tensor: loading tensor blk.54.layer_output_scale.weight
create_tensor: loading tensor blk.54.ffn_norm.weight
create_tensor: loading tensor blk.54.ffn_gate.weight
create_tensor: loading tensor blk.54.ffn_up.weight
create_tensor: loading tensor blk.54.ffn_down.weight
create_tensor: loading tensor blk.54.post_ffw_norm.weight
create_tensor: loading tensor blk.55.attn_norm.weight
create_tensor: loading tensor blk.55.attn_q.weight
create_tensor: loading tensor blk.55.attn_k.weight
create_tensor: loading tensor blk.55.attn_v.weight
create_tensor: loading tensor blk.55.attn_output.weight
create_tensor: loading tensor blk.55.attn_q_norm.weight
create_tensor: loading tensor blk.55.attn_k_norm.weight
create_tensor: loading tensor blk.55.post_attention_norm.weight
create_tensor: loading tensor blk.55.layer_output_scale.weight
create_tensor: loading tensor blk.55.ffn_norm.weight
create_tensor: loading tensor blk.55.ffn_gate.weight
create_tensor: loading tensor blk.55.ffn_up.weight
create_tensor: loading tensor blk.55.ffn_down.weight
create_tensor: loading tensor blk.55.post_ffw_norm.weight
create_tensor: loading tensor blk.56.attn_norm.weight
create_tensor: loading tensor blk.56.attn_q.weight
create_tensor: loading tensor blk.56.attn_k.weight
create_tensor: loading tensor blk.56.attn_v.weight
create_tensor: loading tensor blk.56.attn_output.weight
create_tensor: loading tensor blk.56.attn_q_norm.weight
create_tensor: loading tensor blk.56.attn_k_norm.weight
create_tensor: loading tensor blk.56.post_attention_norm.weight
create_tensor: loading tensor blk.56.layer_output_scale.weight
create_tensor: loading tensor blk.56.ffn_norm.weight
create_tensor: loading tensor blk.56.ffn_gate.weight
create_tensor: loading tensor blk.56.ffn_up.weight
create_tensor: loading tensor blk.56.ffn_down.weight
create_tensor: loading tensor blk.56.post_ffw_norm.weight
create_tensor: loading tensor blk.57.attn_norm.weight
create_tensor: loading tensor blk.57.attn_q.weight
create_tensor: loading tensor blk.57.attn_k.weight
create_tensor: loading tensor blk.57.attn_v.weight
create_tensor: loading tensor blk.57.attn_output.weight
create_tensor: loading tensor blk.57.attn_q_norm.weight
create_tensor: loading tensor blk.57.attn_k_norm.weight
create_tensor: loading tensor blk.57.post_attention_norm.weight
create_tensor: loading tensor blk.57.layer_output_scale.weight
create_tensor: loading tensor blk.57.ffn_norm.weight
create_tensor: loading tensor blk.57.ffn_gate.weight
create_tensor: loading tensor blk.57.ffn_up.weight
create_tensor: loading tensor blk.57.ffn_down.weight
create_tensor: loading tensor blk.57.post_ffw_norm.weight
create_tensor: loading tensor blk.58.attn_norm.weight
create_tensor: loading tensor blk.58.attn_q.weight
create_tensor: loading tensor blk.58.attn_k.weight
create_tensor: loading tensor blk.58.attn_v.weight
create_tensor: loading tensor blk.58.attn_output.weight
create_tensor: loading tensor blk.58.attn_q_norm.weight
create_tensor: loading tensor blk.58.attn_k_norm.weight
create_tensor: loading tensor blk.58.post_attention_norm.weight
create_tensor: loading tensor blk.58.layer_output_scale.weight
create_tensor: loading tensor blk.58.ffn_norm.weight
create_tensor: loading tensor blk.58.ffn_gate.weight
create_tensor: loading tensor blk.58.ffn_up.weight
create_tensor: loading tensor blk.58.ffn_down.weight
create_tensor: loading tensor blk.58.post_ffw_norm.weight
create_tensor: loading tensor blk.59.attn_norm.weight
create_tensor: loading tensor blk.59.attn_q.weight
create_tensor: loading tensor blk.59.attn_k.weight
create_tensor: loading tensor blk.59.attn_output.weight
create_tensor: loading tensor blk.59.attn_q_norm.weight
create_tensor: loading tensor blk.59.attn_k_norm.weight
create_tensor: loading tensor blk.59.post_attention_norm.weight
create_tensor: loading tensor blk.59.layer_output_scale.weight
create_tensor: loading tensor blk.59.ffn_norm.weight
create_tensor: loading tensor blk.59.ffn_gate.weight
create_tensor: loading tensor blk.59.ffn_up.weight
create_tensor: loading tensor blk.59.ffn_down.weight
create_tensor: loading tensor blk.59.post_ffw_norm.weight
warning: failed to mlock 1497366528-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors:        CUDA0 model buffer size = 18293.86 MiB
load_tensors:        CUDA1 model buffer size = 12814.96 MiB
load_tensors:    CUDA_Host model buffer size =  1428.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
........................................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
....................................load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
.
common_init_result: added <eos> logit bias = -inf
common_init_result: added <|tool_response> logit bias = -inf
common_init_result: added <turn|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 39936
llama_context: n_ctx_seq     = 39936
llama_context: n_batch       = 256
llama_context: n_ubatch      = 256
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (39936) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 39936 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: filtered
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: dev = CUDA0
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: filtered
llama_kv_cache: layer  32: filtered
llama_kv_cache: layer  33: filtered
llama_kv_cache: layer  34: filtered
llama_kv_cache: layer  35: dev = CUDA0
llama_kv_cache: layer  36: filtered
llama_kv_cache: layer  37: filtered
llama_kv_cache: layer  38: filtered
llama_kv_cache: layer  39: filtered
llama_kv_cache: layer  40: filtered
llama_kv_cache: layer  41: dev = CUDA1
llama_kv_cache: layer  42: filtered
llama_kv_cache: layer  43: filtered
llama_kv_cache: layer  44: filtered
llama_kv_cache: layer  45: filtered
llama_kv_cache: layer  46: filtered
llama_kv_cache: layer  47: dev = CUDA1
llama_kv_cache: layer  48: filtered
llama_kv_cache: layer  49: filtered
llama_kv_cache: layer  50: filtered
llama_kv_cache: layer  51: filtered
llama_kv_cache: layer  52: filtered
llama_kv_cache: layer  53: dev = CUDA1
llama_kv_cache: layer  54: filtered
llama_kv_cache: layer  55: filtered
llama_kv_cache: layer  56: filtered
llama_kv_cache: layer  57: filtered
llama_kv_cache: layer  58: filtered
llama_kv_cache: layer  59: dev = CUDA1
llama_kv_cache: reusing layers:
llama_kv_cache: - layer   0: no reuse
llama_kv_cache: - layer   1: no reuse
llama_kv_cache: - layer   2: no reuse
llama_kv_cache: - layer   3: no reuse
llama_kv_cache: - layer   4: no reuse
llama_kv_cache: - layer   5: no reuse
llama_kv_cache: - layer   6: no reuse
llama_kv_cache: - layer   7: no reuse
llama_kv_cache: - layer   8: no reuse
llama_kv_cache: - layer   9: no reuse
llama_kv_cache: - layer  10: no reuse
llama_kv_cache: - layer  11: no reuse
llama_kv_cache: - layer  12: no reuse
llama_kv_cache: - layer  13: no reuse
llama_kv_cache: - layer  14: no reuse
llama_kv_cache: - layer  15: no reuse
llama_kv_cache: - layer  16: no reuse
llama_kv_cache: - layer  17: no reuse
llama_kv_cache: - layer  18: no reuse
llama_kv_cache: - layer  19: no reuse
llama_kv_cache: - layer  20: no reuse
llama_kv_cache: - layer  21: no reuse
llama_kv_cache: - layer  22: no reuse
llama_kv_cache: - layer  23: no reuse
llama_kv_cache: - layer  24: no reuse
llama_kv_cache: - layer  25: no reuse
llama_kv_cache: - layer  26: no reuse
llama_kv_cache: - layer  27: no reuse
llama_kv_cache: - layer  28: no reuse
llama_kv_cache: - layer  29: no reuse
llama_kv_cache: - layer  30: no reuse
llama_kv_cache: - layer  31: no reuse
llama_kv_cache: - layer  32: no reuse
llama_kv_cache: - layer  33: no reuse
llama_kv_cache: - layer  34: no reuse
llama_kv_cache: - layer  35: no reuse
llama_kv_cache: - layer  36: no reuse
llama_kv_cache: - layer  37: no reuse
llama_kv_cache: - layer  38: no reuse
llama_kv_cache: - layer  39: no reuse
llama_kv_cache: - layer  40: no reuse
llama_kv_cache: - layer  41: no reuse
llama_kv_cache: - layer  42: no reuse
llama_kv_cache: - layer  43: no reuse
llama_kv_cache: - layer  44: no reuse
llama_kv_cache: - layer  45: no reuse
llama_kv_cache: - layer  46: no reuse
llama_kv_cache: - layer  47: no reuse
llama_kv_cache: - layer  48: no reuse
llama_kv_cache: - layer  49: no reuse
llama_kv_cache: - layer  50: no reuse
llama_kv_cache: - layer  51: no reuse
llama_kv_cache: - layer  52: no reuse
llama_kv_cache: - layer  53: no reuse
llama_kv_cache: - layer  54: no reuse
llama_kv_cache: - layer  55: no reuse
llama_kv_cache: - layer  56: no reuse
llama_kv_cache: - layer  57: no reuse
llama_kv_cache: - layer  58: no reuse
llama_kv_cache: - layer  59: no reuse
llama_kv_cache:      CUDA0 KV buffer size =  1872.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  1248.00 MiB
llama_kv_cache: size = 3120.00 MiB ( 39936 cells,  10 layers,  1/1 seqs), K (f16): 1560.00 MiB, V (f16): 1560.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_kv_cache_iswa: creating     SWA KV cache, size = 1280 cells
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: filtered
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache: layer  28: dev = CUDA0
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: dev = CUDA0
llama_kv_cache: layer  31: dev = CUDA0
llama_kv_cache: layer  32: dev = CUDA0
llama_kv_cache: layer  33: dev = CUDA0
llama_kv_cache: layer  34: dev = CUDA0
llama_kv_cache: layer  35: filtered
llama_kv_cache: layer  36: dev = CUDA0
llama_kv_cache: layer  37: dev = CUDA1
llama_kv_cache: layer  38: dev = CUDA1
llama_kv_cache: layer  39: dev = CUDA1
llama_kv_cache: layer  40: dev = CUDA1
llama_kv_cache: layer  41: filtered
llama_kv_cache: layer  42: dev = CUDA1
llama_kv_cache: layer  43: dev = CUDA1
llama_kv_cache: layer  44: dev = CUDA1
llama_kv_cache: layer  45: dev = CUDA1
llama_kv_cache: layer  46: dev = CUDA1
llama_kv_cache: layer  47: filtered
llama_kv_cache: layer  48: dev = CUDA1
llama_kv_cache: layer  49: dev = CUDA1
llama_kv_cache: layer  50: dev = CUDA1
llama_kv_cache: layer  51: dev = CUDA1
llama_kv_cache: layer  52: dev = CUDA1
llama_kv_cache: layer  53: filtered
llama_kv_cache: layer  54: dev = CUDA1
llama_kv_cache: layer  55: dev = CUDA1
llama_kv_cache: layer  56: dev = CUDA1
llama_kv_cache: layer  57: dev = CUDA1
llama_kv_cache: layer  58: dev = CUDA1
llama_kv_cache: layer  59: filtered
llama_kv_cache: reusing layers:
llama_kv_cache: - layer   0: no reuse
llama_kv_cache: - layer   1: no reuse
llama_kv_cache: - layer   2: no reuse
llama_kv_cache: - layer   3: no reuse
llama_kv_cache: - layer   4: no reuse
llama_kv_cache: - layer   5: no reuse
llama_kv_cache: - layer   6: no reuse
llama_kv_cache: - layer   7: no reuse
llama_kv_cache: - layer   8: no reuse
llama_kv_cache: - layer   9: no reuse
llama_kv_cache: - layer  10: no reuse
llama_kv_cache: - layer  11: no reuse
llama_kv_cache: - layer  12: no reuse
llama_kv_cache: - layer  13: no reuse
llama_kv_cache: - layer  14: no reuse
llama_kv_cache: - layer  15: no reuse
llama_kv_cache: - layer  16: no reuse
llama_kv_cache: - layer  17: no reuse
llama_kv_cache: - layer  18: no reuse
llama_kv_cache: - layer  19: no reuse
llama_kv_cache: - layer  20: no reuse
llama_kv_cache: - layer  21: no reuse
llama_kv_cache: - layer  22: no reuse
llama_kv_cache: - layer  23: no reuse
llama_kv_cache: - layer  24: no reuse
llama_kv_cache: - layer  25: no reuse
llama_kv_cache: - layer  26: no reuse
llama_kv_cache: - layer  27: no reuse
llama_kv_cache: - layer  28: no reuse
llama_kv_cache: - layer  29: no reuse
llama_kv_cache: - layer  30: no reuse
llama_kv_cache: - layer  31: no reuse
llama_kv_cache: - layer  32: no reuse
llama_kv_cache: - layer  33: no reuse
llama_kv_cache: - layer  34: no reuse
llama_kv_cache: - layer  35: no reuse
llama_kv_cache: - layer  36: no reuse
llama_kv_cache: - layer  37: no reuse
llama_kv_cache: - layer  38: no reuse
llama_kv_cache: - layer  39: no reuse
llama_kv_cache: - layer  40: no reuse
llama_kv_cache: - layer  41: no reuse
llama_kv_cache: - layer  42: no reuse
llama_kv_cache: - layer  43: no reuse
llama_kv_cache: - layer  44: no reuse
llama_kv_cache: - layer  45: no reuse
llama_kv_cache: - layer  46: no reuse
llama_kv_cache: - layer  47: no reuse
llama_kv_cache: - layer  48: no reuse
llama_kv_cache: - layer  49: no reuse
llama_kv_cache: - layer  50: no reuse
llama_kv_cache: - layer  51: no reuse
llama_kv_cache: - layer  52: no reuse
llama_kv_cache: - layer  53: no reuse
llama_kv_cache: - layer  54: no reuse
llama_kv_cache: - layer  55: no reuse
llama_kv_cache: - layer  56: no reuse
llama_kv_cache: - layer  57: no reuse
llama_kv_cache: - layer  58: no reuse
llama_kv_cache: - layer  59: no reuse
llama_kv_cache:      CUDA0 KV buffer size =   620.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   380.00 MiB
llama_kv_cache: size = 1000.00 MiB (  1280 cells,  50 layers,  1/1 seqs), K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: max_nodes = 6680
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 256, n_seqs = 1, n_outputs = 1
sched_reserve: resolving fused Gated Delta Net support:
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (autoregressive) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =   16
sched_reserve: fused Gated Delta Net (chunked) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  256, n_seqs =  1, n_outputs =  256
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  256, n_seqs =  1, n_outputs =  256
sched_reserve:      CUDA0 compute buffer size =   291.66 MiB
sched_reserve:      CUDA1 compute buffer size =   362.79 MiB
sched_reserve:  CUDA_Host compute buffer size =   171.54 MiB
sched_reserve: graph nodes  = 2462
sched_reserve: graph splits = 3
sched_reserve: reserve took 20.65 ms, sched copies = 4
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_model_loader: model name:   Gemma 4 31B It
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    356
clip_model_loader: n_kv:         26

clip_model_loader: has vision encoder
clip_model_loader: tensor[0]: n_dims = 2, name = mm.input_projection.weight, tensor_size=12386304, offset=0, shape:[1152, 5376, 1, 1], type = bf16
clip_model_loader: tensor[1]: n_dims = 1, name = v.blk.0.ln1.weight, tensor_size=4608, offset=12386304, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[2]: n_dims = 2, name = v.blk.0.ffn_down.weight, tensor_size=9916416, offset=12390912, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[3]: n_dims = 2, name = v.blk.0.ffn_gate.weight, tensor_size=9916416, offset=22307328, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[4]: n_dims = 2, name = v.blk.0.ffn_up.weight, tensor_size=9916416, offset=32223744, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[5]: n_dims = 1, name = v.blk.0.attn_post_norm.weight, tensor_size=4608, offset=42140160, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[6]: n_dims = 1, name = v.blk.0.ffn_post_norm.weight, tensor_size=4608, offset=42144768, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[7]: n_dims = 1, name = v.blk.0.ln2.weight, tensor_size=4608, offset=42149376, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[8]: n_dims = 1, name = v.blk.0.attn_k_norm.weight, tensor_size=288, offset=42153984, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[9]: n_dims = 2, name = v.blk.0.attn_k.weight, tensor_size=2654208, offset=42154272, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[10]: n_dims = 2, name = v.blk.0.attn_out.weight, tensor_size=2654208, offset=44808480, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[11]: n_dims = 1, name = v.blk.0.attn_q_norm.weight, tensor_size=288, offset=47462688, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[12]: n_dims = 2, name = v.blk.0.attn_q.weight, tensor_size=2654208, offset=47462976, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[13]: n_dims = 2, name = v.blk.0.attn_v.weight, tensor_size=2654208, offset=50117184, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[14]: n_dims = 1, name = v.blk.1.ln1.weight, tensor_size=4608, offset=52771392, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[15]: n_dims = 2, name = v.blk.1.ffn_down.weight, tensor_size=9916416, offset=52776000, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[16]: n_dims = 2, name = v.blk.1.ffn_gate.weight, tensor_size=9916416, offset=62692416, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[17]: n_dims = 2, name = v.blk.1.ffn_up.weight, tensor_size=9916416, offset=72608832, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[18]: n_dims = 1, name = v.blk.1.attn_post_norm.weight, tensor_size=4608, offset=82525248, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[19]: n_dims = 1, name = v.blk.1.ffn_post_norm.weight, tensor_size=4608, offset=82529856, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[20]: n_dims = 1, name = v.blk.1.ln2.weight, tensor_size=4608, offset=82534464, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[21]: n_dims = 1, name = v.blk.1.attn_k_norm.weight, tensor_size=288, offset=82539072, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[22]: n_dims = 2, name = v.blk.1.attn_k.weight, tensor_size=2654208, offset=82539360, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[23]: n_dims = 2, name = v.blk.1.attn_out.weight, tensor_size=2654208, offset=85193568, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[24]: n_dims = 1, name = v.blk.1.attn_q_norm.weight, tensor_size=288, offset=87847776, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[25]: n_dims = 2, name = v.blk.1.attn_q.weight, tensor_size=2654208, offset=87848064, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[26]: n_dims = 2, name = v.blk.1.attn_v.weight, tensor_size=2654208, offset=90502272, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[27]: n_dims = 1, name = v.blk.10.ln1.weight, tensor_size=4608, offset=93156480, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[28]: n_dims = 2, name = v.blk.10.ffn_down.weight, tensor_size=9916416, offset=93161088, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[29]: n_dims = 2, name = v.blk.10.ffn_gate.weight, tensor_size=9916416, offset=103077504, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[30]: n_dims = 2, name = v.blk.10.ffn_up.weight, tensor_size=9916416, offset=112993920, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[31]: n_dims = 1, name = v.blk.10.attn_post_norm.weight, tensor_size=4608, offset=122910336, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[32]: n_dims = 1, name = v.blk.10.ffn_post_norm.weight, tensor_size=4608, offset=122914944, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[33]: n_dims = 1, name = v.blk.10.ln2.weight, tensor_size=4608, offset=122919552, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[34]: n_dims = 1, name = v.blk.10.attn_k_norm.weight, tensor_size=288, offset=122924160, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[35]: n_dims = 2, name = v.blk.10.attn_k.weight, tensor_size=2654208, offset=122924448, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[36]: n_dims = 2, name = v.blk.10.attn_out.weight, tensor_size=2654208, offset=125578656, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[37]: n_dims = 1, name = v.blk.10.attn_q_norm.weight, tensor_size=288, offset=128232864, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[38]: n_dims = 2, name = v.blk.10.attn_q.weight, tensor_size=2654208, offset=128233152, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[39]: n_dims = 2, name = v.blk.10.attn_v.weight, tensor_size=2654208, offset=130887360, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[40]: n_dims = 1, name = v.blk.11.ln1.weight, tensor_size=4608, offset=133541568, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[41]: n_dims = 2, name = v.blk.11.ffn_down.weight, tensor_size=9916416, offset=133546176, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[42]: n_dims = 2, name = v.blk.11.ffn_gate.weight, tensor_size=9916416, offset=143462592, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[43]: n_dims = 2, name = v.blk.11.ffn_up.weight, tensor_size=9916416, offset=153379008, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[44]: n_dims = 1, name = v.blk.11.attn_post_norm.weight, tensor_size=4608, offset=163295424, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[45]: n_dims = 1, name = v.blk.11.ffn_post_norm.weight, tensor_size=4608, offset=163300032, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[46]: n_dims = 1, name = v.blk.11.ln2.weight, tensor_size=4608, offset=163304640, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[47]: n_dims = 1, name = v.blk.11.attn_k_norm.weight, tensor_size=288, offset=163309248, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[48]: n_dims = 2, name = v.blk.11.attn_k.weight, tensor_size=2654208, offset=163309536, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[49]: n_dims = 2, name = v.blk.11.attn_out.weight, tensor_size=2654208, offset=165963744, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[50]: n_dims = 1, name = v.blk.11.attn_q_norm.weight, tensor_size=288, offset=168617952, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[51]: n_dims = 2, name = v.blk.11.attn_q.weight, tensor_size=2654208, offset=168618240, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[52]: n_dims = 2, name = v.blk.11.attn_v.weight, tensor_size=2654208, offset=171272448, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[53]: n_dims = 1, name = v.blk.12.ln1.weight, tensor_size=4608, offset=173926656, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[54]: n_dims = 2, name = v.blk.12.ffn_down.weight, tensor_size=9916416, offset=173931264, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[55]: n_dims = 2, name = v.blk.12.ffn_gate.weight, tensor_size=9916416, offset=183847680, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[56]: n_dims = 2, name = v.blk.12.ffn_up.weight, tensor_size=9916416, offset=193764096, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[57]: n_dims = 1, name = v.blk.12.attn_post_norm.weight, tensor_size=4608, offset=203680512, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[58]: n_dims = 1, name = v.blk.12.ffn_post_norm.weight, tensor_size=4608, offset=203685120, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[59]: n_dims = 1, name = v.blk.12.ln2.weight, tensor_size=4608, offset=203689728, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[60]: n_dims = 1, name = v.blk.12.attn_k_norm.weight, tensor_size=288, offset=203694336, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[61]: n_dims = 2, name = v.blk.12.attn_k.weight, tensor_size=2654208, offset=203694624, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[62]: n_dims = 2, name = v.blk.12.attn_out.weight, tensor_size=2654208, offset=206348832, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[63]: n_dims = 1, name = v.blk.12.attn_q_norm.weight, tensor_size=288, offset=209003040, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[64]: n_dims = 2, name = v.blk.12.attn_q.weight, tensor_size=2654208, offset=209003328, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[65]: n_dims = 2, name = v.blk.12.attn_v.weight, tensor_size=2654208, offset=211657536, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[66]: n_dims = 1, name = v.blk.13.ln1.weight, tensor_size=4608, offset=214311744, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[67]: n_dims = 2, name = v.blk.13.ffn_down.weight, tensor_size=9916416, offset=214316352, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[68]: n_dims = 2, name = v.blk.13.ffn_gate.weight, tensor_size=9916416, offset=224232768, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[69]: n_dims = 2, name = v.blk.13.ffn_up.weight, tensor_size=9916416, offset=234149184, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[70]: n_dims = 1, name = v.blk.13.attn_post_norm.weight, tensor_size=4608, offset=244065600, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[71]: n_dims = 1, name = v.blk.13.ffn_post_norm.weight, tensor_size=4608, offset=244070208, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[72]: n_dims = 1, name = v.blk.13.ln2.weight, tensor_size=4608, offset=244074816, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[73]: n_dims = 1, name = v.blk.13.attn_k_norm.weight, tensor_size=288, offset=244079424, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[74]: n_dims = 2, name = v.blk.13.attn_k.weight, tensor_size=2654208, offset=244079712, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[75]: n_dims = 2, name = v.blk.13.attn_out.weight, tensor_size=2654208, offset=246733920, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[76]: n_dims = 1, name = v.blk.13.attn_q_norm.weight, tensor_size=288, offset=249388128, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[77]: n_dims = 2, name = v.blk.13.attn_q.weight, tensor_size=2654208, offset=249388416, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[78]: n_dims = 2, name = v.blk.13.attn_v.weight, tensor_size=2654208, offset=252042624, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[79]: n_dims = 1, name = v.blk.14.ln1.weight, tensor_size=4608, offset=254696832, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[80]: n_dims = 2, name = v.blk.14.ffn_down.weight, tensor_size=9916416, offset=254701440, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[81]: n_dims = 2, name = v.blk.14.ffn_gate.weight, tensor_size=9916416, offset=264617856, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[82]: n_dims = 2, name = v.blk.14.ffn_up.weight, tensor_size=9916416, offset=274534272, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[83]: n_dims = 1, name = v.blk.14.attn_post_norm.weight, tensor_size=4608, offset=284450688, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[84]: n_dims = 1, name = v.blk.14.ffn_post_norm.weight, tensor_size=4608, offset=284455296, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[85]: n_dims = 1, name = v.blk.14.ln2.weight, tensor_size=4608, offset=284459904, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[86]: n_dims = 1, name = v.blk.14.attn_k_norm.weight, tensor_size=288, offset=284464512, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[87]: n_dims = 2, name = v.blk.14.attn_k.weight, tensor_size=2654208, offset=284464800, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[88]: n_dims = 2, name = v.blk.14.attn_out.weight, tensor_size=2654208, offset=287119008, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[89]: n_dims = 1, name = v.blk.14.attn_q_norm.weight, tensor_size=288, offset=289773216, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[90]: n_dims = 2, name = v.blk.14.attn_q.weight, tensor_size=2654208, offset=289773504, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[91]: n_dims = 2, name = v.blk.14.attn_v.weight, tensor_size=2654208, offset=292427712, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[92]: n_dims = 1, name = v.blk.15.ln1.weight, tensor_size=4608, offset=295081920, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[93]: n_dims = 2, name = v.blk.15.ffn_down.weight, tensor_size=9916416, offset=295086528, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[94]: n_dims = 2, name = v.blk.15.ffn_gate.weight, tensor_size=9916416, offset=305002944, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[95]: n_dims = 2, name = v.blk.15.ffn_up.weight, tensor_size=9916416, offset=314919360, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[96]: n_dims = 1, name = v.blk.15.attn_post_norm.weight, tensor_size=4608, offset=324835776, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[97]: n_dims = 1, name = v.blk.15.ffn_post_norm.weight, tensor_size=4608, offset=324840384, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[98]: n_dims = 1, name = v.blk.15.ln2.weight, tensor_size=4608, offset=324844992, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[99]: n_dims = 1, name = v.blk.15.attn_k_norm.weight, tensor_size=288, offset=324849600, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[100]: n_dims = 2, name = v.blk.15.attn_k.weight, tensor_size=2654208, offset=324849888, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[101]: n_dims = 2, name = v.blk.15.attn_out.weight, tensor_size=2654208, offset=327504096, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[102]: n_dims = 1, name = v.blk.15.attn_q_norm.weight, tensor_size=288, offset=330158304, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[103]: n_dims = 2, name = v.blk.15.attn_q.weight, tensor_size=2654208, offset=330158592, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[104]: n_dims = 2, name = v.blk.15.attn_v.weight, tensor_size=2654208, offset=332812800, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[105]: n_dims = 1, name = v.blk.16.ln1.weight, tensor_size=4608, offset=335467008, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[106]: n_dims = 2, name = v.blk.16.ffn_down.weight, tensor_size=9916416, offset=335471616, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[107]: n_dims = 2, name = v.blk.16.ffn_gate.weight, tensor_size=9916416, offset=345388032, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[108]: n_dims = 2, name = v.blk.16.ffn_up.weight, tensor_size=9916416, offset=355304448, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[109]: n_dims = 1, name = v.blk.16.attn_post_norm.weight, tensor_size=4608, offset=365220864, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[110]: n_dims = 1, name = v.blk.16.ffn_post_norm.weight, tensor_size=4608, offset=365225472, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[111]: n_dims = 1, name = v.blk.16.ln2.weight, tensor_size=4608, offset=365230080, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[112]: n_dims = 1, name = v.blk.16.attn_k_norm.weight, tensor_size=288, offset=365234688, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[113]: n_dims = 2, name = v.blk.16.attn_k.weight, tensor_size=2654208, offset=365234976, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[114]: n_dims = 2, name = v.blk.16.attn_out.weight, tensor_size=2654208, offset=367889184, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[115]: n_dims = 1, name = v.blk.16.attn_q_norm.weight, tensor_size=288, offset=370543392, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[116]: n_dims = 2, name = v.blk.16.attn_q.weight, tensor_size=2654208, offset=370543680, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[117]: n_dims = 2, name = v.blk.16.attn_v.weight, tensor_size=2654208, offset=373197888, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[118]: n_dims = 1, name = v.blk.17.ln1.weight, tensor_size=4608, offset=375852096, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[119]: n_dims = 2, name = v.blk.17.ffn_down.weight, tensor_size=9916416, offset=375856704, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[120]: n_dims = 2, name = v.blk.17.ffn_gate.weight, tensor_size=9916416, offset=385773120, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[121]: n_dims = 2, name = v.blk.17.ffn_up.weight, tensor_size=9916416, offset=395689536, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[122]: n_dims = 1, name = v.blk.17.attn_post_norm.weight, tensor_size=4608, offset=405605952, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[123]: n_dims = 1, name = v.blk.17.ffn_post_norm.weight, tensor_size=4608, offset=405610560, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[124]: n_dims = 1, name = v.blk.17.ln2.weight, tensor_size=4608, offset=405615168, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[125]: n_dims = 1, name = v.blk.17.attn_k_norm.weight, tensor_size=288, offset=405619776, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[126]: n_dims = 2, name = v.blk.17.attn_k.weight, tensor_size=2654208, offset=405620064, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[127]: n_dims = 2, name = v.blk.17.attn_out.weight, tensor_size=2654208, offset=408274272, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[128]: n_dims = 1, name = v.blk.17.attn_q_norm.weight, tensor_size=288, offset=410928480, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[129]: n_dims = 2, name = v.blk.17.attn_q.weight, tensor_size=2654208, offset=410928768, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[130]: n_dims = 2, name = v.blk.17.attn_v.weight, tensor_size=2654208, offset=413582976, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[131]: n_dims = 1, name = v.blk.18.ln1.weight, tensor_size=4608, offset=416237184, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[132]: n_dims = 2, name = v.blk.18.ffn_down.weight, tensor_size=9916416, offset=416241792, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[133]: n_dims = 2, name = v.blk.18.ffn_gate.weight, tensor_size=9916416, offset=426158208, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[134]: n_dims = 2, name = v.blk.18.ffn_up.weight, tensor_size=9916416, offset=436074624, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[135]: n_dims = 1, name = v.blk.18.attn_post_norm.weight, tensor_size=4608, offset=445991040, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[136]: n_dims = 1, name = v.blk.18.ffn_post_norm.weight, tensor_size=4608, offset=445995648, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[137]: n_dims = 1, name = v.blk.18.ln2.weight, tensor_size=4608, offset=446000256, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[138]: n_dims = 1, name = v.blk.18.attn_k_norm.weight, tensor_size=288, offset=446004864, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[139]: n_dims = 2, name = v.blk.18.attn_k.weight, tensor_size=2654208, offset=446005152, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[140]: n_dims = 2, name = v.blk.18.attn_out.weight, tensor_size=2654208, offset=448659360, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[141]: n_dims = 1, name = v.blk.18.attn_q_norm.weight, tensor_size=288, offset=451313568, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[142]: n_dims = 2, name = v.blk.18.attn_q.weight, tensor_size=2654208, offset=451313856, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[143]: n_dims = 2, name = v.blk.18.attn_v.weight, tensor_size=2654208, offset=453968064, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[144]: n_dims = 1, name = v.blk.19.ln1.weight, tensor_size=4608, offset=456622272, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[145]: n_dims = 2, name = v.blk.19.ffn_down.weight, tensor_size=9916416, offset=456626880, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[146]: n_dims = 2, name = v.blk.19.ffn_gate.weight, tensor_size=9916416, offset=466543296, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[147]: n_dims = 2, name = v.blk.19.ffn_up.weight, tensor_size=9916416, offset=476459712, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[148]: n_dims = 1, name = v.blk.19.attn_post_norm.weight, tensor_size=4608, offset=486376128, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[149]: n_dims = 1, name = v.blk.19.ffn_post_norm.weight, tensor_size=4608, offset=486380736, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[150]: n_dims = 1, name = v.blk.19.ln2.weight, tensor_size=4608, offset=486385344, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[151]: n_dims = 1, name = v.blk.19.attn_k_norm.weight, tensor_size=288, offset=486389952, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[152]: n_dims = 2, name = v.blk.19.attn_k.weight, tensor_size=2654208, offset=486390240, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[153]: n_dims = 2, name = v.blk.19.attn_out.weight, tensor_size=2654208, offset=489044448, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[154]: n_dims = 1, name = v.blk.19.attn_q_norm.weight, tensor_size=288, offset=491698656, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[155]: n_dims = 2, name = v.blk.19.attn_q.weight, tensor_size=2654208, offset=491698944, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[156]: n_dims = 2, name = v.blk.19.attn_v.weight, tensor_size=2654208, offset=494353152, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[157]: n_dims = 1, name = v.blk.2.ln1.weight, tensor_size=4608, offset=497007360, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[158]: n_dims = 2, name = v.blk.2.ffn_down.weight, tensor_size=9916416, offset=497011968, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[159]: n_dims = 2, name = v.blk.2.ffn_gate.weight, tensor_size=9916416, offset=506928384, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[160]: n_dims = 2, name = v.blk.2.ffn_up.weight, tensor_size=9916416, offset=516844800, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[161]: n_dims = 1, name = v.blk.2.attn_post_norm.weight, tensor_size=4608, offset=526761216, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[162]: n_dims = 1, name = v.blk.2.ffn_post_norm.weight, tensor_size=4608, offset=526765824, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[163]: n_dims = 1, name = v.blk.2.ln2.weight, tensor_size=4608, offset=526770432, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[164]: n_dims = 1, name = v.blk.2.attn_k_norm.weight, tensor_size=288, offset=526775040, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[165]: n_dims = 2, name = v.blk.2.attn_k.weight, tensor_size=2654208, offset=526775328, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[166]: n_dims = 2, name = v.blk.2.attn_out.weight, tensor_size=2654208, offset=529429536, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[167]: n_dims = 1, name = v.blk.2.attn_q_norm.weight, tensor_size=288, offset=532083744, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[168]: n_dims = 2, name = v.blk.2.attn_q.weight, tensor_size=2654208, offset=532084032, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[169]: n_dims = 2, name = v.blk.2.attn_v.weight, tensor_size=2654208, offset=534738240, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[170]: n_dims = 1, name = v.blk.20.ln1.weight, tensor_size=4608, offset=537392448, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[171]: n_dims = 2, name = v.blk.20.ffn_down.weight, tensor_size=9916416, offset=537397056, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[172]: n_dims = 2, name = v.blk.20.ffn_gate.weight, tensor_size=9916416, offset=547313472, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[173]: n_dims = 2, name = v.blk.20.ffn_up.weight, tensor_size=9916416, offset=557229888, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[174]: n_dims = 1, name = v.blk.20.attn_post_norm.weight, tensor_size=4608, offset=567146304, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[175]: n_dims = 1, name = v.blk.20.ffn_post_norm.weight, tensor_size=4608, offset=567150912, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[176]: n_dims = 1, name = v.blk.20.ln2.weight, tensor_size=4608, offset=567155520, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[177]: n_dims = 1, name = v.blk.20.attn_k_norm.weight, tensor_size=288, offset=567160128, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[178]: n_dims = 2, name = v.blk.20.attn_k.weight, tensor_size=2654208, offset=567160416, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[179]: n_dims = 2, name = v.blk.20.attn_out.weight, tensor_size=2654208, offset=569814624, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[180]: n_dims = 1, name = v.blk.20.attn_q_norm.weight, tensor_size=288, offset=572468832, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[181]: n_dims = 2, name = v.blk.20.attn_q.weight, tensor_size=2654208, offset=572469120, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[182]: n_dims = 2, name = v.blk.20.attn_v.weight, tensor_size=2654208, offset=575123328, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[183]: n_dims = 1, name = v.blk.21.ln1.weight, tensor_size=4608, offset=577777536, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[184]: n_dims = 2, name = v.blk.21.ffn_down.weight, tensor_size=9916416, offset=577782144, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[185]: n_dims = 2, name = v.blk.21.ffn_gate.weight, tensor_size=9916416, offset=587698560, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[186]: n_dims = 2, name = v.blk.21.ffn_up.weight, tensor_size=9916416, offset=597614976, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[187]: n_dims = 1, name = v.blk.21.attn_post_norm.weight, tensor_size=4608, offset=607531392, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[188]: n_dims = 1, name = v.blk.21.ffn_post_norm.weight, tensor_size=4608, offset=607536000, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[189]: n_dims = 1, name = v.blk.21.ln2.weight, tensor_size=4608, offset=607540608, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[190]: n_dims = 1, name = v.blk.21.attn_k_norm.weight, tensor_size=288, offset=607545216, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[191]: n_dims = 2, name = v.blk.21.attn_k.weight, tensor_size=2654208, offset=607545504, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[192]: n_dims = 2, name = v.blk.21.attn_out.weight, tensor_size=2654208, offset=610199712, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[193]: n_dims = 1, name = v.blk.21.attn_q_norm.weight, tensor_size=288, offset=612853920, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[194]: n_dims = 2, name = v.blk.21.attn_q.weight, tensor_size=2654208, offset=612854208, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[195]: n_dims = 2, name = v.blk.21.attn_v.weight, tensor_size=2654208, offset=615508416, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[196]: n_dims = 1, name = v.blk.22.ln1.weight, tensor_size=4608, offset=618162624, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[197]: n_dims = 2, name = v.blk.22.ffn_down.weight, tensor_size=9916416, offset=618167232, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[198]: n_dims = 2, name = v.blk.22.ffn_gate.weight, tensor_size=9916416, offset=628083648, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[199]: n_dims = 2, name = v.blk.22.ffn_up.weight, tensor_size=9916416, offset=638000064, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[200]: n_dims = 1, name = v.blk.22.attn_post_norm.weight, tensor_size=4608, offset=647916480, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[201]: n_dims = 1, name = v.blk.22.ffn_post_norm.weight, tensor_size=4608, offset=647921088, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[202]: n_dims = 1, name = v.blk.22.ln2.weight, tensor_size=4608, offset=647925696, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[203]: n_dims = 1, name = v.blk.22.attn_k_norm.weight, tensor_size=288, offset=647930304, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[204]: n_dims = 2, name = v.blk.22.attn_k.weight, tensor_size=2654208, offset=647930592, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[205]: n_dims = 2, name = v.blk.22.attn_out.weight, tensor_size=2654208, offset=650584800, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[206]: n_dims = 1, name = v.blk.22.attn_q_norm.weight, tensor_size=288, offset=653239008, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[207]: n_dims = 2, name = v.blk.22.attn_q.weight, tensor_size=2654208, offset=653239296, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[208]: n_dims = 2, name = v.blk.22.attn_v.weight, tensor_size=2654208, offset=655893504, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[209]: n_dims = 1, name = v.blk.23.ln1.weight, tensor_size=4608, offset=658547712, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[210]: n_dims = 2, name = v.blk.23.ffn_down.weight, tensor_size=9916416, offset=658552320, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[211]: n_dims = 2, name = v.blk.23.ffn_gate.weight, tensor_size=9916416, offset=668468736, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[212]: n_dims = 2, name = v.blk.23.ffn_up.weight, tensor_size=9916416, offset=678385152, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[213]: n_dims = 1, name = v.blk.23.attn_post_norm.weight, tensor_size=4608, offset=688301568, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[214]: n_dims = 1, name = v.blk.23.ffn_post_norm.weight, tensor_size=4608, offset=688306176, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[215]: n_dims = 1, name = v.blk.23.ln2.weight, tensor_size=4608, offset=688310784, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[216]: n_dims = 1, name = v.blk.23.attn_k_norm.weight, tensor_size=288, offset=688315392, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[217]: n_dims = 2, name = v.blk.23.attn_k.weight, tensor_size=2654208, offset=688315680, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[218]: n_dims = 2, name = v.blk.23.attn_out.weight, tensor_size=2654208, offset=690969888, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[219]: n_dims = 1, name = v.blk.23.attn_q_norm.weight, tensor_size=288, offset=693624096, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[220]: n_dims = 2, name = v.blk.23.attn_q.weight, tensor_size=2654208, offset=693624384, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[221]: n_dims = 2, name = v.blk.23.attn_v.weight, tensor_size=2654208, offset=696278592, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[222]: n_dims = 1, name = v.blk.24.ln1.weight, tensor_size=4608, offset=698932800, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[223]: n_dims = 2, name = v.blk.24.ffn_down.weight, tensor_size=9916416, offset=698937408, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[224]: n_dims = 2, name = v.blk.24.ffn_gate.weight, tensor_size=9916416, offset=708853824, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[225]: n_dims = 2, name = v.blk.24.ffn_up.weight, tensor_size=9916416, offset=718770240, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[226]: n_dims = 1, name = v.blk.24.attn_post_norm.weight, tensor_size=4608, offset=728686656, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[227]: n_dims = 1, name = v.blk.24.ffn_post_norm.weight, tensor_size=4608, offset=728691264, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[228]: n_dims = 1, name = v.blk.24.ln2.weight, tensor_size=4608, offset=728695872, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[229]: n_dims = 1, name = v.blk.24.attn_k_norm.weight, tensor_size=288, offset=728700480, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[230]: n_dims = 2, name = v.blk.24.attn_k.weight, tensor_size=2654208, offset=728700768, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[231]: n_dims = 2, name = v.blk.24.attn_out.weight, tensor_size=2654208, offset=731354976, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[232]: n_dims = 1, name = v.blk.24.attn_q_norm.weight, tensor_size=288, offset=734009184, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[233]: n_dims = 2, name = v.blk.24.attn_q.weight, tensor_size=2654208, offset=734009472, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[234]: n_dims = 2, name = v.blk.24.attn_v.weight, tensor_size=2654208, offset=736663680, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[235]: n_dims = 1, name = v.blk.25.ln1.weight, tensor_size=4608, offset=739317888, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[236]: n_dims = 2, name = v.blk.25.ffn_down.weight, tensor_size=9916416, offset=739322496, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[237]: n_dims = 2, name = v.blk.25.ffn_gate.weight, tensor_size=9916416, offset=749238912, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[238]: n_dims = 2, name = v.blk.25.ffn_up.weight, tensor_size=9916416, offset=759155328, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[239]: n_dims = 1, name = v.blk.25.attn_post_norm.weight, tensor_size=4608, offset=769071744, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[240]: n_dims = 1, name = v.blk.25.ffn_post_norm.weight, tensor_size=4608, offset=769076352, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[241]: n_dims = 1, name = v.blk.25.ln2.weight, tensor_size=4608, offset=769080960, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[242]: n_dims = 1, name = v.blk.25.attn_k_norm.weight, tensor_size=288, offset=769085568, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[243]: n_dims = 2, name = v.blk.25.attn_k.weight, tensor_size=2654208, offset=769085856, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[244]: n_dims = 2, name = v.blk.25.attn_out.weight, tensor_size=2654208, offset=771740064, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[245]: n_dims = 1, name = v.blk.25.attn_q_norm.weight, tensor_size=288, offset=774394272, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[246]: n_dims = 2, name = v.blk.25.attn_q.weight, tensor_size=2654208, offset=774394560, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[247]: n_dims = 2, name = v.blk.25.attn_v.weight, tensor_size=2654208, offset=777048768, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[248]: n_dims = 1, name = v.blk.26.ln1.weight, tensor_size=4608, offset=779702976, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[249]: n_dims = 2, name = v.blk.26.ffn_down.weight, tensor_size=9916416, offset=779707584, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[250]: n_dims = 2, name = v.blk.26.ffn_gate.weight, tensor_size=9916416, offset=789624000, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[251]: n_dims = 2, name = v.blk.26.ffn_up.weight, tensor_size=9916416, offset=799540416, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[252]: n_dims = 1, name = v.blk.26.attn_post_norm.weight, tensor_size=4608, offset=809456832, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[253]: n_dims = 1, name = v.blk.26.ffn_post_norm.weight, tensor_size=4608, offset=809461440, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[254]: n_dims = 1, name = v.blk.26.ln2.weight, tensor_size=4608, offset=809466048, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[255]: n_dims = 1, name = v.blk.26.attn_k_norm.weight, tensor_size=288, offset=809470656, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[256]: n_dims = 2, name = v.blk.26.attn_k.weight, tensor_size=2654208, offset=809470944, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[257]: n_dims = 2, name = v.blk.26.attn_out.weight, tensor_size=2654208, offset=812125152, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[258]: n_dims = 1, name = v.blk.26.attn_q_norm.weight, tensor_size=288, offset=814779360, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[259]: n_dims = 2, name = v.blk.26.attn_q.weight, tensor_size=2654208, offset=814779648, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[260]: n_dims = 2, name = v.blk.26.attn_v.weight, tensor_size=2654208, offset=817433856, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[261]: n_dims = 1, name = v.blk.3.ln1.weight, tensor_size=4608, offset=820088064, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[262]: n_dims = 2, name = v.blk.3.ffn_down.weight, tensor_size=9916416, offset=820092672, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[263]: n_dims = 2, name = v.blk.3.ffn_gate.weight, tensor_size=9916416, offset=830009088, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[264]: n_dims = 2, name = v.blk.3.ffn_up.weight, tensor_size=9916416, offset=839925504, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[265]: n_dims = 1, name = v.blk.3.attn_post_norm.weight, tensor_size=4608, offset=849841920, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[266]: n_dims = 1, name = v.blk.3.ffn_post_norm.weight, tensor_size=4608, offset=849846528, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[267]: n_dims = 1, name = v.blk.3.ln2.weight, tensor_size=4608, offset=849851136, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[268]: n_dims = 1, name = v.blk.3.attn_k_norm.weight, tensor_size=288, offset=849855744, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[269]: n_dims = 2, name = v.blk.3.attn_k.weight, tensor_size=2654208, offset=849856032, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[270]: n_dims = 2, name = v.blk.3.attn_out.weight, tensor_size=2654208, offset=852510240, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[271]: n_dims = 1, name = v.blk.3.attn_q_norm.weight, tensor_size=288, offset=855164448, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[272]: n_dims = 2, name = v.blk.3.attn_q.weight, tensor_size=2654208, offset=855164736, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[273]: n_dims = 2, name = v.blk.3.attn_v.weight, tensor_size=2654208, offset=857818944, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[274]: n_dims = 1, name = v.blk.4.ln1.weight, tensor_size=4608, offset=860473152, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[275]: n_dims = 2, name = v.blk.4.ffn_down.weight, tensor_size=9916416, offset=860477760, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[276]: n_dims = 2, name = v.blk.4.ffn_gate.weight, tensor_size=9916416, offset=870394176, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[277]: n_dims = 2, name = v.blk.4.ffn_up.weight, tensor_size=9916416, offset=880310592, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[278]: n_dims = 1, name = v.blk.4.attn_post_norm.weight, tensor_size=4608, offset=890227008, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[279]: n_dims = 1, name = v.blk.4.ffn_post_norm.weight, tensor_size=4608, offset=890231616, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[280]: n_dims = 1, name = v.blk.4.ln2.weight, tensor_size=4608, offset=890236224, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[281]: n_dims = 1, name = v.blk.4.attn_k_norm.weight, tensor_size=288, offset=890240832, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[282]: n_dims = 2, name = v.blk.4.attn_k.weight, tensor_size=2654208, offset=890241120, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[283]: n_dims = 2, name = v.blk.4.attn_out.weight, tensor_size=2654208, offset=892895328, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[284]: n_dims = 1, name = v.blk.4.attn_q_norm.weight, tensor_size=288, offset=895549536, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[285]: n_dims = 2, name = v.blk.4.attn_q.weight, tensor_size=2654208, offset=895549824, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[286]: n_dims = 2, name = v.blk.4.attn_v.weight, tensor_size=2654208, offset=898204032, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[287]: n_dims = 1, name = v.blk.5.ln1.weight, tensor_size=4608, offset=900858240, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[288]: n_dims = 2, name = v.blk.5.ffn_down.weight, tensor_size=9916416, offset=900862848, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[289]: n_dims = 2, name = v.blk.5.ffn_gate.weight, tensor_size=9916416, offset=910779264, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[290]: n_dims = 2, name = v.blk.5.ffn_up.weight, tensor_size=9916416, offset=920695680, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[291]: n_dims = 1, name = v.blk.5.attn_post_norm.weight, tensor_size=4608, offset=930612096, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[292]: n_dims = 1, name = v.blk.5.ffn_post_norm.weight, tensor_size=4608, offset=930616704, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[293]: n_dims = 1, name = v.blk.5.ln2.weight, tensor_size=4608, offset=930621312, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[294]: n_dims = 1, name = v.blk.5.attn_k_norm.weight, tensor_size=288, offset=930625920, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[295]: n_dims = 2, name = v.blk.5.attn_k.weight, tensor_size=2654208, offset=930626208, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[296]: n_dims = 2, name = v.blk.5.attn_out.weight, tensor_size=2654208, offset=933280416, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[297]: n_dims = 1, name = v.blk.5.attn_q_norm.weight, tensor_size=288, offset=935934624, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[298]: n_dims = 2, name = v.blk.5.attn_q.weight, tensor_size=2654208, offset=935934912, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[299]: n_dims = 2, name = v.blk.5.attn_v.weight, tensor_size=2654208, offset=938589120, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[300]: n_dims = 1, name = v.blk.6.ln1.weight, tensor_size=4608, offset=941243328, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[301]: n_dims = 2, name = v.blk.6.ffn_down.weight, tensor_size=9916416, offset=941247936, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[302]: n_dims = 2, name = v.blk.6.ffn_gate.weight, tensor_size=9916416, offset=951164352, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[303]: n_dims = 2, name = v.blk.6.ffn_up.weight, tensor_size=9916416, offset=961080768, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[304]: n_dims = 1, name = v.blk.6.attn_post_norm.weight, tensor_size=4608, offset=970997184, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[305]: n_dims = 1, name = v.blk.6.ffn_post_norm.weight, tensor_size=4608, offset=971001792, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[306]: n_dims = 1, name = v.blk.6.ln2.weight, tensor_size=4608, offset=971006400, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[307]: n_dims = 1, name = v.blk.6.attn_k_norm.weight, tensor_size=288, offset=971011008, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[308]: n_dims = 2, name = v.blk.6.attn_k.weight, tensor_size=2654208, offset=971011296, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[309]: n_dims = 2, name = v.blk.6.attn_out.weight, tensor_size=2654208, offset=973665504, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[310]: n_dims = 1, name = v.blk.6.attn_q_norm.weight, tensor_size=288, offset=976319712, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[311]: n_dims = 2, name = v.blk.6.attn_q.weight, tensor_size=2654208, offset=976320000, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[312]: n_dims = 2, name = v.blk.6.attn_v.weight, tensor_size=2654208, offset=978974208, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[313]: n_dims = 1, name = v.blk.7.ln1.weight, tensor_size=4608, offset=981628416, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[314]: n_dims = 2, name = v.blk.7.ffn_down.weight, tensor_size=9916416, offset=981633024, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[315]: n_dims = 2, name = v.blk.7.ffn_gate.weight, tensor_size=9916416, offset=991549440, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[316]: n_dims = 2, name = v.blk.7.ffn_up.weight, tensor_size=9916416, offset=1001465856, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[317]: n_dims = 1, name = v.blk.7.attn_post_norm.weight, tensor_size=4608, offset=1011382272, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[318]: n_dims = 1, name = v.blk.7.ffn_post_norm.weight, tensor_size=4608, offset=1011386880, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[319]: n_dims = 1, name = v.blk.7.ln2.weight, tensor_size=4608, offset=1011391488, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[320]: n_dims = 1, name = v.blk.7.attn_k_norm.weight, tensor_size=288, offset=1011396096, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[321]: n_dims = 2, name = v.blk.7.attn_k.weight, tensor_size=2654208, offset=1011396384, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[322]: n_dims = 2, name = v.blk.7.attn_out.weight, tensor_size=2654208, offset=1014050592, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[323]: n_dims = 1, name = v.blk.7.attn_q_norm.weight, tensor_size=288, offset=1016704800, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[324]: n_dims = 2, name = v.blk.7.attn_q.weight, tensor_size=2654208, offset=1016705088, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[325]: n_dims = 2, name = v.blk.7.attn_v.weight, tensor_size=2654208, offset=1019359296, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[326]: n_dims = 1, name = v.blk.8.ln1.weight, tensor_size=4608, offset=1022013504, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[327]: n_dims = 2, name = v.blk.8.ffn_down.weight, tensor_size=9916416, offset=1022018112, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[328]: n_dims = 2, name = v.blk.8.ffn_gate.weight, tensor_size=9916416, offset=1031934528, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[329]: n_dims = 2, name = v.blk.8.ffn_up.weight, tensor_size=9916416, offset=1041850944, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[330]: n_dims = 1, name = v.blk.8.attn_post_norm.weight, tensor_size=4608, offset=1051767360, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[331]: n_dims = 1, name = v.blk.8.ffn_post_norm.weight, tensor_size=4608, offset=1051771968, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[332]: n_dims = 1, name = v.blk.8.ln2.weight, tensor_size=4608, offset=1051776576, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[333]: n_dims = 1, name = v.blk.8.attn_k_norm.weight, tensor_size=288, offset=1051781184, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[334]: n_dims = 2, name = v.blk.8.attn_k.weight, tensor_size=2654208, offset=1051781472, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[335]: n_dims = 2, name = v.blk.8.attn_out.weight, tensor_size=2654208, offset=1054435680, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[336]: n_dims = 1, name = v.blk.8.attn_q_norm.weight, tensor_size=288, offset=1057089888, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[337]: n_dims = 2, name = v.blk.8.attn_q.weight, tensor_size=2654208, offset=1057090176, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[338]: n_dims = 2, name = v.blk.8.attn_v.weight, tensor_size=2654208, offset=1059744384, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[339]: n_dims = 1, name = v.blk.9.ln1.weight, tensor_size=4608, offset=1062398592, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[340]: n_dims = 2, name = v.blk.9.ffn_down.weight, tensor_size=9916416, offset=1062403200, shape:[4304, 1152, 1, 1], type = bf16
clip_model_loader: tensor[341]: n_dims = 2, name = v.blk.9.ffn_gate.weight, tensor_size=9916416, offset=1072319616, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[342]: n_dims = 2, name = v.blk.9.ffn_up.weight, tensor_size=9916416, offset=1082236032, shape:[1152, 4304, 1, 1], type = bf16
clip_model_loader: tensor[343]: n_dims = 1, name = v.blk.9.attn_post_norm.weight, tensor_size=4608, offset=1092152448, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[344]: n_dims = 1, name = v.blk.9.ffn_post_norm.weight, tensor_size=4608, offset=1092157056, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[345]: n_dims = 1, name = v.blk.9.ln2.weight, tensor_size=4608, offset=1092161664, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[346]: n_dims = 1, name = v.blk.9.attn_k_norm.weight, tensor_size=288, offset=1092166272, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[347]: n_dims = 2, name = v.blk.9.attn_k.weight, tensor_size=2654208, offset=1092166560, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[348]: n_dims = 2, name = v.blk.9.attn_out.weight, tensor_size=2654208, offset=1094820768, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[349]: n_dims = 1, name = v.blk.9.attn_q_norm.weight, tensor_size=288, offset=1097474976, shape:[72, 1, 1, 1], type = f32
clip_model_loader: tensor[350]: n_dims = 2, name = v.blk.9.attn_q.weight, tensor_size=2654208, offset=1097475264, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[351]: n_dims = 2, name = v.blk.9.attn_v.weight, tensor_size=2654208, offset=1100129472, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[352]: n_dims = 4, name = v.patch_embd.weight, tensor_size=3538944, offset=1102783680, shape:[16, 16, 3, 1152], type = f32
clip_model_loader: tensor[353]: n_dims = 3, name = v.position_embd.weight, tensor_size=94371840, offset=1106322624, shape:[1152, 10240, 2, 1], type = f32
clip_model_loader: tensor[354]: n_dims = 1, name = v.std_bias, tensor_size=4608, offset=1200694464, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[355]: n_dims = 1, name = v.std_scale, tensor_size=4608, offset=1200699072, shape:[1152, 1, 1, 1], type = f32
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          gemma4v
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu_quick
load_hparams: projection_dim:     5376

--- vision hparams ---
load_hparams: image_size:         224
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            3
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   1152000 (custom value)
load_hparams: image_max_pixels:   1152000 (custom value)

load_hparams: model size:         1145.08 MiB
load_hparams: metadata size:      0.12 MiB
load_tensors: loaded 356 tensors from /mnt/Speed/AI/Models/bartowski/google_gemma-4-31B-it-GGUF/mmproj-google_gemma-4-31B-it-bf16.gguf
warmup: warmup with image size = 768 x 768
alloc_compute_meta:      CUDA0 compute buffer size =   140.50 MiB
alloc_compute_meta:        CPU compute buffer size =     6.77 MiB
alloc_compute_meta: graph splits = 1, nodes = 1569
warmup: flash attention is enabled
srv    load_model: loaded multimodal model, '/mnt/Speed/AI/Models/bartowski/google_gemma-4-31B-it-GGUF/mmproj-google_gemma-4-31B-it-bf16.gguf'
srv    load_model: initializing slots, n_slots = 1
CUDA Graph id 43 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
CUDA Graph id 44 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | new slot, n_ctx = 39936
slot        reset: id  0 | task -1 | 
srv    load_model: prompt cache is enabled, size limit: 2048 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --clear-idle requires --kv-unified, disabling
init: chat template, example_format: '<|turn>system
<|think|>
You are a helpful assistant<turn|>
<|turn>user
Hello<turn|>
<|turn>model
Hi there<turn|>
<|turn>user
How are you?<turn|>
<|turn>model
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8085
main: starting the main loop...
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
add_text: <|turn>system
Marcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white.<turn|>
<|turn>user
Summary of the story so far:

<summary>

</summary>

Dialogue examples:

<examples><turn|>
<|turn>system
</examples>

This is the end of the examples, the examples are NOT CANON.

The canon story:

<story><turn|>
<|turn>system
[Start a new Chat]<turn|>
<|turn>user
<turn="Marcal">
How many 'r's in the word 'strawberry'?
</turn><turn|>
<|turn>model
<|channel>
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: Generation prompt: '<|turn>model
'
srv  params_from_: Preserved token: 100
srv  params_from_: Preserved token: 101
srv  params_from_: Preserved token: 48
srv  params_from_: Preserved token: 49
srv  params_from_: Preserved token: 105
srv  params_from_: reasoning budget: tokens=-1, generation_prompt='<|turn>model
', start=2 toks, end=1 toks, forced=1 toks
res  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 2048.000 MiB, 39936 tokens, 2147483648 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | launching slot : {"id":0,"n_ctx":39936,"speculative":false,"is_processing":false}
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 39936, n_keep = 0, task.n_tokens = 149
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 145, batch.n_tokens = 145, progress = 0.973154
slot update_slots: id  0 | task 0 | main/do_checkpoint = no, pos_min = -1, pos_max = -1
srv  update_slots: decoding batch, n_tokens = 145
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
ggml_backend_cuda_graph_compute: CUDA graph warmup reset
ggml_backend_cuda_graph_compute: CUDA graph warmup reset
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | n_tokens = 145, memory_seq_rm [145, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.02 ms, tokens: text = 149, total = 149
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 149, batch.n_tokens = 4
slot update_slots: id  0 | task 0 | main/do_checkpoint = yes, pos_min = 0, pos_max = 144
slot create_check: id  0 | task 0 | created context checkpoint 1 of 2 (pos_min = 0, pos_max = 144, n_tokens = 145, size = 113.284 MiB)
srv  update_slots: decoding batch, n_tokens = 4
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = 4095, next token: 45518 'thought'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 2
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 3, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=45518, n_ctx = 39936, n_tokens = 149, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv          stop: all tasks already finished, no need to cancel
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request:  {"messages":[{"role":"system","content":"Marcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white."},{"role":"user","content":"Summary of the story so far:\n\n<summary>\n\n</summary>\n\nDialogue examples:\n\n<examples>"},{"role":"system","content":"</examples>\n\nThis is the end of the examples, the examples are NOT CANON.\n\nThe canon story:\n\n<story>"},{"role":"system","content":"[Start a new Chat]"},{"role":"user","content":"<turn=\"Marcal\">\nHow many 'r's in the word 'strawberry'?\n</turn>"},{"role":"assistant","content":"<|channel>"}],"model":"G4-31","temperature":0.75,"max_tokens":4096,"stream":true,"presence_penalty":0,"frequency_penalty":0,"top_p":0.99,"stop":["\nYou:","\n``...``","\n''.''","<|user|>"],"chat_template_kwargs":{"enable_thinking":false}}
srv  log_server_r: response: 
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1776650875,"id":"chatcmpl-4fwUOdBIhvbvn4w0pHc8aHYBjZBOjYRN","model":"G4-31","system_fingerprint":"b8851-e365e658f","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"thought"}}],"created":1776650875,"id":"chatcmpl-4fwUOdBIhvbvn4w0pHc8aHYBjZBOjYRN","model":"G4-31","system_fingerprint":"b8851-e365e658f","object":"chat.completion.chunk"}


res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = 4094, next token:   107 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 3
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 4, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=107, n_ctx = 39936, n_tokens = 150, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
CUDA Graph id 58 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 3, n_remaining = 4093, next token:   101 '<channel|>'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 4
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 5, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=101, n_ctx = 39936, n_tokens = 151, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1776650875,"id":"chatcmpl-4fwUOdBIhvbvn4w0pHc8aHYBjZBOjYRN","model":"G4-31","system_fingerprint":"b8851-e365e658f","object":"chat.completion.chunk"}


CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 4, n_remaining = 4092, next token:  3810 'There'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 5
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 6, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=3810, n_ctx = 39936, n_tokens = 152, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 5, n_remaining = 4091, next token:   659 ' are'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 6
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 7, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=659, n_ctx = 39936, n_tokens = 153, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 6, n_remaining = 4090, next token: 236743 ' '
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 7
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 8, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236743, n_ctx = 39936, n_tokens = 154, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 7, n_remaining = 4089, next token: 236800 '3'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236800, n_ctx = 39936, n_tokens = 155, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 8, n_remaining = 4088, next token:   756 ' ''
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 9
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 10, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=756, n_ctx = 39936, n_tokens = 156, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 9, n_remaining = 4087, next token: 236750 'r'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 10
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 11, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236750, n_ctx = 39936, n_tokens = 157, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 10, n_remaining = 4086, next token: 236789 '''
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 11
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 12, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236789, n_ctx = 39936, n_tokens = 158, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 11, n_remaining = 4085, next token: 236751 's'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 12
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 13, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236751, n_ctx = 39936, n_tokens = 159, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 12, n_remaining = 4084, next token:   528 ' in'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 13
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 14, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=528, n_ctx = 39936, n_tokens = 160, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 13, n_remaining = 4083, next token: 35324 ' strawberry'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 14
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 15, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=35324, n_ctx = 39936, n_tokens = 161, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 14, n_remaining = 4082, next token: 236761 '.'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 15
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 16, front = 0
slot update_batch: id  0 | task 0 | slot decode token, id=236761, n_ctx = 39936, n_tokens = 162, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 57 reused
CUDA Graph id 58 reused
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | stopped by EOS
slot process_toke: id  0 | task 0 | n_decoded = 15, n_remaining = 4081, next token:   106 ''
slot print_timing: id  0 | task 0 | 
prompt eval time =     776.52 ms /   149 tokens (    5.21 ms per token,   191.88 tokens per second)
       eval time =     598.26 ms /    15 tokens (   39.88 ms per token,    25.07 tokens per second)
      total time =    1374.78 ms /   164 tokens
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot      release: id  0 | task 0 | stop processing: n_tokens = 163, truncated = 0
slot        reset: id  0 | task 0 | 
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 16
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
Parsed message: {"role":"assistant","content":"thought\n"}
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1776650875,"id":"chatcmpl-4fwUOdBIhvbvn4w0pHc8aHYBjZBOjYRN","model":"G4-31","system_fingerprint":"b8851-e365e658f","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":149,"prompt_ms":776.522,"prompt_per_token_ms":5.211557046979866,"prompt_per_second":191.88123453038034,"predicted_n":15,"predicted_ms":598.256,"predicted_per_token_ms":39.88373333333333,"predicted_per_second":25.07287850017384},"__verbose":{"index":0,"content":"","tokens":[],"id_slot":0,"stop":true,"model":"G4-31","tokens_predicted":15,"tokens_evaluated":149,"generation_settings":{"seed":4294967295,"temperature":0.75,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":64,"top_p":0.9900000095367432,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":39936,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":["\nYou:","\n``...``","\n''.''","<|user|>"],"max_tokens":4096,"n_predict":4096,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[48,49,100,101,105],"chat_format":"peg-gemma4","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"<|turn>model\n","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<bos><|turn>system\nMarcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white.<turn|>\n<|turn>user\nSummary of the story so far:\n\n<summary>\n\n</summary>\n\nDialogue examples:\n\n<examples><turn|>\n<|turn>system\n</examples>\n\nThis is the end of the examples, the examples are NOT CANON.\n\nThe canon story:\n\n<story><turn|>\n<|turn>system\n[Start a new Chat]<turn|>\n<|turn>user\n<turn=\"Marcal\">\nHow many 'r's in the word'strawberry'?\n</turn><turn|>\n<|turn>model\n<|channel>","has_new_line":true,"truncated":false,"stop_type":"eos","stopping_word":"","tokens_cached":163,"timings":{"cache_n":0,"prompt_n":149,"prompt_ms":776.522,"prompt_per_token_ms":5.211557046979866,"prompt_per_second":191.88123453038034,"predicted_n":15,"predicted_ms":598.256,"predicted_per_token_ms":39.88373333333333,"predicted_per_second":25.07287850017384}}}


srv    operator(): all results received, terminating stream
srv    operator(): http: streamed chunk: data: [DONE]


srv    operator(): http: stream ended
res  remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
srv          stop: all tasks already finished, no need to cancel
add_text: <|turn>system
Marcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white.<turn|>
<|turn>user
Summary of the story so far:

<summary>

</summary>

Dialogue examples:

<examples><turn|>
<|turn>system
</examples>

This is the end of the examples, the examples are NOT CANON.

The canon story:

<story><turn|>
<|turn>system
[Start a new Chat]<turn|>
<|turn>user
<turn="Marcal">
How many 'r's in the word 'strawberry'?
</turn><turn|>
<|turn>model
<|channel>
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: Generation prompt: '<|turn>model
'
srv  params_from_: Preserved token: 100
srv  params_from_: Preserved token: 101
srv  params_from_: Preserved token: 48
srv  params_from_: Preserved token: 49
srv  params_from_: Preserved token: 105
srv  params_from_: reasoning budget: tokens=-1, generation_prompt='<|turn>model
', start=2 toks, end=1 toks, forced=1 toks
res  add_waiting_: add task 17 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 17/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 17
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.914
slot launch_slot_: id  0 | task -1 | launching slot : {"id":0,"n_ctx":39936,"speculative":false,"is_processing":false,"id_task":0,"params":{"seed":4294967295,"temperature":0.75,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":64,"top_p":0.9900000095367432,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":39936,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":["\nYou:","\n``...``","\n''.''","<|user|>"],"max_tokens":4096,"n_predict":4096,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[48,49,100,101,105],"chat_format":"peg-gemma4","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"<|turn>model\n","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"next_token":[{"has_next_token":false,"has_new_line":false,"n_remain":4081,"n_decoded":15}],"prompt":"<bos><|turn>system\nMarcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white.<turn|>\n<|turn>user\nSummary of the story so far:\n\n<summary>\n\n</summary>\n\nDialogue examples:\n\n<examples><turn|>\n<|turn>system\n</examples>\n\nThis is the end of the examples, the examples are NOT CANON.\n\nThe canon story:\n\n<story><turn|>\n<|turn>system\n[Start a new Chat]<turn|>\n<|turn>user\n<turn=\"Marcal\">\nHow many 'r's in the word'strawberry'?\n</turn><turn|>\n<|turn>model\n<|channel>","generated":""}
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 17 | processing task, is_child = 0
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 18, front = 0
slot update_slots: id  0 | task 17 | new prompt, n_ctx_slot = 39936, n_keep = 0, task.n_tokens = 149
slot update_slots: id  0 | task 17 | n_past = 149, slot.prompt.tokens.size() = 163, seq_id = 0, pos_min = 0, n_swa = 1024
slot update_slots: id  0 | task 17 | Checking checkpoint with [0, 144] against 0...
state_read_meta: cell_count = 145, dest_seq_id = 0
slot update_slots: id  0 | task 17 | restored context checkpoint (pos_min = 0, pos_max = 144, n_tokens = 145, n_past = 144, size = 113.284 MiB)
slot update_slots: id  0 | task 17 | n_tokens = 144, memory_seq_rm [144, end)
slot update_slots: id  0 | task 17 | prompt processing progress, n_tokens = 145, batch.n_tokens = 1, progress = 0.973154
slot update_slots: id  0 | task 17 | main/do_checkpoint = no, pos_min = 0, pos_max = 143
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 18
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 19, front = 0
slot update_slots: id  0 | task 17 | n_tokens = 145, memory_seq_rm [145, end)
slot init_sampler: id  0 | task 17 | init sampler, took 0.01 ms, tokens: text = 149, total = 149
slot update_slots: id  0 | task 17 | prompt processing done, n_tokens = 149, batch.n_tokens = 4
slot update_slots: id  0 | task 17 | main/do_checkpoint = no, pos_min = 0, pos_max = 144
srv  update_slots: decoding batch, n_tokens = 4
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
slot process_toke: id  0 | task 17 | n_decoded = 1, n_remaining = 4095, next token: 45518 'thought'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 19
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 20, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=45518, n_ctx = 39936, n_tokens = 149, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
slot process_toke: id  0 | task 17 | n_decoded = 2, n_remaining = 4094, next token:   107 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 20
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 21, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=107, n_ctx = 39936, n_tokens = 150, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
CUDA Graph id 70 reused
ggml_backend_cuda_graph_compute: CUDA graph warmup complete
slot process_toke: id  0 | task 17 | n_decoded = 3, n_remaining = 4093, next token:   101 '<channel|>'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 21
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 22, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=101, n_ctx = 39936, n_tokens = 151, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 4, n_remaining = 4092, next token:  3810 'There'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 22
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 23, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=3810, n_ctx = 39936, n_tokens = 152, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 5, n_remaining = 4091, next token:   659 ' are'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 23
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 24, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=659, n_ctx = 39936, n_tokens = 153, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 6, n_remaining = 4090, next token: 236743 ' '
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 24
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 25, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236743, n_ctx = 39936, n_tokens = 154, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 7, n_remaining = 4089, next token: 236800 '3'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 25
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 26, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236800, n_ctx = 39936, n_tokens = 155, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 8, n_remaining = 4088, next token:   756 ' ''
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 26
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 27, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=756, n_ctx = 39936, n_tokens = 156, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 9, n_remaining = 4087, next token: 236750 'r'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 27
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 28, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236750, n_ctx = 39936, n_tokens = 157, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 10, n_remaining = 4086, next token: 236789 '''
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 28
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 29, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236789, n_ctx = 39936, n_tokens = 158, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 11, n_remaining = 4085, next token: 236751 's'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 29
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 30, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236751, n_ctx = 39936, n_tokens = 159, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 12, n_remaining = 4084, next token:   528 ' in'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 30
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 31, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=528, n_ctx = 39936, n_tokens = 160, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 13, n_remaining = 4083, next token: 35324 ' strawberry'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 31
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 32, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=35324, n_ctx = 39936, n_tokens = 161, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | n_decoded = 14, n_remaining = 4082, next token: 236761 '.'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 32
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 33, front = 0
slot update_batch: id  0 | task 17 | slot decode token, id=236761, n_ctx = 39936, n_tokens = 162, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
CUDA Graph id 69 reused
CUDA Graph id 70 reused
slot process_toke: id  0 | task 17 | stopped by EOS
slot process_toke: id  0 | task 17 | n_decoded = 15, n_remaining = 4081, next token:   106 ''
slot print_timing: id  0 | task 17 | 
prompt eval time =     729.34 ms /     5 tokens (  145.87 ms per token,     6.86 tokens per second)
       eval time =     599.69 ms /    15 tokens (   39.98 ms per token,    25.01 tokens per second)
      total time =    1329.02 ms /    20 tokens
res          send: sending result for task id = 17
res          send: task id = 17 pushed to result queue
slot      release: id  0 | task 17 | stop processing: n_tokens = 163, truncated = 0
slot        reset: id  0 | task 17 | 
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 33
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
Parsed message: {"role":"assistant","content":"thought\n"}
srv          stop: all tasks already finished, no need to cancel
res  remove_waiti: remove task 17 from waiting list. current waiting = 1 (before remove)
srv          stop: all tasks already finished, no need to cancel
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv  log_server_r: request:  {"messages":[{"role":"system","content":"Marcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white."},{"role":"user","content":"Summary of the story so far:\n\n<summary>\n\n</summary>\n\nDialogue examples:\n\n<examples>"},{"role":"system","content":"</examples>\n\nThis is the end of the examples, the examples are NOT CANON.\n\nThe canon story:\n\n<story>"},{"role":"system","content":"[Start a new Chat]"},{"role":"user","content":"<turn=\"Marcal\">\nHow many 'r's in the word 'strawberry'?\n</turn>"},{"role":"assistant","content":"<|channel>"}],"model":"G4-31","temperature":0.75,"max_tokens":4096,"stream":false,"presence_penalty":0,"frequency_penalty":0,"top_p":0.99,"stop":["\nYou:","\n``...``","\n''.''","<|user|>"],"chat_template_kwargs":{"enable_thinking":false}}
srv  log_server_r: response: {"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"thought\n"}}],"created":1776650913,"model":"G4-31","system_fingerprint":"b8851-e365e658f","object":"chat.completion","usage":{"completion_tokens":15,"prompt_tokens":149,"total_tokens":164,"prompt_tokens_details":{"cached_tokens":144}},"id":"chatcmpl-keO5OkL3B59m3xRhGIhhxd49fh88taLE","__verbose":{"index":0,"content":"thought\n<channel|>There are 3 'r's in strawberry.","tokens":[],"id_slot":0,"stop":true,"model":"G4-31","tokens_predicted":15,"tokens_evaluated":149,"generation_settings":{"seed":4294967295,"temperature":0.75,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":64,"top_p":0.9900000095367432,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":39936,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":["\nYou:","\n``...``","\n''.''","<|user|>"],"max_tokens":4096,"n_predict":4096,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[48,49,100,101,105],"chat_format":"peg-gemma4","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"<|turn>model\n","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"speculative.type":"none","speculative.ngram_size_n":1024,"speculative.ngram_size_m":1024,"speculative.ngram_m_hits":1024,"timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"<bos><|turn>system\nMarcal is **male**, and uses both he/him and they/them pronouns interchangeably, but identifies as male and presents as such. Marcal is bisexual Marcal is caucasian/white.<turn|>\n<|turn>user\nSummary of the story so far:\n\n<summary>\n\n</summary>\n\nDialogue examples:\n\n<examples><turn|>\n<|turn>system\n</examples>\n\nThis is the end of the examples, the examples are NOT CANON.\n\nThe canon story:\n\n<story><turn|>\n<|turn>system\n[Start a new Chat]<turn|>\n<|turn>user\n<turn=\"Marcal\">\nHow many 'r's in the word'strawberry'?\n</turn><turn|>\n<|turn>model\n<|channel>","has_new_line":true,"truncated":false,"stop_type":"eos","stopping_word":"","tokens_cached":163,"timings":{"cache_n":144,"prompt_n":5,"prompt_ms":729.335,"prompt_per_token_ms":145.86700000000002,"prompt_per_second":6.855560202101914,"predicted_n":15,"predicted_ms":599.688,"predicted_per_token_ms":39.9792,"predicted_per_second":25.01300676351703}},"timings":{"cache_n":144,"prompt_n":5,"prompt_ms":729.335,"prompt_per_token_ms":145.86700000000002,"prompt_per_second":6.855560202101914,"predicted_n":15,"predicted_ms":599.688,"predicted_per_token_ms":39.9792,"predicted_per_second":25.01300676351703}}

Currently in the latest version it doesn't give the error 500, but it still parses the thought as response, and the response doesn't display.
image

@aldehir
Copy link
Copy Markdown
Contributor Author

aldehir commented Apr 20, 2026

I am quite confused with your prompt, are you using a custom template or are you intentionally injecting Gemma 4 tokens into your conversation?

e.g.,

Dialogue examples:

<examples><turn|>
<|turn>system
</examples>

@Quairon-Nailo
Copy link
Copy Markdown

Quairon-Nailo commented Apr 20, 2026

I'm using SillyTavern's chat completion, which uses a call to the openAI-compatible endpoint in /v1/chat/completions. What I'm sending is a json which includes a list of system, user, and assistant messages. The only Gemma4 I'm intentionally adding is <|channel>, everything else you see is added by the llama.cpp when transforming that call into a text completion prompt for the AI

mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
@Quairon-Nailo
Copy link
Copy Markdown

I created a fix for that problem #22325. I've tested it and it works both in my case, using more extensive prefills, and generating normally with no prefill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.