Skip to content

Eval bug: CPU usage is abnormally high when using the CUDA backend to infer GLM-4.7-Flash #18948

@lingyezhixing

Description

@lingyezhixing

Name and Version

D:\LLM\LLM-Manager\backend\llama.cpp>llama-cli.exe --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
load_backend: loaded CUDA backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-cpu-zen4.dll
version: 7779 (6df686b)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

AMD R9 7940H + V100-32G-SXM2

Models

GLM 4.7 Flash UD Q5_K_XL(Quantized by Unsloth)

Problem description & steps to reproduce

llama-completion.exe -fit off -m E:\models\LLM\GGUF\GLM-4.7-Flash-UD-Q5_K_XL.gguf -c 4096 --jinja -ngl 99 -fa on --verbose -p "Write a Python program for a snake game"

First Bad Commit

No response

Relevant log output

Logs
D:\LLM\LLM-Manager\backend\llama.cpp>set CUDA_VISIBLE_DEVICES=1

D:\LLM\LLM-Manager\backend\llama.cpp>llama-completion.exe -fit off -m E:\models\LLM\GGUF\GLM-4.7-Flash-UD-Q5_K_XL.gguf -c 4096 --jinja -ngl 99 -fa on --verbose -p "Write a Python program for a snake game"
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes
load_backend: loaded CUDA backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from D:\LLM\LLM-Manager\backend\llama.cpp\ggml-cpu-zen4.dll
build: 7779 (6df686bee) with Clang 19.1.5 for Windows x86_64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (Tesla V100-SXM2-32GB) (0000:08:00.0) - 31292 MiB free
llama_model_loader: direct I/O is enabled, disabling mmap
llama_model_loader: loaded meta data with 58 key-value pairs and 844 tensors from E:\models\LLM\GGUF\GLM-4.7-Flash-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   3:                               general.name str              = Glm-4.7-Flash
llama_model_loader: - kv   4:                           general.basename str              = Glm-4.7-Flash
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 64x2.6B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = GLM 4.7 Flash
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Zai Org
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/zai-org/GLM-4....
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  14:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  15:                      deepseek2.block_count u32              = 47
llama_model_loader: - kv  16:                   deepseek2.context_length u32              = 202752
llama_model_loader: - kv  17:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  18:              deepseek2.feed_forward_length u32              = 10240
llama_model_loader: - kv  19:             deepseek2.attention.head_count u32              = 20
llama_model_loader: - kv  20:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  21:                   deepseek2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  23:                deepseek2.expert_used_count u32              = 4
llama_model_loader: - kv  24:               deepseek2.expert_group_count u32              = 1
llama_model_loader: - kv  25:          deepseek2.expert_group_used_count u32              = 1
llama_model_loader: - kv  26:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  27:                       deepseek2.vocab_size u32              = 154880
llama_model_loader: - kv  28:            deepseek2.attention.q_lora_rank u32              = 768
llama_model_loader: - kv  29:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  30:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  31:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  32:         deepseek2.attention.key_length_mla u32              = 256
llama_model_loader: - kv  33:       deepseek2.attention.value_length_mla u32              = 256
llama_model_loader: - kv  34:       deepseek2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  35:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  36:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  37:             deepseek2.expert_weights_scale f32              = 1.800000
llama_model_loader: - kv  38:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  41:                         tokenizer.ggml.pre str              = glm4
llama_model_loader: - kv  42:                      tokenizer.ggml.tokens arr[str,154880]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,154880]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  44:                      tokenizer.ggml.merges arr[str,321649]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  45:                tokenizer.ggml.eos_token_id u32              = 154820
llama_model_loader: - kv  46:            tokenizer.ggml.padding_token_id u32              = 154821
llama_model_loader: - kv  47:                tokenizer.ggml.bos_token_id u32              = 154822
llama_model_loader: - kv  48:                tokenizer.ggml.eot_token_id u32              = 154827
llama_model_loader: - kv  49:            tokenizer.ggml.unknown_token_id u32              = 154820
llama_model_loader: - kv  50:                tokenizer.ggml.eom_token_id u32              = 154829
llama_model_loader: - kv  51:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
llama_model_loader: - kv  52:               general.quantization_version u32              = 2
llama_model_loader: - kv  53:                          general.file_type u32              = 17
llama_model_loader: - kv  54:                      quantize.imatrix.file str              = GLM-4.7-Flash-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  55:                   quantize.imatrix.dataset str              = unsloth_calibration_GLM-4.7-Flash.txt
llama_model_loader: - kv  56:             quantize.imatrix.entries_count u32              = 607
llama_model_loader: - kv  57:              quantize.imatrix.chunks_count u32              = 85
llama_model_loader: - type  f32:  281 tensors
llama_model_loader: - type q8_0:  374 tensors
llama_model_loader: - type q4_K:   10 tensors
llama_model_loader: - type q5_K:  147 tensors
llama_model_loader: - type q6_K:   32 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 20.11 GiB (5.77 BPW)
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 154825 '<eop>' is not marked as EOG
load: control token: 154822 '[gMASK]' is not marked as EOG
load: control token: 154853 '<|end_of_box|>' is not marked as EOG
load: control token: 154834 '<|begin_of_audio|>' is not marked as EOG
load: control token: 154826 '<|system|>' is not marked as EOG
load: control token: 154836 '<|begin_of_transcription|>' is not marked as EOG
load: control token: 154835 '<|end_of_audio|>' is not marked as EOG
load: control token: 154827 '<|user|>' is not marked as EOG
load: control token: 154823 '[sMASK]' is not marked as EOG
load: control token: 154837 '<|end_of_transcription|>' is not marked as EOG
load: control token: 154821 '[MASK]' is not marked as EOG
load: control token: 154824 '<sop>' is not marked as EOG
load: control token: 154828 '<|assistant|>' is not marked as EOG
load: control token: 154829 '<|observation|>' is not marked as EOG
load: control token: 154830 '<|begin_of_image|>' is not marked as EOG
load: control token: 154831 '<|end_of_image|>' is not marked as EOG
load: control token: 154832 '<|begin_of_video|>' is not marked as EOG
load: control token: 154833 '<|end_of_video|>' is not marked as EOG
load: control token: 154838 '<|code_prefix|>' is not marked as EOG
load: control token: 154839 '<|code_middle|>' is not marked as EOG
load: control token: 154840 '<|code_suffix|>' is not marked as EOG
load: control token: 154852 '<|begin_of_box|>' is not marked as EOG
load: control token: 154854 '<|image|>' is not marked as EOG
load: control token: 154855 '<|video|>' is not marked as EOG
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 154820 ('<|endoftext|>')
load:   - 154827 ('<|user|>')
load:   - 154829 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9811 MB
print_info: arch                  = deepseek2
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 202752
print_info: n_embd                = 2048
print_info: n_embd_inp            = 2048
print_info: n_layer               = 47
print_info: n_head                = 20
print_info: n_head_kv             = 1
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 576
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = 20
print_info: n_embd_k_gqa          = 576
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 10240
print_info: n_expert              = 64
print_info: n_expert_used         = 4
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 202752
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = ?B
print_info: model params          = 29.94 B
print_info: general.name          = Glm-4.7-Flash
print_info: n_layer_dense_lead    = 1
print_info: n_lora_q              = 768
print_info: n_lora_kv             = 512
print_info: n_embd_head_k_mla     = 256
print_info: n_embd_head_v_mla     = 256
print_info: n_ff_exp              = 1536
print_info: n_expert_shared       = 1
print_info: expert_weights_scale  = 1.8
print_info: expert_weights_norm   = 1
print_info: expert_gating_func    = softmax
print_info: vocab type            = BPE
print_info: n_vocab               = 154880
print_info: n_merges              = 321649
print_info: BOS token             = 154822 '[gMASK]'
print_info: EOS token             = 154820 '<|endoftext|>'
print_info: EOT token             = 154827 '<|user|>'
print_info: EOM token             = 154829 '<|observation|>'
print_info: UNK token             = 154820 '<|endoftext|>'
print_info: PAD token             = 154821 '[MASK]'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 154838 '<|code_prefix|>'
print_info: FIM SUF token         = 154840 '<|code_suffix|>'
print_info: FIM MID token         = 154839 '<|code_middle|>'
print_info: EOG token             = 154820 '<|endoftext|>'
print_info: EOG token             = 154827 '<|user|>'
print_info: EOG token             = 154829 '<|observation|>'
print_info: max token length      = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 0
load_tensors: layer  37 assigned to device CUDA0, is_swa = 0
load_tensors: layer  38 assigned to device CUDA0, is_swa = 0
load_tensors: layer  39 assigned to device CUDA0, is_swa = 0
load_tensors: layer  40 assigned to device CUDA0, is_swa = 0
load_tensors: layer  41 assigned to device CUDA0, is_swa = 0
load_tensors: layer  42 assigned to device CUDA0, is_swa = 0
load_tensors: layer  43 assigned to device CUDA0, is_swa = 0
load_tensors: layer  44 assigned to device CUDA0, is_swa = 0
load_tensors: layer  45 assigned to device CUDA0, is_swa = 0
load_tensors: layer  46 assigned to device CUDA0, is_swa = 0
load_tensors: layer  47 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q_a_norm.weight
create_tensor: loading tensor blk.0.attn_kv_a_norm.weight
create_tensor: loading tensor blk.0.attn_q_a.weight
create_tensor: loading tensor blk.0.attn_q_b.weight
create_tensor: loading tensor blk.0.attn_kv_a_mqa.weight
create_tensor: loading tensor blk.0.attn_k_b.weight
create_tensor: loading tensor blk.0.attn_v_b.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
......(no error in loading tensor)
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_q_a_norm.weight
create_tensor: loading tensor blk.46.attn_kv_a_norm.weight
create_tensor: loading tensor blk.46.attn_q_a.weight
create_tensor: loading tensor blk.46.attn_q_b.weight
create_tensor: loading tensor blk.46.attn_kv_a_mqa.weight
create_tensor: loading tensor blk.46.attn_k_b.weight
create_tensor: loading tensor blk.46.attn_v_b.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.exp_probs_b.bias
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.46.ffn_gate_shexp.weight
create_tensor: loading tensor blk.46.ffn_down_shexp.weight
create_tensor: loading tensor blk.46.ffn_up_shexp.weight
load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading output layer to GPU
load_tensors: offloading 46 repeating layers to GPU
load_tensors: offloaded 48/48 layers to GPU
load_tensors:          CPU model buffer size =   207.97 MiB
load_tensors:        CUDA0 model buffer size = 20383.21 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
....................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|user|> logit bias = -inf
common_init_result: added <|observation|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache: layer  28: dev = CUDA0
llama_kv_cache: layer  29: dev = CUDA0
llama_kv_cache: layer  30: dev = CUDA0
llama_kv_cache: layer  31: dev = CUDA0
llama_kv_cache: layer  32: dev = CUDA0
llama_kv_cache: layer  33: dev = CUDA0
llama_kv_cache: layer  34: dev = CUDA0
llama_kv_cache: layer  35: dev = CUDA0
llama_kv_cache: layer  36: dev = CUDA0
llama_kv_cache: layer  37: dev = CUDA0
llama_kv_cache: layer  38: dev = CUDA0
llama_kv_cache: layer  39: dev = CUDA0
llama_kv_cache: layer  40: dev = CUDA0
llama_kv_cache: layer  41: dev = CUDA0
llama_kv_cache: layer  42: dev = CUDA0
llama_kv_cache: layer  43: dev = CUDA0
llama_kv_cache: layer  44: dev = CUDA0
llama_kv_cache: layer  45: dev = CUDA0
llama_kv_cache: layer  46: dev = CUDA0
llama_kv_cache:      CUDA0 KV buffer size =   399.50 MiB
llama_kv_cache: size =  399.50 MiB (  4096 cells,  47 layers,  1/1 seqs), K (f16):  211.50 MiB, V (f16):  188.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 6752
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
ggml_cuda_graph_set_enabled: disabling CUDA graphs due to GPU architecture
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:      CUDA0 compute buffer size =   330.88 MiB
sched_reserve:  CUDA_Host compute buffer size =    58.51 MiB
sched_reserve: graph nodes  = 3411
sched_reserve: graph splits = 96
sched_reserve: reserve took 15.54 ms, sched copies = 1
clear_adapter_lora: call
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
main: llama threadpool init, n_threads = 8
attach_threadpool: call
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
[gMASK]<sop><|system|>You are a helpful assistant<|user|>Hello<|assistant|></think>Hi there<|user|>How are you?<|assistant|><think>

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

n_ctx: 4096, add_bos: 0
formatted: '[gMASK]<sop><|user|>Write a Python program for a snake game<|assistant|><think>'
tokenize the prompt
prompt: "[gMASK]<sop><|user|>Write a Python program for a snake game<|assistant|><think>"
tokens: [ '[gMASK]':154822, '<sop>':154824, '<|user|>':154827, 'Write':7984, ' a':264, ' Python':13020, ' program':2025, ' for':369, ' a':264, ' snake':25187, ' game':1809, '<|assistant|>':154828, '<think>':154841 ]
recalculate the cached logits (check): embd_inp.size() 13, n_matching_session_tokens 0, embd_inp.size() 13, session_tokens.size() 0
main: interactive mode on.
sampler seed: 3371144253
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 1.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

embd_inp.size(): 13, n_consumed: 0
Write a Python program for a snake game<think>eval: [ '[gMASK]':154822, '<sop>':154824, '<|user|>':154827, 'Write':7984, ' a':264, ' Python':13020, ' program':2025, ' for':369, ' a':264, ' snake':25187, ' game':1809, '<|assistant|>':154828, '<think>':154841 ]
n_past = 13
n_remain: -2
1eval: [ '1':16 ]
n_past = 14
n_remain: -3
.eval: [ '.':13 ]
n_past = 15
n_remain: -4
 eval: [ ' ':220 ]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions