$ lama-cli -hf mradermacher/translategemma-27b-it-i1-GGUF:Q4_K_S --jinja --chat-template-kwargs '{"source_lang_code": "ja", "target_lang_code": "en-GB"}' -v --verbose-prompt
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
common_download_file_single_online: no previous model file found /persistent/ai/llama.cpp/mradermacher_translategemma-27b-it-i1-GGUF_preset.ini
common_download_file_single_online: HEAD invalid http status code received: 404
no remote preset found, skipping
common_download_file_single_online: using cached file: /persistent/ai/llama.cpp/mradermacher_translategemma-27b-it-i1-GGUF_translategemma-27b-it.i1-Q4_K_S.gguf
build: 7823 (bb02f74) with GNU 15.2.0 for Linux x86_64
Loading model... |srv load_model: loading model '/persistent/ai/llama.cpp/mradermacher_translategemma-27b-it-i1-GGUF_translategemma-27b-it.i1-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c2:00.0) - 50100 MiB free
llama_model_loader: loaded meta data with 50 key-value pairs and 808 tensors from /persistent/ai/llama.cpp/mradermacher_translategemma-27b-it-i1-GGUF_translategemma-27b-it.i1-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.name str = Translategemma 27b It
llama_model_loader: - kv 5: general.finetune str = it
llama_model_loader: - kv 6: general.basename str = translategemma
llama_model_loader: - kv 7: general.size_label str = 27B
llama_model_loader: - kv 8: general.license str = gemma
llama_model_loader: - kv 9: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.context_length u32 = 131072
llama_model_loader: - kv 12: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 13: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 14: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 15: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 16: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 17: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 18: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 20: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 21: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
-llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 34: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- set languages = {\n "aa": "Afar...
llama_model_loader: - kv 36: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 14
llama_model_loader: - kv 39: general.url str = https://huggingface.co/mradermacher/t...
llama_model_loader: - kv 40: mradermacher.quantize_version str = 2
llama_model_loader: - kv 41: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 42: mradermacher.quantized_at str = 2026-01-16T10:37:05+01:00
llama_model_loader: - kv 43: mradermacher.quantized_on str = nico1
llama_model_loader: - kv 44: general.source.url str = https://huggingface.co/google/transla...
llama_model_loader: - kv 45: mradermacher.convert_type str = hf
llama_model_loader: - kv 46: quantize.imatrix.file str = translategemma-27b-it-i1-GGUF/transla...
llama_model_loader: - kv 47: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 48: quantize.imatrix.entries_count u32 = 434
llama_model_loader: - kv 49: quantize.imatrix.chunks_count u32 = 320
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q4_K: 423 tensors
llama_model_loader: - type q5_K: 11 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Small
print_info: file size = 14.59 GiB (4.64 BPW)
init_tokenizer: initializing tokenizer for type 1
load: 6242 unused tokens
load: control token: 262144 '<image_soft_token>' is not marked as EOG
load: control token: 105 '<start_of_turn>' is not marked as EOG
load: control token: 2 '<bos>' is not marked as EOG
load: control token: 0 '<pad>' is not marked as EOG
load: control token: 3 '<unk>' is not marked as EOG
load: control token: 1 '<eos>' is not marked as EOG
load: control token: 4 '<mask>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 1 ('<eos>')
load: - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
\load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: no_alloc = 1
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_embd_inp = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Translategemma 27b It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 1 '<eos>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
# ...
load_tensors: layer 62 assigned to device Vulkan0, is_swa = 0
create_tensor: loading tensor token_embd.weight
# ...
create_tensor: loading tensor blk.61.post_ffw_norm.weight
load_tensors: offloading output layer to GPU
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: Vulkan0 model buffer size = 0.00 MiB
load_tensors: Vulkan_Host model buffer size = 0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
set_abort_callback: call
llama_context: Vulkan_Host output buffer size = 1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache: layer 0: filtered
# ...
llama_kv_cache: layer 61: dev = Vulkan0
llama_kv_cache: Vulkan0 KV buffer size = 0.00 MiB
llama_kv_cache: size = 624.00 MiB ( 1536 cells, 52 layers, 1/1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 6472
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
sched_reserve: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
sched_reserve: Vulkan0 compute buffer size = 522.62 MiB
sched_reserve: Vulkan_Host compute buffer size = 280.03 MiB
sched_reserve: graph nodes = 2489
sched_reserve: graph splits = 2
sched_reserve: reserve took 5.39 ms, sched copies = 1
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - Vulkan0 (8060S Graphics (RADV GFX1151)) | 108512 = 50099 + (26328 = 14941 + 10864 + 522) + 32084 |
llama_memory_breakdown_print: | - Host | 1382 = 1102 + 0 + 280 |
llama_params_fit_impl: projected to use 26328 MiB of device memory vs. 50099 MiB of free device memory
llama_params_fit_impl: will leave 23770 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.24 seconds
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) (0000:c2:00.0) - 50100 MiB free
llama_model_loader: direct I/O is enabled, disabling mmap
llama_model_loader: loaded meta data with 50 key-value pairs and 808 tensors from /persistent/ai/llama.cpp/mradermacher_translategemma-27b-it-i1-GGUF_translategemma-27b-it.i1-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.name str = Translategemma 27b It
llama_model_loader: - kv 5: general.finetune str = it
llama_model_loader: - kv 6: general.basename str = translategemma
llama_model_loader: - kv 7: general.size_label str = 27B
llama_model_loader: - kv 8: general.license str = gemma
llama_model_loader: - kv 9: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 10: gemma3.block_count u32 = 62
llama_model_loader: - kv 11: gemma3.context_length u32 = 131072
llama_model_loader: - kv 12: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 13: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 14: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 15: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 16: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 17: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 18: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 20: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 21: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
|llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 33: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 34: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- set languages = {\n "aa": "Afar...
llama_model_loader: - kv 36: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 14
llama_model_loader: - kv 39: general.url str = https://huggingface.co/mradermacher/t...
llama_model_loader: - kv 40: mradermacher.quantize_version str = 2
llama_model_loader: - kv 41: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 42: mradermacher.quantized_at str = 2026-01-16T10:37:05+01:00
llama_model_loader: - kv 43: mradermacher.quantized_on str = nico1
llama_model_loader: - kv 44: general.source.url str = https://huggingface.co/google/transla...
llama_model_loader: - kv 45: mradermacher.convert_type str = hf
llama_model_loader: - kv 46: quantize.imatrix.file str = translategemma-27b-it-i1-GGUF/transla...
llama_model_loader: - kv 47: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 48: quantize.imatrix.entries_count u32 = 434
llama_model_loader: - kv 49: quantize.imatrix.chunks_count u32 = 320
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q4_K: 423 tensors
llama_model_loader: - type q5_K: 11 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Small
print_info: file size = 14.59 GiB (4.64 BPW)
init_tokenizer: initializing tokenizer for type 1
load: 6242 unused tokens
load: control token: 262144 '<image_soft_token>' is not marked as EOG
load: control token: 105 '<start_of_turn>' is not marked as EOG
load: control token: 2 '<bos>' is not marked as EOG
load: control token: 0 '<pad>' is not marked as EOG
load: control token: 3 '<unk>' is not marked as EOG
load: control token: 1 '<eos>' is not marked as EOG
load: control token: 4 '<mask>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 1 ('<eos>')
load: - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
/load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5376
print_info: n_embd_inp = 5376
print_info: n_layer = 62
print_info: n_head = 32
print_info: n_head_kv = 16
print_info: n_rot = 128
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 7.7e-02
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 27B
print_info: model params = 27.01 B
print_info: general.name = Translategemma 27b It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 1 '<eos>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: layer 0 assigned to device Vulkan0, is_swa = 1
# ...
create_tensor: loading tensor blk.61.post_ffw_norm.weight
/load_tensors: offloading output layer to GPU
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors: Vulkan0 model buffer size = 14941.69 MiB
load_tensors: Vulkan_Host model buffer size = 1102.77 MiB
load_all_data: using async uploads for device Vulkan0, buffer type Vulkan0, backend Vulkan0
-......\.......|......../.......-..\|./-.\|./-.\.|/.-\|./-.\|./-.\|./-.\|/.-\.|/.-......\load_all_data: buffer type Vulkan_Host is not the default buffer type for device Vulkan0 for async uploads
\.
common_init_result: added <eos> logit bias = -inf
common_init_result: added <end_of_turn> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
set_abort_callback: call
llama_context: Vulkan_Host output buffer size = 1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache: layer 0: filtered
# ...
llama_kv_cache: layer 61: dev = Vulkan0
llama_kv_cache: Vulkan0 KV buffer size = 624.00 MiB
llama_kv_cache: size = 624.00 MiB ( 1536 cells, 52 layers, 1/1 seqs), K (f16): 312.00 MiB, V (f16): 312.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 6472
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
sched_reserve: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
\graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
sched_reserve: Vulkan0 compute buffer size = 522.62 MiB
sched_reserve: Vulkan_Host compute buffer size = 280.03 MiB
sched_reserve: graph nodes = 2489
sched_reserve: graph splits = 2
sched_reserve: reserve took 48.32 ms, sched copies = 1
clear_adapter_lora: call
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
|set_warmup: value = 0
srv load_model: initializing slots, n_slots = 1
slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
slot reset: id 0 | task -1 |
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
Merging system prompt into next message
init: chat template, example_format: '<start_of_turn>user
You are a professional Japanese (ja) to English (en-GB) translator. Your goal is to accurately convey the meaning and nuances of the original Japanese text while adhering to English grammar, vocabulary, and cultural sensitivities.
Produce only the English translation, without any additional explanations or commentary. Please translate the following Japanese text into English:
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
You are a professional Japanese (ja) to English (en-GB) translator. Your goal is to accurately convey the meaning and nuances of the original Japanese text while adhering to English grammar, vocabulary, and cultural sensitivities.
Produce only the English translation, without any additional explanations or commentary. Please translate the following Japanese text into English:
How are you?<end_of_turn>
<start_of_turn>model
'
srv init: init: chat template, thinking = 0
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b7823-bb02f74
model : mradermacher/translategemma-27b-it-i1-GGUF:Q4_K_S
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> こんにちは、世界!
res add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0, front = 0
|slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | launching slot : {"id":0,"n_ctx":131072,"speculative":false,"is_processing":false}
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 83
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 19, batch.n_tokens = 19, progress = 0.228916
srv update_slots: decoding batch, n_tokens = 19
clear_adapter_lora: call
set_embeddings: value = 0
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 1
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 2, front = 0
slot update_slots: id 0 | task 0 | n_tokens = 19, memory_seq_rm [19, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 83, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 83, batch.n_tokens = 64
slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 83, total = 83
srv update_slots: decoding batch, n_tokens = 64
clear_adapter_lora: call
set_embeddings: value = 0
\res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = -1, next token: 9259 'Hello'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 2
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 3, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 131072, n_tokens = 84, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
srv update_chat_: Parsing chat message: Hello
Parsing input with format Generic: Hello
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 2, n_remaining = -1, next token: 236764 ','
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 3
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 4, front = 0
srv update_chat_: Parsing chat message: Hello,
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 131072, n_tokens = 85, truncated = 0
Parsing input with format Generic: Hello,
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 3, n_remaining = -1, next token: 1902 ' world'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 4
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 5, front = 0
srv update_chat_: Parsing chat message: Hello, world
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 131072, n_tokens = 86, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
Parsing input with format Generic: Hello, world
clear_adapter_lora: call
set_embeddings: value = 0
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 4, n_remaining = -1, next token: 236888 '!'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 5
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 6, front = 0
srv update_chat_: Parsing chat message: Hello, world!
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 131072, n_tokens = 87, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
Parsing input with format Generic: Hello, world!
clear_adapter_lora: call
set_embeddings: value = 0
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | stopped by EOS
slot process_toke: id 0 | task 0 | n_decoded = 5, n_remaining = -1, next token: 106 ''
slot print_timing: id 0 | task 0 |
prompt eval time = 663.78 ms / 83 tokens ( 8.00 ms per token, 125.04 tokens per second)
eval time = 312.94 ms / 5 tokens ( 62.59 ms per token, 15.98 tokens per second)
total time = 976.72 ms / 88 tokens
srv update_chat_: Parsing chat message: Hello, world!
Parsing input with format Generic: Hello, world!
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot release: id 0 | task 0 | stop processing: n_tokens = 87, truncated = 0
slot reset: id 0 | task 0 |
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 6
que start_loop: update slots
srv update_slots: all slots are idle
que start_loop: waiting for new tasks
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
srv update_chat_: Parsing chat message: Hello, world!
Parsing input with format Generic: Hello, world!
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON: <<<>>>
Partial parse: JSON
Parsed message: {"role":"assistant","content":"Hello, world!"}
res remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
srv stop: all tasks already finished, no need to cancel
[ Prompt: 125.0 t/s | Generation: 16.0 t/s ]
>
Name and Version
Operating systems
Linux
GGML backends
Vulkan
Hardware
Framework Desktop - AMD Ryzen AI Max 300 Series & Radeon 8060S Graphics
Models
mradermacher/translategemma-27b-it-i1-GGUF:Q4_K_S
Problem description & steps to reproduce
Translation appears to happen successfully looking at the logs seeing
こんにちは、世界!converted toHello, world!but JSON parsing seems to break and results in no output to the user.If this is user error LMK but the resulting template does seem accurate, correctly putting "Japanese" and English" for input and output respectively.
I belive my llama.cpp version b7823 includes the recent translategemma related PRs: #19019 and #19052
I would have tried b7885 to double check but I couldn't get it to build adhoc:
First Bad Commit
No response
Relevant log output
Logs