Skip to content

Translategemma:Enhance common/jinja's ability to detect model chat templates#19043

Closed
xiaobing318 wants to merge 10 commits intoggml-org:masterfrom
xiaobing318:master-translategemma-chatTemplateProbeBug
Closed

Translategemma:Enhance common/jinja's ability to detect model chat templates#19043
xiaobing318 wants to merge 10 commits intoggml-org:masterfrom
xiaobing318:master-translategemma-chatTemplateProbeBug

Conversation

@xiaobing318
Copy link
Copy Markdown
Contributor

Make sure to read the contributing guidelines before submitting a PR

Questions and motivation

llama-server -m "E:\data\llama.cpp-data\gguf-models\translategemma-4b-it.Q8_0.gguf" -mm "E:\data\llama.cpp-data\gguf-models\translategemma-4b-it.mmproj-Q8_0.gguf" --jinja -c 2048 -ngl 20 --media-path "E:\data\llama.cpp-data\translategemma-data" --chat-template-kwargs "{\"source_lang_code\":\"zh\",\"target_lang_code\":\"en\"}"

When running llama-server with this command, the logs (shown below) indicate that an error occurred while parsing TranslateGemma’s built-in chat template, so it fell back to the default chatml chat template. After tracing the issue, I found that capability probing in common/jinja for the TranslateGemma chat template produced a false negative, so I enhanced common/jinja’s detection logic. In addition, I added initialization for the TranslateGemma chat template.

main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7815 (091a46cb8) with Clang 19.1.5 for Windows x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'E:\data\llama.cpp-data\gguf-models\translategemma-4b-it.Q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 3166 MiB of device memory vs. 3292 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 898 MiB
llama_params_fit_impl: context size set by user to 2048 -> no change
llama_params_fit: failed to fit params to free device memory: n_gpu_layers already set by user to 20, abort
�[0mllama_params_fit: fitting params to free memory took 0.56 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1650) (0000:01:00.0) - 3292 MiB free
llama_model_loader: direct I/O is enabled, disabling mmap
�[0mllama_model_loader: loaded meta data with 46 key-value pairs and 444 tensors from E:\data\llama.cpp-data\gguf-models\translategemma-4b-it.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 64
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                               general.name str              = Translategemma 4b It
llama_model_loader: - kv   5:                           general.finetune str              = it
llama_model_loader: - kv   6:                           general.basename str              = translategemma
llama_model_loader: - kv   7:                         general.size_label str              = 4B
llama_model_loader: - kv   8:                            general.license str              = gemma
llama_model_loader: - kv   9:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  10:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  11:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  12:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  13:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  14:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  15:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  16:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  17:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  18:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  20:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  21:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  33:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  34:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- set languages = {\n    "aa": "Afar...
llama_model_loader: - kv  36:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 7
llama_model_loader: - kv  39:                                general.url str              = https://huggingface.co/mradermacher/t...
llama_model_loader: - kv  40:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  41:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  42:                  mradermacher.quantized_at str              = 2026-01-16T07:59:57+01:00
llama_model_loader: - kv  43:                  mradermacher.quantized_on str              = nico1
llama_model_loader: - kv  44:                         general.source.url str              = https://huggingface.co/google/transla...
llama_model_loader: - kv  45:                  mradermacher.convert_type str              = hf
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q8_0:  239 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.84 GiB (8.50 BPW) 
load: 6242 unused tokens
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch                  = gemma3
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 34
print_info: n_head                = 8
print_info: n_head_kv             = 4
print_info: n_rot                 = 256
print_info: n_swa                 = 1024
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 2
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 6.2e-02
print_info: n_ff                  = 10240
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 0.125
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_ctx_orig_yarn       = 131072
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 4B
print_info: model params          = 3.88 B
print_info: general.name          = Translategemma 4b It
print_info: vocab type            = SPM
print_info: n_vocab               = 262208
print_info: n_merges              = 0
print_info: BOS token             = 2 '<bos>'
print_info: EOS token             = 1 '<eos>'
print_info: EOT token             = 106 '<end_of_turn>'
print_info: UNK token             = 3 '<unk>'
print_info: PAD token             = 0 '<pad>'
print_info: LF token              = 248 '<0x0A>'
print_info: EOG token             = 1 '<eos>'
print_info: EOG token             = 106 '<end_of_turn>'
print_info: max token length      = 48
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 19 repeating layers to GPU
load_tensors: offloaded 20/35 layers to GPU
load_tensors:        CUDA0 model buffer size =  2497.83 MiB
load_tensors:    CUDA_Host model buffer size =  2115.16 MiB
..........................................................................
common_init_result: added <eos> logit bias = -inf
common_init_result: added <end_of_turn> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 2048
llama_context: n_ctx_seq     = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:  CUDA_Host  output buffer size =     4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache:        CPU KV buffer size =    16.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    24.00 MiB
llama_kv_cache: size =   40.00 MiB (  2048 cells,   5 layers,  4/1 seqs), K (f16):   20.00 MiB, V (f16):   20.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 2048 cells
llama_kv_cache:        CPU KV buffer size =   104.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =   128.00 MiB
llama_kv_cache: size =  232.00 MiB (  2048 cells,  29 layers,  4/1 seqs), K (f16):  116.00 MiB, V (f16):  116.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA0 compute buffer size =   517.12 MiB
sched_reserve:  CUDA_Host compute buffer size =    13.02 MiB
sched_reserve: graph nodes  = 1369
sched_reserve: graph splits = 227 (with bs=512), 32 (with bs=1)
sched_reserve: reserve took 13.74 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
�[0mclip_model_loader: model name:   Translategemma 4b It
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    439
clip_model_loader: n_kv:         25

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          gemma3
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2560

--- vision hparams ---
load_hparams: image_size:         896
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            4
load_hparams: n_wa_pattern: 0

load_hparams: model size:         563.96 MiB
load_hparams: metadata size:      0.15 MiB
warmup: warmup with image size = 896 x 896
alloc_compute_meta:      CUDA0 compute buffer size =   121.25 MiB
alloc_compute_meta:        CPU compute buffer size =     9.19 MiB
alloc_compute_meta: graph splits = 1, nodes = 863
warmup: flash attention is enabled
srv    load_model: loaded multimodal model, 'E:\data\llama.cpp-data\gguf-models\translategemma-4b-it.mmproj-Q8_0.gguf'
srv    load_model: initializing slots, n_slots = 4
slot   load_model: id  0 | task -1 | new slot, n_ctx = 2048
slot   load_model: id  1 | task -1 | new slot, n_ctx = 2048
slot   load_model: id  2 | task -1 | new slot, n_ctx = 2048
slot   load_model: id  3 | task -1 | new slot, n_ctx = 2048
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
�[0msrv    load_model: use `--cache-ram 0` to disable the prompt cache
�[0msrv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
�[0msrv          init: init: chat template parsing error: 
------------
While executing CallExpression at line 601, column 31 in source:
... != 1 -%}↵            {{ raise_exception(↵                "User role must provid...
                                           ^
Error: Jinja Exception: User role must provide `content` as an iterable with exactly one item. That item must be a `mapping(type:'text' | 'image', source_lang_code:string, target_lang_code:string, text:string | none, image:string | none)`.
�[0msrv          init: init: please consider disabling jinja via --no-jinja, or use a custom chat template via --chat-template
�[0msrv          init: init: for example: --no-jinja --chat-template chatml
�[0msrv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
�[0m

Test examples

Note

For flexibility (rather than targeting only the specific TranslateGemma model), the request body does not support TranslateGemma’s officially recommended request format and needs to be adjusted to be compatible with the request format used by llama.cpp. The differences are shown below.

Recommended request format

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "source_lang_code": "en",
          "target_lang_code": "zh",
          "url": "https://c7.alamy.com/comp/2YAX36N/traffic-signs-in-czech-republic-pedestrian-zone-2YAX36N.jpg"
        }
      ]
    }
  ],
  "temperature": 0,
  "max_tokens": 2048
}

Request format compatible with llama.cpp

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://c7.alamy.com/comp/2YAX36N/traffic-signs-in-czech-republic-pedestrian-zone-2YAX36N.jpg",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "chat_template_kwargs": {
    "source_lang_code": "en",
    "target_lang_code": "zh"
  },
  "temperature": 0,
  "max_tokens": 2048
}

Text data

1. Single message

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "welcome, translategemma model."
        }
      ]    
    }
  ],
  "chat_template_kwargs": {
    "source_lang_code": "en",
    "target_lang_code": "ar-QA"
  },
  "temperature": 0,
  "max_tokens": 2048
}

2. Multiple messages

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "welcome, translategemma model."
        }
      ]    
    },
    {
      "role": "assistant",
      "content": "أهلاً وسهلاً، نموذج ترجمة جيما."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Parameter support can differ depending on the model used to generate the response, particularly for newer reasoning models. Parameters that are only supported for reasoning models are noted below. "
        }
      ]    
    }
  ],
  "chat_template_kwargs": {
    "source_lang_code": "en",
    "target_lang_code": "ar-QA"
  },
  "temperature": 0,
  "max_tokens": 2048
}

Internet image data

1. Single message

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://c7.alamy.com/comp/2YAX36N/traffic-signs-in-czech-republic-pedestrian-zone-2YAX36N.jpg",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "chat_template_kwargs": {
    "source_lang_code": "cs",
    "target_lang_code": "zh"
  },
  "temperature": 0,
  "max_tokens": 2048
}

Local image data

1. Single message

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "file://cs.png",
            "detail": "auto"
          }
        }
      ]
    }
  ],
  "chat_template_kwargs": { 
    "source_lang_code": "cs",
    "target_lang_code": "zh"
  },
  "temperature": 0,
  "max_tokens": 2048
}

2. Single message

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "file://cs.png",
            "detail": "auto"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "Pedestrian zone"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Toto je čínská věta."
        }
      ]    
    }

  ],
  "chat_template_kwargs": {
    "source_lang_code": "cs",
    "target_lang_code": "zh"
  },
  "temperature": 0,
  "max_tokens": 2048
}

Comment thread common/jinja/caps.cpp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a cap for a single use case chat template is a bad idea. We already know the chat template that has this feature is translategemma, why need a cap?

Caps are only for detecting features that are shared by multiple templates

Comment thread common/chat.cpp
// TranslateGemma format detection
if (src.find("source_lang_code") != std::string::npos &&
src.find("target_lang_code") != std::string::npos &&
src.find("You are a professional") != std::string::npos) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding this?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 23, 2026

I think you haven't yet looked into this PR: #19019

I prefer the implementation in the mentioned PR as it is more simple. Will close this one to avoid duplicate efforts

@ngxson ngxson closed this Jan 23, 2026
@xiaobing318
Copy link
Copy Markdown
Contributor Author

I think you haven't yet looked into this PR: #19019

I prefer the implementation in the mentioned PR as it is more simple. Will close this one to avoid duplicate efforts

You are correct, I did not notice #19019. Thank you for reviewing.

@xiaobing318
Copy link
Copy Markdown
Contributor Author

I think you haven't yet looked into this PR: #19019

I prefer the implementation in the mentioned PR as it is more simple. Will close this one to avoid duplicate efforts

@ngxson
I compiled and ran the solution for #19019 locally, using the same commands and test cases as #19043, but it didn't output the expected results. Multiple test results show that the target language code is always en-GB. Am I missing some crucial information?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 23, 2026

The language is included per-message, not global level kwarg

@xiaobing318
Copy link
Copy Markdown
Contributor Author

The language is included per-message, not global level kwarg

@ngxson

The following are the test case I used and the result I obtained.
Request Body

{
  "model": "translategemma-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "source_lang_code": "en",
          "target_lang_code": "cs",
          "text": "this is a test"
        }
      ]
    }
  ],
  "temperature": 0,
  "max_tokens": 2048
}

Response Body

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "This is a test."
      }
    }
  ],
  "created": 1769170032,
  "model": "translategemma-4b-it.Q8_0.gguf",
  "system_fingerprint": "b7827-de4e19dde",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 6,
    "prompt_tokens": 83,
    "total_tokens": 89
  },
  "id": "chatcmpl-B5OPiRcZpSzkVg8rdlmmbhdXHRu6ES6M",
  "timings": {
    "cache_n": 0,
    "prompt_n": 83,
    "prompt_ms": 96.262,
    "prompt_per_token_ms": 1.1597831325301204,
    "prompt_per_second": 862.2301635120815,
    "predicted_n": 6,
    "predicted_ms": 185.518,
    "predicted_per_token_ms": 30.919666666666668,
    "predicted_per_second": 32.34187518192305
  }
}

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 23, 2026

Hmm ok, seems like the language fields are discarded when converting from/to internal representation of common_chat_msg

I think the most simple way is to allow them to be passed via global kwarg, should be ~10 lines of code to be added. I will push a fix

@xiaobing318
Copy link
Copy Markdown
Contributor Author

Hmm ok, seems like the language fields are discarded when converting from/to internal representation of common_chat_msg

I think the most simple way is to allow them to be passed via global kwarg, should be ~10 lines of code to be added. I will push a fix

Thank you very much. It would be great if you could also provide examples of using images and text (via the request body).

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jan 23, 2026

PTAL: #19052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jinja parser Issues related to the jinja parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants