Skip to content

Bug: embeddings endpoint broken #7842

@skoulik

Description

@skoulik

What happened?

New warning not seen previously has begun to show recently on each embedding request:

WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]

Name and Version

Broken: 10ceba3
Still OK: b90dc56

I'll try to pinpoint the exact version.

What operating system are you seeing the problem on?

No response

Relevant log output

Log BAD (10ceba354a3b152ff425e9fa97f9caaef99a46b1):
(llm) D:\build\llama.cpp>bin\release\server --n-gpu-layers 13 --model D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf --ctx-size 2048 --batch-size 2048 --ubatch-size 2048 --port 8081 --embeddings
INFO [                    main] build info | tid="40148" timestamp=1717981849 build=3119 commit="10ceba35"
INFO [                    main] system info | tid="40148" timestamp=1717981849 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 22 key-value pairs and 112 tensors from D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW)
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
.......................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   260.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    38.02 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="40148" timestamp=1717981850 n_slots=1
INFO [                    init] new slot | tid="40148" timestamp=1717981850 id_slot=0 n_ctx_slot=2048
INFO [                    main] model loaded | tid="40148" timestamp=1717981850
INFO [                    main] chat template | tid="40148" timestamp=1717981850 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="40148" timestamp=1717981850 hostname="127.0.0.1" port="8081" n_threads_http="31"
INFO [            update_slots] all slots are idle | tid="40148" timestamp=1717981850
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [   launch_slot_with_task] slot is processing task | tid="40148" timestamp=1717981910 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="40148" timestamp=1717981910 id_slot=0 id_task=0 p0=0
INFO [            update_slots] slot released | tid="40148" timestamp=1717981910 id_slot=0 id_task=0 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="40148" timestamp=1717981910
INFO [      log_server_request] request | tid="39808" timestamp=1717981910 remote_addr="127.0.0.1" remote_port=50789 status=200 method="POST" path="/embeddings" params={}
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [   launch_slot_with_task] slot is processing task | tid="40148" timestamp=1717981910 id_slot=0 id_task=2
INFO [            update_slots] kv cache rm [p0, end) | tid="40148" timestamp=1717981910 id_slot=0 id_task=2 p0=0
INFO [            update_slots] slot released | tid="40148" timestamp=1717981910 id_slot=0 id_task=2 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="40148" timestamp=1717981910
INFO [      log_server_request] request | tid="39808" timestamp=1717981910 remote_addr="127.0.0.1" remote_port=50789 status=200 method="POST" path="/embeddings" params={}
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [   launch_slot_with_task] slot is processing task | tid="40148" timestamp=1717981910 id_slot=0 id_task=4
INFO [            update_slots] kv cache rm [p0, end) | tid="40148" timestamp=1717981910 id_slot=0 id_task=4 p0=0
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [      log_server_request] request | tid="38504" timestamp=1717981910 remote_addr="127.0.0.1" remote_port=50799 status=200 method="POST" path="/embeddings" params={}
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [            update_slots] slot released | tid="40148" timestamp=1717981910 id_slot=0 id_task=4 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="40148" timestamp=1717981910
WARN [              json_value] Wrong type supplied for parameter 'prompt'. Expected 'string', using default value. | tid="40148" timestamp=1717981910 prompt=["Test"]
INFO [   launch_slot_with_task] slot is processing task | tid="40148" timestamp=1717981910 id_slot=0 id_task=6

===
Log GOOD (b90dc566c1c615289b05b50d61680f23744a21e7):
(llm) D:\build\llama.cpp>d:\opt\llama.cpp\bin\server --n-gpu-layers 13 --model D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf --ctx-size 2048 --batch-size 2048 --ubatch-size 2048 --por
t 8081 --embeddings
INFO [                    main] build info | tid="33428" timestamp=1717982359 build=3088 commit="b90dc566"
INFO [                    main] system info | tid="33428" timestamp=1717982359 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 22 key-value pairs and 112 tensors from D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW)
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
.......................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   260.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    38.02 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="33428" timestamp=1717982360 n_slots=1
INFO [                    init] new slot | tid="33428" timestamp=1717982360 id_slot=0 n_ctx_slot=2048
INFO [                    main] model loaded | tid="33428" timestamp=1717982360
INFO [                    main] chat template | tid="33428" timestamp=1717982360 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="33428" timestamp=1717982360 hostname="127.0.0.1" port="8081" n_threads_http="31"
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982360
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982369 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982369 id_slot=0 id_task=0 p0=0
INFO [            update_slots] slot released | tid="33428" timestamp=1717982369 id_slot=0 id_task=0 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982369
INFO [      log_server_request] request | tid="36500" timestamp=1717982369 remote_addr="127.0.0.1" remote_port=50918 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982369 id_slot=0 id_task=2
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982369 id_slot=0 id_task=2 p0=0
INFO [            update_slots] slot released | tid="33428" timestamp=1717982370 id_slot=0 id_task=2 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982370
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982370 id_slot=0 id_task=4
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982370 id_slot=0 id_task=4 p0=0
INFO [      log_server_request] request | tid="17636" timestamp=1717982370 remote_addr="127.0.0.1" remote_port=50919 status=200 method="POST" path="/embeddings" params={}
INFO [            update_slots] slot released | tid="33428" timestamp=1717982370 id_slot=0 id_task=4 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982370
INFO [      log_server_request] request | tid="18084" timestamp=1717982370 remote_addr="127.0.0.1" remote_port=50920 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982370 id_slot=0 id_task=6
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982370 id_slot=0 id_task=6 p0=0
INFO [            update_slots] slot released | tid="33428" timestamp=1717982370 id_slot=0 id_task=6 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982370
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982370 id_slot=0 id_task=8
INFO [      log_server_request] request | tid="29004" timestamp=1717982370 remote_addr="127.0.0.1" remote_port=50927 status=200 method="POST" path="/embeddings" params={}
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982370 id_slot=0 id_task=8 p0=0
INFO [            update_slots] slot released | tid="33428" timestamp=1717982370 id_slot=0 id_task=8 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982370
INFO [      log_server_request] request | tid="26856" timestamp=1717982370 remote_addr="127.0.0.1" remote_port=50928 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="33428" timestamp=1717982370 id_slot=0 id_task=9
INFO [            update_slots] kv cache rm [p0, end) | tid="33428" timestamp=1717982370 id_slot=0 id_task=9 p0=0
INFO [            update_slots] slot released | tid="33428" timestamp=1717982370 id_slot=0 id_task=9 n_ctx=2048 n_past=3 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="33428" timestamp=1717982370
INFO [      log_server_request] request | tid="36484" timestamp=1717982370 remote_addr="127.0.0.1" remote_port=50929 status=200 method="POST" path="/embeddings" params={}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions