The /v1/chat/completions response http headers are doubled when I run llama-server in multiple models mode (with --models-dir parameter). For following request:
POST /v1/chat/completions HTTP/1.1
Host: 127.0.0.1:8080
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0
Accept: */*
Accept-Language: en-US,en;q=0.7,cs;q=0.3
Accept-Encoding: gzip, deflate
Referer: http://127.0.0.1:8080/
Content-Type: application/json
Content-Length: 522
Origin: http://127.0.0.1:8080
DNT: 1
Connection: keep-alive
Cookie: sidebar:state=true
Priority: u=4
{"messages":[{"role":"user","content":"test"}],"stream":true,"model":"Qwen3-0.6B-Q8_0","reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["top_k","typ_p","top_p","min_p","temperature"],"timings_per_token":true}
the server returns:
HTTP/1.1 200 OK
Server: llama.cpp
Server: llama.cpp
Access-Control-Allow-Origin: http://127.0.0.1:8080
Access-Control-Allow-Origin: http://127.0.0.1:8080
Content-Type: application/json; charset=utf-8
Content-Type: text/event-stream
Transfer-Encoding: chunked
Transfer-Encoding: chunked
Keep-Alive: timeout=5, max=100
Keep-Alive: timeout=5, max=100
Everything seems to work if I connect directly, however it is problematic when the proxy server (nginx in my case) is used:
2025/12/02 14:42:25 [error] 21668#0: *2408 upstream sent duplicate header line: "Transfer-Encoding: chunked", previous value: "Transfer-Encoding: chunked" while reading response header from upstream, client: 172.0.0.1, server: llm.home, request: "POST /v1/chat/completions HTTP/1.1", upstream: "http://127.0.0.1:8080/v1/chat/completions", host: "llm.home", referrer: "http://llm.home/"
Similarly, llm tool does not work as well, with following error:
The issue is related to PR #17470.
The
/v1/chat/completionsresponse http headers are doubled when I runllama-serverin multiple models mode (with--models-dirparameter). For following request:the server returns:
Everything seems to work if I connect directly, however it is problematic when the proxy server (nginx in my case) is used:
Similarly,
llmtool does not work as well, with following error:The issue is related to PR #17470.