When running vllm serving with 16 threads using the model DeepSeek-Distill-Qwen-7b, the result is wrong with the prompt below.
xfastertransformer 1.8.2.
vllm-xft 0.5.5.0
The result is correct while running 12 threads (OMP_NUM_THREADS=12).
The prompt and error message:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-qwen-7b-xft",
"messages": [{"role": "user", "content": "请帮我用 HTML 生成一个五子棋游戏,所有代码都保存在一个 HTML 中。"}],
"max_tokens": 256,
"temperature": 0.7
}'
{"id":"chat-9dc50d6d9c8b499f9b4e13c0f9cd7644","object":"chat.completion","created":1739864206,"model":"deepseek-qwen-7b-xft","choices":[{"index":0,"message":{"role":"assistant","content":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":279,"completion_tokens":256},"prompt_logprobs":null}(base)